Enterprises running Hadoop must absorb rapid changes in big data ecosystems, frameworks, products, and workloads. Virtualized approaches can offer important advantages in speed, flexibility, and elasticity. Now, a world-class team of enterprise virtualization and big data experts guide you through the choices, considerations, and tradeoffs surrounding Hadoop virtualization. The authors help you decide whether to virtualize Hadoop, deploy Hadoop in the cloud, or integrate conventional and virtualized approaches in a blended solution.
First, Virtualizing Hadoop reviews big data and Hadoop from the standpoint of the virtualization specialist. The authors demystify MapReduce, YARN, and HDFS and guide you through each stage of Hadoop data management. Next, they turn the tables, introducing big data experts to modern virtualization concepts and best practices.
Finally, they bring Hadoop and virtualization together, guiding you through the decisions you’ll face in planning, deploying, provisioning, and managing virtualized Hadoop. From security to multitenancy to day-to-day management, you’ll find reliable answers for choosing your best Hadoop strategy and executing it.
Coverage includes the following:
• Reviewing the frameworks, products, distributions, use cases, and roles associated with Hadoop
• Understanding YARN resource management, HDFS storage, and I/O
• Designing data ingestion, movement, and organization for modern enterprise data platforms
• Defining SQL engine strategies to meet strict SLAs
• Considering security, data isolation, and scheduling for multitenant environments
• Deploying Hadoop as a service in the cloud
• Reviewing the essential concepts, capabilities, and terminology of virtualization
• Applying current best practices, guidelines, and key metrics for Hadoop virtualization
• Managing multiple Hadoop frameworks and products as one unified system
• Virtualizing master and worker nodes to maximize availability and performance
• Installing and configuring Linux for a Hadoop environment
Transformation
Summary
CHAPTER 2 Hadoop Fundamental Concepts
Types of Data in Hadoop
Use Cases What Is Hadoop?
Hadoop Distributions
Hadoop Frameworks
NoSQL Databases
What Is NoSQL?
A Hadoop Cluster
Hadoop Software Processes
Hadoop Hardware Profiles
Roles in the Hadoop Environment
Summary
CHAPTER 3 YARN and HDFS
A Hadoop Cluster Is Distributed
Hadoop Directory Layouts
Hadoop Operating System Users
The Hadoop Distributed File System
YARN Logging
The NameNode
The DataNode
Block Placement
NameNode Configurations and Managing Metadata
Rack Awareness
Block Management
The Balancer
Maintaining Data Integrity in the Cluster
Quotas and Trash
YARN and the YARN Processing Model
Running Applications on YARN
Resource Schedulers
Benchmarking
TeraSort Benchmarking Suite
Summary
CHAPTER 4 The Modern Data Platform
Designing a Hadoop Cluster
Enterprise Data Movement
Summary
CHAPTER 5 Data Ingestion
Extraction, Loading, and Transformation (ELT)
Sqoop: Data Movement with SQL Sources
Flume: Streaming Data
Oozie: Scheduling and Workflow
Falcon: Data Lifecycle Management
Kafka: Real-time Data Streaming
Summary
CHAPTER 6 Hadoop SQL Engines
Where SQL Was Born
SQL in Hadoop
Hadoop SQL Engines
Selecting the SQL Tool For Hadoop
Now Getting Groovy with Hive and Pig
Hive
HCatalog
Pig
Summary
CHAPTER 7 Multitenancy in Hadoop
Securing the Access
Authentication
Auditing
Authorization
Data Protection
Isolating the Data
Isolating the Process
Summary
Part II: Introduction to Virtualization
CHAPTER 8 Virtualization Fundamentals
Why Virtualize Hadoop?
Introduction to Virtualization
Summary
References
CHAPTER 9 Best Practices for Virtualizing Hadoop
Running Virtualized Hadoop with Purpose and Discipline
The Discipline of Purpose Starts with a Clear Target
Virtualizing Different Tiers of Hadoop
Industry Best Practices
Summary
Part III: Virtualizing Hadoop
CHAPTER 10 Virtualizing Hadoop
How Are Hadoop Ecosystems Going to Be Managed?
Building an Enterprise Hadoop Platform That Is Agile and Flexible
Clarification of Terms
The Journey from Bare-Metal to Virtualization
Why Consider Virtualizing Hadoop?
Benefits of Virtualizing Hadoop
Virtualized Hadoop Can Run as Fast or Faster Than Native
Coordination and Cross-Purpose Specialization Is the Future
Barriers Can Be Organizational
Virtualization Is Not an All or Nothing Option
Rapid Provisioning and Improving Quality of Development and Test
Environments
Improve High Availability with Virtualization
Use Virtualization to Leverage Hadoop Workloads
Hadoop in the Cloud
Big Data Extensions
The Path to Virtualization
The Software-Defined Data Center
Virtualizing the Network
vRealize Suite
Summary
References
CHAPTER 11 Virtualizing Hadoop Master Servers
Virtualizing Servers in a Hadoop Cluster
Virtualizing the Environment Around Hadoop
Virtualizing the Master Hadoop Servers
Virtualizing Without the SAN
Summary
CHAPTER 12 Virtualizing the Hadoop Worker Nodes
A Brief Introduction to the Worker Nodes in Hadoop
Deployment Models for Hadoop Clusters
The Combined Model
The Separated Model
Network Effects of the Data-Compute Separation
The Shared-Storage Approach to the Data-Compute Separated Model
Resources
CHAPTER 13 Deploying Hadoop as a Service in the Private Cloud
The Cloud Context
Stakeholders for Hadoop
Overview of the Solution Architecture
Summary
References
CHAPTER 14 Understanding the Installation of Hadoop
Map the Right Solutions to the Right Use Case
Thoughts About Installing Hadoop
Configuring Repositories
Installing HDP 2.2
Environment Preparation
Setting Up the Hadoop Configuration
Starting HDFS and YARN
Start YARN
Verifying MapReduce Functionality
Installing and Configuring Hive
Installing and Configuring MySQL Database
Installing and Configuring Hive and HCatalog
Summary
CHAPTER 15 Configuring Linux for Hadoop
Supported Linux Platforms
Different Deployment Models
Linux Golden Templates
Building a Linux Enterprise Hadoop Platform
Selecting the Linux Distribution
Optimal Linux Kernel Parameters and System Settings
epoll
Disable Swap Space
Disable Security During Install
IO Scheduler Tuning
Check Transparent Huge Pages Configuration
Limits.conf
Partition Alignment for RDMs
File System Considerations
Lazy Count Parameter for XFS
Mount Options
I/O Scheduler
Disk Read and Write Options
Storage Benchmarking
Java Version
Set Up NTP
Enable Jumbo Frames
Additional Network Considerations
Summary
Appendix A Hadoop Cluster Creation: A Prerequisite Checklist
Appendix B Big Data/Hadoop on VMware vSphere Reference Materials
Deployment Guides
Reference Architectures
Customer Case Studies
Performance
vSphere Big Data Extensions (BDE)
Other vSphere Features and Big Data
No comments:
Post a Comment