Ready to Work Across Large Data Sets With Best Hadoop Books

This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).
What You Will Learn
- Store large datasets with the Hadoop Distributed File System (HDFS)
- Run distributed computations with MapReduce
- Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
- Discover common pitfalls and advanced features for writing real-world MapReduce programs
- Design, build and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
- Load data from relational databases into HDFS, using Sqoop
- Perform large-scale data processing with the Pig query language
- Analyze datasets with Hive, Hadoop’s data warehousing system
- Take advantage of HBase for structured and semi-structured data and ZooKeeper for building distributed systems
With the almost unfathomable increase in web traffic over recent years, driven by millions of connected users, businesses are gaining access to massive amounts of complex, unstructured data from which to gain insight.
When Hadoop was introduced by Yahoo in 2007, it brought with it a paradigm shift in how this data was stored and analyzed. Hadoop allowed small and medium-sized companies to store huge amounts of data on cheap commodity servers in racks. The introduction of Big Data has allowed businesses to make decisions based on quantifiable analysis.
Hadoop is now implemented in major organizations such as Amazon, IBM, Cloudera, and Dell to name a few. This book introduces you to Hadoop and to concepts such as ‘MapReduce’, ‘Rack Awareness’, ‘Yarn’ and ‘HDFS Federation’, which will help you get acquainted with the technology.
Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.
What You Will Learn
- Learn fundamental components such as MapReduce, HDFS, and YARN
- Explore MapReduce in depth, including steps for developing applications with it
- Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
- Learn two data formats: Avro for data serialization and Parquet for nested data
- Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
- Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
- Learn the HBase distributed database and the ZooKeeper distributed configuration service
- Understand the challenges of securing distributed systems, particularly Hadoop
- Use best practices for preparing Hadoop cluster hardware as securely as possible
- Get an overview of the Kerberos network authentication protocol
- Delve into authorization and accounting principles as they apply to Hadoop
- Learn how to use mechanisms to protect data in a Hadoop cluster, both in transit and at rest
- Integrate Hadoop data ingest into the enterprise-wide security architecture
- Ensure that security architecture reaches all the way to end-user access
It's always a good time to upgrade your Hadoop skills! Hadoop in Practice, Second Edition provides a collection of 104 tested, instantly useful techniques for analyzing real-time streams, moving data securely, machine learning, managing large-scale clusters, and taming big data using Hadoop.
This completely revised edition covers changes and new features in Hadoop core, including MapReduce 2 and YARN. You'll pick up hands-on best practices for integrating Spark, Kafka, and Impala with Hadoop, and get new and updated techniques for the latest versions of Flume, Sqoop, and Mahout. In short, this is the most practical, up-to-date coverage of Hadoop available.
What You Will Learn
- Thoroughly updated for Hadoop 2
- How to write YARN applications
- Integrate real-time technologies like Storm, Impala, and Spark
- Predictive analytics using Mahout and RR
- Readers need to know a programming language like Java and have the basic familiarity with Hadoop.
Apache Hadoop is the technology at the heart of the Big Data revolution, and Hadoop skills are in enormous demand. Now, in just 24 lessons of one hour or less, you can learn all the skills and techniques you'll need to deploy each key component of a Hadoop platform in your local environment or in the cloud, building a fully functional Hadoop cluster and using it with real programs and datasets. Each short, easy lesson builds on all that's come before, helping you master all of Hadoop's essentials, and extend it to meet your unique challenges.
What You Will Learn
- Understanding Hadoop and the Hadoop Distributed File System (HDFS)
- Importing data into Hadoop, and process it there
- Mastering basic MapReduce Java programming, and using advanced MapReduce API concepts
- Making the most of Apache Pig and Apache Hive
- Implementing and administering YARN
- Taking advantage of the full Hadoop ecosystem
- Managing Hadoop clusters with Apache Ambari
- Working with the Hadoop User Environment (HUE)
- Scaling, securing, and troubleshooting Hadoop environments
- Integrating Hadoop into the enterprise
- Deploying Hadoop in the cloud
- Getting started with Apache Spark.
Ready to use statistical and machine-learning techniques across large datasets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.
What You Will Learn
- Understand core concepts of Hadoop and cluster computing
- Use design patterns and parallel analytical algorithms to create distributed data analysis jobs
- Learn about data management, mining, and warehousing in a distributed context using Apache Hive and HBase
- Use Sqoop and Apache Flume to ingest data from relational databases
- Program complex Hadoop and Spark applications with Apache Pig and Spark DataFrames
- Perform machine learning techniques such as classification, clustering, and collaborative filtering with Spark’s MLlib
- Get a high-level overview of HDFS and MapReduce: why they exist and how they work
- Plan a Hadoop deployment, from hardware and OS selection to network requirements
- Learn setup and configuration details with a list of critical properties
- Manage resources by sharing a cluster across multiple groups
- Get a runbook of the most common cluster maintenance tasks
- Monitor Hadoop clusters--and learn troubleshooting with the help of real-world war stories
- Use basic tools and techniques to handle backup and catastrophic failure
This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. With in-depth code examples in Java and XML and the latest on recent additions to the Hadoop ecosystem, this complete resource also covers the use of APIs, exposing their inner workings and allowing architects and developers to better leverage and customize them.
What You Will Learn
- The ultimate guide for developers, designers, and architects who need to build and deploy Hadoop applications
- Covers storing and processing data with various technologies, automating data processing, Hadoop security, and delivering real-time solutions
- Includes detailed, real-world examples and code-level guidelines
- Explains when, why, and how to use these tools effectively
- Written by a team of Hadoop experts in the programmer-to-programmer Wrox style
Hadoop 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers and to navigate the powerful technologies that complement it.
What You Will Learn
- Understanding what Hadoop 2 and YARN do, and how they improve on Hadoop 1 with MapReduce
- Understanding Hadoop-based Data Lakes versus RDBMS Data Warehouses
- Installing Hadoop 2 and core services on Linux machines, virtualized sandboxes, or clusters
- Exploring the Hadoop Distributed File System (HDFS)
- Understanding the essentials of MapReduce and YARN application programming
- Simplifying programming and data movement with Apache Pig, Hive, Sqoop, Flume, Oozie, and HBase
- Observing application progress, controlling jobs, and managing workflows
- Managing Hadoop efficiently with Apache Ambari–including recipes for HDFS to NFSv3 gateway, HDFS snapshots, and YARN configuration
- Learning basic Hadoop 2 troubleshooting, and installing Apache Hue and Apache Spark
In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples.
What You Will Learn
- Understand Hadoop’s architecture from an administrator’s standpoint
- Create simple and fully distributed clusters
- Run MapReduce and Spark applications in a Hadoop cluster
- Manage and protect Hadoop data and high availability
- Work with HDFS commands, file permissions, and storage management
- Move data, and use YARN to allocate resources and schedule jobs
- Manage job workflows with Oozie and Hue
- Secure, monitor, log, and optimize Hadoop
- Benchmark and troubleshoot Hadoop
Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop – the framework of big data. Revised to cover Hadoop 2.0, the book covers the very latest developments such as YARN (aka MapReduce 2.0), new HDFS high-availability features, and increased scalability in the form of HDFS Federations.
What You Will Learn
- Covers all that is new in Hadoop 2.0
- Written by a professional involved in Hadoop since day one
- Takes you quickly to the seasoned pro level on the hottest cloud-computing framework
- How to let Hadoop take care of distributing and parallelizing your software
- Solve big-data problems the MapReduce way, by breaking a big problem into chunks
- Creating small-scale solutions that can be flung across thousands upon thousands of nodes
- To analyze large data volumes in a short amount of wall-clock time.
Q. What makes this book important right now?
A. Hadoop has quickly become the standard for processing and analyzing Big Data. In order to integrate a new Hadoop deployment into your existing environment, you will need to transfer data stored in relational databases into Hadoop. Sqoop optimizes data transfers between Hadoop and databases with a command line interface listing 60 parameters. In this book, we'll focus on applying the parameters in common use cases to help you deploy and use Sqoop in your environment.
Q. What do you hope that readers of your book will walk away with?
A. One recipe at a time, this book guides you through basic commands not requiring prior Sqoop knowledge all the way to very advanced use cases. These recipes are detailed enough not only to enable you to deploy them within your environment but also to understand Sqoop's inner workings.
Q. Can you give us a little taste of the contents?
A. Imagine a scenario where you are incrementally importing records from MySQL into Hadoop. When you resume importing and noticing that some records have been modified, you also want to include those updated records. How do you drop the older copies of records when records have been updated and then merge in the newer copies?
This sounds like a use-case for using the last modified incremental mode. Internally, the last modified import consists of two standalone MapReduce jobs. The first job will import the delta of changed data similarly to the way normal import does. This import job will save data in a temporary directory on HDFS. The second job will take both the old and new data and will merge them together into the final output, preserving only the last updated value for each row.
About This Book
- Optimize your MapReduce job performance
- Identify your Hadoop cluster's weaknesses
- Tune your MapReduce configuration
What You Will Learn
- Learn about the factors that affect MapReduce performance
- Utilize the Hadoop MapReduce performance counters to identify resource bottlenecks
- Size your Hadoop cluster's nodes
- Set the number of mappers and reducers correctly
- Optimize mapper and reducer task throughput and code size using compression and Combiners
- Understand the various tuning properties and best practices to optimize clusters
15. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems
Until now, design patterns for the MapReduce framework have been scattered among various research papers, blogs, and books. This handy guide brings together a unique collection of valuable MapReduce patterns that will save you time and effort regardless of the domain, language, or development framework you’re using. This book also provides a complete overview of MapReduce that explains its origins and implementations, and why design patterns are so important. All code examples are written for Hadoop.
What You Will Learn
- Summarization patterns: get a top-level view by summarizing and grouping data
- Filtering patterns: view data subsets such as records generated from one user
- Data organization patterns: reorganize data to work with other systems, or to make MapReduce analysis easier
- Join patterns: analyze different datasets together to discover interesting relationships
- Metapatterns: piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
- Input and output patterns: customize the way you use Hadoop to load or store data
By combining Apache Hadoop and Solr you can build super-efficient, high-speed enterprise search engines, and this book takes you through every stage of the process with a practical tutorial. Written specifically for Java programmers.
Overview
- Understand the different approaches of making Solr work on Big Data as well as the benefits and drawbacks
- Learn from interesting, real-life use cases for Big Data search along with sample code
- Work with the Distributed Enterprise Search without prior knowledge of Hadoop and Solr
What You Will Learn
- Understand Apache Hadoop, its ecosystem, and Apache Solr
- Learn different industry-based architectures while designing Big Data enterprise search and understand their applicability and benefits
- Write map/reduce tasks for indexing your data
- Fine-tune the performance of your Big Data search while scaling your data
- Increase your awareness of new technologies available today in the market that provide you with Hadoop and Solr
- Use Solr as a NOSQL database
- Configure your Big Data instance to perform in the real world
- Address the key features of a distributed Big Data system such as ensuring high availability and reliability of your instances
- Integrate Hadoop and Solr together in your industry by means of use cases.
If you are a Big Data enthusiast and wish to use Hadoop v2 to solve your problems, then this book is for you. This book is for Java programmers with little to moderate knowledge of Hadoop MapReduce. This is also a one-stop reference for developers and system admins who want to quickly get up to speed with using Hadoop v2. It would be helpful to have a basic knowledge of software development using Java and a basic working knowledge of Linux.
About This Book
- Process large and complex datasets using next generation Hadoop
- Install, configure, and administer MapReduce programs and learn what's new in MapReduce v2
- More than 90 Hadoop MapReduce recipes presented in a simple and straightforward manner, with step-by-step instructions and real-world examples
What You Will Learn
- Configure and administer Hadoop YARN, MapReduce v2, and HDFS clusters
- Use Hive, HBase, Pig, Mahout, and Nutch with Hadoop v2 to solve your big data problems easily and effectively
- Solve large-scale analytics problems using MapReduce-based applications
- Tackle complex problems such as classifications, finding relationships, online marketing, recommendations, and searching using Hadoop MapReduce and other related projects
- Perform massive text data processing using Hadoop MapReduce and other related projects
- Deploy your clusters to cloud environments.