18 Best Apache Spark books for data scientists and engineers

Apache Spark is an open source cluster-computing framework. It provides various technologies and tools to design, build and handle large-scale data applications. It provides‚ distributed task dispatching, scheduling, and basic‚ I/O‚ functionalities through APIs. It also provides Spark SQL for real-time data streaming, MLib for machine learning, Spark Streaming for data streaming analytics. It also includes libraries like GraphX for graph processing and Kafka for building data streaming applications.

The entire aspect of Apache Spark is so large that you can be confused to decide where to start. Yes! that true and I think a set of good books can show you the right way to learn Spark. To help you get started, here I've gathered some books which, no doubt, will serve your demand. So hurry up and pick up the books!

Here you'll get some best books on Apache Spark


Learning Spark: Lightning-Fast Big Data Analysis
Author: Holden Karau,Andy Konwinski,Patrick Wendell,Matei Zaharia
Published at: 27/02/2015
ISBN: 1449358624

Learning Spark is an important book for data scientists and engineers. In this guide, you'll learn how to tackle big datasets quickly using simple APIs of Spark. It provides effective techniques to express parallel jobs with just a few lines of code. It helps you understand and use various Spark applications to solve problems including simple batch jobs to stream processing and machine learning. You'll also grab important concepts about Spark SQL, Spark Streaming, setup, and Maven coordinates.

What you'll learn:

  • Working with distributed datasets
  • Handling in-memory caching
  • Using Spark's interactive shell
  • Using Spark's built-in libraries including Spark SQL, Spark Streaming, and MLlib
  • Deploying interactive, batch, and streaming applications
  • Using one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm
  • Connecting to data sources including HDFS, Hive, JSON, and S3
  • Performing data partitioning
  • Working with shared variables


Apache Spark in 24 Hours, Sams Teach Yourself
Author: Jeffrey Aven
Published at: 27/08/2016
ISBN: 0672338513

Apache Spark in 24 Hours is another good book on this list to learn Spark. This book follows a step-by-step approach to teach you the basic and advanced concepts of Spark. In small lessons, it helps you quickly learn how to deploy, program, optimize, manage, integrate, and extend Spark. It also covers powerful solutions for cloud computing, real-time stream processing, and machine learning. With this book, you'll gain enough skills to build real-world applications with Spark.

What you'll learn:

  • Fundamentals of Apache Spark
  • Deploying Spark locally or in the cloud
  • Using Spark's iterative shell
  • Developing Spark applications with Scala and functional Python
  • Working with Resilient Distributed Datasets (RDDs) for caching, persistence, and output
  • Using Spark with SQL (via Spark SQL) and with NoSQL (via Cassandra)
  • Building Spark-based machine learning and graph-processing applications
  • Optimizing Spark solution performance
  • Programming with the Spark API, including transformations and actions


High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
Author: Holden Karau,Rachel Warren
Published at: 16/06/2017
ISBN: 1491943203

High Performance Spark is a helpful resource for developers in time of developing large-scale data applications. If you're developing applications that process large data sets and finding ways how to improve the performance, then this book is for you. It's a practical guide that explains effective techniques to help you to reduce data infrastructure costs and developer hours. It shows how you can write Spark queries that run faster and handle larger data sets with Spark. Completing this guide, you'll gain a more comprehensive understanding of Spark.

What you'll learn:

  • Working with datasets, DataFrames, and Spark SQL
  • The choice between data joins in Core Spark and Spark SQL
  • Techniques for effective transformations with RDD
  • Working with key/value Data
  • Writing high-performance Spark code without Scala or the JVM
  • Testing and validation of applications
  • Using Spark MLlib and Spark ML machine learning libraries
  • Spark components and packages


Advanced Analytics with Spark: Patterns for Learning from Data at Scale
Author: Sandy Ryza,Uri Laserson,Sean Owen,Josh Wills
Published at: 06/07/2017
ISBN: 1491972955

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming.

You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance.

If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.

With this book, you will:

  • Familiarize yourself with the Spark programming model
  • Become comfortable within the Spark ecosystem
  • Learn general approaches in data science
  • Examine complete implementations that analyze large public data sets
  • Discover which machine learning tools make sense for particular problems
  • Acquire code that can be adapted to many uses


Spark GraphX in Action
Author: Michael Malak,Robin East
Published at: 03/07/2016
ISBN: 1617292524

Spark GraphX in Action is a comprehensive book learn Spark's GraphX graph processing API in practice. It is an example based book that demonstrates you all the uses of GraphX tools in real-world applications. It starts with an introduction to building big data graphs from regular data. It then explores the problems and possibilities of implementing graph algorithms and architecting graph processing pipelines. It provides effective techniques to solve these problems. Along the way,  you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data.

What you'll learn:

  • Fundamentals of graphs
  • Using the GraphX API
  • Working with GraphX's built-in algorithms
  • Building your own graph algorithm
  • Performing machine learning with graphs
  • Visualizing your graphs
  • Monitoring the performance of your application


Apache Spark 2.x Cookbook: Cloud-ready recipes for analytics and data science
Author: Rishi Yadav
Published at: 31/05/2017
ISBN: 1787127265

Apache Spark 2.x Cookbook is a recipe based book. That means it follows an approach that provides effective solutions to a list of problems in one example. This guide helps you to learn Spark in a practical way. The recipes in this book contain almost all the aspects of Spark. It covers  RDDs, DataFrames, and Datasets to operate on schema aware data. it also provides solutions for real-time streaming using various sources such as Twitter Stream and Apache Kafka. It'll walk you through machine learning, including supervised learning, unsupervised learning & recommendation engines in Spark. Completing structured recipes in this book, you'll be able to analyze and manipulate large and complex sets of data with Spark.

What you'll learn:

  • Installing and configuring Apache Spark in the cloud
  • Setting up development environment for Spark
  • Operating on data in Spark with schemas
  • Working with real-time streaming analytics using Spark Streaming & Structured Streaming
  • Using Spark Streaming & Structured Streaming for real-time streaming analytics
  • Performing supervised learning and unsupervised learning using MLlib
  • Developing a recommendation engine using MLlib
  • Processing graphs using GraphX and GraphFrames libraries
  • Developing applications to solve complex big data problems


Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark
Author: Zubair Nabi
Published at: 14/06/2016
ISBN: 1484214803

This book is specially designed to teach you how to apply Spark streaming to implement a wide variety of real-time streaming applications. It demonstrates how to build real-time applications using practical examples. It follows an application-first approach where each lesson provides specific industry based examples to give you hands-on experience of production-grade design and implementation. With this guide, you'll gain enough knowledge and skills to build applications for social media, the sharing economy, finance, online advertising, telecommunication, and IoT.

What you'll learn:

  • Understanding Spark Streaming application development
  • Working with the low-level details of discretized streams
  • Using Graphite, collected, and Nagios for production-grade deployments
  • Collecting data from diverse sources including MQTT, Flume, Kafka, Twitter, and a custom HTTP receiver
  • Integrating your applications with HBase, Cassandra, and Redis
  • Implementing real-time and scalable ETL using data frames, SparkSQL, Hive, and SparkR
  • Working with machine learning, predictive analytics, and recommendations
  • processing Mesh batch with stream processing via the Lambda architecture



Frank Kane's Taming Big Data with Apache Spark and Python
Author: Frank Kane
Published at: 30/06/2017
ISBN: 1787287947

This is another good book for data scientists to learn how to deal with a large amount of data. This book shows the effective ways to analyze large data sets using Spark RDD. It also focuses on developing and running effective Spark jobs quickly using Python. It contains over 15 real-world examples of Big Data processing with Spark. These examples really help you to get ideas and implement them in your own projects. If you have experience working with Python and Spark, then this book is an ideal resource for you to handle large-scale data.

What you'll learn:

  • Converting Big Data problems into Spark problems
  • Installing and configuring Apache Spark on your computer or on a cluster
  • Analyzing large data sets across many CPUs using Spark's Resilient Distributed Datasets
  • Understand how Spark can be distributed across computing clusters
  • Using the Spark streaming module to process continuous streams of data in real time 
  • Working with machine learning on Spark using the MLlib library
  • Performing complex network analysis using Spark's GraphX library


Mastering Apache Spark 2.x - Second Edition: Scale your machine learning and deep learning systems with SparkML, DeepLearning4j and H2O
Author: Romeo Kienzler
Published at: 26/07/2017
ISBN: 1786462745

Mastering Apache Spark 2.x is an advanced level book. It is an ideal book for those developers who have some experience with Spark and want to extend their skills to advanced level. This book combines instructions and practical examples to teach you the most up-to-date Spark functionalities. With these examples, you'll learn all the to use Spark in processing your huge chunk of data in minimum time. It covers advanced concepts of Spark such as graph processing, machine learning, stream processing, and SQL. To read this book you're required to have basic knowledge of Linux, Hadoop, and Spark. A reasonable knowledge of Scala will be helpful.

What you'll learn:

  • An overview of the Spark ecosystem.
  • Building highly optimised unified batch with Spark SQL
  • Performing real-time data processing using structured streaming
  • Working with large-scale graph processing and analysis with GraphX and GraphFrames
  • Exploring advanced machine learning and deep learning with MLib, SparkML, SystemML, H2O, and DeepLearning4J
  • Applying Apache Spark in Elastic deployments
  • Extending your knowledge of Scala, R, and python for your data science projects


Building Data Streaming Applications with Apache Kafka: Design, develop and streamline applications using Apache Kafka, Storm, Heron and Spark
Author: Manish Kumar,Chanchal Singh
Published at: 18/08/2017
ISBN: 1787283984

Building Data Streaming Applications with Apache Kafka is a comprehensive guide that teaches you how to design and build enterprise-grade streaming applications using Apache Kafka and other big data tools. It starts with explaining the type messaging system and then provides a thorough introduction to Apache Kafka and its internal details. The second part of the book walks you through designing streaming application using various frameworks and tools such as Apache Spark, Apache Storm, and more. After you grasp the basics, you'll go through more advanced topics in Apache Kafka such as capacity planning and security. With this guide, you'll gain all the information and skills you need to be comfortable with using Apache Kafka to design efficient streaming data applications with it.

What you'll learn:

  • Fundamentals of Apache Kafka
  • Understanding the basic building blocks of a streaming application
  • Designing effective streaming applications with Kafka using Spark, Storm &, and Heron
  • Understanding the importance of a low -latency and high- throughput
  • Understanding fault-tolerant messaging system
  • Dealing with capacity while deploying your Kafka Application
  • Implementing the best security practices in your applications


Apache Spark Graph Processing
Author: Rindra Ramamonjison
Published at: 10/09/2015
ISBN: 1784391808

About This Book

  • Find solutions for every stage of data processing from loading and transforming graph data to
  • Improve the scalability of your graphs with a variety of real-world applications with complete Scala code.
  • A concise guide to processing large-scale networks with Apache Spark.

What You'll Learn

  • Write, build and deploy Spark applications with the Scala Build Tool.
  • Build and analyze large-scale network datasets
  • Analyze and transform graphs using RDD and graph-specific operations
  • Implement new custom graph operations tailored to specific needs.
  • Develop iterative and efficient graph algorithms using message aggregation and Pregel abstraction
  • Extract subgraphs and use it to discover common clusters
  • Analyze graph data and solve various data science problems using real-world datasets


Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis
Author: Mohammed Guller
Published at: 25/12/2015
ISBN: 1484209656

Big Data Analytics with Spark is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how to use Spark for different types of big data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. In addition, this book will help you become a much sought-after Spark expert.

What you'll learn:

  • An introduction to Spark and related big-data technologies
  • Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, and MLlib
  • Using Spark for different types of big data analytics projects
  • Working with batch, interactive, graph, and stream data analysis
  • Working with machine learning
  • The basics of functional programming in Scala
  • Other big data technologies that are used along with Spark, like Hive, Avro, Kafka and so on



Learning Real Time processing with Spark Streaming
Author: Sumit Gupta
Published at: 01/10/2015
ISBN: 1783987669

About This Book

  • Process live data streams more efficiently with better fault recovery using Spark Streaming
  • Implement and deploy real-time log file analysis
  • Learn about integration with Advance Spark Libraries – GraphX, Spark SQL, and MLib.

What You Will Learn

  • Install and configure Spark and Spark Streaming to execute applications
  • Explore the architecture and components of Spark and Spark Streaming to use it as a base for other libraries
  • Process distributed log files in real-time to load data from distributed sources
  • Apply transformations on streaming data to use its functions
  • Integrate Apache Spark with the various advance libraries like MLib and GraphX
  • Apply production deployment scenarios to deploy your application


Spark Cookbook
Author: Rishi Yadav
Published at: 03/08/2015
ISBN: 1783987065

About this book

  • Become an expert at graph processing using GraphX
  • Use Apache Spark as your single big data compute platform and master its libraries
  • Learn with recipes that can be run on a single machine as well as on a production cluster of thousands of machines

What you'll learn

  • Install and configure Apache Spark with various cluster managers
  • Set up development environments
  • Perform interactive queries using Spark SQL
  • Get to grips with real-time streaming analytics using Spark Streaming
  • Master supervised learning and unsupervised learning using MLlib
  • Build a recommendation engine using MLlib
  • Develop a set of common applications or project types, and solutions that solve complex big data problems
  • Use Apache Spark as your single big data compute platform and master its librarie


Apache Spark 2.x for Java Developers: Explore big data at scale using Apache Spark 2.x Java APIs
Author: Sourav Gulati,Sumit Kumar
Published at: 26/07/2017
ISBN: 1787126498

About the book

  • Perform big data processing with Spark―without having to learn Scala!
  • Use the Spark Java API to implement efficient enterprise-grade applications for data processing and analytics
  • Go beyond mainstream data processing by adding querying capability, Machine Learning, and graph processing using Spark

What you'll learn

  • Process data using different file formats such as XML, JSON, CSV, and plain and delimited text, using the Spark core Library.
  • Perform analytics on data from various data sources such as Kafka, and Flume


Learning Apache Spark 2.0
Author: Muhammad Asif Abbasi
Published at: 28/03/2017
ISBN: 1785885138

About the Book

Learning Apache Spark 2 is a superb introduction to Apache Spark 2 for beginners, covering everything you need to know about big data analytics & fast data processing. Learn how to install Apache Spark, write & build your first Spark program, and work through real-world examples easily and confidently.

What you'll learn

  • Overview of Apache Spark architecture & installation
  • Spark SQL
  • Transformations and Actions with Spark RDDs
  • Spark Streaming
  • Machine Learning with Spark
  • GraphX
  • Operating in Clustered Mode
  • Understand Spark Streaming & Machine Learning
  • Build a recommendation system


Apache Spark 2.x Machine Learning Cookbook: Over 100 recipes to simplify machine learning model implementations with Spark
Author: Siamak Amirghodsi,Meenakshi Rajendran,Broderick Hall,Shuen Mei
Published at: 22/09/2017
ISBN: 1783551607

About this book

  • Solve the day-to-day problems of data science with Spark
  • This unique cookbook consists of exciting and intuitive numerical recipes
  • Optimize your work by acquiring, cleaning, analyzing, predicting, and visualizing your data

What you'll learn

  • Get to know how Scala and Spark go hand-in-hand for developers when developing ML systems with Spark
  • Build a recommendation engine that scales with Spark
  • Find out how to build unsupervised clustering systems to classify data in Spark
  • Build machine learning systems with the Decision Tree and Ensemble models in Spark
  • Deal with the curse of high-dimensionality in big data using Spark
  • Implement Text analytics for Search Engines in Spark
  • Streaming Machine Learning System implementation using Spark


Machine Learning with Spark - Tackle Big Data with Powerful Spark Machine Learning Algorithms
Author: Nick Pentreath
Published at: 08/12/2014
ISBN: 1783288515

About this book

  • Follow real-world examples to learn how to develop your own machine learning systems with Spark
  • A practical tutorial with real-world use cases allowing you to develop your own machine learning systems with Spark
  • Combine various techniques and models into an intelligent machine learning system
  • Explore and use Spark s powerful range of features to load, analyze, clean, and your data

What you'll learn

  • Create your first Spark program in Scala, Java, and Python
  • Set up and configure a development environment for Spark on your own computer, as well as on Amazon EC2
  • Access public machine learning datasets and use Spark to load, process, clean, and transform data
  • Use Spark's machine learning library to implement programs utilizing well-known machine learning models including collaborative filtering, classification, regression, clustering, and dimensionality reduction
  • Write Spark functions to evaluate the performance of your machine learning models
  • Deal with large-scale text data, including feature extraction and using text data as input to your machine learning models
  • Explore online learning methods and use Spark Streaming for online learning and model evaluation


Thanks for reading this post. If you have any opinion don't hesitate to comment here. Also please subscribe our newsletter to get more updates.