Apache Spark Tuning and Best Practices (Spark 110): 5 half-day Live-Online Class (Americas)

San Francisco, California
Monday, January 28, 2019
Databricks
Apache Spark Tuning and Best Practices (Spark 110): 5 half-day Live-Online Class (Americas)
Monday, January 28, 2019 7:00 AM -
Friday, February 01, 2019 11:00 AM (Pacific Time)

Databricks Inc. (Online-only delivery)
(866) 330-0121
160 Spear Street
San Francisco, California 94105
United States

Map and Directions
Overview

This 2-day live online instructor led class will be delivered on: Tuesday, January 29  - Wednesday, January 30 from ​7am ​to 11am PST.


Location: Online

This 5 half-day online course is primarily for data engineers, software engineers, dev-ops, IT operations, and team-leads but is directly applicable to analysts, architects, data scientist, and technical managers interested in troubleshooting and optimizing Apache Spark applications.

Price per student: $2500 USD

This course provides a deeper understanding of how to tuning Spark applications, general best practices, anti-patterns to avoid, and other measures to help tune and troubleshoot Spark applications and queries.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them after the class ends.

Learning Objectives

After taking this class, students will be able to:

Understand the role of memory in Spark applications
Properly use broadcast variables and, in particular, broadcast joins, to increase the performance of DataFrame operations
Explain how the Catalyst Query Optimizer works to increase query performance
Better manage Spark’s partitioning and shuffling behavior
Properly size a Spark cluster for different kinds of workflows

Topics

Spark Memory Usage
Using the Spark UI and Spark logs to determine how much memory your application is using
Understanding how Tungsten (used by DataFrames and Datasets) dramatically improves memory use, compared to the RDD API
Why it’s important that DataFrames never be partially cached, even if it means spilling the cache to disk
The benefits of co-located data
Tuning JVM garbage collection for Spark

Broadcast Variables
How broadcast variables can affect performance
Why broadcast joins are useful
How to force Spark to do a broadcast join
When not to force a broadcast join

Catalyst
Avoiding Catalyst anti-patterns, such as Cartesian products and partially cached DataFrames
Efficient use of the Datasets API within a query plan
Understanding how encoders and decoders affect Catalyst optimizations
How and when to write a custom Catalyst optimizer
Tuning Shuffling
When does shuffling occur?
Understanding how shuffling affects repartitioning
Understanding shuffling impact on network I/O
Narrow vs. wide transformations
Spark configuration settings that affect shuffling

Cluster Sizing
How a lack of memory affects how you should size your disks
The importance of properly defined schemas on memory use
Hardware provisioning
How to decide how much memory to allocate to each machine
Network considerations
How to decide how many CPU cores each machine will need
FIFO scheduler vs. fair scheduler
Details

Prerequisites

Applicable experience with Apache Spark projects.
A strong understanding of the DataFrame/Dataset APIs.
Basic programming experience in an object-oriented or functional language is required (Python or Scala).

Requirements

Chrome or Firefox Web Browser (Internet explorer and Safari are not supported)

Internet access with unfettered connections to the following domains:
*.databricks.com - required
*.slack.com - highly recommended
spark.apache.org - required
drive.google.com - helpful but not required

Training will be conducted via GoToTraining.  You may test if you can join GoToTraining here: https://support.logmeininc.com/gotomeeting/get-ready

 

Contact Information

© 2019
Quick, easy and affordable online event registration and event management software for all event sizes.