Apache Spark For Machine Learning & Data Science (Spark 301): 5 half-day Live-Online Class (Americas)

San Francisco, California
Monday, March 04, 2019
Apache Spark For Machine Learning & Data Science (Spark 301): 5 half-day Live-Online Class (Americas)
Monday, March 04, 2019 7:00 AM -
Friday, March 08, 2019 11:00 AM (Pacific Time)

Databricks Inc. - online delivery only
(866) 330-0121
160 Spear Street
San Francisco, California 94105
United States

Map and Directions

This hands-on, instructor-led interactive 5 half-day Live-Online Spark 301 training targets experienced Data Scientists wishing to perform data analysis at scale using Apache Spark. The class will be held on the following days and times:

Monday, Feb Monday, March 4, 2019 ​to ​Friday, March 8,  ​2019, ​from ​7:00am ​to ​11:00am ​PDT each day.

Location: Online

This course covers an overview of Apache Spark, hands-on projects utilizing extract- transform-load operations (ETL), employing exploratory data analysis (EDA), building machine learning models, evaluating models, and performing cross validation.

All hands-on labs are run on Databricks Community Edition, a free cloud based Spark environment. This allows the participants to maximize their time using open source Apache Spark to solve real problems, rather than dealing with the complex issues of setting up Spark cluster installations. Labs can easily be ported to run on open source Apache Spark after class.

Intended Audience

Data scientists with experience in machine learning and Scala or Python programming, who want to adapt traditional machine learning tasks to run at scale using Apache Spark.


$2500 per person


All ​participants ​need ​a ​laptop ​with ​updated ​versions ​of ​Chrome ​or ​Firefox ​(Internet ​Explorer ​and ​Safari ​are ​not ​supported ​​and ​​an ​​internet ​​connection ​​which ​​can ​​support ​​use ​​of ​​GoToTraining. ​​ ​​GoToTraining ​​will ​​be ​​the ​​platform ​​on ​​which ​​the ​​class ​​will ​​be ​​delivered. ​​ ​​Prior ​​to ​​class, ​​each ​​registrant ​​will ​​receive ​​GoToTraining ​​log-in ​​instructions. ​ 

For more information and to confirm ​​your ​​computer ​​can ​​run ​​GoToTraining, please check here: https://support.logmeininc.com/gotomeeting/get-ready

Course Learning Objectives

General Spark:

  • Improve performance through judicious use of caching and applying best practices.
  • Troubleshoot slow running DataFrame queries using explain-plan and the Spark UI.
  • Visualize how jobs are broken into stages and tasks and executed within Spark.
  • Troubleshoot errors and program crashes using executor logs, driver stack traces, and local-mode runtimes.
  • Troubleshoot Spark jobs using the administration UIs and logs inside Databricks.
  • Find answers to common Spark and Databricks questions using the documentation and other resources.

Extracting, Processing and Analyzing Data:

  • Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with DataFrames.
  • Extract structured data from unstructured data sources by parsing using Datasets (where possible) or RDDs (if not possible with Datasets), with transformations and actions (map, flatMap, filter, reduce, reduceByKey).
  • Extend the capabilities of DataFrames using user defined functions (UDFs and UDAFs) in Python and Scala.
  • Resolve missing fields in DataFrame rows using filtering and imputation.
  • Apply best practices for data analytics using Spark
  • Perform exploratory data analysis (EDA) using DataFrames and Datasets to:
    • Compute descriptive statistics
    • Identify data quality issues
    • Better understand a dataset
  • Visualizing Data:

    • Integrate visualizations into a Spark application using Databricks and popular visualization libraries (d3, ggplot, matplotlib)
    • Develop dashboards to provide “at-a-glance” summaries and reports.

    Machine Learning:

    • Learn to apply various regression and classification models, both supervised and unsupervised.
    • Train analytical models with Spark ML estimators including: linear regression, decision trees, logistic regression, and k-means.
    • Use Spark ML transformers to perform pre-processing on a dataset prior to training, including: standardization, normalization, one-hot encoding, and binarization.
    • Create Spark ML pipelines to create a processing pipeline including transformations, estimations, evaluation of analytical models.
    • Evaluate model accuracy by dividing data into training and test datasets and computing metrics using Spark ML evaluators.
    • Tune training hyper-parameters by integrating cross-validation into Spark ML pipelines.
    • Compute using Spark MLlib functionality not present in SparkML by converting DataFrames to RDDs and applying RDD transformations and actions. (Optional Module)
    • Troubleshoot and tune machine learning algorithms in Spark.
    • Understand and build a general ML pipeline for Spark.

    About Databricks

    Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™, a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest contributor to the open source Apache Spark project providing 10x more code than any other company. The company has also trained over 40,000 users on Apache Spark, and has the largest number of customers deploying Spark to date. Databricks provides a virtual analytics platform, to simplify data integration, real-time experimentation, and robust deployment of production applications. Databricks is venture-backed by Andreessen Horowitz and NEA. For more information, contact info@databricks.com.


    Contact Information

    © 2019
    Quick, easy and affordable online event registration and event management software for all event sizes.