Apache Spark For Machine Learning & Data Science (Spark 301): 2.5 day Instructor Led Public Class (Reston, VA) w/ online opt

Reston, Virginia
Wednesday, February 27, 2019
Apache Spark For Machine Learning & Data Science (Spark 301): 2.5 day Instructor Led Public Class (Reston, VA) w/ online opt
Wednesday, February 27, 2019 10:00 AM -
Friday, March 01, 2019 6:00 PM (Eastern Time)

Sunset Learning Institute
(888) 888-5251
12120 Sunset Hills Road
Suite 100
Reston, Virginia 20190
United States

Map and Directions


This ​2.5 day, ​onsite and online instructor-led course ​​will ​be ​delivered ​on: ​Wednesday, ​February 27 (2pm - 6pm);  ​Thursday, February 28 (10am - 6pm); and ​Friday, March 1, ​2018, (​10am ​to 6pm EST). 

Location: Reston, VA or online - your choice

This course covers an overview of Apache Spark, hands-on projects utilizing extract- transform-load operations (ETL), employing exploratory data analysis (EDA), building machine learning models, evaluating models, and performing cross validation.

All hands-on labs are run on Databricks Community Edition, a free cloud based Spark environment. This allows the participants to maximize their time using open source Apache Spark to solve real problems, rather than dealing with the complex issues of setting up Spark cluster installations. Labs can easily be ported to run on open source Apache Spark after class.

Intended Audience

Data scientists with experience in machine learning and Scala or Python programming, who want to adapt traditional machine learning tasks to run at scale using Apache Spark.


$2500 per person


All ​participants ​need ​a ​laptop ​with ​updated ​versions ​of ​Chrome ​or ​Firefox ​(Internet ​Explorer ​and ​Safari ​are ​not ​supported ​​and ​​an ​​internet ​​connection ​​which ​​can ​​support ​​use ​​of ​​GoToTraining. ​​ ​​GoToTraining ​​will ​​be ​​the ​​platform ​​on ​​which ​​the ​​class ​​will ​​be ​​delivered. ​​ ​​Prior ​​to ​​class, ​​each ​​registrant ​​will ​​receive ​​GoToTraining ​​log-in ​​instructions. ​ 

For ​more ​information ​and ​to ​confirm ​​your ​​computer ​​can ​​run ​​GoToTraining ​go ​to: ​​https://support.logmeininc.com/gotomeeting/get-ready ​

Course Learning Objectives

General Spark:

  • Improve performance through judicious use of caching and applying best practices.
  • Troubleshoot slow running DataFrame queries using explain-plan and the Spark UI.
  • Visualize how jobs are broken into stages and tasks and executed within Spark.
  • Troubleshoot errors and program crashes using executor logs, driver stack traces, and local-mode runtimes.
  • Troubleshoot Spark jobs using the administration UIs and logs inside Databricks.
  • Find answers to common Spark and Databricks questions using the documentation and other resources.

Extracting, Processing and Analyzing Data:

  • Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with DataFrames.
  • Extract structured data from unstructured data sources by parsing using Datasets (where possible) or RDDs (if not possible with Datasets), with transformations and actions (map, flatMap, filter, reduce, reduceByKey).
  • Extend the capabilities of DataFrames using user defined functions (UDFs and UDAFs) in Python and Scala.
  • Resolve missing fields in DataFrame rows using filtering and imputation.
  • Apply best practices for data analytics using Spark
  • Perform exploratory data analysis (EDA) using DataFrames and Datasets to:
    • Compute descriptive statistics
    • Identify data quality issues
    • Better understand a dataset
  • Visualizing Data:

    • Integrate visualizations into a Spark application using Databricks and popular visualization libraries (d3, ggplot, matplotlib)
    • Develop dashboards to provide “at-a-glance” summaries and reports.

    Machine Learning:

    • Learn to apply various regression and classification models, both supervised and unsupervised.
    • Train analytical models with Spark ML estimators including: linear regression, decision trees, logistic regression, and k-means.
    • Use Spark ML transformers to perform pre-processing on a dataset prior to training, including: standardization, normalization, one-hot encoding, and binarization.
    • Create Spark ML pipelines to create a processing pipeline including transformations, estimations, evaluation of analytical models.
    • Evaluate model accuracy by dividing data into training and test datasets and computing metrics using Spark ML evaluators.
    • Tune training hyper-parameters by integrating cross-validation into Spark ML pipelines.
    • Compute using Spark MLlib functionality not present in SparkML by converting DataFrames to RDDs and applying RDD transformations and actions. (Optional Module)
    • Troubleshoot and tune machine learning algorithms in Spark.
    • Understand and build a general ML pipeline for Spark.

    About Databricks

    Databricks’ vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache® Spark™, a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. Databricks is the largest contributor to the open source Apache Spark project providing 10x more code than any other company. The company has also trained over 40,000 users on Apache Spark, and has the largest number of customers deploying Spark to date. Databricks provides a virtual analytics platform, to simplify data integration, real-time experimentation, and robust deployment of production applications. Databricks is venture-backed by Andreessen Horowitz and NEA. For more information, contact info@databricks.com.


    Contact Information

    © 2019
    Quick, easy and affordable online event registration and event management software for all event sizes.