PyDSLA: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Video Available Now:

Talk by Miklos Christine, solutions engineer at Databricks

Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it’s the hottest big data technology on their site! In this talk, I’ll use the PySpark interface to leverage the speed and performance of Apache Spark. I’ll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I’ll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I’ll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I’ll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.

Talk Overview:

  • Spark‘s Architecture. What’s out now and what’s in Spark 2.0
  • Spark APIs: Most common APIs used by Spark
  • Common misconceptions and proper techniques for using Spark.

Demo:

  • Walk through ETL of the Reddit dataset.
  • SparkSQL Analytics + Visualizations of the Dataset using MatplotLib
  • Sentiment Analysis on Reddit Comments

Speaker:

Miklos Christine is a solutions engineer for Databricks where he helps customers deploy and use Apache Spark to build batch and streaming applications. Miklos was previously a system engineer at Cloudera where he helped strategic customers deploy and use the Apache Hadoop ecosystem in production. He has contributed to several projects in the open source community and holds a BS in electrical engineering and computer sciences from the University of California-Berkeley.

Date: May 5, 2016

Timeline:
– 6:30pm arrival, food/drinks and networking
– 7:0pm talk starts

You must have a confirmed RSVP and please arrive by 6:55pm the latest. Please RSVP here on Eventbrite and if the event is already full, sign up to the DataScience.LA priority mailing list to increase your change in getting in next time (we announce all events there first before publicizing on www.datascience.la or meetup.com).

Venue: Venice Arts, 1702 Lincoln Boulevard, Los Angeles, CA 90291

Thanks OpenMail for hosting and for food/drinks!

Share This Post

Leave a reply