Databricks Certified Spark Developer: Mastering Big Data Processing

So, I recently completed the XSeries course by UC Berkeley on Apache Spark and I was all set to give the Databricks Spark Developer Certification. And guess what? I cleared the exam last week! Woohoo!

The course on edX was great for brushing up on my Spark skills (I had worked with Spark MLlib before), but the Databricks certification was definitely on a whole new level. The questions were all hands-on and assumed a fair understanding of Python, Java, and Scala. You really need to have a deep understanding of Spark Core to clear the certification. Some of the important concepts that come to mind are RDD transformations and actions, lineage, checkpointing, memory usage, and schemaRDD. Phew!

If you're planning to take the certification exam, here's how I prepared:

  1. I completed the XSeries course on Spark by UCBerkeley on edX (it's in archive mode, but you can still access the content).
  2. I read the book "Learning Spark" at least twice. It's a short book, only 250 pages, but it really helped me get started.
  3. I tried out all the codes from the book, and you can fork the github repository if you want to do the same.
  4. I watched the videos from Spark Summit 2014 and practiced all the questions. Don't be intimidated by the length of the videos. Most of it is unedited, so the actual learning time is probably half of the video length.
  5. I read the Spark Programming Guide and did hands-on coding for all the practice problems. I did most of them in Scala and Python. This step filled in a lot of the gaps for me.
  6. I made sure I had a strong understanding of Lambda functions.
  7. Finally, if you're starting from scratch, take it easy and give yourself at least 6 months to understand Apache Spark. Once you're comfortable developing small applications or doing data analysis, start preparing for the certification.

So, what's my next step? I'm not quite sure yet, but I think I'll start looking into Apache Spark github commits and engaging in some more data analysis and modeling. Cheers!