After completing the XSeries course on Apache Spark on edx I had been planning on giving the Databricks Spark Developer Certification. Finally, I gave and cleared the exam last week.
The course on edX was a perfect way to ramp up the understanding of Spark (I had done some work with Spark MLlib before) but Databricks certification is definitely a level higher. The questions were fully hands-on and assumed a fair understanding of Python, Java and Scala.
Moreover, you need to have a deep understanding of Spark Core to clear the certification. Some of the important concepts that come on the top of my head are RDDs transformations and actions, lineage, checkpointing, memory usage, schemaRDD etc.

This is how I prepared for the exams and I hope it helps for others as well:

  1. Completed the XSeries course on Spark by UCBerkeley on edX (It’s in archive mode but you can access the content.)
  2. Read the book Learning Spark at least twice. It’s a short book probably 250 pages but will get you started.
  3. Try all the codes from the book, you can fork the github repository
  4. Watch the videos from Spark Summit 2014 and practice all the questions. Don’t be intimidated by the length of the videos, most of it are unedited. Thus, the actual learning time is probably half of the video length.
  5. Read the Spark Programming Guide and do hands-on coding for all of the practice problems. I did most of them in Scala and Python. Do this as the last step to fill the gaps.
  6. Have a strong understanding of Lambda functions.
  7. Lastly, if you’re starting from scratch I would advice you to take it easy and give at least 6 months to understand Apache Spark. Once you’re comfortable in developing small applications or doing data analysis start preparing for the certification.

So what is my next step? I’m really not sure but the plan is I’ll start looking into Apache Spark github commits and, engage in some more data analysis and modeling. Cheers.