My summer season internship on the PySpark staff was a whirlwind of thrilling occasions. The PySpark staff develops the Python APIs of the open supply Apache Spark library and Databricks Runtime. Over the course of the 12 weeks, I drove a undertaking to implement a new built-in PySpark check framework. I additionally contributed to an open supply Databricks Labs undertaking known as English SDK for Apache Spark, which I offered in-person on the 2023 Knowledge + AI Summit (DAIS).
From enhancing the PySpark check expertise to decreasing the barrier of entry to Spark with English SDK, my summer season was all about making Spark extra accessible.
PySpark Take a look at Framework
My major internship undertaking targeted on beginning a built-in PySpark check framework (SPARK-44042), and it’s included within the upcoming Apache Spark launch. The code can also be out there within the open supply Spark repo.
Why a built-in check framework?
Previous to our built-in PySpark check framework, there was a scarcity of Spark-provided instruments to assist builders write their very own checks. Builders might use weblog posts and on-line boards, but it surely was tough to piece collectively so many disparate sources. This undertaking aimed to consolidate testing sources beneath the official Spark repository to simplify the PySpark developer expertise.
The great thing about open supply is that that is only the start! We’re excited for the open supply neighborhood to contribute to the check framework, enhancing the present options and likewise including new ones. Please take a look at the PySpark Take a look at Framework SPIP and Spark JIRA board!
DataFrame Equality Take a look at Util (SPARK-44061): The assertDataFrameEqual util operate permits for equality comparability between two DataFrames, or between lists of rows. Some configurable options embody the power to customise approximate precision degree for float values, and the power to decide on whether or not to take row order under consideration. This util operate could be helpful for testing DataFrame transformation capabilities.
Schema Equality Take a look at Util (SPARK-44216): The assertSchemaEqual operate permits for comparability between two DataFrame schemas. The util helps nested schema varieties. By default, it ignores the “nullable” flag in complicated varieties (StructType, MapType, ArrayType) when asserting equality.
Improved Error Messages (SPARK-44363): Probably the most generally occurring ache factors amongst Spark builders is debugging complicated error messages.
Take the next error message for instance, the place I attempt asserting equality between unequal schemas utilizing the built-in unittest assertEqual technique. From the error message, we are able to inform that the schemas are unequal, but it surely’s complicated to see precisely the place they differ and the best way to right them.
The brand new check util capabilities embody detailed, color-coded check error messages, which clearly point out variations between unequal DataFrame schemas and information in DataFrames.
English SDK for Apache Spark
Overview of English SDK
One thing I discovered this summer season is that issues transfer in a short time at Databricks. Positive sufficient, I spent a lot of the summer season on a brand new undertaking that was really solely began through the internship: the English SDK for Apache Spark!
The concept behind the English SDK is summed up by the tweet above from Reynold Xin, Chief Architect at Databricks and a co-founder of Spark. With current developments in generative AI, what if we are able to use English as a programming language and generative AI because the compiler to get PySpark and SQL code? By doing this, we decrease the barrier of entry to Spark improvement, democratizing entry to highly effective information analytic instruments. English SDK simplifies complicated coding duties, permitting information analysts to focus extra on deriving information insights.
English SDK additionally has many highly effective functionalities, resembling producing plots, looking the net to ingest information right into a DataFrame, and describing DataFrames in plain English.
For instance, say I’ve a DataFrame known as github_df with information about PRs within the OSS Spark repo. Say I wish to see the common variety of PRs over time, and the way they relate to Spark launch dates. All I’ve to do is ask English SDK:
github_df.ai.plot("present the 7 day shifting common and annotate the spark model with a purple line")
And it returns the plot for me:
The English SDK undertaking was unveiled on the 2023 Knowledge + AI Summit. For those who’re all for studying extra about it, please take a look at the full weblog publish. That is additionally an ongoing open supply undertaking, and we welcome all contributions and suggestions; simply open a difficulty on the GitHub repo!
One of many coolest elements of my summer season was presenting a demo of the English SDK on the annual Databricks Knowledge + AI Summit (DAIS). This yr’s occasion was held in San Francisco from June 26-29, and there have been over 30,000 digital and in-person attendees!
The convention was a multi-day occasion, and I additionally attended the opposite summit periods. I attended keynote periods (and was a bit starstruck seeing some audio system!), collected numerous Dolly-themed swag, and discovered from world-class consultants at different breakout periods. The power on the occasion was infectious, and I am so grateful I had this expertise.
Open Supply Design Course of
For the reason that PySpark Take a look at Framework was a brand new initiative, I noticed the ins and outs of the software program design course of, from writing a design doc to attending buyer conferences. I additionally acquired to expertise distinctive elements of the open supply Apache Spark design course of, together with writing a Spark Undertaking Enchancment Proposals (SPIP) doc and internet hosting on-line discussions in regards to the initiative. I acquired numerous nice suggestions from the open supply neighborhood, which helped me iterate on and enhance the preliminary design. Fortunately, when the voting interval got here round, the initiative handed with ten +1s!
This internship undertaking was a novel look into the full-time OSS developer expertise, because it not solely bolstered my technical coding abilities but in addition mushy abilities, resembling communication, writing, and teamwork.
Workforce Bonding Occasions
The Spark OSS staff could be very worldwide, with staff members spanning internationally. Our staff stand-up conferences have been all the time very vigorous, and I beloved attending to know everybody all through the summer season. Happily, I additionally acquired to fulfill lots of my worldwide staff members in individual through the week of DAIS!
My intern cohort additionally occurred to turn into a few of my favourite folks ever. All of us acquired actually shut over the summer season, from consuming lunch collectively on the workplace to exploring San Francisco on the weekends. A few of my favourite reminiscences from the summer season embody climbing Lands Finish Path, getting scrumptious dim sum in Chinatown, and taking a weekend journey to Yosemite.
My 12-week internship at Databricks was a tremendous expertise. I used to be surrounded by exceptionally good but humble staff members, and I will all the time keep in mind the various classes they shared with me.
Particular because of my mentor Allison Wang, my supervisor Xiao Li, and all the Spark staff for his or her invaluable mentorship and steerage. Thanks additionally to Hyukjin Kwon, Gengliang Wang, Matthew Powers, and Allan Folting, who I labored intently with on the English SDK for Apache Spark and PySpark Take a look at Framework tasks.
If you wish to work on cutting-edge tasks alongside trade leaders, I extremely encourage you to use to work at Databricks! Go to the Databricks Careers web page to be taught extra about job openings throughout the corporate.