This month, I am re-introducing myself to Python coding. To make this week interesting, I went off the beaten path and completed a DataCamp’s A New Era of Data Analysis in Baseball course, which uses on Major League Baseball’s Statcast data in a Jupyter notebook.
Statcast uses radar technology to track every baseball in every ballpark 20,000 times per second. This provides incredible insight about all phases of the game, including pitch type, ball speed off the bat, and fielder’s arm strength.
Here’s what I learned about Python this week:
# Load in Statcast data using pandas
judge = pd.read_csv(“datasets/judge.csv”)
# Display the last five rows of a dataframe
judge.tail()
# Filter only for events in 2017 then get a row count
judge_events_2017 = judge[“events”][judge[‘game_date’].str.contains(“2017”)]
print(“Aaron Judge batted ball event totals, 2017:”)
print(judge_events_2017.count()
# Using the seaborn package and regplot function for multiple graphs in one figure
fig1, axs1 = plt.subplots(ncols=2, sharex=True, sharey=True)
sns.regplot(x=”launch_speed”, y=”launch_angle”, fit_reg=False, color=’tab:blue’, data=judge_hr, ax=axs1[0]).set_title(‘Aaron Judge\nHome Runs, 2015-2017′)
sns.regplot(x=”launch_speed”, y=”launch_angle”, fit_reg=False, color=’tab:blue’, data=stanton_hr, ax=axs1[1]).set_title(‘Giancarlo Stanton\nHome Runs, 2015-2017’)
#Combine two dataframes via pandas concat
judge_stanton_hr = pd.concat([judge_hr, stanton_hr])
#Use seaborn to create a boxplot of a single column
sns.boxplot(x=judge_stanton_hr[“release_speed”]).set_title(‘Home Runs, 2015-2017’)
#Apply a column-based function to each row
judge_strike_hr[‘zone_x’] = judge_strike_hr.apply(assign_x_coord, axis=1)
#Using matplotlib, plot a 2d histogram
plt.hist2d(x=”zone_x”, y=”zone_y”, data=judge_strike_hr,bins = 3, cmap=’Blues’)
That’s all for this week.