Python pandas DataFrame - Tutorial 2

Jay Kim
Oct 4, 2017
2 min read

Group by

pandas has 'groupby' function that it will group the data by any categorical values, and it can be used with numerical calculations such as mean, sum, etc.

I calculated the average runtime and domestic total gross for each rating. You need to be cautious that if you have more than one numerical column to be calculated, you need double brackets for the columns.

Pandas is smart enough to calculate only for the numerical columns even though you don't specify the columns. Check an example below.

I was able to calculate the total values for each rating without specifying what columns are included for the sum, and pandas calculated the sum only for the numerical values.

'Rating' column will be used as index if 'as_index=False' is not included for groupby.

Filter by column values

When you look for column values, pandas return True or False for each row. You can use True and False to filter rows by the column values.

df.loc selects rows by lable, and df.iloc selects rows by position. 'loc' and 'iloc' receive the True and False values and return only True rows.

Only PG-13 movies are filter. Index number is not consecutive, since only it displays PG-13 movies. It does not reset index number.

Q: Find a director who had the highest domestic total gross.

df.DomesticTotalGross.max() returns the highest value in DomesticTotalGross column.

df.loc[. . . . .].Director will return the value in Director column for the row selected with df.loc.

As a practical use for filter, here is a final example.

# Label for ratings. This variable will be used in plt plot right below. ratings=['G','PG','PG-13','R']

plt.figure(figsize=(20,20)) for i in range(len(df.Rating.unique())): plt.subplot(4,1,i+1) plt.plot_date(df.loc[df.Rating == ratings[i]].date, df.loc[df.Rating == ratings[i]].DomesticTotalGross, label=ratings[i]) plt.xticks(fontsize=15)

plt.yticks(fontsize=15)

plt.legend(fontsize=25,loc='upper left')

plt.tight_layout()