The Internet Movie Database (IMDb) provides information about movies, such as total budgets, lengths, actors, and user ratings. It is publicly available here. The examples below show basic methods for exploring and visualizing the imdb.csv dataset.
The downloaded file contains 4 columns separated by tab:
There are 313,012 lines in the file.
import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
matplotlib inline, allows us to display plots within the notebook instead of creating figure files.
Python provides the csv module to read and write CSV files. The csv.reader function returns a Python object which will iterate over lines in the given file. Each line is returned as a list of strings, so that we can access a particular column using the list index. If we want to ignore the first line, we can use islice. It is like slicing a list, but it can slice an iterator (e.g. file stream). For instance, islice(reader, 0, 5) means "give me the first 5 items from the reader." islice(reader, 1, 5) means "give me the 4 items starting from the second item."
f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 0, 5):
print(row)
print(row[1])
It's also very easy to read CSV files with pandas, using the panda.read_csv() function. This function has many options but there are things that you need to be careful about. For example:
delimiter or sep: the data file may use a comma, tab, or any weird character to separate fields. You can't read data properly if this option is incorrect. header: some data files have a "header" row that contains the names of the columns. If you read it as data or use the first row as the header, you'll have problems. na_values or na_filter: often the dataset is incomplete and contains missing data (NA, NaN (not a number), etc.). It's very important to take care of them properly. You don't need to create dictionaries and other data structures. Pandas just imports the whole table into a data structure called DataFrame.
df = pd.read_csv('imdb.csv', delimiter='\t')
df.head()
There are different options for looking at the data in the first few rows.
df.head(2)
df['Year'].head(3)
df[['Year','Rating']].head(3)
df[:10]
df[['Year','Rating']][:10]
The value_counts() function counts how many times each data value appears.
print( min(df['Year']), df['Year'].min(), max(df['Year']), df['Year'].max() )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()
You can calculate average ratings and votes
print( np.mean(df['Rating']), np.mean(df['Votes']) )
print( df['Rating'].mean() )
To get the median ratings of movies in 1990s, you first select only movies in that decade
geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]
movie_nineties.head()
print(movie_nineties['Rating'].median(), movie_nineties['Votes'].median())
Finally, if we want to know the top rated movies in 1990s, we can use the sort() function:
sorted_by_rating = movie_nineties.sort_values('Rating', ascending=False)
sorted_by_rating[1:10]
You can be specific about characteristics that you want to perform calculations on such as: Ratings of movies only in 1994 - Finding the 10th percentile, median, mean, 90th percentile.
print("The 10th percentile is", np.percentile(movie_nineties['Rating'], 10))
print()
print( "The median", movie_nineties['Rating'].median() )
print( "The mean", movie_nineties['Rating'].mean() )
print()
print("The 90th percentile is", np.percentile(movie_nineties['Rating'], 90))
Pandas also provides some easy ways to draw plots by using matplotlib. A Dataframe object has several plotting functions such as:
df['Year'].hist()
movie_nineties['Rating'].hist()
You can also plot a histogram of ratings using the pyplot.hist() function.
plt.hist(df['Rating'], bins=10)
You can also try to make some style changes to the plot:
plt.hist(movie_nineties['Rating'], bins = 20, color = "red")
plt.xlabel('Ratings')
plt.ylabel('Number')
plt.title('Histogram of Movie Ratings from the 90s')
plt.grid(True)
plt.show()
Seaborn sits on the top of matplotlib and makes it easier to draw statistical plots. Most plots that you create with Seaborn can be created with matplotlib, it just typically requires more work. We can use the distplot() function to plot the histogram.
sns.distplot(df['Rating'])
You can change adjust your plot in different ways such as changing bin size or removing the kde
sns.distplot(df['Rating'], bins = 10, kde= False)