The Internet Movie Database (IMDb) provides information about movies, such as total budgets, lengths, actors, and user ratings. It is publicly available here. The examples below show basic methods for exploring and visualizing the imdb.csv dataset.

The downloaded file contains 4 columns separated by tab:

  1. Title: title of the movie;
  2. Year: release year;
  3. Rating: average IMDb user rating;
  4. Votes: number of IMDB users who rated this movie

There are 313,012 lines in the file.

In [33]:
import pandas as pd  
import numpy as np
import csv
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

matplotlib inline, allows us to display plots within the notebook instead of creating figure files.

Python provides the csv module to read and write CSV files. The csv.reader function returns a Python object which will iterate over lines in the given file. Each line is returned as a list of strings, so that we can access a particular column using the list index. If we want to ignore the first line, we can use islice. It is like slicing a list, but it can slice an iterator (e.g. file stream). For instance, islice(reader, 0, 5) means "give me the first 5 items from the reader." islice(reader, 1, 5) means "give me the 4 items starting from the second item."

In [34]:
f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 0, 5):
    print(row)
    print(row[1])
['Title', 'Year', 'Rating', 'Votes']
Year
['!Next?', '1994', '5.4', '5']
1994
['#1 Single', '2006', '6.1', '61']
2006
['#7DaysLater', '2013', '7.1', '14']
2013
['#Bikerlive', '2014', '6.8', '11']
2014

It's also very easy to read CSV files with pandas, using the panda.read_csv() function. This function has many options but there are things that you need to be careful about. For example:

  1. delimiter or sep: the data file may use a comma, tab, or any weird character to separate fields. You can't read data properly if this option is incorrect.
  2. header: some data files have a "header" row that contains the names of the columns. If you read it as data or use the first row as the header, you'll have problems.
  3. na_values or na_filter: often the dataset is incomplete and contains missing data (NA, NaN (not a number), etc.). It's very important to take care of them properly.

You don't need to create dictionaries and other data structures. Pandas just imports the whole table into a data structure called DataFrame.

In [35]:
df = pd.read_csv('imdb.csv', delimiter='\t')
df.head()
Out[35]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
2 #7DaysLater 2013 7.1 14
3 #Bikerlive 2014 6.8 11
4 #ByMySide 2012 5.5 13

There are different options for looking at the data in the first few rows.

In [36]:
df.head(2)
Out[36]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
In [37]:
df['Year'].head(3)
Out[37]:
0    1994
1    2006
2    2013
Name: Year, dtype: int64
In [38]:
df[['Year','Rating']].head(3)
Out[38]:
Year Rating
0 1994 5.4
1 2006 6.1
2 2013 7.1
In [39]:
df[:10]
Out[39]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
2 #7DaysLater 2013 7.1 14
3 #Bikerlive 2014 6.8 11
4 #ByMySide 2012 5.5 13
5 #LawstinWoods 2013 7.0 6
6 #lovemilla 2013 6.7 17
7 #nitTWITS 2011 7.1 9
8 $#*! My Dad Says 2010 6.3 4349
9 $1,000,000 Chance of a Lifetime 1986 6.4 16
In [40]:
df[['Year','Rating']][:10]
Out[40]:
Year Rating
0 1994 5.4
1 2006 6.1
2 2013 7.1
3 2014 6.8
4 2012 5.5
5 2013 7.0
6 2013 6.7
7 2011 7.1
8 2010 6.3
9 1986 6.4

The value_counts() function counts how many times each data value appears.

In [41]:
print( min(df['Year']), df['Year'].min(), max(df['Year']), df['Year'].max() )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()
1874 1874 2017 2017
Out[41]:
2011    13944
2012    13887
2013    13048
2010    12931
2009    12268
Name: Year, dtype: int64

You can calculate average ratings and votes

In [42]:
print( np.mean(df['Rating']), np.mean(df['Votes']) )
6.296195341377723 1691.2317746021706
In [43]:
print( df['Rating'].mean() )
6.296195341377723

To get the median ratings of movies in 1990s, you first select only movies in that decade

In [44]:
geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]
In [45]:
movie_nineties.head()
Out[45]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
23 'N Sync TV 1998 7.5 11
33 't Zal je gebeuren... 1998 6.0 7
34 't Zonnetje in huis 1993 6.1 148
42 .COM 1999 3.8 5
In [46]:
print(movie_nineties['Rating'].median(), movie_nineties['Votes'].median())
6.3 32.0

Finally, if we want to know the top rated movies in 1990s, we can use the sort() function:

In [47]:
sorted_by_rating = movie_nineties.sort_values('Rating', ascending=False)
sorted_by_rating[1:10]
Out[47]:
Title Year Rating Votes
202778 Nicole's Revenge 1995 9.5 13
38899 The Beatles Anthology 1995 9.4 3822
39429 The Civil War 1990 9.4 4615
218444 Pink Floyd: P. U. L. S. E. Live at Earls Court 1994 9.3 3202
279320 The Shawshank Redemption 1994 9.3 1511933
72171 Bardot 1992 9.2 5
42590 The Sopranos 1999 9.2 163406
29419 Otvorena vrata 1994 9.1 2337
3955 Baseball 1994 9.1 2463

You can be specific about characteristics that you want to perform calculations on such as: Ratings of movies only in 1994 - Finding the 10th percentile, median, mean, 90th percentile.

In [48]:
print("The 10th percentile is", np.percentile(movie_nineties['Rating'], 10))
print()
print( "The median", movie_nineties['Rating'].median() )
print( "The mean", movie_nineties['Rating'].mean() )
print() 
print("The 90th percentile is", np.percentile(movie_nineties['Rating'], 90))
The 10th percentile is 4.3

The median 6.3
The mean 6.183766787465408

The 90th percentile is 7.9

Pandas also provides some easy ways to draw plots by using matplotlib. A Dataframe object has several plotting functions such as:

In [49]:
df['Year'].hist()
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x24fc8d8c8d0>
In [50]:
movie_nineties['Rating'].hist()
Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x24fc4786208>

You can also plot a histogram of ratings using the pyplot.hist() function.

In [51]:
plt.hist(df['Rating'], bins=10)
Out[51]:
(array([   824.,   3363.,   9505.,  21207.,  42500.,  69391.,  86470.,
         58059.,  21538.,    154.]),
 array([ 1.  ,  1.89,  2.78,  3.67,  4.56,  5.45,  6.34,  7.23,  8.12,
         9.01,  9.9 ]),
 <a list of 10 Patch objects>)

You can also try to make some style changes to the plot:

In [52]:
plt.hist(movie_nineties['Rating'], bins = 20, color = "red")
plt.xlabel('Ratings')
plt.ylabel('Number')
plt.title('Histogram of Movie Ratings from the 90s')
plt.grid(True)
plt.show()

Seaborn sits on the top of matplotlib and makes it easier to draw statistical plots. Most plots that you create with Seaborn can be created with matplotlib, it just typically requires more work. We can use the distplot() function to plot the histogram.

In [53]:
sns.distplot(df['Rating'])
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x24fc645bd30>

You can change adjust your plot in different ways such as changing bin size or removing the kde

In [54]:
sns.distplot(df['Rating'], bins = 10, kde= False)
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x24fc1b597b8>