The Internet Movie Database (IMDb) provides information about movies, such as total budgets, lengths, actors, and user ratings. It is publicly available here. The examples below show basic methods for exploring and visualizing the imdb.csv dataset.

The downloaded file contains 4 columns separated by tab:

Title: title of the movie;
Year: release year;
Rating: average IMDb user rating;
Votes: number of IMDB users who rated this movie

There are 313,012 lines in the file.

import pandas as pd  
import numpy as np
import csv
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

matplotlib inline, allows us to display plots within the notebook instead of creating figure files.

Python provides the csv module to read and write CSV files. The csv.reader function returns a Python object which will iterate over lines in the given file. Each line is returned as a list of strings, so that we can access a particular column using the list index. If we want to ignore the first line, we can use islice. It is like slicing a list, but it can slice an iterator (e.g. file stream). For instance, islice(reader, 0, 5) means "give me the first 5 items from the reader." islice(reader, 1, 5) means "give me the 4 items starting from the second item."

f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 0, 5):
    print(row)
    print(row[1])

['Title', 'Year', 'Rating', 'Votes']
Year
['!Next?', '1994', '5.4', '5']
1994
['#1 Single', '2006', '6.1', '61']
2006
['#7DaysLater', '2013', '7.1', '14']
2013
['#Bikerlive', '2014', '6.8', '11']
2014

It's also very easy to read CSV files with pandas, using the panda.read_csv() function. This function has many options but there are things that you need to be careful about. For example:

delimiter or sep: the data file may use a comma, tab, or any weird character to separate fields. You can't read data properly if this option is incorrect.
header: some data files have a "header" row that contains the names of the columns. If you read it as data or use the first row as the header, you'll have problems.
na_values or na_filter: often the dataset is incomplete and contains missing data (NA, NaN (not a number), etc.). It's very important to take care of them properly.

You don't need to create dictionaries and other data structures. Pandas just imports the whole table into a data structure called DataFrame.

df = pd.read_csv('imdb.csv', delimiter='\t')
df.head()

There are different options for looking at the data in the first few rows.

df.head(2)

df['Year'].head(3)

0    1994
1    2006
2    2013
Name: Year, dtype: int64

df[['Year','Rating']].head(3)

df[:10]

df[['Year','Rating']][:10]

The value_counts() function counts how many times each data value appears.

print( min(df['Year']), df['Year'].min(), max(df['Year']), df['Year'].max() )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()

1874 1874 2017 2017

2011    13944
2012    13887
2013    13048
2010    12931
2009    12268
Name: Year, dtype: int64

You can calculate average ratings and votes

print( np.mean(df['Rating']), np.mean(df['Votes']) )

6.296195341377723 1691.2317746021706

print( df['Rating'].mean() )

6.296195341377723

To get the median ratings of movies in 1990s, you first select only movies in that decade

geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]

movie_nineties.head()

print(movie_nineties['Rating'].median(), movie_nineties['Votes'].median())

6.3 32.0

Finally, if we want to know the top rated movies in 1990s, we can use the sort() function:

sorted_by_rating = movie_nineties.sort_values('Rating', ascending=False)
sorted_by_rating[1:10]

You can be specific about characteristics that you want to perform calculations on such as: Ratings of movies only in 1994 - Finding the 10th percentile, median, mean, 90th percentile.

print("The 10th percentile is", np.percentile(movie_nineties['Rating'], 10))
print()
print( "The median", movie_nineties['Rating'].median() )
print( "The mean", movie_nineties['Rating'].mean() )
print() 
print("The 90th percentile is", np.percentile(movie_nineties['Rating'], 90))

The 10th percentile is 4.3

The median 6.3
The mean 6.183766787465408

The 90th percentile is 7.9

Pandas also provides some easy ways to draw plots by using matplotlib. A Dataframe object has several plotting functions such as:

df['Year'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x24fc8d8c8d0>

movie_nineties['Rating'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x24fc4786208>

You can also plot a histogram of ratings using the pyplot.hist() function.

plt.hist(df['Rating'], bins=10)

(array([   824.,   3363.,   9505.,  21207.,  42500.,  69391.,  86470.,
         58059.,  21538.,    154.]),
 array([ 1.  ,  1.89,  2.78,  3.67,  4.56,  5.45,  6.34,  7.23,  8.12,
         9.01,  9.9 ]),
 <a list of 10 Patch objects>)

You can also try to make some style changes to the plot:

change the color from blue to whatever you want
- http://matplotlib.org/users/pyplot_tutorial.html#working-with-text
- http://matplotlib.org/api/colors_api.html
add labels of x and y axis
change the number of bins to 20

plt.hist(movie_nineties['Rating'], bins = 20, color = "red")
plt.xlabel('Ratings')
plt.ylabel('Number')
plt.title('Histogram of Movie Ratings from the 90s')
plt.grid(True)
plt.show()

Seaborn sits on the top of matplotlib and makes it easier to draw statistical plots. Most plots that you create with Seaborn can be created with matplotlib, it just typically requires more work. We can use the distplot() function to plot the histogram.

sns.distplot(df['Rating'])

<matplotlib.axes._subplots.AxesSubplot at 0x24fc645bd30>

You can change adjust your plot in different ways such as changing bin size or removing the kde

sns.distplot(df['Rating'], bins = 10, kde= False)

<matplotlib.axes._subplots.AxesSubplot at 0x24fc1b597b8>

	Title	Year	Rating	Votes
202778	Nicole's Revenge	1995	9.5	13
38899	The Beatles Anthology	1995	9.4	3822
39429	The Civil War	1990	9.4	4615
218444	Pink Floyd: P. U. L. S. E. Live at Earls Court	1994	9.3	3202
279320	The Shawshank Redemption	1994	9.3	1511933
72171	Bardot	1992	9.2	5
42590	The Sopranos	1999	9.2	163406
29419	Otvorena vrata	1994	9.1	2337
3955	Baseball	1994	9.1	2463

	Title	Year	Rating	Votes
0	!Next?	1994	5.4	5
1	#1 Single	2006	6.1	61
2	#7DaysLater	2013	7.1	14
3	#Bikerlive	2014	6.8	11
4	#ByMySide	2012	5.5	13

	Title	Year	Rating	Votes
0	!Next?	1994	5.4	5
23	'N Sync TV	1998	7.5	11
33	't Zal je gebeuren...	1998	6.0	7
34	't Zonnetje in huis	1993	6.1	148
42	.COM	1999	3.8	5