What are best and worst Friends seasons according to IMDb users? How can we plot this information in a clean and effective way so we can also represent which are best and worst episodes (overall and for each season), the average rating for each season, and so forth? Let’s see how we can achieve this with just a few lines of R code.
According to Can I use IMDb data in my software?, we are not allowed to use data mining, robots, screen scraping, or similar online data gathering and extraction tools on IMDb website. However, they do provide some daily refreshed IMDb data files available for download.
For the plot we want to create we will need both
title.ratings.tsv.gz. Documentation for these data files can be found here.
library(tidyverse) episodes <- read_tsv('data/title.episode.tsv', na = "\\N", quote = '') ratings <- read_tsv('data/title.ratings.tsv', na = "\\N", quote = '')
title.ratings.tsv provides ratings information about every episode (or movie or whatever) on IMDb. But it just has three attributes: an identifier, the average rating and the number of votes.
## # A tibble: 6 x 3 ## tconst averageRating numVotes ## <chr> <dbl> <dbl> ## 1 tt0000001 5.8 1470 ## 2 tt0000002 6.4 177 ## 3 tt0000003 6.6 1096 ## 4 tt0000004 6.5 106 ## 5 tt0000005 6.2 1803 ## 6 tt0000006 5.6 95
We also need to know which of the identifiers in that file belong to Friends episodes. Luckily,
title.episode.tsv contains this information (and the season and episode number for each episode as well, which we also need for our little study).
## # A tibble: 6 x 4 ## tconst parentTconst seasonNumber episodeNumber ## <chr> <chr> <dbl> <dbl> ## 1 tt0041951 tt0041038 1 9 ## 2 tt0042816 tt0989125 1 17 ## 3 tt0042889 tt0989125 NA NA ## 4 tt0043426 tt0040051 3 42 ## 5 tt0043631 tt0989125 2 16 ## 6 tt0043693 tt0989125 2 8
With this in place, we need to know the identifier for Friends. A good option could be to use a third file to find that out. But in this case we can just search for the IMDb webpage for Friends and find that this series identifier is
tt0108778. By the way, we also see Friends lasted 10 seasons with a total of 236 episodes. This will be useful information in just a second to double check we are working with the correct filtered data.
Now that we know the identifier for Friends, let’s use it to
filter() all the episodes and keep only the ones we are interested in.
friends_episodes <- episodes %>% filter(parentTconst == 'tt0108778') friends_episodes$seasonNumber <- as.factor(friends_episodes$seasonNumber)
We can see we get 236 rows, as expected :)
##  236
Let’s add to each episode information about its ratings using
left_join(). Also, we will need a new column with the overall episode number (so, from 1 to 236). To achieve this, we first
arrange() our data by season and episode number and
add_column() with values
friends_ratings <- friends_episodes %>% left_join(ratings, by = 'tconst') %>% arrange(seasonNumber, episodeNumber) %>% add_column(overallEpisodeNumber = 1:nrow(friends_episodes))
friends_ratings %>% glimpse()
## Observations: 236 ## Variables: 7 ## $ tconst <chr> "tt0583459", "tt0583647", "tt0583653", "tt0… ## $ parentTconst <chr> "tt0108778", "tt0108778", "tt0108778", "tt0… ## $ seasonNumber <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ episodeNumber <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, … ## $ averageRating <dbl> 8.4, 8.1, 8.2, 8.2, 8.5, 8.2, 9.0, 8.2, 8.3… ## $ numVotes <dbl> 5356, 3950, 3718, 3586, 3564, 3445, 4447, 3… ## $ overallEpisodeNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
And that is all the data we need to finally create our plot.
source("pudding_theme.R") ggplot(friends_ratings) + aes(x = overallEpisodeNumber, y = averageRating, color = seasonNumber) + scale_color_manual(values = rep(colors[1:2], 5)) + stat_smooth(method = 'lm', formula = y ~ 1, size = 0.2) + geom_point(aes(size = numVotes)) + scale_size_continuous(range = c(.2, 5)) + scale_x_continuous(breaks = seq(0, 225, 25)) + ylim(7, 10) + labs(title = "Friends Ratings (236 Episodes, 10 Seasons)", x = "Overall Episode Number", y = "Average Rating by IMDb Users", caption = credits_imdb) + theme(legend.position = "none") + custom_theme
Some interesting points:
color = seasonNumberand
scale_color_manual(values = rep(colors, 5))we can represent all 10 seasons using two different colors.
stat_smooth(method = 'lm', formula = y ~ 1, size = 0.2)we can plot the average rating for each season.
geom_point(aes(size = numVotes))we can represent the rating for each episode with a bigger or smaller dot based on the number of votes that episode received (and we control maximum and minimum sizes with
scale_size_continuous(range = c(.2, 5))).
colors[1:2] is just a vector with specific blue and red colors, and
custom_theme returns a
ggplot2 theme to customize this plot (basically font families, sizes, colors, and other similar things). As you can see, I keep them in a separated
pudding_theme.R file for easier code reuse and more clarity (but feel free to ask if you want to have a look at it).
Finally, remember we can save our plots (i.e. as SVG) by assigning the result of
ggplot() call to a variable (i.e.
p) and then using
ggsave(file = "friends-plot.svg", plot = p, width = 10, height = 8)