Friends (TV Series) Best and Worst Seasons According to IMDb Users


What are best and worst Friends seasons according to IMDb users? How can we plot this information in a clean and effective way so we can also represent which are best and worst episodes (overall and for each season), the average rating for each season, and so forth? Let’s see how we can achieve this with just a few lines of R code.

According to Can I use IMDb data in my software?, we are not allowed to use data mining, robots, screen scraping, or similar online data gathering and extraction tools on IMDb website. However, they do provide some daily refreshed IMDb data files available for download.

For the plot we want to create we will need both title.episode.tsv.gz and title.ratings.tsv.gz. Documentation for these data files can be found here.

library(tidyverse)
episodes <- read_tsv('data/title.episode.tsv', na = "\\N", quote = '')
ratings  <- read_tsv('data/title.ratings.tsv', na = "\\N", quote = '')

Basically, title.ratings.tsv provides ratings information about every episode (or movie or whatever) on IMDb. But it just has three attributes: an identifier, the average rating and the number of votes.

head(ratings)
## # A tibble: 6 x 3
##   tconst    averageRating numVotes
##   <chr>             <dbl>    <dbl>
## 1 tt0000001           5.8     1470
## 2 tt0000002           6.4      177
## 3 tt0000003           6.6     1096
## 4 tt0000004           6.5      106
## 5 tt0000005           6.2     1803
## 6 tt0000006           5.6       95

We also need to know which of the identifiers in that file belong to Friends episodes. Luckily, title.episode.tsv contains this information (and the season and episode number for each episode as well, which we also need for our little study).

head(episodes)
## # A tibble: 6 x 4
##   tconst    parentTconst seasonNumber episodeNumber
##   <chr>     <chr>               <dbl>         <dbl>
## 1 tt0041951 tt0041038               1             9
## 2 tt0042816 tt0989125               1            17
## 3 tt0042889 tt0989125              NA            NA
## 4 tt0043426 tt0040051               3            42
## 5 tt0043631 tt0989125               2            16
## 6 tt0043693 tt0989125               2             8

With this in place, we need to know the identifier for Friends. A good option could be to use a third file to find that out. But in this case we can just search for the IMDb webpage for Friends and find that this series identifier is tt0108778. By the way, we also see Friends lasted 10 seasons with a total of 236 episodes. This will be useful information in just a second to double check we are working with the correct filtered data.

Now that we know the identifier for Friends, let’s use it to filter() all the episodes and keep only the ones we are interested in.

friends_episodes <- episodes %>% filter(parentTconst == 'tt0108778')
friends_episodes$seasonNumber <- as.factor(friends_episodes$seasonNumber)

We can see we get 236 rows, as expected :)

nrow(friends_episodes)
## [1] 236

Let’s add to each episode information about its ratings using left_join(). Also, we will need a new column with the overall episode number (so, from 1 to 236). To achieve this, we first arrange() our data by season and episode number and add_column() with values 1:236.

friends_ratings <- friends_episodes %>% 
                     left_join(ratings, by = 'tconst') %>% 
                     arrange(seasonNumber, episodeNumber) %>%
                     add_column(overallEpisodeNumber = 1:nrow(friends_episodes))
friends_ratings %>% glimpse()
## Observations: 236
## Variables: 7
## $ tconst               <chr> "tt0583459", "tt0583647", "tt0583653", "tt0…
## $ parentTconst         <chr> "tt0108778", "tt0108778", "tt0108778", "tt0…
## $ seasonNumber         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ episodeNumber        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ averageRating        <dbl> 8.4, 8.1, 8.2, 8.2, 8.5, 8.2, 9.0, 8.2, 8.3…
## $ numVotes             <dbl> 5356, 3950, 3718, 3586, 3564, 3445, 4447, 3…
## $ overallEpisodeNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …

And that is all the data we need to finally create our plot.

source("pudding_theme.R")

ggplot(friends_ratings) +
  aes(x = overallEpisodeNumber, y = averageRating, color = seasonNumber) +
  scale_color_manual(values = rep(colors[1:2], 5)) +
  stat_smooth(method = 'lm', formula = y ~ 1, size = 0.2) +
  geom_point(aes(size = numVotes)) +
  scale_size_continuous(range = c(.2, 5)) +
  scale_x_continuous(breaks = seq(0, 225, 25)) +
  ylim(7, 10) +
  labs(title = "Friends Ratings (236 Episodes, 10 Seasons)",
       x = "Overall Episode Number", 
       y = "Average Rating by IMDb Users", 
       caption = credits_imdb) +
  theme(legend.position = "none") + 
  custom_theme

Some interesting points:

  • With color = seasonNumber and scale_color_manual(values = rep(colors, 5)) we can represent all 10 seasons using two different colors.
  • With stat_smooth(method = 'lm', formula = y ~ 1, size = 0.2) we can plot the average rating for each season.
  • With geom_point(aes(size = numVotes)) we can represent the rating for each episode with a bigger or smaller dot based on the number of votes that episode received (and we control maximum and minimum sizes with scale_size_continuous(range = c(.2, 5))).

Please note colors[1:2] is just a vector with specific blue and red colors, and custom_theme returns a ggplot2 theme to customize this plot (basically font families, sizes, colors, and other similar things). As you can see, I keep them in a separated pudding_theme.R file for easier code reuse and more clarity (but feel free to ask if you want to have a look at it).

Finally, remember we can save our plots (i.e. as SVG) by assigning the result of ggplot() call to a variable (i.e. p) and then using ggsave() function.

ggsave(file = "friends-plot.svg", plot = p, width = 10, height = 8)