Hip hop and basketball have always had a unique and close relationship, so it is not surprising when player names and various terms from the sport appear in rap lyrics. This post works though the R code needed to find NBA player names in the lyrics of ~4000 songs from 20 hip hop artists and plotting the resulting patterns.

I’m writing this mainly to document several tricks and hacks for working with and ultimately plotting large(ish) volumes of text-based data. The choice of artists is a mix of rappers that I listen to, and others that seemed well-represented in the Genius lyrics data.

The overall workflow is split into getting a hold of lyrics and NBA player names and finding out: which artists are mentioning which players in their songs, and which players get mentioned most.

Getting the lyrics

Until last year, lyrics could be accessed from the Genius website (a massive collection of song lyrics and crowd-sourced annotations) via the Genius API with the geniusr package. At present, because of changes to the Genius API and legal terms, we can no longer fetch lyrics directly with geniusr directly without legally-gray webscraping (if you must, the relevant geniusr functions can be patched with the advice in this issue and they’ll work fine - but see here for more information).

This little example uses many cool packages, let’s set them up first.

# bball player names in hip hop
# Load libraries ----
library(geniusr)   # [github::ewenme/geniusr] v1.2.0.9000
library(nbastatR)  # [github::abresler/nbastatR] v0.1.1506
library(rvest)     # CRAN v1.0.2
library(xml2)      # CRAN v1.3.3
library(rlang)     # CRAN v1.0.1
library(tibble)    # CRAN v3.1.6
library(dplyr)     # CRAN v1.0.8
library(purrr)     # CRAN v0.3.4
library(stringr)   # CRAN v1.4.0
library(tidytext)  # CRAN v0.3.2
library(tidyr)     # CRAN v1.2.0
library(forcats)   # CRAN v0.5.1
library(ggplot2)   # CRAN v3.3.5
library(artyfarty) # [github::datarootsio/artyfarty] v0.0.1
library(ggfittext) # CRAN v0.9.1
library(ggtext)    # CRAN v0.1.1
library(scico)     # CRAN v1.3.0
library(extrafont) # CRAN v0.17
library(fuzzyjoin) # CRAN v0.1.6

Back when the API still produced tibbles with lyrics, I searched for the respective ID for the following artists, and used each artist ID to obtain a vector with all the unique song IDS for each of the following artists:

The artist names:

c("Migos", "Ghostface Killah", "2 Chainz", "G-Unit", "Beastie Boys", "Army of the Pharaohs", "Flatbush Zombies",
"R.A. The Rugged Man", "Das EFX", "Jedi Mind Tricks", "Gang Starr", "Mobb Deep", "Wu-Tang Clan", "Kool G Rap", 
"DMX", "MF DOOM", "213", "Goodie Mob", "Sage Francis")         

Here are examples for two of the artists. Each vector of song ids gets named consistently, so we can combine them all later (I didn’t iterate here so I could check that I was getting the correct matches for each search term).

search_artist("outerspace") # 1836
OSall <- geniusr::get_artist_songs_df(1836)
allOSsong_ids <- OSall %>% pull(song_id)

geniusr::search_artist("gang starr") # 220
GSTall <- geniusr::get_artist_songs_df(220)
allGSTsong_ids <- GSTall %>% pull(song_id)

Once we found all the artists we wanted, we can get the vector objects from our enviroment with mget.

### combine all artists
all_song_ids <- flatten_chr(mget(ls(pattern = "^all")))

To get all the lyrics, we can iterate over the final vector of song ID (with a multi-statement lambda function to add some time between requests and not spam the server).

# iterate and get all song lyrics
all_lyricsdf <- map_df(all_song_ids,~ {
  Sys.sleep(sample(seq(0.5,1,0.25),1))
  get_lyrics_id(song_id = .x)})

The resulting tibble has >200,000 rows, one per line for roughly 4000 songs. Here’s a random sample of 10 rows.

all_lyrsdf %>% slice_sample(n=10)
# A tibble: 10 × 6
   line            section_name section_artist song_name artist_name song_id
   <chr>           <chr>        <chr>          <chr>     <chr>       <chr>  
 1 I know what it Verse 3      Young Buck     Footprin G-Unit      20762  
 2 Dracos and MAC Verse 3      Offset & Quavo Racks 2  Migos       5555420
 3 I don't trust … Hook         Quavo          Trust No… Migos       2340835
 4 Leaving my dre… Verse 3      Offset         White Sa… Migos       3468920
 5 All you know i… Verse 3      2 Chainz       I Said Me 2 Chainz    4346684
 6 Pickin my mark… Havoc        Mobb Deep      It’s Over Mobb Deep   33514  
 7 You should see… Chorus       The Notorious… The Dang… R.A. The R… 136500 
 8 Close your ear… Intro        Havoc          So Long   Mobb Deep   33536  
 9 I'm off style  Verse 3      Ghostface Kil Ron ONe Wu-Tang Cl 491657 
10 A lousy condit Yeeeaah, yo Louis Logic    Over the Louis Logic 29385 

For reference, this post by Tom MacNamara also shows how to get and visualize lyrics from the same data source.

Text analysis

For this exercise, let’s split up (tokenize) all the lines into bigrams (consecutive sequences of two words) with unnest_token from tidytext. For the next step, it’s also convenient to add new columns with the bigrams split into separate columns.

# tokenize 
lyric_bigrams <- all_lyrsdf %>% unnest_tokens(BGlyric,line, token = "ngrams", n=2) %>% 
  filter(!is.na(BGlyric)) %>% separate(BGlyric,into=c("w1","w2"),sep = " ",remove = FALSE)

To reduce the number of comparisons and because people’s names are the whole point of this exercise, we can filter out the rows in the lyrics data which match any of the words in a custom list of stop words (extremely common words not useful for analysis, such as “the”, “of”, “to”, etc.) from a generic text file.

# stopwords
stopWords <- tibble(word=readr::read_lines("minimal-stop.txt"))
lir <- lyric_bigrams %>% filter(!w1 %in% stopWords$word & !w2 %in% stopWords$word)

Player names

The dictionary_bref_players() function from nbastatR will get us a dictionary of NBA player names (from the Basketball Reference website) in tibble form. No arguments are needed, but we can just remove names from the BAA league to simplify the process.

# NBA player names
playerNames <- nbastatR::dictionary_bref_players()
playerNames <- playerNames %>% filter(!is.na(countSeasons))

Fuzzy joining

For a more flexible merge of the two columns (the player names and the lyric bigrams), the fuzzyjoin package implements various methods for fuzzy string matching, to allow for minor variations in spelling. For no particular reason, I matched the columns with Levenshtein distance and a maximum distance of one. Be aware that all these comparisons consume a lot of memory.

joineddfs <- 
  stringdist_left_join(playerNames,
                       lir,by=c("namePlayerBREF"="BGlyric"),max_dist=1,
                       method=c("lv"),
                       ignore_case=TRUE) 

After some cleaning and deduplicating, for the main visualization we can count how many times each player was mentioned. I manually filtered out eight homonyms (I doubt that lyrics mentioning Michael Jackson or Mel Gibson referred to a Knicks point guard (1987-1990) or a guard for the Lakers in 1964, respectively). I most likely missed some.

# clean and deduplicate
playermentions <- joineddfs %>% filter(!is.na(BGlyric)) %>% 
  distinct(song_id,BGlyric,.keep_all = TRUE)
# remove homonyms
mentions_countF <- 
  mentions_count %>% filter(!namePlayerBREF %in% c("Michael Jackson","Dan King",
                                                   "Bill Smith","Ed Horton","Larry Sanders",
                                                   "Bobby Brown","Mel Gibson","Harry Davis"))

The final data for plotting results from a series of hacks to rank and arrange the names according to their number of mentions (cur_group_id() is our friend here), and then to ‘conditionally’ wrap some of the names so they fit nicely in the plot. My improvised approach for categories with many names was to stack them side by side with lead, slice out every other row (note the %% operator), then put things back together.

# prepare data for plotting
freqsPL <- 
  mentions_countF %>% add_count(n,name = "ncat") %>% 
  mutate(nr=forcats::fct_inorder(as.factor(n))) %>% 
  group_by(nr) %>% 
  arrange(desc(namePlayerBREF),.by_group = TRUE) %>% 
  mutate(rank = cur_group_id()) %>% 
  mutate(forticks=paste0(n)) 

freqsPLtop <- freqsPL %>% filter(n>2) %>% 
  mutate(namePlayerBREF=str_wrap(namePlayerBREF,width = 12))
freqsPLmid <- freqsPL %>% filter(n==2) %>% 
  mutate(namePlayerBREF=paste(namePlayerBREF," ", lead(namePlayerBREF))) %>% 
  slice(which(row_number() %% 2 == 1)) 
freqsPLbottom <- freqsPL %>% filter(n==1) %>% 
  mutate(namePlayerBREF=paste(namePlayerBREF," ", lead(namePlayerBREF))) %>% 
  slice(which(row_number() %% 2 == 1)) 

allfreqsPL <- bind_rows(freqsPLtop,freqsPLmid,freqsPLbottom) %>% 
  mutate(namePlayerBREF=str_remove(namePlayerBREF," NA"))
  

Plotting

Before the plotting, lets set up a gradient background for the plot panel following this entry, and a separate tibble for a colorful annotation using ggtext.

  # gradient background grob
g <- grid::rasterGrob(c("#272822","black"), width=unit(1,"npc"), height = unit(1,"npc"), 
                      interpolate = TRUE) 
# custom text annotation
forsubtitle <- 
  tibble(label = "<span style = 'color: #0F58FF;'>**3988**</span> songs from <span style = 'color: #0F58FF;'>**21**</span> artists <br> <span style = 'color: #F95C09;'>**5022**</span> player names",
         x = 0.7, y = 35)

The data now looks like this:

> allfreqsPL
# A tibble: 90 × 6
# Groups:   nr [8]
   namePlayerBREF          n  ncat nr     rank forticks
   <chr>               <int> <int> <fct> <int> <chr>   
 1 "Michael\nJordan"       9     1 9         1 9       
 2 "Scottie\nPippen"       8     1 8         2 8       
 3 "Shaquille\nO'Neal"     6     3 6         3 6       
 4 "Reggie\nMiller"        6     3 6         3 6       
 5 "LeBron James"          6     3 6         3 6       
 6 "Yao Ming"              5     6 5         4 5       
 7 "Steve Nash"            5     6 5         4 5       
 8 "Shawn Kemp"            5     6 5         4 5       
 9 "Paul Pierce"           5     6 5         4 5       
10 "Gilbert\nArenas"       5     6 5         4 5       
# … with 80 more rows  

Now we can plot the mentions as text stacked inside bars, using ggfittext for dynamic resizing. The ‘Rock Salt’ font is from Google Fonts, downloaded to my Linux system using Typecatcher and shown using extrafont.

ggplot(allfreqsPL,aes(x=factor(rank),y= n,label=namePlayerBREF))+
  annotation_custom(g, xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf) + 
  geom_bar_text(position="stack",min.size = 2,
                family="Rock Salt",reflow = F,
                grow = TRUE, place="left",outside = TRUE)+
  scale_x_discrete(breaks=1:8,labels=unique(freqsPL$forticks),
                   expand = expansion(add=c(0.01,0.5)))+
  scale_y_continuous(expand = c(0.01,0.01))+
  geom_richtext(data=forsubtitle, aes(x=x,y=y,label=label),
                color="white",family="Lato Thin",size=9,
                fill = NA, label.color = NA,hjust= 0)+
  theme(panel.grid.major = element_blank(),
        axis.line.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.background=element_blank(),
        panel.border = element_blank(),
        plot.background=element_rect(fill = "black"),
        plot.title = element_text(vjust = -0.5,size=32, hjust=0.03,  
                                  family = "Lato Medium",color="white"),
        axis.text.x = element_text(family="Lato Heavy",size=19),
        axis.ticks = element_blank(),
        axis.title.x = element_text(color="white",size=17))+
  labs(x="mentions",y="",
       title="NBA players mentioned in hip hop songs")
click for high res

As a complement, we can figure out the number of distinct players mentioned by the different artists.

# to get player per artist
lyrsplyrs <- 
  joineddfs %>% 
  filter(!is.na(BGlyric)) %>% 
  select(namePlayerBREF,song_id,song_name,artist_name) %>% 
  mutate(song_id=as.character(song_id)) %>% 
  left_join(all_lyrsdf)%>% filter(!namePlayerBREF %in% c("Michael Jackson","Dan King",
                                                         "Bill Smith","Ed Horton","Larry Sanders",
                                                         "Bobby Brown","Mel Gibson","Harry Davis"))

artists_nPlayers <- 
  lyrsplyrs %>% distinct(artist_name,song_id,namePlayerBREF) %>% 
  group_by(artist_name) %>% 
  distinct(namePlayerBREF) %>% count(artist_name) %>% 
  arrange(desc(n))  

After some basic wrangling we get this two-column tibble with artists and the total number of players mentioned.

 artists_nPlayers
# A tibble: 21 × 2
# Groups:   artist_name [21]
   artist_name              n
   <chr>                <int>
 1 Migos                   67
 2 Ghostface Killah        28
 3 2 Chainz                16
 4 G-Unit                  14
 5 Beastie Boys            13
 6 Army of the Pharaohs    11
 7 Flatbush Zombies        10
 8 R.A. The Rugged Man      8
 9 Das EFX                  7
10 Jedi Mind Tricks         7
# … with 11 more rows  

To plot these values we can use geom_segment

artists_nPlayers %>% filter(n>1) %>% 
  ggplot()+
  geom_segment(aes(x=fct_reorder(artist_name,n),
                   xend=fct_reorder(artist_name,n),y=0,yend=n,color=n),
  )+
  geom_text(aes(x=fct_reorder(artist_name,n),y=n+0.5,label=n),color="gray",
            family="Lato Medium",size=5)+
  coord_flip()+
  scale_color_scico(palette = 'nuuk',direction = -1,guide='none')+
  scale_y_continuous(expand = expansion(add=c(0,1)))+
  labs(title="Number of players mentioned")+
  theme(
    plot.title =  element_text(color="white",family="Lato Medium"),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    axis.text.y = element_text(family = "Rock Salt",color="white",size=18),
    axis.text.x = element_blank(),
    plot.background = element_rect(fill="black"),
    panel.background = element_rect(fill="black"))     
click to enlarge

Lastly, for those of us interested in the lines that actually contain player names, we can produce a tibble of artists, song names, and lines by cleaning up the special characters in the line variable and keeping only the rows which contain a name.

# clean weird characters, filter lines to keep mentions only
linesplyrs <- 
lyrsplyrs %>% mutate(linecln=str_remove_all(line,"[^\\w\\s]")) %>% 
  filter(stringi::stri_detect_regex(linecln,namePlayerBREF)) %>% 
  distinct(song_id,.keep_all = TRUE)

A random sample of the lines with player names:

> linesplyrs %>% select(song_name,artist_name,linecln) %>% 
+   slice_sample(n=6) %>% as_tibble()
# A tibble: 6 × 3
  song_name                            artist_name      linecln                                                         
  <chr>                                <chr>            <chr>                                                           
1 Put ’Em In The Grave (Funeral Remix) Jedi Mind Tricks I take my Glock and I point god point guard like Brevin Knight   
2 The Black Diamonds                   Ghostface Killah Like Mike Harris                                                
3 Can’t Go Out Sad                     Migos            Quavo Paul Pierce em whip a ball Wilson                         
4 Drip (Remix)                         Migos            No Vince Carter fifteen with the carbon                         
5 Look at My Dab                       Migos            Michael Jordan Im perfecting my craft                           
6 Represent the Real                   Das EFX          This for my block handlin rock like Kenny Anderson

Nice!

This simplistic approach was for a limited number of artists, and by matching full names as they appear in the player dictionary (without considering nicknames, partial matches, or abbreviations) I’m missing out on many more mentions. Still, this code should document a few things I needed to learn such as ranking groups, slicing every other row, text sizing, and colorful annotations.

As usual, feel free to contact me with any questions or comments.