Hip hop and basketball have always had a unique and close relationship, so it is not surprising when player names and various terms from the sport appear in rap lyrics. This post works though the R code needed to find NBA player names in the lyrics of ~4000 songs from 20 hip hop artists and plotting the resulting patterns.
I’m writing this mainly to document several tricks and hacks for working with and ultimately plotting large(ish) volumes of text-based data. The choice of artists is a mix of rappers that I listen to, and others that seemed well-represented in the Genius lyrics data.
The overall workflow is split into getting a hold of lyrics and NBA player names and finding out: which artists are mentioning which players in their songs, and which players get mentioned most.
Getting the lyrics
Until last year, lyrics could be accessed from the Genius website (a massive collection of song lyrics and crowd-sourced annotations) via the Genius API with the geniusr package. At present, because of changes to the Genius API and legal terms, we can no longer fetch lyrics directly with geniusr directly without legally-gray webscraping (if you must, the relevant geniusr functions can be patched with the advice in this issue and they’ll work fine - but see here for more information).
This little example uses many cool packages, let’s set them up first.
Back when the API still produced tibbles with lyrics, I searched for the respective ID for the following artists, and used each artist ID to obtain a vector with all the unique song IDS for each of the following artists:
The artist names:
Here are examples for two of the artists. Each vector of song ids gets named consistently, so we can combine them all later (I didn’t iterate here so I could check that I was getting the correct matches for each search term).
Once we found all the artists we wanted, we can get the vector objects from our enviroment with mget.
To get all the lyrics, we can iterate over the final vector of song ID (with a multi-statement lambda function to add some time between requests and not spam the server).
The resulting tibble has >200,000 rows, one per line for roughly 4000 songs. Here’s a random sample of 10 rows.
For reference, this post by Tom MacNamara also shows how to get and visualize lyrics from the same data source.
For this exercise, let’s split up (tokenize) all the lines into bigrams (consecutive sequences of two words) with unnest_token from tidytext. For the next step, it’s also convenient to add new columns with the bigrams split into separate columns.
To reduce the number of comparisons and because people’s names are the whole point of this exercise, we can filter out the rows in the lyrics data which match any of the words in a custom list of stop words (extremely common words not useful for analysis, such as “the”, “of”, “to”, etc.) from a generic text file.
The dictionary_bref_players() function from nbastatR will get us a dictionary of NBA player names (from the Basketball Reference website) in tibble form. No arguments are needed, but we can just remove names from the BAA league to simplify the process.
For a more flexible merge of the two columns (the player names and the lyric bigrams), the fuzzyjoin package implements various methods for fuzzy string matching, to allow for minor variations in spelling. For no particular reason, I matched the columns with Levenshtein distance and a maximum distance of one. Be aware that all these comparisons consume a lot of memory.
After some cleaning and deduplicating, for the main visualization we can count how many times each player was mentioned. I manually filtered out eight homonyms (I doubt that lyrics mentioning Michael Jackson or Mel Gibson referred to a Knicks point guard (1987-1990) or a guard for the Lakers in 1964, respectively). I most likely missed some.
The final data for plotting results from a series of hacks to rank and arrange the names according to their number of mentions (cur_group_id() is our friend here), and then to ‘conditionally’ wrap some of the names so they fit nicely in the plot. My improvised approach for categories with many names was to stack them side by side with lead, slice out every other row (note the %% operator), then put things back together.
Before the plotting, lets set up a gradient background for the plot panel following this entry, and a separate tibble for a colorful annotation using ggtext.
The data now looks like this:
Now we can plot the mentions as text stacked inside bars, using ggfittext for dynamic resizing. The ‘Rock Salt’ font is from Google Fonts, downloaded to my Linux system using Typecatcher and shown using extrafont.
As a complement, we can figure out the number of distinct players mentioned by the different artists.
After some basic wrangling we get this two-column tibble with artists and the total number of players mentioned.
To plot these values we can use geom_segment
Lastly, for those of us interested in the lines that actually contain player names, we can produce a tibble of artists, song names, and lines by cleaning up the special characters in the line variable and keeping only the rows which contain a name.
A random sample of the lines with player names:
This simplistic approach was for a limited number of artists, and by matching full names as they appear in the player dictionary (without considering nicknames, partial matches, or abbreviations) I’m missing out on many more mentions. Still, this code should document a few things I needed to learn such as ranking groups, slicing every other row, text sizing, and colorful annotations.
As usual, feel free to contact me with any questions or comments.