With Mammal March Madness happening this month, I’ve been seeing a lot of common names for mammalian species in my Twitter feed and this year in particular two of the divisions are based directly on common names: Adjective mammals (e.g. Spectacled bear, pouched rat, clouded leopard, etc. ) and Two Animals One Mammal (e.g. bearcat, tiger quoll, hog badger, etc.). I recently had to figure out how to do text analysis for another project (in which I counted the most frequently-used words in the titles of hundreds of papers), so I wondered if I could apply the same analysis code to the common names for mammals (turns out I could).
This post has two parts: Part One is a straightforward text analysis of word frequency, and Part Two is a nifty approach to quantifying name lengths.
Part 1: What are the most frequent words in the common names of thousands of mammalian species?
I’m doing this post for common names in English because these made for the largest dataset. Personally, I rarely use common names and we don’t really have nearly as many in Spanish - although some of the ones we have are pretty cool (e.g. tlacuachín, chungungo, and viejo de monte).
For this post we’ll use the tidytext R package and a massive list of common names for mammals that’s available thanks to the IUCN Red List assessments. All the code here should be fully reproducible, although you will probably need to install various packages first.
Once we download the IUCN data that we had from another post we use unnest_tokens() to split up the common names and end up with a row for every token (in this case words). With the words in this long format, we can easily quantify them using count(). Pretty cool and pretty simple.
I chose an arbitrary number of 20 top words to plot, using ggalt and hrbrthemes to make crisp and minimalist lollipop charts.
Unsurprisingly, the top words for all orders reflect the names of the most diverse groups (rodents, bats, shrews and primates)
To gain more insight, we can join the list of top words with the Parts of speech data frame that comes with tidytext. This dataset contains hundreds of thousands of English words from the Moby Project by Grady Ward, with each one tagged as “Noun”, “Adverb”, “Adjective”, or “Verb”, among other options. For some of the top terms there were multiple matches (for example: “flying” as an adjective, a noun, and a verb) but we can keep the first match using slice(). I also fixed up missing or mismatched terms manually using case_when().
I wrote a function to get the top n words, mainly as a way to document how non-standard evaluation works for unnest_tokens_ because I couldn’t find anything in the help files. Hint: it takes the arguments as character vectors.
Among the top 20 words and ordered by frequency:
Nouns outnumbered adjectives (55 vs 45%).
Tails, noses, and ears are the most common features used to describe species.
White, red, and black are the most common colors used to describe species.
Long, flying, and lesser are the most common descriptors of different attributes or animals.
Even when expanding to the top 50 words, people’s last names did not make it into the list and the only place name is “African” at number 41. I kept looking and “Thomas’s” appears in a 10-way tie at number 125, with 33 species that include “Thomas’s” in their common name (e.g. Thomas’s Shrew Tenrec Microgale thomasi and Thomas’s Giant Deer Mouse Megadontomys thomasi). This is mainly because so many species have been named in dedication to British zoologist Oldfield Thomas.
We can then use faceting to repeat the plot for the six most speciose orders (>150 species) but with way less top words (for visibility).
We see that there is essentially no overlap in the most frequent words, but this is probably not the best way to visualize the high amount of mismatch. Instead, we can condense way more information using a heatmap, in this case created using Rebecca Barter’s superheat package.
To make the heatmap, we take advantage of dplyr’s grouping capabilities before running the function to get the top n words. Afterwards all we need to do is change the data from long to wide format (using tidyr::spread in the same way we would have used reshape2::cast), and wrangle the row names into the order we want to use.
The heatmap shows how there are no shared top words between the six most diverse orders, and also the frequency of each one.
Finally, I wrote a very crude function to generate new common names by just mashing up some of the popular words following three few simple formulas (adjective adjective noun, noun noun, adjective nounnoun).
Most of the output makes no sense, but there were some funny ones, and I decided to draw a few with my epic MS Paint skills. Below is a sample of 100.
This function could be vastly improved by using more tags other than noun and adjective. For example, it’s possible to follow this post by Maëlle Salmon to separate animal names from other nouns.
100 random mashed up names
Part 2: Name lengths
Now we will quantify the length of different common names, in terms of both words and characters.
Before doing that, it’s worth noting that of 5567 species in the dataset, 5350 have at least one common name listed. Also, 2407 species have >1 common names. The trick there was to use stringi’s stri_detect() to only keep rows that contain commas and then count the number of rows.
To count the number of names for species that have many, I kept it simple and used stringi to count the number of commas (plus one).
Two species are tied with the highest number of common names (9): the grey wolf (aka. Timber Wolf, Arctic Wolf, Gray Wolf, Mexican Wolf, Plains Wolf, Common Wolf, Tundra Wolf, Wolf) and the wapiti (aka. Siberian Wapiti, McNeill’s Deer, Merriam’s Wapiti, Shou, Izubra/Manchurian Wapiti, Tien Shan Wapiti, Tule Elk, Alashan Wapiti).
Once we’ve counted the number of names, let’s check it out in histogram format.
To count the length of the common names in terms of characters, I had to make a choice about which name to keep for those species that had many. The simplest option was to keep the first one provided, using the separate() function in tidyr.
The mammal with the shortest common name is the Kob (Kobus kob), an antelope found across sub-Saharan Africa. The species with the longest common name is the Black-crowned Central American Squirrel Monkey (Saimiri oerstedii), with a self-explanatory common name.
Now let’s look at the distribution of the number of characters, but using a density plot instead of a histogram.
That’s all, if you found this helpful please let me know, and also contact me if you find any mistakes in the code.