This post was discussed in the Rweekly Highlights Podcast (Episode 118). Listen here.
From small one-off scripts to massive interactive apps, most workflows use packages that help us extend the capabilities of a programming language. Sometimes we only need some example data or a few additional functions. Other times we want to add grammars, data structures, printing methods, graphical devices, or OOP programming systems.
Rather than only working interactively, it helps to use scripts and save our source code for repeatability and transparency. When we need something from a package that fits our needs, we generally call this package from a script, so properly documenting how we use packages within scripts can make code easier to understand, debug, and share.
Can you figure out which packages here don’t actually exist?
One easy way to add information about the packages we use is with code comments for package load calls. In R, anything after the commenting symbol “#” will be ignored during evaluation. It’s best practice to not overuse comments and to write self-explanatory code, but recording certain details about the packages we’re calling in a script can be useful.
The code above can be easily enhanced with the title and source for each package (as installed on my machine), and this is potentially helpful.
The two made up packages were xonkicks
and labbo
.
To easily create code comments about the package load calls in a script, we can use the annotater package (read more about the package here). The functions build code comments using information already supplied by a package in its DESCRIPTION file or within its internal lists of functions and bundled datasets. This can be quite helpful for sharing code online, teaching, or making sense of existing scripts.
At present, annotater can add the following details to package load calls in scripts or markdown (Rmd/Qmd) files:
- package title
- package version
- package repository source (e.g. CRAN, GitHub, Bioconductor, etc.)
- which functions from each package are being used
- which datasets from each packages are being used
The number of possible package annotations has grown thanks to feedback and community contributions, but I still had a pressing question:
What are people commenting about their loaded packages?
With so much public code available, it is now possible to search for .R, .Rmd and .Qmd files online, look for package load calls (e.g., library(data.table)
), and then check for code comments after these lines (e.g., library(data.table) # using dev version
).
A good source of data for this can be the weekly GitHub snapshots available on Google’s BigQuery cloud platform. This 3TB+ dataset includes the content of 163 million files, all searchable with regular expressions.
This post by Vlad Kozhevnikov explains how to write SQL for BigQuery to select the IDs of all relevant files and then select content of R files. After that we can run another query to search the script contents for code comments after library()
calls.
The queries for .R files look like this (change with your respective BigQuery project, dataset, and table names). I repeated this for .R, .Rmd, and .Qmd files.
Many thanks to fellow instructors Steve Condylios and Amit Kohli for pointing me in the right direction!
My latest query found 500,641 .R files, 44,839 Rmd, and 242 Qmd files in the GitHub Data snapshot (March 14, 2023) of which 3968 included at least one code comment after a library()
call.
A good proportion of these 545,722 files won’t be scripts that people are meant to see or interact with (for example, .R files in Shiny apps). Still, the 0.7% of files with comments about the loaded packages will have some interesting information about what us as users are commenting.
Also following Vlad’s post, I was able to connect with BigQuery from R using bigrquery
and download the tables to objects in my R environment. Once in R, there is some cleaning and parsing to get each file’s library calls and respective comments (if any) into a tidier structure.
All the R code for this post is at the Gist at the end in case anyone is interested. This approach uses regex to parse code but also other cool functions to treat code as code rather than strings. In short, I had to:
- get only the lines with library calls in the field that stored all the script’s content
- split each call into a separate row
- separate comments into their own variable
- clean up spaces and unmatched brackets (from wrapped
library()
calls) - parse the calls and extract the first argument of the
library()
calls to get package names and not other arguments people sometimes use. - various summary statistics
General overview
For the 3968 (out of >500,000 files) that included at least one code comment after a library load call, the mean number of packages loaded was 6 per script (minimum 1, maximum 203!) and across these scripts over half (55%) of the packages mentioned in each one have a code comment.
Let’s see a set of commented calls sampled at random:
pkgname | comment |
---|---|
roxygen2 | If not availble install these packages with ‘install.packages(…)’ |
tidyverse | load c(dplyr, tidyr, stringr, readr) due to system doesn’t work. |
scales | install.packages(“scales”) |
GenomicRanges | Required for defining regions |
devtools | session_info |
magrittr | Pipes |
forecast | varios metodos para pronóstico |
gridExtra | for multiple plots on a page |
car | using external libraries |
rgl | Need to install this package first |
reshape2 | For converting wide to long |
MplusAutomation | for extracting gh5 |
ggplot2 | for plotting color plots with legend |
ggplot2 | For plotting |
DataCombine | for the slide function |
irr | fleiss’ kappa |
corpcor | for fast computation of pseudoinverse |
ggplot2 | For plotting |
dplyr | alternative, this also loads %>% |
scales | For formating values in graphs |
Now let’s see the comments that repeat the most regardless of which package they’re meant for:
comment | n |
---|---|
Pipes | 206 |
enables piping : %>% | 93 |
For graphing | 84 |
data manipulation | 49 |
for data manipulation | 38 |
Even with this small sample we can already see some patterns and shared themes in these comments. For example:
- Notes about installation
- What the package is being called for
- Which function or functions from the package are being used
- Pipes
Skimming the comments, a decent proportion of them are describing what a package was used for in general or mentioning the functions or datasets of interest. 22% of all comments start with the words “for”, “para”, and “pour” (case insensitive). Here’s another random sample:
pkgname | comment |
---|---|
bcmaps | for BC regional district map |
tmap | for maps |
car | for initial parameter estimate |
factoextra | for fviz_cluster(), get_eigenvalue(), eclust() |
clusterSim | for Davies-Bouldin’s cluster separation measure |
e1071 | Para la función svm |
plotrix | for plotCI(), cld() |
edgeR | for DGE |
parallel | for examining machine |
Boruta | for feature importance |
missForest | for missForest() |
grid | For ‘unit’ function in tick length setting |
Amelia | for missmap() |
tidyr | for data handling |
MASS | For the data set # Use smoke as the faceting variable |
Epi | for AUC evaluator |
minpack.lm | for non-linear regression package |
plyr | For the function “each” |
MASS | for LDA |
grid | for gList for likert plots w/ histograms |
~15% of comments had some kind of note about installation instructions or package source (matched the pattern “instal” or ‘CRAN’ or ‘github’).
We can even group the data by package find the most commented packages. For reference, these are the ten packages with the most comments (excludes duplicate comments):
pkgname | n |
---|---|
ggplot2 | 208 |
dplyr | 190 |
plyr | 92 |
MASS | 87 |
tidyverse | 74 |
reshape2 | 71 |
stringr | 64 |
scales | 64 |
data.table | 60 |
lubridate | 55 |
Many of these are tidyverse packages and sometimes the comments made note of that. I did not look at when these scripts are from, but they seem to cover and make note of important changes such as the melt/cast-gather/spread-pivot transition. See some comments for tidyr
that mention reshaping in general:
pkgname | comment |
---|---|
tidyr | for reshaping data (e.g., ‘gather’) |
tidyr | instead of reshape2::melt |
tidyr | key package for reshaping |
tidyr | reshaping data |
tidyr | library(Hmisc); library(reshape2); |
tidyr | data reshaping |
tidyr | because we’ll need to reshape the data |
tidyr | For reshaping dataframes, specifically nesting |
tidyr | this library contains the function that we’ll need in order to reshape the dataframe |
I had noted there are comments in different languages, so we can try to identify them with Google’s Compact Language Detector as implemented in the cld3
pacakge. The function is vectorized and easy to use.
commentLanguage | n | percent | |
---|---|---|---|
en | 4820 | 0.72 | |
no | 324 | 0.05 | |
pt | 103 | 0.02 | |
fr | 90 | 0.01 | |
es | 88 | 0.01 | |
ja | 85 | 0.01 |
The vast majority of annotations appear to be in English, followed by Norwegian, Portugese, French, Spanish, and Japanese. Because the comments are so brief and many include function names that aren’t real words, I highly doubt the accuracy some of these detected languages. I looked at some the comments that were supposedly in Norweigian and they were mostly short comments with function names and not words or sentences in any particular language. Still, I was interested in the Spanish-language comments and I found some similar to what I’ve written in the past. It’s pretty interesting to see how people describe packages.
pkgname | comment |
---|---|
dplyr | paquete que veremos mañana y que por ahota lo necesitan otras funciones |
dplyr | ya incluido en tidyverse |
corrplot | grafico de correlacion |
data.table | manipulacion de datos mayor rapidez que dplyr |
NombreDelPaquete | Para usar el paquete |
dplyr | Para cargar el paquete dplyr |
stringr | Para la conversión de tipos numéricos a cadenas |
psych | incluye describe() |
IRdisplay | despliegue de resultados en jupyter para R |
fdrtool | para half normal |
Hmisc | algunas funciones de interes para describir datos |
randomForest | construir enms |
readxl | read_excel |
stringr | creo que ni la usé |
foreign | Solo para cargar el conjunto de datos de prueba en formato arff |
ggplot2 | manipulacion de graficos - tidyverse |
curl | Descargar desde internet |
I particularly like the comment “creo que ni la usé” or “I think I didn’t even use this package” because it makes the tools from annotater
seem helpful. If I have a big script and am unsure whether or not I used a function or dataset from a package, the addins can check that for me.
As a quick exercise, here are seven comments sampled randomly for five of the packages in the data drawn as network graphs. I think that even without labeling the package name we could figure out what each one is based on the comments.
There is a lot to explore in these data (TidyTuesday potential??), but I’m quite happy with the overlap between what I found and the comments that can be created using the the tools that I’ve been working on.
Exploration code: