What are people commenting about their loaded packages?

This post was discussed in the Rweekly Highlights Podcast (Episode 118). Listen here.

From small one-off scripts to massive interactive apps, most workflows use packages that help us extend the capabilities of a programming language. Sometimes we only need some example data or a few additional functions. Other times we want to add grammars, data structures, printing methods, graphical devices, or OOP programming systems.

Rather than only working interactively, it helps to use scripts and save our source code for repeatability and transparency. When we need something from a package that fits our needs, we generally call this package from a script, so properly documenting how we use packages within scripts can make code easier to understand, debug, and share.

Can you figure out which packages here don’t actually exist?

# packages I may or may not need for this task
library(e1071)
library(yulab.utils)
library(wk)
library(windex)
library(xonkicks)
library(whisker)
library(labbo)
library(stringx)
library(castor)

One easy way to add information about the packages we use is with code comments for package load calls. In R, anything after the commenting symbol “#” will be ignored during evaluation. It’s best practice to not overuse comments and to write self-explanatory code, but recording certain details about the packages we’re calling in a script can be useful.

The code above can be easily enhanced with the title and source for each package (as installed on my machine), and this is potentially helpful.

# packages with title and source as comments
library(e1071) # Misc Functions of the Department of Statistics, Probability Theory Group CRAN v1.7-9
library(yulab.utils) # Supporting Functions for Packages Maintained by 'YuLab-SMU', CRAN v0.0.4
library(wk) # Lightweight Well-Known Geometry Parsing, CRAN v0.6.0
library(windex) # Analysing Convergent Evolution using the Wheatsheaf Index, CRAN v2.0.3
library(xonkicks) # not installed
library(whisker) #  for R, Logicless Templating, CRAN v0.4
library(labbo)   # not installed
library(stringx) # Drop-in Replacements for Base String Functions Powered by'stringi', [github::gagolews/stringx] v0.2.4
library(castor) # Efficient Phylogenetics on Large Trees, CRAN v1.7.8

The two made up packages were xonkicks and labbo.

To easily create code comments about the package load calls in a script, we can use the annotater package (read more about the package here). The functions build code comments using information already supplied by a package in its DESCRIPTION file or within its internal lists of functions and bundled datasets. This can be quite helpful for sharing code online, teaching, or making sense of existing scripts.

At present, annotater can add the following details to package load calls in scripts or markdown (Rmd/Qmd) files:

package title
package version
package repository source (e.g. CRAN, GitHub, Bioconductor, etc.)
which functions from each package are being used
which datasets from each packages are being used

The number of possible package annotations has grown thanks to feedback and community contributions, but I still had a pressing question:

What are people commenting about their loaded packages?

With so much public code available, it is now possible to search for .R, .Rmd and .Qmd files online, look for package load calls (e.g., library(data.table)), and then check for code comments after these lines (e.g., library(data.table) # using dev version).

A good source of data for this can be the weekly GitHub snapshots available on Google’s BigQuery cloud platform. This 3TB+ dataset includes the content of 163 million files, all searchable with regular expressions.

This post by Vlad Kozhevnikov explains how to write SQL for BigQuery to select the IDs of all relevant files and then select content of R files. After that we can run another query to search the script contents for code comments after library() calls.

The queries for .R files look like this (change with your respective BigQuery project, dataset, and table names). I repeated this for .R, .Rmd, and .Qmd files.

-- Find all files with .R extension (case insensitive)
SELECT *
FROM `bigquery-public-data.github_repos.files`
WHERE lower(RIGHT(path, 2)) = '.r'

-- Match to get file contents
SELECT *
FROM `bigquery-public-data.github_repos.contents`
WHERE id IN (select id from `YOURBQprojectID.yourDATASET.yourTABLE`)

– Regex match library calls with a comment after whitespace (except for newlines)
SELECT *
FROM `YOURBQprojectID.yourDATASET.yourTABLE`
WHERE REGEXP_CONTAINS(content, r'library\(.*\)[ \t]?#')

Many thanks to fellow instructors Steve Condylios and Amit Kohli for pointing me in the right direction!

My latest query found 500,641 .R files, 44,839 Rmd, and 242 Qmd files in the GitHub Data snapshot (March 14, 2023) of which 3968 included at least one code comment after a library() call.

A good proportion of these 545,722 files won’t be scripts that people are meant to see or interact with (for example, .R files in Shiny apps). Still, the 0.7% of files with comments about the loaded packages will have some interesting information about what us as users are commenting.

Also following Vlad’s post, I was able to connect with BigQuery from R using bigrquery and download the tables to objects in my R environment. Once in R, there is some cleaning and parsing to get each file’s library calls and respective comments (if any) into a tidier structure.

All the R code for this post is at the Gist at the end in case anyone is interested. This approach uses regex to parse code but also other cool functions to treat code as code rather than strings. In short, I had to:

get only the lines with library calls in the field that stored all the script’s content
split each call into a separate row
separate comments into their own variable
clean up spaces and unmatched brackets (from wrapped library() calls)
parse the calls and extract the first argument of the library() calls to get package names and not other arguments people sometimes use.
various summary statistics

General overview

For the 3968 (out of >500,000 files) that included at least one code comment after a library load call, the mean number of packages loaded was 6 per script (minimum 1, maximum 203!) and across these scripts over half (55%) of the packages mentioned in each one have a code comment.

Let’s see a set of commented calls sampled at random:

pkgname	comment
roxygen2	If not availble install these packages with ‘install.packages(…)’
tidyverse	load c(dplyr, tidyr, stringr, readr) due to system doesn’t work.
scales	install.packages(“scales”)
GenomicRanges	Required for defining regions
devtools	session_info
magrittr	Pipes
forecast	varios metodos para pronóstico
gridExtra	for multiple plots on a page
car	using external libraries
rgl	Need to install this package first
reshape2	For converting wide to long
MplusAutomation	for extracting gh5
ggplot2	for plotting color plots with legend
ggplot2	For plotting
DataCombine	for the slide function
irr	fleiss’ kappa
corpcor	for fast computation of pseudoinverse
ggplot2	For plotting
dplyr	alternative, this also loads %>%
scales	For formating values in graphs

Now let’s see the comments that repeat the most regardless of which package they’re meant for:

comment	n
Pipes	206
enables piping : %>%	93
For graphing	84
data manipulation	49
for data manipulation	38

Even with this small sample we can already see some patterns and shared themes in these comments. For example:

Notes about installation
What the package is being called for
Which function or functions from the package are being used
Pipes

Skimming the comments, a decent proportion of them are describing what a package was used for in general or mentioning the functions or datasets of interest. 22% of all comments start with the words “for”, “para”, and “pour” (case insensitive). Here’s another random sample:

pkgname	comment
bcmaps	for BC regional district map
tmap	for maps
car	for initial parameter estimate
factoextra	for fviz_cluster(), get_eigenvalue(), eclust()
clusterSim	for Davies-Bouldin’s cluster separation measure
e1071	Para la función svm
plotrix	for plotCI(), cld()
edgeR	for DGE
parallel	for examining machine
Boruta	for feature importance
missForest	for missForest()
grid	For ‘unit’ function in tick length setting
Amelia	for missmap()
tidyr	for data handling
MASS	For the data set # Use smoke as the faceting variable
Epi	for AUC evaluator
minpack.lm	for non-linear regression package
plyr	For the function “each”
MASS	for LDA
grid	for gList for likert plots w/ histograms

~15% of comments had some kind of note about installation instructions or package source (matched the pattern “instal” or ‘CRAN’ or ‘github’).

We can even group the data by package find the most commented packages. For reference, these are the ten packages with the most comments (excludes duplicate comments):

pkgname	n
ggplot2	208
dplyr	190
plyr	92
MASS	87
tidyverse	74
reshape2	71
stringr	64
scales	64
data.table	60
lubridate	55

Many of these are tidyverse packages and sometimes the comments made note of that. I did not look at when these scripts are from, but they seem to cover and make note of important changes such as the melt/cast-gather/spread-pivot transition. See some comments for tidyr that mention reshaping in general:

pkgname	comment
tidyr	for reshaping data (e.g., ‘gather’)
tidyr	instead of reshape2::melt
tidyr	key package for reshaping
tidyr	reshaping data
tidyr	library(Hmisc); library(reshape2);
tidyr	data reshaping
tidyr	because we’ll need to reshape the data
tidyr	For reshaping dataframes, specifically nesting
tidyr	this library contains the function that we’ll need in order to reshape the dataframe

I had noted there are comments in different languages, so we can try to identify them with Google’s Compact Language Detector as implemented in the cld3 pacakge. The function is vectorized and easy to use.

n	percent
en	4820	0.72
no	324	0.05
pt	103	0.02
fr	90	0.01
es	88	0.01
ja	85	0.01

The vast majority of annotations appear to be in English, followed by Norwegian, Portugese, French, Spanish, and Japanese. Because the comments are so brief and many include function names that aren’t real words, I highly doubt the accuracy some of these detected languages. I looked at some the comments that were supposedly in Norweigian and they were mostly short comments with function names and not words or sentences in any particular language. Still, I was interested in the Spanish-language comments and I found some similar to what I’ve written in the past. It’s pretty interesting to see how people describe packages.

pkgname	comment
dplyr	paquete que veremos mañana y que por ahota lo necesitan otras funciones
dplyr	ya incluido en tidyverse
corrplot	grafico de correlacion
data.table	manipulacion de datos mayor rapidez que dplyr
NombreDelPaquete	Para usar el paquete
dplyr	Para cargar el paquete dplyr
stringr	Para la conversión de tipos numéricos a cadenas
psych	incluye describe()
IRdisplay	despliegue de resultados en jupyter para R
fdrtool	para half normal
Hmisc	algunas funciones de interes para describir datos
randomForest	construir enms
readxl	read_excel
stringr	creo que ni la usé
foreign	Solo para cargar el conjunto de datos de prueba en formato arff
ggplot2	manipulacion de graficos - tidyverse
curl	Descargar desde internet

I particularly like the comment “creo que ni la usé” or “I think I didn’t even use this package” because it makes the tools from annotater seem helpful. If I have a big script and am unsure whether or not I used a function or dataset from a package, the addins can check that for me.

As a quick exercise, here are seven comments sampled randomly for five of the packages in the data drawn as network graphs. I think that even without labeling the package name we could figure out what each one is based on the comments.

There is a lot to explore in these data (TidyTuesday potential??), but I’m quite happy with the overlap between what I found and the comments that can be created using the the tools that I’ve been working on.

Exploration code:

Share on

Twitter Facebook LinkedIn

What are people commenting about their loaded packages?

Luis D. Verde Arregoitia

What are people commenting about their loaded packages?

General overview

Share on

You may also enjoy

Nueva publicación - Conservación de primates

New publication on primate conservation

R is not RStudio

R no es RStudio