2024 LLMs/genAI + R roundup

<!DOCTYPE html>

hexout
created with hexsession

This year has seen significant progress in the genAI/LLM space, and lately a lot of these tools have been integrated with R in various ways. This is nice because, not everything needs to be a standalone chat box in the web browser - which seems prone to misuse (e.g., using things like chatgpt as a search engine for some reason).

To help me keep track of what’s happening, I’ve put together this (potentially incomplete) list of relevant LLM+R resources.

First, it’s worth mentioning this perspective in Methods in Ecology and Evolution: Harnessing large language models for coding, teaching and inclusion to empower research in ecology and evolution, led by Natalie Cooper, and part of a special issue on the use of LLMs in ecology and evolution. These papers summarize the pros and cons of genAI/LLMs in the context of research and teaching, and their conclusions go beyond research and biological sciences.


Now, in no particular order, here is a roundup of some notable developments that I’ve come across online.

pal

pal by Simon Couch provides easy to use assistants that can edit, document, or explain code. The package provides an addin that works in both RStudio and Positron. Very cute package logo, and pal seems like a good option for writing boilerplate code and automating some of the more tedious, repetitive tasks. Works with multiple underlying models. Haven’t tried it yet.

  • Here’s a tutorial in Spanish for using pal with custom assistants that can have different roles.

elmer

elmer is a new tidyverse-adjacent package by Hadley Wickham and Joe Cheng for interacting in R with various models, either programmatically or interactively. elmer creates R6 chat objects that remember context, and we can interact with models with a console or browser-based chat box, or programmatically within R functions.

Seems promising, quite flexible, and all the activity in the GitHub repo suggests a very active development process, a solid dev team, and lots of community input.

mall

mall is part of the mlverse ecosystem of open source Data Science and Machine Learning libraries. Rather than a chat-based approach, mall applies LLMs rowwise in the columns of a data frame. Built-in prompts include translating, summarizing, extraction, and sentiment analysis of text strings.

mall uses Ollama and is implemented for both R and Python. I will likely be using it to analyze the comments I collected about loaded packages, which I talked about in my posit::conf(2024) talk.

gemini.R

gemini.R by Jinhwan Kim connects R with Google’s gemini model via the gemini API. With a valid API key, the gemini() function takes text prompts and the gemini_image() function can work with images and text promts.

The package algo provides an RStudio addin for creating Roxygen documentation.

GiHub Copilot

GitHub Copilot is a widely-used and well-documented coding assistant for code completions and the possibility of turning natural language prompts into code suggestions. Works on GitHub.com and inside most IDEs.

Subscription-based. I have not tried it yet.

This recent talk by Yanina Bellini for RLadies Rome provides a good overview, and Yani’s slides also mention important considerations about the AI skill threat.

continue

A vscode extension that works nicely in Positron. Supports multiple models for chat and code completion. Easy to provide local files and folders for context and nice integration with the source editor.

Tried it out after Julia Silge mentioned it in the Super Data Science podcast. Works nicely, and I used it in Positron with Anthropic’s Claude 3.5 Sonnet model to write and edit the repetitive css, html, and javascript code that powers the hexsession package. Also helped me with the nested for loops that play a big role in forgts.

ensure

Ensure by Simon Couch helps write code for unit tests using the testthat package. Works through an Rstudio addin, and the documentation mentions that the model has been made aware of testthat syntax and the tidy style guide for code. Will be trying it out for my more recent packages that still have poor test coverage.

codeium

Another cool VScode extension that works well in Positron. Provides autocomplete and chat. Free for individuals. The base model uses Llama 3.1 70B, but paid tiers can choose other models. I’ve tried this out in Positron with good results for code refactoring, at least when playing around with small scripts.

llmR

llmr by Angelo D’Ambrosio provides a unified API to interact with various LLMs and providers through functions with consistent grammar and syntax. Provides easy switching between models, plus logging. Seems similar to elmer. Will try it soon.

  • A similar package called LLMR recently appeared on CRAN but I could not find much materials about it.

lang

lang is also part of the mlverse. This package uses LLMs to translate R documentation and display it in the help pane of Rstudio or Positron. Having participated in various translation initiatives, I am very wary of machine-translated function documentation and how it may affect new learners.

The best part of this package (in my opinion) is the infrastructure for package developers to help translate documentation that after editing can be shipped as part of package with multulingual help files.


If I missed anything let me know and I’ll add it here!