Regular expressions and working with strings

R-Ladies St. Louis
February 22, 2023

About: me

  • Mammals, conservation, macroecology
  • Evolution, ecomorphology
  • Phylogenetic Comparative Methods
  • Biogeography, R as a GIS
  • Certified trainer (posit::tidyverse & Software Carpentry)

What are Regular Expressions?

  • Also abbreviated as regex, R.E., or regexp (singular)

  • A concise language for describing patterns of text


Specially encoded 🧶strings of characters that match patterns in other text strings

Regular expressions

In practice, a computer language with its own terminologies and syntax

  • Input as a text string that compiles into a mini program built specifically to identify a pattern.

  • Can be used to match, search, replace, or split text

Have you used regular expressions? If so, what for?

Possible patterns

  • “dog” but not “dogs”

  • “dogs” but only if the match is at the start of the string,

  • digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

  • “modeling” or “modelling” (alternate spellings of the same word)

  • words ending in “at” (words that begin or end with a specific pattern)

  • strings that start with digits

  • dates

More possible patterns

  • zip codes

  • numbers inside square brackets and also the brackets

  • valid Twitter handles (start with @, no spaces or symbols, <16 characters)

  • UPPERCASE words

Why Learn Regular Expressions?

  • Underutilized, valuable skill

  • Save time when we need to match, extract, or transform text by matching words, characters, and patterns

  • Regular expressions can often replace dozens of lines of code

  • Not specific to any particular programming language

What we need

  • strings
  • patterns


Generally, we describe patterns and look for them in input strings

🧶Strings

A collection of characters that make up one element of a vector:

test_string <- "This sentence is a string."


We can store multiple strings in a character vector:

pets <-  c("dogs","cat","parrot","pig")

Strings

Names, row names, column names, and values in a data frame can also be strings:

drink price
Coffee 3.50
Tea 2.99
Juice 3.20

Note: strings in R are case sensitive

Uppercase and lowercase letters are treated as distinct

"rat"=="rat"
[1] TRUE
"rat"=="rAT"
[1] FALSE

Regex searches are case sensitive by default

Our options:

  • Build case-insensitive regular expressions

  • Turn off case sensitivity in the matching

  • Modify the input text before matching

Regex - Getting started

  1. Describe a pattern
  2. Provide an input string
  3. Feed into a function (for matching, replacing, splitting, etc.)


The matching is done by the regex engine. We do not access the engine directly, but functions that take regular expressions as arguments call on it whenever its needed.

Getting started

Regexp


dog

Input string


“The dog is fat.”

Match


The dog is fat.

To search for a specific sequence of characters, the regexp (regular expression) we need is simply that sequence of characters

find a d followed by an o followed by a g. (all lowercase and all characters together and in that order when read from left to right)

Online regex testers

Is the regular expression matching anything?

Regex Testers

Practice - Regex

Navigate to rubular, regex101, regexr, or regexpal

Input the following text into the test string field:

cat
hat
CAT
Manhattan
MOUSE
housekeeping

Let’s try different regexes in the pattern box to see what happens

Literal Characters and Metacharacters

Strings and regular expressions are made of characters


For regular expressions, characters can be grouped into two classes depending on their behavior

Literal characters

If input string is “dog” and regex is dog


We will get a match whenever the characters d, o, and g occur consecutively in the input text


"dog" tells the regex engine: find a d, immediately followed by an o, immediately followed by a g together and in that particular order

Literal characters (continued)

d, o and g are examples of literal characters


They stand for exactly what they are: d in the regex matches a “d” in the input text, o matches an “o” in text, and so on.



The power and flexibility of regular expressions comes from their ability to describe more complex patterns.


If a text pattern can be described verbally, we can most likely write a regular expression to match it.

Metacharacters

To match more than literal strings of text, a small subset of characters that have special functionalities when they appear in a regular expression.


Metacharacters do not stand for themselves, they interpreted in a special way.


Metacharacters include: []^$.|?*+(), which are reserved for unique matching purposes.

Wildcards

Stand in for unknown characters

. match any character once

f..l matches “fill” and “fool”, but not “flail”

.top matches “stop and”isotope”, but not “topple”

Character sets

Match one or more characters enclosed in square brackets

[ ] match a set of characters

[cb]at matches “cat”, and “bat”, but not “rat”

[CK]at[iey] matches “Caty”, “Kati”, “Katy”, and “Kate

Negation tokens

[^]    match characters not in the specified character set

^ must be the first character inside the brackets

[^aoeiou] matches consonants

[^R] matches everything except capital R

Character ranges

Indicate a series of sequential characters inside character sets

Dash - inside a character set abbreviates alphabetical or numeric sequences

[A-D] matches any single letter from A,B,C, or D (uppercase)

[5-8] matches any single digit between 5 and 8

[A-Za-z] matches all alphabetical characters

character sets are case sensitive
character ranges can also be negated with ^

Anchors

Specify the relative position of the pattern being matched

 ^     starts with

 $     ends with
note:  ^      outside a pair of square brackets is an anchor

^mil matches “milkshake” but not “family”

ing$ matches “going” but not “ingest”

Practice

  • Write a regexp that can match tender, timber, tailor, and taller

  • Match possible misspellings of ‘herbivore’ using character sets?

Practice

  • Which of these regular expressions matches food at the beginning of a string?
  1. ^food
  1. food
  1. $food
  1. food^

Quantifiers

Specify how many times a character or character class must appear in the input for a match to be found

 ?    Zero or one
 *    Zero or more occurrences
 +    One or more occurrences
 {}   Exactly the specified number of occurrences

quantifiers apply to the preceding character

Quantifiers

modell?ing matches “modeling” and “modelling

zero or one els (l)

ya*y! matches “yy!”, “yay!”, “yaaay!”, “yaaaaaay!”, etc.

zero or more aes (a)

Quantifiers

no+ matches “no”, “nooo”, “noooooo”, etc, but not “n”

one or more oes (o)

e{2} matches .”keep” and “bee” but not “meat”

exactly two ees (e)

Quiz

  • Use a quantifier to match cute, cuuute, cuuuuuute, and cuuuuuuuuute

  • How can we match Computer, computer, Computers, and computers?

  1. [cC]omputers?
  1. Computers+
  1. [cC]omputer[s]+

Alternation

Alternation tokens separate a series of alternatives

 |     either or


dog|bird matches “dog” or “bird

gr(a|e)y matches “gray” and “grey

note the alternation enclosed in brackets

Special sequences and escapes

 \     signals a shorthand sequence or gives special characters a literal meaning

Escapes

hello\\? matches “hello?

question mark treated as a literal character, but in R escape the backslash first

Metacharacters inside a character set are stripped of their special nature

Shorthand sequences

Refer to commonly-used character sets

\w  letters, underscore, and numbers
\d  digits
\t    tab \n   new line
\s  space
\b  word boundary

Predefined character classes help us avoid malformed character sets

Word boundaries

\b

Match positions between a word character (letter, digit or underscore) and a non-word character (usually a space or the start/end of a string).

Before a sequence of word characters

\bcase matches “case” and “two cases” but not “suitcase”

After a sequence of word characters

org\b matches “cyborg rebellion” but not “organic”

Word characters

Matches any character (letter, digit or underscore). Useful in combination with word boundaries and quantifiers.


Equivalent to [a-zA-Z0-9_]

\w matches letters (case insensitive), numbers, and underscores in:

F33d.%.the_mÖusE pl@ase!!#

Practice

  • Enter “That atmospheric sensor is at the university” as the test string in a regex tester.

  • Explain the matches obtained with the following three regular expressions?

  1. at
  2. \bat
  3. at\b

Anchor, wildcard with quantifier (0 or more)

^can.* matches “canine”, “canadian”, and “canolioli”, but not “a canister”


Wildcard and quantifier

A.*x strings that start with “A” and end with “x


Shorthand sequence (space) and quantifier

\s{3} matches three spaces

Anchors, character set, and quantifier (one or more)

^[a-z]+$ matches a lowercase string

Word characters, quantifiers, word boundaries, and anchors

\w+\b$ matches the last word in a string

“Fix the car
“12 eggs

^\w+\b matches the first word in a string

Fix the car.”
12 eggs”


Modified from ‘Regular Expressions’ concept map by Greg Wilson

Regex in R

We can match column names and values in character strings with regular expressions

📦 stringr

Cohesive set of functions for string manipulation

  • Function names start with str_

  • All functions take a vector of strings as the first argument (pipe-friendly)

regex() modifier to control matching behavior

ignore_case=TRUE will make matches case insensitive

stringr examples

Matches?

str_detect(string = c("catalog", "battlecat", "apple"), 
           pattern = "cat")
[1]  TRUE  TRUE FALSE

Output is a logical, TRUE or FALSE vector of the same length as our input string

stringr examples

Which elements contain matches?

str_which(string = c("catalog", "battlecat", "apple"), 
          pattern = "cat")
[1] 1 2

Output is the index for each of the matching elements

stringr examples

Replacing matches

str_replace(string = c("colour", "neighbour", "honour"),
            pattern = "ou",
            replacement = "o")
[1] "color"    "neighbor" "honor"   

stringr examples

Case insensitive matching

str_replace(string = c("colOur", "neighboUr", "honOUr"),
            pattern = regex("ou", ignore_case = TRUE),   
            replacement = "o") 
[1] "color"    "neighbor" "honor"   

Demo - stringr

Let’s match these REs against the test vector below using str_detect. Can we explain the matches?

Regular expressions
1. ^dog
2. ^[a-z]+$
3. \d

test_vector <- c("Those dogs are small.","dogs and cats",
                 "34","(34)","rat","watchdog","placemat",
                 "BABY","2011_April","mice")

Using regular expressions in data manipulation


Select, subset, keep, or discard rows and columns

Substitute or recode values

Extract or remove substrings

Cleaning data with regex

To select variables with 📦 dplyr and tidyr, we:

  • write out their names

  • refer to them by position

  • specify ranges of contiguous variables

  • use 📦 tidyselect helper functions

📦 tidyselect helpers


matches(): takes regular expressions, and selects variables that match a given pattern

starts_with(): Starts with a prefix

ends_with(): Ends with a suffix

contains(): Contains a literal string

Selecting columns by name

penguins data from 📦 palmerpenguins

names(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             
penguins %>% 
  select(species, bill_length_mm, flipper_length_mm) %>% 
  sample_n(3)
# A tibble: 3 × 3
  species   bill_length_mm flipper_length_mm
  <fct>              <dbl>             <int>
1 Chinstrap           45.6               194
2 Gentoo              50.7               223
3 Gentoo              45.8               219

Selecting columns by matches in variable names

penguins %>% 
  select(species, matches("length")) %>% 
  sample_n(3)
# A tibble: 3 × 3
  species bill_length_mm flipper_length_mm
  <fct>            <dbl>             <int>
1 Adelie            42.8               195
2 Gentoo            44.9               212
3 Adelie            42                 190

Match values and filter rows

Mammals sleep dataset (msleep) from 📦 ggplot2

msleep %>% select(name,genus) %>% sample_n(4)
# A tibble: 4 × 2
  name            genus       
  <chr>           <chr>       
1 Gray seal       Haliochoerus
2 Tenrec          Tenrec      
3 Giraffe         Giraffa     
4 Giant armadillo Priodontes  

Match values and filter rows

Filter to keep rats only

msleep %>% 
  select(name,genus) %>% 
  filter(str_detect(string = name,pattern = "rat"))
# A tibble: 5 × 2
  name                      genus     
  <chr>                     <chr>     
1 African giant pouched rat Cricetomys
2 Round-tailed muskrat      Neofiber  
3 Laboratory rat            Rattus    
4 Cotton rat                Sigmodon  
5 Mole rat                  Spalax    

Practice

🐀 After running the code below, how can we exclude muskrats from the matches?

msleep %>% 
  select(name,genus) %>% 
  filter(str_detect(string = name,pattern = "rat"))

Wrap up

  • Regex == super helpful
  • We often don’t need to write REs from scratch
  • More regex features to explore (lookaheads, lookbehinds, capture groups, backreferences)

Thank you!

Questions? Comments?