Regular expressions and working with strings

R-Ladies St. Louis
February 22, 2023

Luis D. Verde Arregoitia

About: me

@LuisDVerde
@LuisDVA
liomys.mx
luis@liomys.mx

Mammals, conservation, macroecology
Evolution, ecomorphology
Phylogenetic Comparative Methods
Biogeography, R as a GIS
Certified trainer (posit::tidyverse & Software Carpentry)

What are Regular Expressions?

Also abbreviated as regex, R.E., or regexp (singular)
A concise language for describing patterns of text

Specially encoded 🧶strings of characters that match patterns in other text strings

Regular expressions

In practice, a computer language with its own terminologies and syntax

Input as a text string that compiles into a mini program built specifically to identify a pattern.
Can be used to match, search, replace, or split text

Have you used regular expressions? If so, what for?

Possible patterns

“dog” but not “dogs”
“dogs” but only if the match is at the start of the string,
digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
“modeling” or “modelling” (alternate spellings of the same word)
words ending in “at” (words that begin or end with a specific pattern)
strings that start with digits
dates

More possible patterns

zip codes
numbers inside square brackets and also the brackets
valid Twitter handles (start with @, no spaces or symbols, <16 characters)
UPPERCASE words

Why Learn Regular Expressions?

Underutilized, valuable skill
Save time when we need to match, extract, or transform text by matching words, characters, and patterns
Regular expressions can often replace dozens of lines of code
Not specific to any particular programming language

What we need

strings
patterns

Generally, we describe patterns and look for them in input strings

🧶Strings

A collection of characters that make up one element of a vector:

test_string <- "This sentence is a string."

We can store multiple strings in a character vector:

pets <-  c("dogs","cat","parrot","pig")

Strings

Names, row names, column names, and values in a data frame can also be strings:

drink	price
Coffee	3.50
Tea	2.99
Juice	3.20

Note: strings in R are case sensitive

Uppercase and lowercase letters are treated as distinct

"rat"=="rat"

[1] TRUE

"rat"=="rAT"

[1] FALSE

Regex searches are case sensitive by default

Our options:

Build case-insensitive regular expressions
Turn off case sensitivity in the matching
Modify the input text before matching

Regex - Getting started

Describe a pattern
Provide an input string
Feed into a function (for matching, replacing, splitting, etc.)

The matching is done by the regex engine. We do not access the engine directly, but functions that take regular expressions as arguments call on it whenever its needed.

Getting started

Regexp

dog

Input string

“The dog is fat.”

Match

The dog is fat.

To search for a specific sequence of characters, the regexp (regular expression) we need is simply that sequence of characters

find a d followed by an o followed by a g. (all lowercase and all characters together and in that order when read from left to right)

Online regex testers

Is the regular expression matching anything?

rubular by Michael Lovitt
regex101 by Firas Dib
regexr by Grant Skinner
regexpal - part of Dan’s Developer Tools

Regex Testers

Practice - Regex

Navigate to rubular, regex101, regexr, or regexpal

Input the following text into the test string field:

cat
hat
CAT
Manhattan
MOUSE
housekeeping

Let’s try different regexes in the pattern box to see what happens

Literal Characters and Metacharacters

Strings and regular expressions are made of characters

For regular expressions, characters can be grouped into two classes depending on their behavior

Literal characters

If input string is “dog” and regex is dog

We will get a match whenever the characters d, o, and g occur consecutively in the input text

"dog" tells the regex engine: find a d, immediately followed by an o, immediately followed by a g together and in that particular order

Literal characters (continued)

d, o and g are examples of literal characters

They stand for exactly what they are: d in the regex matches a “d” in the input text, o matches an “o” in text, and so on.

The power and flexibility of regular expressions comes from their ability to describe more complex patterns.

If a text pattern can be described verbally, we can most likely write a regular expression to match it.

Metacharacters

To match more than literal strings of text, a small subset of characters that have special functionalities when they appear in a regular expression.

Metacharacters do not stand for themselves, they interpreted in a special way.

Metacharacters include: []^$.|?*+(), which are reserved for unique matching purposes.

Wildcards

Stand in for unknown characters

. match any character once

f..l matches “fill” and “fool”, but not “flail”

.top matches “stop and”isotope”, but not “topple”

Character sets

Match one or more characters enclosed in square brackets

[ ] match a set of characters

[cb]at matches “cat”, and “bat”, but not “rat”

[CK]at[iey] matches “Caty”, “Kati”, “Katy”, and “Kate”

Negation tokens

[^] match characters not in the specified character set

^ must be the first character inside the brackets

[^aoeiou] matches consonants

[^R] matches everything except capital R

Character ranges

Indicate a series of sequential characters inside character sets

Dash - inside a character set abbreviates alphabetical or numeric sequences

[A-D] matches any single letter from A,B,C, or D (uppercase)

[5-8] matches any single digit between 5 and 8

[A-Za-z] matches all alphabetical characters

character sets are case sensitive
character ranges can also be negated with ^

Anchors

Specify the relative position of the pattern being matched

^ starts with

$ ends with
note: ^ outside a pair of square brackets is an anchor

^mil matches “milkshake” but not “family”

ing$ matches “going” but not “ingest”

Practice

Write a regexp that can match tender, timber, tailor, and taller
Match possible misspellings of ‘herbivore’ using character sets?

Practice

Which of these regular expressions matches food at the beginning of a string?

^food

food

$food

food^

Quantifiers

Specify how many times a character or character class must appear in the input for a match to be found

?   Zero or one
*   Zero or more occurrences
+   One or more occurrences
{}   Exactly the specified number of occurrences

quantifiers apply to the preceding character

Quantifiers

modell?ing matches “modeling” and “modelling”

zero or one els (l)

ya*y! matches “yy!”, “yay!”, “yaaay!”, “yaaaaaay!”, etc.

zero or more aes (a)

Quantifiers

no+ matches “no”, “nooo”, “noooooo”, etc, but not “n”

one or more oes (o)

e{2} matches .”keep” and “bee” but not “meat”

exactly two ees (e)

Quiz

Use a quantifier to match cute, cuuute, cuuuuuute, and cuuuuuuuuute
How can we match Computer, computer, Computers, and computers?

[cC]omputers?

Computers+

[cC]omputer[s]+

Alternation

Alternation tokens separate a series of alternatives

| either or

dog|bird matches “dog” or “bird”

gr(a|e)y matches “gray” and “grey”

note the alternation enclosed in brackets

Special sequences and escapes

\ signals a shorthand sequence or gives special characters a literal meaning

Escapes

hello\\? matches “hello?”

question mark treated as a literal character, but in R escape the backslash first

Metacharacters inside a character set are stripped of their special nature

Shorthand sequences

Refer to commonly-used character sets

\w letters, underscore, and numbers
\d digits
\t tab \n new line
\s space
\b word boundary

Predefined character classes help us avoid malformed character sets

Word boundaries

Match positions between a word character (letter, digit or underscore) and a non-word character (usually a space or the start/end of a string).

Before a sequence of word characters

\bcase matches “case” and “two cases” but not “suitcase”

After a sequence of word characters

org\b matches “cyborg rebellion” but not “organic”

Word characters

Matches any character (letter, digit or underscore). Useful in combination with word boundaries and quantifiers.

Equivalent to [a-zA-Z0-9_]

\w matches letters (case insensitive), numbers, and underscores in:

F33d.%.the_mÖusE pl@ase!!#

Practice

Enter “That atmospheric sensor is at the university” as the test string in a regex tester.
Explain the matches obtained with the following three regular expressions?

at
\bat
at\b

Anchor, wildcard with quantifier (0 or more)

^can.* matches “canine”, “canadian”, and “canolioli”, but not “a canister”

Wildcard and quantifier

A.*x strings that start with “A” and end with “x”

Shorthand sequence (space) and quantifier

\s{3} matches three spaces

Anchors, character set, and quantifier (one or more)

^[a-z]+$ matches a lowercase string

Word characters, quantifiers, word boundaries, and anchors

\w+\b$ matches the last word in a string

“Fix the car”
“12 eggs”

^\w+\b matches the first word in a string

“Fix the car.”
“12 eggs”

Modified from ‘Regular Expressions’ concept map by Greg Wilson

Regex in R

We can match column names and values in character strings with regular expressions

📦 `stringr`

Cohesive set of functions for string manipulation

Function names start with str_
All functions take a vector of strings as the first argument (pipe-friendly)

regex() modifier to control matching behavior

ignore_case=TRUE will make matches case insensitive

`stringr` examples

Matches?

str_detect(string = c("catalog", "battlecat", "apple"), 
           pattern = "cat")

[1]  TRUE  TRUE FALSE

Output is a logical, TRUE or FALSE vector of the same length as our input string

`stringr` examples

Which elements contain matches?

str_which(string = c("catalog", "battlecat", "apple"), 
          pattern = "cat")

[1] 1 2

Output is the index for each of the matching elements

`stringr` examples

Replacing matches

str_replace(string = c("colour", "neighbour", "honour"),
            pattern = "ou",
            replacement = "o")

[1] "color"    "neighbor" "honor"

`stringr` examples

Case insensitive matching

str_replace(string = c("colOur", "neighboUr", "honOUr"),
            pattern = regex("ou", ignore_case = TRUE),   
            replacement = "o")

[1] "color"    "neighbor" "honor"

Demo - `stringr`

Let’s match these REs against the test vector below using str_detect. Can we explain the matches?

Regular expressions
1. ^dog
2. ^[a-z]+$
3. \d

test_vector <- c("Those dogs are small.","dogs and cats",
                 "34","(34)","rat","watchdog","placemat",
                 "BABY","2011_April","mice")

Using regular expressions in data manipulation

Select, subset, keep, or discard rows and columns

Substitute or recode values

Extract or remove substrings

Cleaning data with regex

To select variables with 📦 dplyr and tidyr, we:

write out their names
refer to them by position
specify ranges of contiguous variables
use 📦 tidyselect helper functions

📦 `tidyselect` helpers

matches(): takes regular expressions, and selects variables that match a given pattern

starts_with(): Starts with a prefix

ends_with(): Ends with a suffix

contains(): Contains a literal string

Selecting columns by name

penguins data from 📦 palmerpenguins

names(penguins)

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

penguins %>% 
  select(species, bill_length_mm, flipper_length_mm) %>% 
  sample_n(3)

# A tibble: 3 × 3
  species   bill_length_mm flipper_length_mm
  <fct>              <dbl>             <int>
1 Chinstrap           45.6               194
2 Gentoo              50.7               223
3 Gentoo              45.8               219

Selecting columns by matches in variable names

penguins %>% 
  select(species, matches("length")) %>% 
  sample_n(3)

# A tibble: 3 × 3
  species bill_length_mm flipper_length_mm
  <fct>            <dbl>             <int>
1 Adelie            42.8               195
2 Gentoo            44.9               212
3 Adelie            42                 190

Match values and filter rows

Mammals sleep dataset (msleep) from 📦 ggplot2

msleep %>% select(name,genus) %>% sample_n(4)

# A tibble: 4 × 2
  name            genus       
  <chr>           <chr>       
1 Gray seal       Haliochoerus
2 Tenrec          Tenrec      
3 Giraffe         Giraffa     
4 Giant armadillo Priodontes

Match values and filter rows

Filter to keep rats only

msleep %>% 
  select(name,genus) %>% 
  filter(str_detect(string = name,pattern = "rat"))

# A tibble: 5 × 2
  name                      genus     
  <chr>                     <chr>     
1 African giant pouched rat Cricetomys
2 Round-tailed muskrat      Neofiber  
3 Laboratory rat            Rattus    
4 Cotton rat                Sigmodon  
5 Mole rat                  Spalax

Practice

🐀 After running the code below, how can we exclude muskrats from the matches?

msleep %>% 
  select(name,genus) %>% 
  filter(str_detect(string = name,pattern = "rat"))

Wrap up

Regex == super helpful
We often don’t need to write REs from scratch
More regex features to explore (lookaheads, lookbehinds, capture groups, backreferences)

Thank you!

Questions? Comments?

Regular expressions and working with strings

About: me

What are Regular Expressions?

Specially encoded 🧶strings of characters that match patterns in other text strings

Regular expressions

Have you used regular expressions? If so, what for?

Possible patterns

More possible patterns

Why Learn Regular Expressions?

What we need

🧶Strings

Strings

Note: strings in R are case sensitive

Regex searches are case sensitive by default

Regex - Getting started

Getting started

Regexp

Input string

Match

Online regex testers

Regex Testers

Practice - Regex

Literal Characters and Metacharacters

Literal characters

Literal characters (continued)

Metacharacters

Wildcards

Character sets

Negation tokens

Character ranges

Anchors

Practice

Practice

Quantifiers

Quantifiers

Quantifiers

Quiz

Alternation

Special sequences and escapes

Escapes

Shorthand sequences

Word boundaries

Word characters

Practice

Anchor, wildcard with quantifier (0 or more)

Wildcard and quantifier

Shorthand sequence (space) and quantifier

Anchors, character set, and quantifier (one or more)

Word characters, quantifiers, word boundaries, and anchors

Regex in R

📦 stringr

stringr examples

stringr examples

stringr examples

stringr examples

Demo - stringr

Using regular expressions in data manipulation

Cleaning data with regex

📦 tidyselect helpers

Selecting columns by name

Selecting columns by matches in variable names

Match values and filter rows

Match values and filter rows

Practice

Wrap up

Thank you!

📦 `stringr`

`stringr` examples

`stringr` examples

`stringr` examples

`stringr` examples

Demo - `stringr`

📦 `tidyselect` helpers