R-Ladies St. Louis
February 22, 2023
tidyverse
& Software Carpentry)Also abbreviated as regex, R.E., or regexp (singular)
A concise language for describing patterns of text
In practice, a computer language with its own terminologies and syntax
Input as a text string that compiles into a mini program built specifically to identify a pattern.
Can be used to match, search, replace, or split text
“dog” but not “dogs”
“dogs” but only if the match is at the start of the string,
digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
“modeling” or “modelling” (alternate spellings of the same word)
words ending in “at” (words that begin or end with a specific pattern)
strings that start with digits
dates
zip codes
numbers inside square brackets and also the brackets
valid Twitter handles (start with @, no spaces or symbols, <16 characters)
UPPERCASE words
Underutilized, valuable skill
Save time when we need to match, extract, or transform text by matching words, characters, and patterns
Regular expressions can often replace dozens of lines of code
Not specific to any particular programming language
Generally, we describe patterns and look for them in input strings
A collection of characters that make up one element of a vector:
We can store multiple strings in a character vector:
Names, row names, column names, and values in a data frame can also be strings:
drink | price |
---|---|
Coffee | 3.50 |
Tea | 2.99 |
Juice | 3.20 |
Uppercase and lowercase letters are treated as distinct
Our options:
Build case-insensitive regular expressions
Turn off case sensitivity in the matching
Modify the input text before matching
The matching is done by the regex engine. We do not access the engine directly, but functions that take regular expressions as arguments call on it whenever its needed.
dog
“The dog is fat.”
The dog is fat.
To search for a specific sequence of characters, the regexp (regular expression) we need is simply that sequence of characters
find a d followed by an o followed by a g. (all lowercase and all characters together and in that order when read from left to right)
Is the regular expression matching anything?
Navigate to rubular, regex101, regexr, or regexpal
Input the following text into the test string field:
cat
hat
CAT
Manhattan
MOUSE
housekeeping
Let’s try different regexes in the pattern
box to see what happens
Strings and regular expressions are made of characters
For regular expressions, characters can be grouped into two classes depending on their behavior
If input string is “dog” and regex is dog
We will get a match whenever the characters d, o, and g occur consecutively in the input text
"dog"
tells the regex engine: find a d, immediately followed by an o, immediately followed by a g together and in that particular order
d, o and g are examples of literal characters
They stand for exactly what they are: d
in the regex matches a “d” in the input text, o
matches an “o” in text, and so on.
The power and flexibility of regular expressions comes from their ability to describe more complex patterns.
If a text pattern can be described verbally, we can most likely write a regular expression to match it.
To match more than literal strings of text, a small subset of characters that have special functionalities when they appear in a regular expression.
Metacharacters do not stand for themselves, they interpreted in a special way.
Metacharacters include: []^$.|?*+(), which are reserved for unique matching purposes.
Stand in for unknown characters
.
match any character once
f..l matches “fill” and “fool”, but not “flail”
.top matches “stop and”isotope”, but not “topple”
Match one or more characters enclosed in square brackets
[ ]
match a set of characters
[cb]at matches “cat”, and “bat”, but not “rat”
[CK]at[iey] matches “Caty”, “Kati”, “Katy”, and “Kate”
[^] match characters not in the specified character set
^
must be the first character inside the brackets
[^aoeiou] matches consonants
[^R] matches everything except capital R
Indicate a series of sequential characters inside character sets
Dash - inside a character set abbreviates alphabetical or numeric sequences
[A-D] matches any single letter from A,B,C, or D (uppercase)
[5-8] matches any single digit between 5 and 8
[A-Za-z] matches all alphabetical characters
character sets are case sensitive
character ranges can also be negated with ^
Specify the relative position of the pattern being matched
^ starts with
$ ends with
note: ^ outside a pair of square brackets is an anchor
^mil matches “milkshake” but not “family”
ing$ matches “going” but not “ingest”
Write a regexp that can match tender, timber, tailor, and taller
Match possible misspellings of ‘herbivore’ using character sets?
food
at the beginning of a string?Specify how many times a character or character class must appear in the input for a match to be found
? Zero or one
* Zero or more occurrences
+ One or more occurrences
{} Exactly the specified number of occurrences
quantifiers apply to the preceding character
modell?ing matches “modeling” and “modelling”
zero or one
els (l)
ya*y! matches “yy!”, “yay!”, “yaaay!”, “yaaaaaay!”, etc.
zero or more
aes (a)
no+ matches “no”, “nooo”, “noooooo”, etc, but not “n”
one or more
oes (o)
e{2} matches .”keep” and “bee” but not “meat”
exactly two
ees (e)
Use a quantifier to match cute, cuuute, cuuuuuute, and cuuuuuuuuute
How can we match Computer, computer, Computers, and computers?
Alternation tokens separate a series of alternatives
| either or
dog|bird matches “dog” or “bird”
gr(a|e)y matches “gray” and “grey”
note the alternation enclosed in brackets
\ signals a shorthand sequence or gives special characters a literal meaning
hello\\? matches “hello?”
question mark treated as a literal character, but in R escape the backslash first
Metacharacters inside a character set are stripped of their special nature
Refer to commonly-used character sets
\w letters, underscore, and numbers
\d digits
\t tab \n new line
\s space
\b word boundary
Predefined character classes help us avoid malformed character sets
\b
Match positions between a word character (letter, digit or underscore) and a non-word character (usually a space or the start/end of a string).
Before a sequence of word characters
\bcase matches “case” and “two cases” but not “suitcase”
After a sequence of word characters
org\b matches “cyborg rebellion” but not “organic”
Matches any character (letter, digit or underscore). Useful in combination with word boundaries and quantifiers.
Equivalent to [a-zA-Z0-9_]
\w matches letters (case insensitive), numbers, and underscores in:
F33d.%.the_mÖusE pl@ase!!#
Enter “That atmospheric sensor is at the university” as the test string in a regex tester.
Explain the matches obtained with the following three regular expressions?
^can.* matches “canine”, “canadian”, and “canolioli”, but not “a canister”
A.*x strings that start with “A” and end with “x”
\s{3} matches three spaces
^[a-z]+$ matches a lowercase string
\w+\b$ matches the last word in a string
“Fix the car”
“12 eggs”
^\w+\b matches the first word in a string
“Fix the car.”
“12 eggs”
Modified from ‘Regular Expressions’ concept map by Greg Wilson
We can match column names and values in character
strings with regular expressions
stringr
Cohesive set of functions for string manipulation
Function names start with str_
All functions take a vector of strings as the first argument (pipe-friendly)
regex()
modifier to control matching behavior
ignore_case=TRUE
will make matches case insensitive
stringr
examplesMatches?
Output is a logical, TRUE or FALSE vector of the same length as our input string
stringr
examplesWhich elements contain matches?
Output is the index for each of the matching elements
stringr
examplesReplacing matches
stringr
examplesCase insensitive matching
stringr
Let’s match these REs against the test vector below using str_detect
. Can we explain the matches?
Regular expressions
1. ^dog
2. ^[a-z]+$
3. \d
Select, subset, keep, or discard rows and columns
Substitute or recode values
Extract or remove substrings
To select variables with 📦 dplyr
and tidyr
, we:
write out their names
refer to them by position
specify ranges of contiguous variables
use 📦 tidyselect
helper functions
tidyselect
helpers
matches()
: takes regular expressions, and selects variables that match a given pattern
starts_with()
: Starts with a prefix
ends_with()
: Ends with a suffix
contains()
: Contains a literal string
penguins
data from 📦 palmerpenguins
[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
Mammals sleep dataset (msleep
) from 📦 ggplot2
Filter to keep rats only
🐀 After running the code below, how can we exclude muskrats from the matches?
Questions? Comments?
R-Ladies theme for Quarto Presentations. Template available on GitHub.