Here’s a quick example of how we can use dplyr
and tidyr
functions to complete sequences in a data frame given start and end values stored in separate columns. This was originally asked in a Spanish-Language R Facebook group and makes for a good use case of pivoting plus the complete()
and full_seq()
functions from tidyr
which I had never written about nor used in my teaching materials.
For a tabular object like this one below, we want to expand the sequence (days) for each category, repeating the values of longitude as needed.
category | start_day | end_day | longitude |
---|---|---|---|
A | 6 | 8 | 12 |
B | 18 | 23 | 15 |
C | 19 | 21 | 11 |
D | 2 | 6 | 13 |
We want to end up with this:
category | day | longitude |
---|---|---|
A | 6 | 12 |
A | 7 | 12 |
A | 8 | 12 |
B | 18 | 15 |
B | 19 | 15 |
B | 20 | 15 |
B | 21 | 15 |
B | 22 | 15 |
B | 23 | 15 |
C | 19 | 11 |
C | 20 | 11 |
C | 21 | 11 |
D | 2 | 13 |
D | 3 | 13 |
D | 4 | 13 |
D | 5 | 13 |
D | 6 | 13 |
Here’s how to complete the data frame with the missing combinations of data. Let’s set up our data first.
Our initial data looks like this:
In this initial structure, the data is in wide format so we need to pivot the data so that start and end days for each group are together in the same variable.
Pivot longer like so:
We now have the key-value pairs and the data in long format.
Now we need to complete the sequence. For example: for category A we need separate rows for day 6, 7, and 8. For group B we need rows for days 18, 19, 20, 21, 22, and 23, and so on.
To create the full sequence of values in a vector, we can use tidyr::full_seq()
. This function takes a numeric vector and a value describing the gaps or increments that need to be filled in.
full_seq(c(1,3,7),1)
will return the same as 1:7
.
complete()
is a nice wrapper for another tidyr
function: expand()
. complete
is pipe-friendly and operates on data frames (including grouped data frames).
On the grouped data frame, let’s complete the new day variable in increments of one, then to get the desired result lets fill()
in the missing values in the longitude column and clean up.
The result:
Pretty cool! As always feel free to contact me with feedback or questions.