Here’s a brief approach for putting information recorded only in the file names of multiple tabular files into the data rectangle at the time of import.

The need for this arose because it is fairly common for government agencies to split related data into multiple files for individual download, with each file for example: containing data for a specific year or region. However, this can make it difficult to combine all the separate files into a global dataset when important variables are not recorded explicitly in the data.

For this example, let’s work with the 🚗 gtcars dataset from the gt package. This dataset is like mtcars but for very fancy and premium cars.

We’ll group gtcars by multiple varibles, split the data into groups, and export each group to a separate csv file without the grouping variables but with this information in the file names. Then, we’ll import all the files, splitting the file names and adding the relevant information back into the data rectangle. This will allow us to combine all the content into a single dataset.

📂 Create a folder with multiple csv files

First, load everything we need.

# load gt and various tidyverse pkgs
library(gt)      # CRAN v0.10.1 
library(dplyr)   # CRAN v1.1.4
library(purrr)   # CRAN v1.0.2
library(glue)    # CRAN v1.7.0
library(stringr) # CRAN v1.5.1
library(readr)   # CRAN v2.1.5
library(fs)      # CRAN v1.6.4

data(gtcars) # get the data 

Now, we can group the data by country of origin, year, and drive train and then a) split the data into a list of tibbles (one for each group), and b) use the group keys to build a vector of file names based on the grouping information used to split the data.

# list of tibbles
gtcars_groups <- gtcars |> 
  group_split(ctry_origin, year, drivetrain)

# tibble of the grouping variables
gtgroups <- 
gtcars |> 
  group_by(ctry_origin, year, drivetrain) |> 
  group_keys() 

After minor changes like replacing spaces, the vector of filenames can be built with rowwise glueing of the values in the grouping variables.

grpnames <- 
    gtgroups |> 
      rowwise() |> 
      mutate(gluedvars = glue_collapse(across(everything(), as.character), sep = "_")) |> 
      mutate(gluedvars= str_replace(gluedvars," ","-")) |> 
      pull(gluedvars)

🗋 Let’s use this vector of filenames to name the list of tibbles, and then export each tibble to a separate csv file using walk2 for its side effects.

names(gtcars_groups) <- grpnames
gtcars_groups <- 
  gtcars_groups |> map(select,-c(ctry_origin,year,drivetrain))
# to disk  
  walk2(gtcars_groups,names(gtcars_groups),
        \(x,y) write_csv(x,paste0("carfiles/",y,".csv")))

We should now have in our working directory: a folder (“carfiles” in this example) full of csv files.

Now let’s write a function that will strip the filename, split it into fragments (one for each variable), and add these fragments as values into the data rectangle as part of the import step. Using read.csv here to get around guessing and type conversion.

filenameTovars <- function(filepath, var_names = NULL) {
  # strip the filename
  filename <- basename(filepath)
  filename <- tools::file_path_sans_ext(filename)
  
  # split the filename 
  filename_parts <- str_split(filename, "_", simplify = TRUE)
  
  # import
  data <- read_csv(filepath)
  
  # if no var_names provided
  if (is.null(var_names))  {
    var_names <- paste0("v", seq_along(filename_parts))
  }

  # filename fragments as new variables
  as_tibble(bind_cols(
    data,
    setNames(as.list(filename_parts), var_names) # a df that will be recycled
  ))
}

Nice little function above, it can take an optional vector with the names of the variables being added and if none are provided, it will use “v1”, “v2”, etc. Hard-coded “_” as separator here, but that could be an argument too.

For one of the files chosen at random, let’s compare the output from reading the csv normally vs applying the new function with and without the vector of variable names.

# read the csv normally
read_csv("carfiles/Italy_2015_rwd.csv")

# read with path only
filenameTovars("carfiles/Italy_2015_rwd.csv") 

# read with names for new columns
filenameTovars("carfiles/Italy_2015_rwd.csv",
        var_names = c("ctry_origin","year","drivetrain")) 

The csv file without the data from the filename.

# A tibble: 6 × 12
  mfr   model trim  bdy_style    hp hp_rpm   trq trq_rpm mpg_c mpg_h trsmn
  <chr> <chr> <chr> <chr>     <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl> <chr>
1 Ferr… 458 … Base… coupe       597   9000   398    6000    13    17 7a   
2 Ferr… 458 … Base  converti…   562   9000   398    6000    13    17 7a   
3 Ferr… Cali… Base… converti…   553   7500   557    4750    16    23 7a   
4 Ferr… F12B… Base… coupe       731   8250   509    6000    11    16 7a   
5 Ferr… LaFe… Base… coupe       949   9000   664    6750    12    16 7a   
6 Lamb… Hura… LP 6… coupe       610   8250   413    6500    16    20 7a   
# ℹ 1 more variable: msrp <dbl>

Making this information explicit, but the new variables are named v1, v2, and v3.

# A tibble: 6 × 15
  mfr   model trim  bdy_style    hp hp_rpm   trq trq_rpm mpg_c mpg_h trsmn
  <chr> <chr> <chr> <chr>     <int>  <int> <int>   <int> <int> <int> <chr>
1 Ferr… 458 … Base… coupe       597   9000   398    6000    13    17 7a   
2 Ferr… 458 … Base  converti…   562   9000   398    6000    13    17 7a   
3 Ferr… Cali… Base… converti…   553   7500   557    4750    16    23 7a   
4 Ferr… F12B… Base… coupe       731   8250   509    6000    11    16 7a   
5 Ferr… LaFe… Base… coupe       949   9000   664    6750    12    16 7a   
6 Lamb… Hura… LP 6… coupe       610   8250   413    6500    16    20 7a   
# ℹ 4 more variables: msrp <int>, v1 <chr>, v2 <chr>, v3 <chr>

Passing the names for the new columns:

# A tibble: 6 × 15
  mfr   model trim  bdy_style    hp hp_rpm   trq trq_rpm mpg_c mpg_h trsmn
  <chr> <chr> <chr> <chr>     <int>  <int> <int>   <int> <int> <int> <chr>
1 Ferr… 458 … Base… coupe       597   9000   398    6000    13    17 7a   
2 Ferr… 458 … Base  converti…   562   9000   398    6000    13    17 7a   
3 Ferr… Cali… Base… converti…   553   7500   557    4750    16    23 7a   
4 Ferr… F12B… Base… coupe       731   8250   509    6000    11    16 7a   
5 Ferr… LaFe… Base… coupe       949   9000   664    6750    12    16 7a   
6 Lamb… Hura… LP 6… coupe       610   8250   413    6500    16    20 7a   
# ℹ 4 more variables: msrp <int>, ctry_origin <chr>, year <chr>,
#   drivetrain <chr>

Finally, we can use purrr to iterate through all the files in a folder and apply the custom function. Note the new recommended approach of using list_rbind rather than map_df.

map(allpaths,\(x) filenameTovars(x,
  var_names = c("ctry_origin","year","drivetrain"))) |> 
  list_rbind() |> type_convert() 

Our output is essentially the same gtcars object we started with.

# A tibble: 47 × 15
   mfr   model    trim    bdy_style    hp hp_rpm   trq trq_rpm mpg_c mpg_h
   <chr> <chr>    <chr>   <chr>     <int>  <int> <int>   <int> <int> <int>
 1 Audi  R8       4.2 (M… coupe       430   7900   317    4500    11    20
 2 BMW   i8       Mega W… coupe       357   5800   420    3700    28    29
 3 Audi  RS 7     Quattr… hatchback   560   5700   516    1750    15    25
 4 Audi  S6       Premiu… sedan       450   5800   406    1400    18    27
 5 Audi  S7       Presti… hatchback   450   5800   406    1400    17    27
 6 Audi  S8       Base S… sedan       520   5800   481    1700    15    25
 7 BMW   6-Series 640 I … coupe       315   5800   330    1400    20    30
 8 BMW   M4       Base C… coupe       425   5500   406    1850    17    24
 9 BMW   M5       Base S… sedan       560   6000   500    1500    15    22
10 BMW   M6       Base C… coupe       560   6000   500    1500    15    22
# ℹ 37 more rows
# ℹ 5 more variables: trsmn <chr>, msrp <int>, ctry_origin <chr>,
#   year <dbl>, drivetrain <chr>

Hopefully others out there find this useful!