Here’s a brief approach for putting information recorded only in the file names of multiple tabular files into the data rectangle at the time of import.
The need for this arose because it is fairly common for government agencies to split related data into multiple files for individual download, with each file for example: containing data for a specific year or region. However, this can make it difficult to combine all the separate files into a global dataset when important variables are not recorded explicitly in the data.
For this example, let’s work with the 🚗 gtcars dataset from the gt package. This dataset is like mtcars but for very fancy and premium cars.
We’ll group gtcars by multiple varibles, split the data into groups, and export each group to a separate csv file without the grouping variables but with this information in the file names. Then, we’ll import all the files, splitting the file names and adding the relevant information back into the data rectangle. This will allow us to combine all the content into a single dataset.
📂 Create a folder with multiple csv files
First, load everything we need.
Now, we can group the data by country of origin, year, and drive train and then a) split the data into a list of tibbles (one for each group), and b) use the group keys to build a vector of file names based on the grouping information used to split the data.
After minor changes like replacing spaces, the vector of filenames can be built with rowwise glueing of the values in the grouping variables.
🗋 Let’s use this vector of filenames to name the list of tibbles, and then export each tibble to a separate csv file using walk2 for its side effects.
We should now have in our working directory: a folder (“carfiles” in this example) full of csv files.
Now let’s write a function that will strip the filename, split it into fragments (one for each variable), and add these fragments as values into the data rectangle as part of the import step. Using read.csv here to get around guessing and type conversion.
Nice little function above, it can take an optional vector with the names of the variables being added and if none are provided, it will use “v1”, “v2”, etc. Hard-coded “_” as separator here, but that could be an argument too.
For one of the files chosen at random, let’s compare the output from reading the csv normally vs applying the new function with and without the vector of variable names.
The csv file without the data from the filename.
Making this information explicit, but the new variables are named v1, v2, and v3.
Passing the names for the new columns:
Finally, we can use purrr to iterate through all the files in a folder and apply the custom function. Note the new recommended approach of using list_rbind rather than map_df.
Our output is essentially the same gtcars object we started with.