library(tidyverse)
library(fs)
<- tibble(
original_tib dir = c('some/path/blub', 'bla/here/', 'direct/'),
file_names = c('file_a.csv', 'file_b.csv', 'file_c.txt')
)
original_tib## # A tibble: 3 × 2
## dir file_names
## <chr> <chr>
## 1 some/path/blub file_a.csv
## 2 bla/here/ file_b.csv
## 3 direct/ file_c.txt
File Management With The {fs}
Package
{fs}
package
As data scientists we often have to deal with lots of tedious tasks. One such tedious task can be interacting with the file system on our computer or the remote machine we’re working with. Thankfully, the {fs}
package has a bunch of convenvience function that make our life a whole lot easier.
Let’s check out a few examples. And if videos are more your thing, you can also watch the video version of this blog post on YouTube.
Assemble paths
Check out this data set.
Here, assembling a path in the form directory/file_name.ext
can be tricky. Some directories have trailing /
and some don’t. So, working with paste0()
or glue::glue()
would be challenging. Thankfully, the path()
function from the {fs}
package doesn’t care whether trailing /
are there or not.
|>
original_tib mutate(path = path(dir, file_names))
## # A tibble: 3 × 3
## dir file_names path
## <chr> <chr> <fs::path>
## 1 some/path/blub file_a.csv some/path/blub/file_a.csv
## 2 bla/here/ file_b.csv bla/here/file_b.csv
## 3 direct/ file_c.txt direct/file_c.txt
Remove and set extensions
We can even modify file extensions really easily. That’s convenient when we want to take input from csv-files and then turn the data into images using the same file names.
|>
original_tib mutate(
path = path(dir, file_names),
out_path = path_ext_set(path, 'png')
)## # A tibble: 3 × 4
## dir file_names path out_path
## <chr> <chr> <fs::path> <fs::path>
## 1 some/path/blub file_a.csv some/path/blub/file_a.csv some/path/blub/file_a.png
## 2 bla/here/ file_b.csv bla/here/file_b.csv bla/here/file_b.png
## 3 direct/ file_c.txt direct/file_c.txt direct/file_c.png
Get directory infos
You can get information on a directory as a tree in the console. Here, I’m using a directory called raw-input
inside my working directory to demonstrate that.
dir_tree('raw-input')
## raw-input
## ├── a
## │ └── dat.csv
## ├── b
## │ └── dat.csv
## └── c
## └── dat.csv
You can also get lots of information on these files.
dir_info('raw-input')
## # A tibble: 3 × 18
## path type size permissions modification_time user group device_id
## <fs::path> <fct> <fs:> <fs::perms> <dttm> <chr> <chr> <dbl>
## 1 raw-input/a direc… 4K rwxrwxr-x 2025-03-29 09:02:24 albe… albe… 66307
## 2 raw-input/b direc… 4K rwxrwxr-x 2025-03-29 09:04:33 albe… albe… 66307
## 3 raw-input/c direc… 4K rwxrwxr-x 2025-03-29 09:04:35 albe… albe… 66307
## # ℹ 10 more variables: hard_links <dbl>, special_device_id <dbl>, inode <dbl>,
## # block_size <dbl>, blocks <dbl>, flags <int>, generation <dbl>,
## # access_time <dttm>, change_time <dttm>, birth_time <dttm>
But in a lot of cases, it will probably suffice to just get the file paths.
dir_ls('raw-input')
## raw-input/a raw-input/b raw-input/c
In this function, you’ll need to use recurse = TRUE
, though, to go into nested structures.
dir_ls('raw-input', recurse = TRUE)
## raw-input/a raw-input/a/dat.csv raw-input/b raw-input/b/dat.csv
## raw-input/c raw-input/c/dat.csv
Iterate over file paths
Usually, you don’t want to stop after finding the desired paths. You usually want to iterate over them. For this, you can save the output of dir_ls()
into a vector and iterate through it using the map()
or walk()
function. Here, the function I use inside of walk()
will
- load the data using the specified path,
- create a ggplot from it, and
- save the image.
The tricky thing here is that I do want to save the files in an output
directory. It is supposed to have the same structure as the raw-input
directory. That’s why I also need to create the necessary paths and directories for that inside the function.
<- dir_ls(
csv_files 'raw-input',
recurse = TRUE,
regexp = '\\.csv$'
)
|>
csv_files walk(
\(file_path) {<- read_csv(file_path) |>
plt ggplot(aes(col_a, col_b)) +
geom_point(size = 10, col = 'dodgerblue4')
<- file_path |>
out_path path_ext_set('.png') |>
str_replace('^raw-input', 'output')
dir_create(path_dir(out_path))
ggsave(filename = out_path)
}
)## Rows: 3 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): col_a, col_b, col_c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Saving 6 x 4 in image
## Rows: 3 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): col_a, col_b, col_c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Saving 6 x 4 in image
## Rows: 3 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): col_a, col_b, col_c
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Saving 6 x 4 in image
Splendid. This should have worked and you can now see the output
directory and the plots in the file tree.
dir_tree()
## .
## ├── index.qmd
## ├── index.rmarkdown
## ├── output
## │ ├── a
## │ │ └── dat.png
## │ ├── b
## │ │ └── dat.png
## │ └── c
## │ └── dat.png
## └── raw-input
## ├── a
## │ └── dat.csv
## ├── b
## │ └── dat.csv
## └── c
## └── dat.csv