Tutorial: Git data backend

This tutorial shows how to use gittargets with the Git-based data versioning backend. Before proceeding, please read the README file or documentation website front page for an overview of the package.

Installation

Please begin with the installation instructions on the documentation website. In addition, if your targets pipeline generates large data files, consider installing Git LFS. The Git data backend in gittargets automatically opts into Git LFS, so you should not need to do any manual configuration to reap the performance benefits.

Remotes

This backend is uses local Git only. It is possible to push the snapshotted data store to a cloud service like GitHub, GitLab, or Bitbucket, but this is the user’s responsibility. Pipelines usually generate large data files, and GitHub and its peers have file size limitations. Also, gittargets automatically opts into Git LFS locally (unless git_lfs is FALSE in tar_git_init()), and Git LFS on the cloud is a paid service.

Overall workflow

The most important steps of the Git data backend are as follows. The rest of this vignette walks through these steps in greater depth.

  1. Create the source code and run the pipeline at least once so the data store exists.
  2. tar_git_init(): initialize a Git/Git-LFS repository for the data store.
  3. Bring the pipeline up to date (e.g. with tar_make()) and commit any changes to the source code.
  4. tar_git_snapshot(): create a data snapshot for the current code commit.
  5. Develop the pipeline. Creating new code commits and code branches early and often, and create data snapshots at key strategic milestones.
  6. tar_git_checkout(): revert the data to the appropriate prior snapshot.

Create code

To begin development, we write _targets.R file for a targets pipeline. targets can handle large complex pipelines for machine learning, Bayesian data analysis, and much more. However, this tutorial focuses on a much simpler pipeline for the sake of pedagogical simplicity.

# _targets.R
library(targets)
list(
  tar_target(data, datasets::airquality),
  tar_target(result, summary(data))
)

Run pipeline

With our target script in hand, we run the pipeline.1

tar_make()
#> • start target data
#> • built target data
#> • start target result
#> • built target result
#> • end pipeline: 0.121 seconds

We inspect the output with tar_read().

tar_read(result)
#>      Ozone           Solar.R           Wind             Temp      
#>  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
#>  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
#>  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
#>  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
#>  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
#>  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
#>  NA's   :37       NA's   :7                                       
#>      Month            Day      
#>  Min.   :5.000   Min.   : 1.0  
#>  1st Qu.:6.000   1st Qu.: 8.0  
#>  Median :7.000   Median :16.0  
#>  Mean   :6.993   Mean   :15.8  
#>  3rd Qu.:8.000   3rd Qu.:23.0  
#>  Max.   :9.000   Max.   :31.0  
#> 

Commit code

We usually iterate between writing code and running the pipeline until we have a decent set of results. After that, we commit the code to a Git repository, which may or may not live on GitHub.2 Happy Git with R is a great way to learn Git, and the gert package is a convenient way to interact with Git from R.

library(gert)
git_init()
git_add("_targets.R")
git_commit("Begin analyzing the airquality dataset")
git_branch_create("airquality")

Snapshot data

Before we snapshot the data, we should check that the code is up to date in the Git repository and the targets are up to date in the pipeline. The tar_git_status() function is an easy way to do this.3

tar_git_status()
#> 
#> ── Data Git status ─────────────────────────────────────────────────────────────
#> ✖ No Git repository for the data store.
#> ! Create one with `gittargets::tar_git_init()`.
#> 
#> ── Code Git status ─────────────────────────────────────────────────────────────
#> # A tibble: 1 × 3
#>   file      status staged
#>   <chr>     <chr>  <lgl> 
#> 1 _targets/ new    FALSE
#> 
#> ── Outdated targets ────────────────────────────────────────────────────────────
#> ✔ All targets are up to date.

Our code and pipeline look ready for a data snapshot. First, we initialize the data repository with tar_git_init(). tar_git_init() writes a .gitattributes file in the data store that automatically opts into Git LFS. If you have Git LFS but do not wish to use it, please remove the .gitattributes after calling tar_git_init().

tar_git_init()
#> ✔ Created data store Git repository
#> ✔ Wrote to _targets/.gitattributes for git-lfs: <https://git-lfs.github.com>.
#> ✔ Created stub commit without data.
#> • Run tar_git_snapshot() to put the data files under version control.

Then, we create our first data commit with tar_git_snapshot().4

tar_git_snapshot()
#> • Creating data branch code=e3337eb3d9bda9e8c43cbd5b01e21add1d2f21a8.
#> • Staging data files.
#> ✔ Staged 6 files in the data store.
#> • Committing data changes.
#> ✔ Created new data snapshot 659e6450ec5bb88a0137bcc0e29fdd6a86e4ce8a.
#> • Packing references.

Snapshot model

In the Git data backend, a data snapshot is a special kind of Git commit (gray boxes above). Each data commit is part of a data branch (vertical dashed lines above), and each data branch is specific to the current code commit (green and brown boxes above). In fact, each data branch name is of the form "code=<SHA1>", where <SHA1> is the Git SHA1 hash of the corresponding code commit. You can always create a data snapshot, but it will supersede any prior data snapshot you already have for the current code commit. To revert to a previous data snapshots for a given code snapshot, you will need to manually enter the repository and check out the relevant data commit.

Repeat

Development typically happens in cycles: develop the code, run the pipeline, commit the code, snapshot the data, and repeat. Not all code commits need a data snapshot, especially if the targets pipeline generates a lot of data. But even then, it is helpful to snapshot the data at key milestones, e.g. if an alternative research question comes up and it is desirable to create a new Git branch for the code. For example, suppose we wish to apply the same pipeline to a different dataset. The code changes:

# _targets.R
library(targets)
list(
  tar_target(data, datasets::UKgas), # different dataset
  tar_target(result, summary(data))
)

We run the pipeline and inspect the new output.

tar_make()
#> • start target data
#> • built target data
#> • start target result
#> • built target result
#> • end pipeline: 0.112 seconds
tar_read(result)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    84.8   153.3   220.9   337.6   469.9  1163.9

We put the code in a new Git branch.

git_branch_create("UKgas")
git_add("_targets.R")
#> # A tibble: 2 × 3
#>   file       status   staged
#>   <chr>      <chr>    <lgl> 
#> 1 _targets.R modified TRUE  
#> 2 _targets/  new      FALSE
git_commit("Switch to UKgas dataset")
#> [1] "0df27a7f4e6e36f6c21b52127e6720aa3fa0a80d"

Finally, we create a data snapshot for the new code commit.

tar_git_snapshot()
#> • Creating data branch code=0df27a7f4e6e36f6c21b52127e6720aa3fa0a80d.
#> • Staging data files.
#> ✔ Staged 5 files in the data store.
#> • Committing data changes.
#> ✔ Created new data snapshot 75189cfa14ba7d84d8f8697d1af49c657eb63447.
#> • Packing references.

View log

Now, suppose we want to switch the project back to the original dataset (airquality). To transition completely, we need to revert both the code and the data. If we only revert the code, then the data store will sill reflect the UKgas dataset, and none of our targets will be up to date. At this point, it is a good time to pause and check the gittargets log to see which code commits have available data snapshots.5

tar_git_log()
#> # A tibble: 2 × 6
#>   message_code   messa…¹ time_code           time_data           commi…² commi…³
#>   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
#> 1 Switch to UKg… Switch… 2022-09-06 16:16:02 2022-09-06 16:16:02 0df27a… 75189c…
#> 2 Begin analyzi… Begin … 2022-09-06 16:15:57 2022-09-06 16:16:00 e3337e… 659e64…
#> # … with abbreviated variable names ¹​message_data, ²​commit_code, ³​commit_data

Check out code

To check out the old airquality code, we use gert::git_branch_checkout().

git_branch_checkout("airquality")

But because we did not revert the data, our results still reflect the UKgas dataset.

tar_read(result)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    84.8   153.3   220.9   337.6   469.9  1163.9

Thus, all our targets are out of date.

tar_outdated()
#> [1] "result" "data"

Check out data

To bring our targets back up to date with the airquality data, instead of beginning a potentially long computation with tar_make(), we can check out the data snapshot that matches our current code commit.

tar_git_checkout()
#> ✔ Checked out data snapshot 659e6450ec5bb88a0137bcc0e29fdd6a86e4ce8a.
#> • Code commit: code=e3337eb3d9bda9e8c43cbd5b01e21add1d2f21a8
#> • Message: Begin analyzing the airquality dataset
#> • Resetting to HEAD of checked-out snapshot.

Now, our results reflect the airquality dataset we previously analyzed.

tar_read(result)
#>      Ozone           Solar.R           Wind             Temp      
#>  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
#>  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
#>  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
#>  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
#>  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
#>  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
#>  NA's   :37       NA's   :7                                       
#>      Month            Day      
#>  Min.   :5.000   Min.   : 1.0  
#>  1st Qu.:6.000   1st Qu.: 8.0  
#>  Median :7.000   Median :16.0  
#>  Mean   :6.993   Mean   :15.8  
#>  3rd Qu.:8.000   3rd Qu.:23.0  
#>  Max.   :9.000   Max.   :31.0  
#> 

And all our targets are up to date.

tar_outdated()
#> character(0)

Merges

It is common to merge code branches into one another. When this happens, a new merge commit is created in the code repository, and the data repository remains unchanged. In fact, the only change is that the code repository is now at a new commit that has no data snapshot yet. If you wish, simply create a new data snapshot with tar_git_snapshot(). If the code commit immediately prior had an up-to-date data snapshot of its own, then the new snapshot for the merge commit should cost little storage or runtime.

Custom data files

Only files inside the targets data store are tracked in a gittargets data snapshot. If your pipeline requires custom external files, you may put them in a folder called _targets/user/. That way, gittargets will automatically put them under data version control in the next snapshot.

Performance

If your targets pipeline generates large data files, consider installing Git LFS. Once you install Git LFS, it should just work on your project right out of the box because tar_git_init() writes the following to _targets/.gitattributes:

objects filter=lfs diff=lfs merge=lfs -text

In addition, every data snapshot with tar_git_snapshot() creates a new Git branch. With thousands of commits and thus thousands of branches, performance may suffer unless you ensure pack_refs is TRUE in the function call (default).6(https://git-scm.com/docs/git-pack-refs) in the command line with your working directory inside _targets/.]


  1. https://books.ropensci.org/targets/hpc.html describes heavy-duty alternatives to tar_make().↩︎

  2. Alternatives to GitHub include GitLab and Bitbucket.↩︎

  3. Helper functions tar_git_status_code(), tar_git_status_targets(), and tar_git_status_data() each generate a piece of the tar_git_status() output.↩︎

  4. Ordinarily, tar_git_snapshot() shows runs tar_git_status() and prompts the user to confirm the snapshot. But in this example, we skip this step.↩︎

  5. If you chose not to call tar_git_snapshot() for some code commits, then not all your code commits will have available data snapshots.↩︎

  6. Alternatively, you can call tar_git_snapshot(pack_refs = FALSE) and then afterwards run git pack-refs --all↩︎