Reproducibility with seeker

Using the seeker package together with docker, it’s easy to make fetching and processing of sequencing and microarray data completely reproducible. First pull the latest version of the socker image, which has seeker and its dependencies already installed.

RNA-seq data

The seeker package includes an example yaml file, R script, and shell script for fetching and processing a subset of an RNA-seq dataset. Here we’ll download the files from GitHub to avoid having to install the package locally:

urlBase = 'https://raw.githubusercontent.com/hugheylab/seeker/master/inst/extdata/'
for (filename in c('PRJNA600892.yml', 'run_seeker.R', 'run_seeker.sh')) {
  download.file(paste0(urlBase, filename), filename)}

PRJNA600892.yml:

study: 'PRJNA600892' # [string]
metadata:
  run: TRUE # [logical]
  bioproject: 'PRJNA600892' # [string]
  include:
    # [named list or NULL]
    colname: 'run_accession' # [string]
    values: ['SRR10876945', 'SRR10876946'] # [vector]
  # exclude # [named list or NULL]
    # colname # [string]
    # values # [vector]
fetch:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # overwrite # [logical or NULL]
  # keepSra # [logical or NULL]
  # prefetchCmd # [string or NULL]
  # prefetchArgs # [character vector or NULL]
  # fasterqdumpCmd # [string or NULL]
  # fasterqdumpArgs # [character vector or NULL]
  # pigzCmd # [string or NULL]
  # pigzArgs # [character vector or NULL]
trimgalore:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
  # pigzCmd # [string or NULL]
fastqc:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
salmon:
  run: TRUE # [logical]
  indexDir: '~/refgenie_genomes/alias/mm10/salmon_partial_sa_index/default' # [string]
  # sampleColname # [string or NULL]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
multiqc:
  run: TRUE # [logical]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
tximport:
  run: TRUE # [logical]
  tx2gene:
    # [named list or NULL]
    organism: 'mmusculus' # [string]
    # version # [number or NULL]
    # filename # [string or NULL]
  countsFromAbundance: 'lengthScaledTPM' # [string]
  # ignoreTxVersion # [logical or NULL]

run_seeker.R:

doParallel::registerDoParallel()

cArgs = commandArgs(TRUE)
yamlPath = cArgs[1L]
parentDir = cArgs[2L]

params = yaml::read_yaml(yamlPath)
seeker::seeker(params, parentDir)

run_seeker.sh:

#!/bin/sh

docker run \
  --mount type=bind,src=`pwd`,dst=/home/rstudio/projects \
  -w /home/rstudio/projects \
  --rm \
  ghcr.io/hugheylab/socker \
  bash -c \
    "source /home/rstudio/miniconda3/etc/profile.d/conda.sh \
      && conda activate seeker \
      && refgenie pull mm10/salmon_partial_sa_index \
      && Rscript run_seeker.R PRJNA600892.yml ." \
  &> PRJNA600892_progress.log

Now simply run the shell script:

sh run_seeker.sh

The output will appear in your working directory. You can follow seeker()’s progress using the log file. To process a different dataset, modify the yaml file and shell script accordingly. Beware this example uses “salmon_partial_sa_index” from refgenie to minimize computational requirements; for actual use we recommend “salmon_sa_index”.

Microarray data

The seeker package also includes an example yaml file, R script, and shell script for fetching and processing a microarray dataset. Download the files to your working directory:

urlBase = 'https://raw.githubusercontent.com/hugheylab/seeker/master/inst/extdata/'
for (filename in c('GSE25585.yml', 'run_seeker_array.R', 'run_seeker_array.sh')) {
  download.file(paste0(urlBase, filename), filename)}

GSE25585.yml:

study: 'GSE25585'
geneIdType: 'entrez'

run_seeker_array.R:

cArgs = commandArgs(TRUE)

params = yaml::read_yaml(cArgs[1L])
parentDir = cArgs[2L]

seeker::seekerArray(
  study = params$study, geneIdType = params$geneIdType,
  platform = params$platform, parentDir)

run_seeker_array.sh:

#!/bin/sh

docker run \
  --mount type=bind,src=`pwd`,dst=/home/rstudio/projects \
  -w /home/rstudio/projects \
  --rm \
  ghcr.io/hugheylab/socker \
  bash -c "Rscript run_seeker_array.R GSE25585.yml ." \
  &> GSE25585_progress.log

Now simply run the shell script:

sh run_seeker_array.sh

The output will appear in your working directory. You can follow seekerArray()’s progress using the log file. To process a different dataset, modify the yaml file and shell script accordingly.

Reproducibility with seeker

Jake Hughey

2022-07-22

RNA-seq data

Microarray data