In order to generate a pagoo
object, the only mandatory data structure is a data.frame
which has to contain 3 basic columns. Let’s illustrate it with an toy dataset.
First load pagoo
:
library(pagoo) # Load package
<- system.file('extdata', 'toy_data.tar.gz', package = 'pagoo')
tgz untar(tarfile = tgz, exdir = tempdir()) # Decompress example dataset
<- list.files(path = tempdir(), full.names = TRUE, pattern = 'tsv$|fasta$') # List files
files files
## [1] "/tmp/RtmpVE7Q3u/case_clusters_meta.tsv"
## [2] "/tmp/RtmpVE7Q3u/case_df.tsv"
## [3] "/tmp/RtmpVE7Q3u/case_orgs_meta.tsv"
## [4] "/tmp/RtmpVE7Q3u/organismA.fasta"
## [5] "/tmp/RtmpVE7Q3u/organismB.fasta"
## [6] "/tmp/RtmpVE7Q3u/organismC.fasta"
## [7] "/tmp/RtmpVE7Q3u/organismD.fasta"
## [8] "/tmp/RtmpVE7Q3u/organismE.fasta"
The file we need now is case_df.tsv
. Lets load it and see what’s its structure:
<- grep("case_df.tsv", files, value = TRUE)
data_file <- read.table(data_file, header = TRUE, sep = '\t', quote = '')
data head(data)
## gene org cluster annot
## 1 gene081 organismA OG001 Thioesterase superfamily protein
## 2 gene122 organismB OG001 Thioesterase superfamily
## 3 gene299 organismC OG001 Thioesterase superfamily protein
## 4 gene186 organismD OG001 Thioesterase superfamily protein
## 5 gene076 organismE OG001 Thioesterase superfamily
## 6 gene352 organismA OG002 Inherit from proNOG: Thioesterase
So it is a data.frame
with 4 columns. The first one with the name of each gene, the second one with the organism to which each gene belongs, the third one with the cluster to which each gene was assigned in the pangenome reconstruction, and the last one with annotation metadata for each gene. Of the 4 columns, the former 3 are required, and pagoo
will look for columns named “gene”, “org”, and “cluster”. More columns are optional, and you can add as many as you want (or none) to add metadata of each gene.
With only this data (even ignoring the fourth column, which is metadata), you can start working with pagoo:
<- pagoo(data = data) pg
but let’s continue adding more.
The next 2 .tsv
files contains metadata for each cluster and for each organism, respectively, and are optional arguments.
# Organism metadata
<- grep("case_orgs_meta.tsv", files, value = TRUE)
orgs_file <- read.table(orgs_file, header = TRUE, sep = '\t', quote = '')
orgs_meta head(orgs_meta)
## org sero country
## 1 organismA a Westeros
## 2 organismB b Westeros
## 3 organismC c Westeros
## 4 organismD a Essos
## 5 organismE b Essos
On this data.frame
we have a column named org
which is mandatory in case you provide this argument. Other columns are metadata associated to each organism. Beware that organisms provided in this table (orgs_meta$org
) must coincide with the names provided in the data$org
field, in order to correctly map each variables.
Last file contains metadata associated with each cluster of orthologous:
# Cluster metadata
<- grep("case_clusters_meta.tsv", files, value = TRUE)
clust_file <- read.table(clust_file, header = TRUE, sep = '\t', quote = '')
clust_meta head(clust_meta)
## cluster kegg cog
## 1 OG001 <NA> S
## 2 OG002 <NA> S
## 3 OG003 <NA> <NA>
## 4 OG004 <NA> D
## 5 OG005 K01990 V
## 6 OG006 <NA> V
Again, the column clust_meta$cluster
must contain the same elements as data$cluster
column to be able to map one into the other.
With all this data the pagoo
object will look much more complete. But you can still add sequence information to the pangenome, which makes it much more useful and interesting to work with.
In this made up dataset we have 5 organisms, so if you decide to add sequences to the pangenome you must provide them for all 5 organisms. The type of data needed is a DNA multifasta file for each organism, in which each sequence is a gene whose name can be mapped to the data$gene
column. You must first load the sequences into a list
, and name each list element as the organism provided in data$org
(as well as org_meta$org
). The list
would look something like:
In the case of the example we are working on:
# Sequences
<- grep("[.]fasta", files, value = TRUE) # List fasta files
fasta_files names(fasta_files) <- sub('[.]fasta', '', basename(fasta_files)) # Name them
# Read fasta files with Biostrings:
library(Biostrings)
<- lapply(fasta_files, readDNAStringSet)
sq class(sq) # Is list?
## [1] "list"
length(sq) # One list element per organism
## [1] 5
names(sq) # Names are the same as in data$org
## [1] "organismA" "organismB" "organismC" "organismD" "organismE"
class(sq[[1]]) # Class of each element of the list
## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"
And we have a list
of DNAStringSet
(Biostrings package). Now we can load a quite complete pagoo
object (you could still add more metadata to genes, clusters, or organisms):
<- pagoo(data = data, # Required data
p org_meta = orgs_meta, # Organism's metadata
cluster_meta = clust_meta, # Cluster's metadata
sequences = sq) # Sequences
All the above stuff with preparing data and loading classes seems difficult and time-consuming, but in real life working datasets this will be rarely needed. We are explaining it here to provide full details about how the software works, but this package also provides functions to automatically read-in output files from pangenome reconstruction software into pagoo
, avoiding any formatting or manipulation of data.
Currently pagoo
supports input from roary (Page et al., 2015), which has been the standard and most cited software for pangenome reconstruction, and panaroo (Tonkin-Hill et al., 2020). It is worth noticing that as roary become the most used software in this field, other tools as PEPPAN, PRIATE and panaroo include scripts to convert their output to roary’s format. To work with roary
’s output, please refer to ?roary_2_pagoo
documentation. Although panaroo also includes a tool for this, we provide a function to load their native output format, see ?panaroo_2_pagoo
. For both functions you will only need the .gff
files used as input for roary of panaroo, and the gene_presence_absence.csv
file.
Also, we have created our own pangenome reconstruction software called pewit (Ferrés et al., still unpublished), which automatically generates a pagoo
-like object to perform downstream analyses. This object contain all the methods and fields pagoo
provides, plus a set of methods and fields exclusive to this software.
Other good pangenome reconstruction software already exists like PanX (Ding et al., 2018), micropan (Snipen & Liland, 2015), GET_HOMOLOGUES (Contreras-Moreira & Vinuesa, 2013), among others. We plan to provide support to some of them in the future.
After object creation, you may want to add new metadata given new information or as result of posterior analyses. pagoo
objects include a function to add columns of metadata to each gene, each cluster, or each organism. To illustrate it, we will add a new column to the $organisms
field named host
to add made up information about the host where each genome was isolated from.
<- data.frame(org = p$organisms$org, host = c("Cow", "Dog", "Cat", "Cow", "Sheep"))
host_df $add_metadata(map = "org", host_df)
p$organisms p
## DataFrame with 5 rows and 4 columns
## org sero country host
## <factor> <character> <character> <character>
## 1 organismA a Westeros Cow
## 2 organismB b Westeros Dog
## 3 organismC c Westeros Cat
## 4 organismD a Essos Cow
## 5 organismE b Essos Sheep
In order to allow pagoo
correct data mapping, the values in the first column of the metadata table should be available at p$organisms$org
, and its column header must also be named org
.
As said, you can add gene
or cluster
metadata following the same idea.
pagoo
objectsOnce loaded, it has two methods for saving and reloading to a new R session. The first one is by saving them as flat (text) files:
<- paste(tempdir(), "pangenome", sep = "/")
tmp $write_pangenome(dir = tmp)
plist.files(tmp, full.names = TRUE)
## [1] "/tmp/RtmpVE7Q3u/pangenome/clusters.tsv"
## [2] "/tmp/RtmpVE7Q3u/pangenome/data.tsv"
## [3] "/tmp/RtmpVE7Q3u/pangenome/organisms.tsv"
This creates a directory with 3 text files. The advantage of this approach is that you can analyze it outside R, the disadvantage is that from a reproducibility point of view reading text could be less stable since class or number precision can be lost, and also you can’t save the state of the object in any given time. Only available organisms/genes/clusters are saved, and if you reload the class using the tsv files, any previously dropped organisms/gene/cluster won’t be available any more. (For information about dropping/recovering organisms, see “Subset” tutorial).
If you want to save the object and continue working with it in other R session, we recommend to save them as R objects with the RDS methods provided:
<- paste(tempdir(), "pangenome.RDS", sep = "/")
rds $save_pangenomeRDS(file = rds)
p<- load_pangenomeRDS(rds) p2
This method is more stable (compatible between pagoo
versions), secure (uses the same metadata classes, and precision isn’t lost), and convenient (the exact state of the object saved is restored, keeping dropped organisms/genes/clusters hidden, available to be recovered, and other object state configuration, e.g. core_level
, is also saved).