Quick Start

If you want a very quick look at pagoo and start playing with pangenome objects, this is a short tutorial to show the concept. Let’s start by loading a Campylobacter spp. dataset, included in the package.

Now that the object (campy) is loaded, we can start by querying it. pagoo was developed considering that in a pangenome each individual gene belongs to a given organism, and is assigned to a cluster of orthologous. So those variables are interconnected, but each of them can have metadata associated that is specific to each of them, i.e.: an individual gene can have coordinates inside a genome, but this doesn’t apply to a whole cluster, and a given organism has, for instance, a host where it was isolated from, but this information doesn’t apply to an individual gene.

Basic Fields

So this 3 variables are 3 separate tables that can be queried:

campy$organisms

## DataFrame with 7 rows and 8 columns
##          org          id      strain      year     country        host
##     <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6         FR15   2008/170h      2008      France       Human
## 2 16244_6_18        FR27   2012/185h      2012      France       Human
## 3 17059_2_16         AR1      99/801      1999   Argentina      Bovine
## 4 17059_2_23         AR8      04/875      2004   Argentina      Bovine
## 5 17059_2_27        AR12      06/195      2006   Argentina      Bovine
## 6 17150_1_73         CA1   001A-0374      2005      Canada       Human
## 7 17059_2_42         TW6        1830      2008      Taiwan       Human
##        source   accession
##   <character> <character>
## 1       Feces   ERS672247
## 2       Blood   ERS672259
## 3     Prepuce   ERS739235
## 4       Fetus   ERS739242
## 5          VM   ERS739246
## 6       Blood   ERS686652
## 7       Blood   ERS739261

(Tip: To see all fields and methods, in any R console type campy$ and press the [TAB] key two times.)

This dataset consist in 7 Campylobacter spp genomes. For each organism, you have a row with associated metadata. The first column, org, indicates the organism.

campy$clusters

## DataFrame with 2588 rows and 2 columns
##        cluster              Pfam_Arch
##       <factor>            <character>
## 1    group0001           2-Hacid_dh_C
## 2    group0002 2-Hacid_dh_C;2-Hacid..
## 3    group0003 2-Hacid_dh_C;ACT;2-H..
## 4    group0004        2Fe-2S_thioredx
## 5    group0005    4HB_MCP_1;MCPsignal
## ...        ...                    ...
## 2584 group2584              zf-RING_7
## 2585 group2585               zf-TFIIB
## 2586 group2586                   ZinT
## 2587 group2587                   ZnuA
## 2588 group2588               ZT_dimer

The $clusters field returns a table with metadata associated to each group of orthologous, in this case is the Pfam architecture domain (second column).

The last, and most important field is $genes, which returns a list of DataFrame with information given for each individual gene, grouped by cluster. We let the user to inspect this field by him/herself.

campy$genes

The first 3 columns (cluster, org, and gene) are the glue that interconnects each of 3 “variables”.

Another useful field is $pan_matrix, which returns a matrix with gene abundance for each cluster (columns), and each organism (rows).

Basic Methods

pagoo objects contain basic methods to analyze the pangenome, from general statistics to some basic plotting capabilities. Some of these methods can also take arguments.

For example:

campy$dist(method = "bray")

##             16244_6_6 16244_6_18 17059_2_16 17059_2_23 17059_2_27 17150_1_73
## 16244_6_18 0.06594656                                                       
## 17059_2_16 0.12122816 0.12500000                                            
## 17059_2_23 0.09622745 0.09632517 0.07632399                                 
## 17059_2_27 0.09245937 0.11310008 0.10311629 0.08230990                      
## 17150_1_73 0.08203991 0.09034444 0.13624408 0.12275937 0.12999735           
## 17059_2_42 0.08920705 0.09927089 0.14532148 0.12682137 0.13706919 0.09518600

Or:

campy$gg_barplot()

Sequence Manipulation

One of the main advantages of using pagoo is the ability to very easily manipulate sequences. Sequences are stored as a List of DNAStringSet from Biostrings package.

campy$sequences

## DNAStringSetList of length 2588
## [["group0001"]] 16244_6_6__16244_6_6_00150=ATGGCGATAACAGTTTATTACGACAAAGATTGCG...
## [["group0002"]] 16244_6_6__16244_6_6_01290=ATGAAAATAGTATGCTTAGATGCCGACACGCTTG...
## [["group0003"]] 16244_6_6__16244_6_6_01710=ATGAAAACAGTTATAGTTTGCGATGCAATACATC...
## [["group0004"]] 16244_6_6__16244_6_6_01754=ATGAAATTCGAATTTACTCATGAGCAATTATCGG...
## [["group0005"]] 16244_6_6__16244_6_6_00049=ATGTCAAATTTAACTACTAACTTAACTACCAAAA...
## [["group0006"]] 16244_6_6__16244_6_6_01069=ATGAATTATTTTGAGAATTTAAAAGTTTCAACAA...
## [["group0007"]] 16244_6_6__16244_6_6_01612=ATGCGAATTAGAATTTATTATGAAGATACCGATG...
## [["group0008"]] 16244_6_6__16244_6_6_01679=ATGATGAAAGATATGGGCGAGCCACGTATAAAAA...
## [["group0009"]] 16244_6_18__16244_6_18_01216=ATGGGGCTTACTACGAGTACGACAAAGTATAT...
## [["group0010"]] 16244_6_6__16244_6_6_00758=ATGAAAAGAGTGGTTATAAAAGTAGGCTCTCACG...
## ...
## <2578 more elements>