geneHapR
is designed for gene haplotype statistics,
phenotype association and visualization.
Input files frequently present challenges to analysis. A common problem I encounter is that chromosome names and gene IDs are not standardized among VCF, FASTA and GFF files. That means additional work for users. I suggest reading these files into R, synchronizing the names and then proceeding with downstream analyses.
Large VCF or GFF file is another
important consideration when using geneHapR.
Usually, the
annotation file (GFF) and variants call format (VCF) files are usually
at chromosome or genome level. Thus those files may be huge. However, a
gene size usually range from hundreds to thousands bp. It means the
subset for a gene haplotype analysis only occupied a very small portion
in the huge file. Import this tiny files rather than huge one will save
much of time on personal computer. It’s convienient to extract a
gene/range from origin files with software in linux. In case the user
known a little or nothing about linux, I have developed a function names
as filterLargeVCF()
(see VCF filtration)
for this purpose.
geneHapR
is schemed to submit to CRAN. If accepted, this
package could be installed with
install.packages("geneHapR")
. geneHapR
has not
published yet, if you use geneHapR
in your study, please
contact Zhang RenLiang
(Maintainer) (email: zhang_renliang@163.com) or Jia GuanQing
(jiaguanqing@caas.cn)
install.packages("geneHapR")
The first step is library the geneHapR
packages. I will
use the test data inside this package as an example for how to perform
statistics of a gene/range, visualization and phenotype association
analysis.
library(geneHapR)
There are two options to conduct a gene haplotype analysis starts from a VCF file or DNA sequences file. Thus a VCF file or DNA sequences file is necessary. However, the GFF, phenos and accession groups are strongly recommend for visualization and phenotype associations.
The import functions takes file name/path as input.
import_vcf()
could import VCF file with surfix of “.vcf”
and “.vcf.gz”. import_gff()
import file format default as
“GFF” and import_seqs()
file format default as “fasta”.
# import vcf file
<- import_vcf("your_vcf_file_path.vcf")
vcf
# import gziped vcf file
<- import_vcf("your_vcf_file_path.vcf.gz") vcf
# import GFFs
<- import_gff("your_gff_file_path.gff", format = "GFF") gff
# import DNA sequences in fasta format
<- import_seqs("your_DNA_seq_file_path.fa", format = "fasta") seqs
# import phynotype data
<- import_AccINFO("your_pheno_file_path.txt")
pheno pheno
## GrainWeight.2021 GrainWeight.2022
## C1 16.76 18.76
## C2 6.66 8.66
## C3 7.80 16.30
## C4 19.73 23.73
## C5 11.95 16.95
## C6 12.43 30.45
# import accession group/location information
<- import_AccINFO("accession_group_file_path.txt") AccINFO
## Sensitive Type latitude longitude
## C1 Sensitive Mordern cultivar 65.216 33.677
## C2 Resistance Mordern cultivar 65.216 33.677
## C3 Mid Mordern cultivar 89.941 24.218
## C4 Sensitive Mordern cultivar 89.941 24.218
## C5 Resistance Mordern cultivar 89.941 24.218
## C6 Mid Mordern cultivar 25.231 42.761
Be aware that the phenotype and accession group are effectively
tables. There are more than one ways to import a table format file with
R
.
Just Note that: a. the accession/individual names
located in first column; b. the first row contents
phenotype/accession_group names; c. NA
is allowed, it’s not
a wise option to replace NA
by 0
.
eg.
# import pheno from space ' ' delimed table
<- read.table("your_pheno_file_path.csv", header = TRUE, row.names = 1, comment.char = "#")
pheno
# import pheno from ',' delimed table
<- read.csv("your_pheno_file_path.csv", header = TRUE, comment.char = "#") pheno
There is a little work need to be done before haplotype calculations: (1) VCF filtration and (2) DNA sequences alignment.
There are three modes to filter a vcfR
object after
import VCF into ‘R’: a. by position; b. by annotation; c. by both of
them.
# filter VCF by position
<- filter_vcf(vcf, mode = "POS",
vcf_f1 Chr = "scaffold_1",
start = 4300, end = 5890)
# filter VCF by annotation
<- filter_vcf(vcf, mode = "type",
vcf_f2 gff = gff,
type = "CDS")
# filter VCF by position and annotation
<- filter_vcf(vcf, mode = "both",
vcf_f3 Chr = "scaffold_1",
start = 4300, end = 5890,
gff = gff,
type = "CDS")
It’s a time consuming work to import and manipulate a very large file
with ‘R’ on personal computer. It’ll be more efficiency to extract the
target ranges from origin VCF with filterLargeVCF()
before
import. If your VCF file is just a few ‘MB’, this step was not necessary
at all.
Note: if extract more than one ranges, length of
output file names (VCFout
) must be equal with
Chr
and POS
.
# new VCF file will be saved to disk
# extract a single gene/range from a large vcf
filterLargeVCF(VCFin = "Ori.vcf.gz",
VCFout = "filtered.vcf.gz",
Chr = "scaffold_8",
POS = c(19802,24501),
override = TRUE)
# extract multi genes/ranges from large vcf
filterLargeVCF(VCFin = "Ori.vcf.gz", # surfix should be .vcf.gz or .vcf
VCFout = c("filtered1.vcf.gz", # surfix should be .vcf.gz or .vcf
"filtered2.vcf.gz",
"filtered3.vcf.gz"),
Chr = c("scaffold_8",
"scaffold_8",
"scaffold_7"),
POS = list(c(19802,24501),
c(27341,28949),
c(38469,40344)),
override = TRUE) # if TRUE, existed file will be override without warning
The origin DNA sequences must be aligned and trimmed due to haplotype
calculation need all sequences have same length. Those operations could
be done with geneHapR
. I still suggest users align and trim
DNA sequences with Mega software and then save the
result as FASTA format before import them into ‘R’.
# sequences alignment
<- allignSeqs(seqs, quiet = TRUE)
seqs
# sequences trim
<- trimSeqs(seqs, minFlankFraction = 0.1)
seqs seqs
As mentioned before, haplotype could be calculated from VCF or
sequences with vcf2hap()
or seqs2hap()
. The
genotype of most sites should be known and homozygous, still, a few site
are unknown or heterozygous due to chromosome variant or error cased by
sequencing or SNP calling or gaps or other reasons. It’s a hard decision
whether to drop accessions/individuals contains heterozygous or unknown
sites for every haplotype analysis. Hence, I leave the choice to
users.
Calculate haplotype result from VCF.
<- vcf2hap(vcf,
hapResult hapPrefix = "H",
hyb_remove = TRUE,
na.drop = TRUE)
hapResult
## Accessions:
## All: 37
## Removed: 4 C12, C16, C29, C33
## Remain: 33
##
## hap frequences:
## H001 H002 H003 H004 H005 H006 H007 H008 H009
## 10 8 4 4 2 2 1 1 1
##
## Options:
## hapPrefix: H
## CHROM: scaffold_1
## POS: 4300-6856
## hyb_remove: YES
## NA_remove: YES
##
## # A tibble: 37 × 11
## Hap `4300` `4345` `4879` `4910` `4950` `5037` `5209` `5213` `6856`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CHR scaffold_1 scaffold_1 scaff… scaff… scaff… scaff… scaff… scaff… scaff…
## 2 POS 4300 4345 4879 4910 4950 5037 5209 5213 6856
## 3 INFO <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 ALLELE G/C T/A,GG T/A,G GCCTA… T/AA,… A/AA,… A/AC C/G A/G
## 5 H001 G T T GCCTA T A A C A
## 6 H001 G T T GCCTA T A A C A
## 7 H001 G T T GCCTA T A A C A
## 8 H001 G T T GCCTA T A A C A
## 9 H001 G T T GCCTA T A A C A
## 10 H001 G T T GCCTA T A A C A
## # … with 27 more rows, and 1 more variable: Accession <chr>
Calculate haplotype result from aligned DNA sequences.
<- seqs2hap(seqs,
hapResult Ref = names(seqs)[1],
hapPrefix = "H",
hyb_remove = TRUE,
na.drop = TRUE,
maxGapsPerSeq = 0.25)
hapResult
Before visualization, there were a few details need to be adjusted. eg. add annotations and adjust position of “ATG”
hapResult
While hapResult
was calculated from vcfR
object, the INFO was taken from @fix
field. The VCF INFO may missing some annotations. or contents
format was inappropriate to display. Further more, INFO
contents nothing if hapResult
was generated from sequences.
Here, we can introduce/replace the origin INFO by
addINFO()
.
Note that: length of values
must be
equal with number of sites.
Let’s see how mant sites contains in the hapResult
.
# Chech number of sites conclude in hapResult
sites(hapResult)
## [1] 9
Now we replace the old INFO field with new tag named as “PrChange”.
# add annotations to INFO field
<- addINFO(hapResult,
hapResult tag = "PrChange",
values = rep(c("C->D", "V->R", "G->N"),3),
replace = TRUE)
Here, we add a tag named as “CDSChange” followed the old INFO.
# To replace the origin INFO by set 'replace' as TRUE
<- addINFO(hapResult,
hapResult tag = "CDSChange",
values = rep(c("C->A", "T->C", "G->T"),3),
replace = FALSE)
This function was only used to adjust the position of “ATG” to 0 and hence convert the gene on negative strand to positive strand.
Be note that: GFF and hapResult need to adjust position of ATG with the same parameters.
# set ATG position as zero in gff
<- gffSetATGas0(gff = gff, hap = hapResult,
newgff geneID = "test1G0387",
Chr = "scaffold_1",
POS = c(4300, 7910))
# set position of ATG as zero in hapResult/hapSummary
<- hapSetATGas0(gff = gff, hap = hapResult,
newhap geneID = "test1G0387",
Chr = "scaffold_1",
POS = c(4300, 7910))
hapResult
summary and visualizationOnce we have the hapResult
object, can we summarize and
visualize our hapResult
by interact with annotations and
phenotypes.
Now, we have the hapResult
object with INFOs we want
display in next step. The hap_summary()
function convert
the object of hapResult
class, which is a long table
format, into a short table belong to hapSummary
class. In
hapResult
each row represent a accession, while each row
represents a hap in hapSummary
.
<- hap_summary(hapResult)
hapSummary hapSummary
##
## Accssions: 33
## Sites: 9
## Indels: 5
## SNPs: 4
##
## Haplotypes: 9
## H001 10 C8, C9, C11, C14, C18, C25, ...
## H002 8 C5, C15, C17, C19, C22, C32, ...
## H003 4 C6, C20, C23, C37
## H004 4 C7, C13, C24, C30
## H005 2 C4, C21
## H006 2 C10, C27
## H007 1 C1
## H008 1 C2
## H009 1 C3
##
## Options:
## hapPrefix: H
## CHROM: scaffold_1
## POS: 4300-6856
## hyb_remove: YES
## NA_remove: YES
##
## # A tibble: 13 × 12
## Hap `4300` `4345` `4879` `4910` `4950` `5037` `5209` `5213` `6856`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CHR scaffold_1 scaff… scaff… scaff… scaff… scaff… scaff… scaff… scaff…
## 2 POS 4300 4345 4879 4910 4950 5037 5209 5213 6856
## 3 INFO PrChange=C->D… PrCha… PrCha… PrCha… PrCha… PrCha… PrCha… PrCha… PrCha…
## 4 ALLELE G/C T/A,GG T/A,G GCCTA… T/AA,… A/AA,… A/AC C/G A/G
## 5 H001 G T T GCCTA T A A C A
## 6 H002 C T T A T A A G A
## 7 H003 G A A A AA AA AC G A
## 8 H004 G T T A T A A G A
## 9 H005 G T G GCCTA T GGG A C A
## 10 H006 G T T A T A A G G
## 11 H007 C GG G A GG CC A G A
## 12 H008 G GG T A T CC A G G
## 13 H009 G T T A GG GGG A G G
## # … with 2 more variables: Accession <chr>, freq <int>
Let’s see how to visualization of our haplotype results.
At first let’s display the hapSummary
as a table. In
this table like figure we can see all the variants and their positions,
haplotypes and their frequencies.
plotHapTable(hapSummary)
Also we can add an annotation, “CDSChange”, to the table by assign
the INFO_tag
. It’s your responsibility to verify whether
the INFO_tag was existed in the INFO field.
# add one annotation
plotHapTable(hapSummary,
hapPrefix = "H",
INFO_tag = "CDSChange",
tag_name = "CDS",
displayIndelSize = 1,
angle = 45,
replaceMultiAllele = TRUE,
ALLELE.color = "grey90")
Now let’s add another INFO_tag
named as “PrChange”.
# add multi annotation
plotHapTable(hapSummary,
hapPrefix = "H",
INFO_tag = c("CDSChange", "PrChange"),
displayIndelSize = 1,
angle = 45,
replaceMultiAllele = TRUE,
ALLELE.color = "grey90")
Parameter tag_name
was used to replace the character if
INFO_tag
was too long.
# add multi annotation
plotHapTable(hapSummary,
hapPrefix = "H",
INFO_tag = c("CDSChange", "PrChange"),
tag_name = c("CDS", "Pr"),
displayIndelSize = 1,
angle = 45,
replaceMultiAllele = TRUE,
ALLELE.color = "grey90")
I think it’s a good idea to figure out where are the variants by marking them on gene model.
displayVarOnGeneModel(gff,
hapSummary,Chr = "scaffold_1",
startPOS = 4300, endPOS = 7910,
type = "pin", cex = 0.7,
CDS_h = 0.05, fiveUTR_h = 0.02, threeUTR_h = 0.01)
hapNet
calculation and visualizationThe hapNet
could be generated from object of
hapSummary
class. The accession group information could be
attached in this step.
<- get_hapNet(hapSummary,
hapNet AccINFO = AccINFO,
groupName = "Type")
Once we have the hapNet
object, we can plot it with
‘R’.
# plot haploNet
plotHapNet(hapNet,
size = "freq", # circle size
scale = "log2", # scale circle with 'log10(size + 1)'
cex = 0.8, # size of hap symbol
col.link = 2, # link colors
link.width = 2, # link widths
show.mutation = 2, # mutation types one of c(0,1,2,3)
legend = c(-13,-2)) # legend position
Finally, let’s see which haplotype has superiority at particular area by interact with phynotype.
Here are two options, merged or separated, to organized the heatmap
of p-values and violin plot. The figure as an object of
ggplot2
, which means user could add/modified figure
elements with ggplot2
.
Here is an example for merged arrangement:
<-hapVsPheno(hapResult,
results hapPrefix = "H",
title = "This is title",
mergeFigs = TRUE,
pheno = pheno,
phenoName = "GrainWeight.2021",
minAcc = 3)
plot(results$figs)
An example for separated plot:
<- hapVsPheno(hap = hapResult,
results hapPrefix = "H",
title = "This is title",
pheno = pheno,
phenoName = "GrainWeight.2021",
minAcc = 3,
mergeFigs = FALSE)
plot(results$fig_pvalue)
plot(results$fig_Violin)
I believe the function of hapVsPhenos()
will be useful
when you have a lot of phenotype need to be associated with haplotype
results.
Note that: the pheno name will be added between the file name and surfix.
hapVsPhenos(hapResult,
pheno,outPutSingleFile = TRUE,
hapPrefix = "H",
title = "Seita.0G000000",
file = "mypheno.tiff",
width = 12,
height = 8,
res = 300)
Now we get the haplotype result and find a few main haplotypes. There is a new question emerged: how did those main haplotypes distributed, are they related to geography?
# library(mapdata)
# library(maptools)
hapDistribution(hapResult,
AccINFO = AccINFO,
LON.col = "longitude",
LAT.col = "latitude",
hapNames = c("H001", "H002", "H003"),
legend = TRUE)