You can check that gwasrapidd is able to connect to https://www.ebi.ac.uk by making a
connection attempt with the function
is_ebi_reachable()
:
is_ebi_reachable()
Returns TRUE
if the connection is possible, or
FALSE
otherwise. If the connection is not possible, use the
parameter chatty = TRUE
to learn at what point the
connection is failing.
is_ebi_reachable(chatty = TRUE)
The GWAS Catalog is mapped against Ensembl, dbSNP and a specific assembly version of
the human genome. You can get this info with
get_metadata()
:
get_metadata()
#> $ensembl_release_number
#> [1] 107
#>
#> $genome_build_version
#> [1] "GRCh38.p13"
#>
#> $dbsnp_version
#> [1] 154
#>
#> $usage_start_date
#> [1] "2022-07-12 13:00:04 UTC"
The four main retrieval functions get_studies()
,
get_associations()
, get_variants()
, and
get_traits()
allow to search by multiple values for the
same search criterion. You only need to pass a vector of queries to each
search criterion parameter. Here are some simple examples.
Get studies by study identifiers (GCST002420 or GCST000392):
get_studies(study_id = c('GCST002420', 'GCST000392'))
#> An object of class "studies"
#> Slot "studies":
#> # A tibble: 2 × 13
#> study_id repor…¹ initi…² repli…³ gxe gxg snp_c…⁴ quali…⁵ imputed pooled
#> <chr> <chr> <chr> <chr> <lgl> <lgl> <int> <chr> <lgl> <lgl>
#> 1 GCST002… Binge … 206 Eu… 70 Eur… FALSE FALSE 8466825 <NA> TRUE FALSE
#> 2 GCST000… Type 1… 7,514 … 4,267 … FALSE FALSE 841622 <NA> TRUE FALSE
#> # … with 3 more variables: study_design_comment <chr>,
#> # full_pvalue_set <lgl>, user_requested <lgl>, and abbreviated variable
#> # names ¹reported_trait, ²initial_sample_size, ³replication_sample_size,
#> # ⁴snp_count, ⁵qualifier
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "genotyping_techs":
#> # A tibble: 2 × 2
#> study_id genotyping_technology
#> <chr> <chr>
#> 1 GCST002420 Genome-wide genotyping array
#> 2 GCST000392 Genome-wide genotyping array
#>
#> Slot "platforms":
#> # A tibble: 3 × 2
#> study_id manufacturer
#> <chr> <chr>
#> 1 GCST002420 Affymetrix
#> 2 GCST000392 Illumina
#> 3 GCST000392 Affymetrix
#>
#> Slot "ancestries":
#> # A tibble: 4 × 4
#> study_id ancestry_id type number_of_individuals
#> <chr> <int> <chr> <int>
#> 1 GCST002420 1 initial 929
#> 2 GCST002420 2 replication 828
#> 3 GCST000392 1 initial 16559
#> 4 GCST000392 2 replication 13279
#>
#> Slot "ancestral_groups":
#> # A tibble: 4 × 3
#> study_id ancestry_id ancestral_group
#> <chr> <int> <chr>
#> 1 GCST002420 1 European
#> 2 GCST002420 2 European
#> 3 GCST000392 1 European
#> 4 GCST000392 2 European
#>
#> Slot "countries_of_origin":
#> # A tibble: 2 × 5
#> study_id ancestry_id country_name major_area region
#> <chr> <int> <chr> <chr> <chr>
#> 1 GCST002420 1 <NA> <NA> <NA>
#> 2 GCST002420 2 <NA> <NA> <NA>
#>
#> Slot "countries_of_recruitment":
#> # A tibble: 5 × 5
#> study_id ancestry_id country_name major_area region
#> <chr> <int> <chr> <chr> <chr>
#> 1 GCST002420 1 U.S. Northern America <NA>
#> 2 GCST002420 2 U.S. Northern America <NA>
#> 3 GCST000392 1 U.K. Europe Northern Europe
#> 4 GCST000392 2 U.K. Europe Northern Europe
#> 5 GCST000392 2 Denmark Europe Northern Europe
#>
#> Slot "publications":
#> # A tibble: 2 × 7
#> study_id pubmed_id publication_date publication title autho…¹ autho…²
#> <chr> <int> <date> <chr> <chr> <chr> <chr>
#> 1 GCST002420 24882193 2014-04-19 J Affect Disord Bipo… Winham… 0000-0…
#> 2 GCST000392 19430480 2009-05-10 Nat Genet Geno… Barret… <NA>
#> # … with abbreviated variable names ¹author_fullname, ²author_orcid
Get associations by variant identifiers (rs3798440 or rs7329174):
get_associations(variant_id = c('rs3798440', 'rs7329174'))
#> An object of class "associations"
#> Slot "associations":
#> # A tibble: 6 × 17
#> associatio…¹ pvalue pvalu…² pvalu…³ pvalu…⁴ multi…⁵ snp_i…⁶ snp_t…⁷ stand…⁸
#> <chr> <dbl> <chr> <int> <int> <lgl> <lgl> <chr> <dbl>
#> 1 24299710 3e-10 <NA> 3 -10 FALSE TRUE novel NA
#> 2 16617 1e- 8 <NA> 1 -8 FALSE FALSE novel NA
#> 3 26451 8e- 9 <NA> 8 -9 FALSE FALSE novel NA
#> 4 26394 6e- 6 <NA> 6 -6 FALSE FALSE known NA
#> 5 17433639 3e- 6 (Chine… 3 -6 FALSE FALSE known NA
#> 6 92481688 3e- 6 <NA> 3 -6 FALSE FALSE known NA
#> # … with 8 more variables: range <chr>, or_per_copy_number <dbl>,
#> # beta_number <dbl>, beta_unit <chr>, beta_direction <chr>,
#> # beta_description <chr>, last_mapping_date <dttm>,
#> # last_update_date <dttm>, and abbreviated variable names ¹association_id,
#> # ²pvalue_description, ³pvalue_mantissa, ⁴pvalue_exponent,
#> # ⁵multiple_snp_haplotype, ⁶snp_interaction, ⁷snp_type, ⁸standard_error
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "loci":
#> # A tibble: 7 × 4
#> association_id locus_id haplotype_snp_count description
#> <chr> <int> <int> <chr>
#> 1 24299710 1 NA SNP x SNP interaction
#> 2 24299710 2 NA SNP x SNP interaction
#> 3 16617 1 NA Single variant
#> 4 26451 1 NA Single variant
#> 5 26394 1 NA Single variant
#> 6 17433639 1 NA Single variant
#> 7 92481688 1 NA Single variant
#>
#> Slot "risk_alleles":
#> # A tibble: 7 × 7
#> association_id locus_id variant_id risk_allele risk_frequ…¹ genom…² limit…³
#> <chr> <int> <chr> <chr> <dbl> <lgl> <lgl>
#> 1 24299710 1 rs3798440 A NA TRUE FALSE
#> 2 24299710 2 rs9350602 C NA TRUE FALSE
#> 3 16617 1 rs7329174 G NA NA NA
#> 4 26451 1 rs7329174 G NA NA NA
#> 5 26394 1 rs7329174 G NA NA NA
#> 6 17433639 1 rs7329174 <NA> 0.211 FALSE FALSE
#> 7 92481688 1 rs7329174 G NA FALSE FALSE
#> # … with abbreviated variable names ¹risk_frequency, ²genome_wide,
#> # ³limited_list
#>
#> Slot "genes":
#> # A tibble: 10 × 3
#> association_id locus_id gene_name
#> <chr> <int> <chr>
#> 1 24299710 1 MYO6
#> 2 24299710 2 MYO6
#> 3 16617 1 ELF1
#> 4 26451 1 WBP4
#> 5 26451 1 ELF1
#> 6 26451 1 microRNA2276
#> 7 26451 1 SLC25A15
#> 8 26394 1 ELF1
#> 9 17433639 1 ELF1
#> 10 92481688 1 <NA>
#>
#> Slot "ensembl_ids":
#> # A tibble: 10 × 4
#> association_id locus_id gene_name ensembl_id
#> <chr> <int> <chr> <chr>
#> 1 24299710 1 MYO6 ENSG00000196586
#> 2 24299710 2 MYO6 ENSG00000196586
#> 3 16617 1 ELF1 ENSG00000120690
#> 4 26451 1 WBP4 ENSG00000120688
#> 5 26451 1 ELF1 ENSG00000120690
#> 6 26451 1 microRNA2276 <NA>
#> 7 26451 1 SLC25A15 ENSG00000102743
#> 8 26394 1 ELF1 ENSG00000120690
#> 9 17433639 1 ELF1 ENSG00000120690
#> 10 92481688 1 <NA> <NA>
#>
#> Slot "entrez_ids":
#> # A tibble: 10 × 4
#> association_id locus_id gene_name entrez_id
#> <chr> <int> <chr> <chr>
#> 1 24299710 1 MYO6 4646
#> 2 24299710 2 MYO6 4646
#> 3 16617 1 ELF1 1997
#> 4 26451 1 WBP4 11193
#> 5 26451 1 ELF1 1997
#> 6 26451 1 microRNA2276 <NA>
#> 7 26451 1 SLC25A15 10166
#> 8 26394 1 ELF1 1997
#> 9 17433639 1 ELF1 1997
#> 10 92481688 1 <NA> <NA>
Get associations by traits (braces or binge eating or gambling):
get_associations(efo_trait = c('braces', 'binge eating', 'gambling'))
#> An object of class "associations"
#> Slot "associations":
#> # A tibble: 14 × 17
#> associati…¹ pvalue pvalu…² pvalu…³ pvalu…⁴ multi…⁵ snp_i…⁶ snp_t…⁷ stand…⁸
#> <chr> <dbl> <chr> <int> <int> <lgl> <lgl> <chr> <dbl>
#> 1 15608 4e-7 (brace… 4 -7 FALSE FALSE novel NA
#> 2 44592 9e-7 <NA> 9 -7 FALSE FALSE novel NA
#> 3 44589 1e-6 <NA> 1 -6 FALSE FALSE novel NA
#> 4 44590 4e-6 <NA> 4 -6 FALSE FALSE novel NA
#> 5 27460823 1e-6 <NA> 1 -6 FALSE FALSE novel NA
#> 6 27460811 1e-7 <NA> 1 -7 FALSE FALSE novel NA
#> 7 27460817 7e-7 <NA> 7 -7 FALSE FALSE novel NA
#> 8 27460805 3e-8 <NA> 3 -8 FALSE FALSE novel NA
#> 9 27460830 1e-6 <NA> 1 -6 FALSE FALSE novel NA
#> 10 27460844 1e-8 <NA> 1 -8 FALSE FALSE novel NA
#> 11 27460858 3e-7 <NA> 3 -7 FALSE FALSE novel NA
#> 12 27460864 3e-7 <NA> 3 -7 FALSE FALSE novel NA
#> 13 27460870 1e-6 <NA> 1 -6 FALSE FALSE novel NA
#> 14 27460851 9e-8 <NA> 9 -8 FALSE FALSE novel NA
#> # … with 8 more variables: range <chr>, or_per_copy_number <dbl>,
#> # beta_number <dbl>, beta_unit <chr>, beta_direction <chr>,
#> # beta_description <chr>, last_mapping_date <dttm>,
#> # last_update_date <dttm>, and abbreviated variable names ¹association_id,
#> # ²pvalue_description, ³pvalue_mantissa, ⁴pvalue_exponent,
#> # ⁵multiple_snp_haplotype, ⁶snp_interaction, ⁷snp_type, ⁸standard_error
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "loci":
#> # A tibble: 14 × 4
#> association_id locus_id haplotype_snp_count description
#> <chr> <int> <int> <chr>
#> 1 15608 1 NA Single variant
#> 2 44592 1 NA Single variant
#> 3 44589 1 NA Single variant
#> 4 44590 1 NA Single variant
#> 5 27460823 1 NA Single variant
#> 6 27460811 1 NA Single variant
#> 7 27460817 1 NA Single variant
#> 8 27460805 1 NA Single variant
#> 9 27460830 1 NA Single variant
#> 10 27460844 1 NA Single variant
#> 11 27460858 1 NA Single variant
#> 12 27460864 1 NA Single variant
#> 13 27460870 1 NA Single variant
#> 14 27460851 1 NA Single variant
#>
#> Slot "risk_alleles":
#> # A tibble: 14 × 7
#> association_id locus_id variant_id risk_allele risk_fre…¹ genom…² limit…³
#> <chr> <int> <chr> <chr> <dbl> <lgl> <lgl>
#> 1 15608 1 rs1535480 <NA> NA NA NA
#> 2 44592 1 rs6006893 <NA> NA NA NA
#> 3 44589 1 rs10198175 <NA> NA NA NA
#> 4 44590 1 rs13233490 <NA> NA NA NA
#> 5 27460823 1 rs182107583 C 0.04 FALSE FALSE
#> 6 27460811 1 rs7904579 G 0.37 FALSE FALSE
#> 7 27460817 1 rs1950038 T 0.3 FALSE FALSE
#> 8 27460805 1 rs726170 T 0.12 FALSE FALSE
#> 9 27460830 1 rs76087671 T 0.05 FALSE FALSE
#> 10 27460844 1 rs111940429 T 0.04 FALSE FALSE
#> 11 27460858 1 rs7337127 T 0.15 FALSE FALSE
#> 12 27460864 1 rs145763646 A 0.1 FALSE FALSE
#> 13 27460870 1 rs73057489 C 0.07 FALSE FALSE
#> 14 27460851 1 rs17810023 T 0.02 FALSE FALSE
#> # … with abbreviated variable names ¹risk_frequency, ²genome_wide,
#> # ³limited_list
#>
#> Slot "genes":
#> # A tibble: 15 × 3
#> association_id locus_id gene_name
#> <chr> <int> <chr>
#> 1 15608 1 <NA>
#> 2 44592 1 PRR5
#> 3 44589 1 APOB
#> 4 44590 1 PER4
#> 5 27460823 1 LOC101929321
#> 6 27460811 1 CUBN
#> 7 27460817 1 Intergenic
#> 8 27460805 1 PRR5
#> 9 27460805 1 ARHGAP8
#> 10 27460830 1 intergenic
#> 11 27460844 1 AC096669.1
#> 12 27460858 1 intergenic
#> 13 27460864 1 SLC25A26
#> 14 27460870 1 intergenic
#> 15 27460851 1 RP11-250B2.3
#>
#> Slot "ensembl_ids":
#> # A tibble: 16 × 4
#> association_id locus_id gene_name ensembl_id
#> <chr> <int> <chr> <chr>
#> 1 15608 1 <NA> <NA>
#> 2 44592 1 PRR5 ENSG00000186654
#> 3 44589 1 APOB ENSG00000084674
#> 4 44590 1 PER4 <NA>
#> 5 27460823 1 LOC101929321 <NA>
#> 6 27460811 1 CUBN ENSG00000107611
#> 7 27460817 1 Intergenic <NA>
#> 8 27460805 1 PRR5 ENSG00000186654
#> 9 27460805 1 ARHGAP8 ENSG00000241484
#> 10 27460830 1 intergenic <NA>
#> 11 27460844 1 AC096669.1 ENSG00000225588
#> 12 27460858 1 intergenic <NA>
#> 13 27460864 1 SLC25A26 ENSG00000282739
#> 14 27460864 1 SLC25A26 ENSG00000144741
#> 15 27460870 1 intergenic <NA>
#> 16 27460851 1 RP11-250B2.3 <NA>
#>
#> Slot "entrez_ids":
#> # A tibble: 15 × 4
#> association_id locus_id gene_name entrez_id
#> <chr> <int> <chr> <chr>
#> 1 15608 1 <NA> <NA>
#> 2 44592 1 PRR5 55615
#> 3 44589 1 APOB 338
#> 4 44590 1 PER4 <NA>
#> 5 27460823 1 LOC101929321 101929321
#> 6 27460811 1 CUBN 8029
#> 7 27460817 1 Intergenic <NA>
#> 8 27460805 1 PRR5 55615
#> 9 27460805 1 ARHGAP8 23779
#> 10 27460830 1 intergenic <NA>
#> 11 27460844 1 AC096669.1 <NA>
#> 12 27460858 1 intergenic <NA>
#> 13 27460864 1 SLC25A26 115286
#> 14 27460870 1 intergenic <NA>
#> 15 27460851 1 RP11-250B2.3 <NA>
Get traits by PubMed identifiers (24882193 or 22780124):
get_traits(pubmed_id = c('24882193', '22780124'))
#> An object of class "traits"
#> Slot "traits":
#> # A tibble: 3 × 3
#> efo_id trait uri
#> <chr> <chr> <chr>
#> 1 EFO_0005924 binge eating http://www.ebi.ac.uk/efo/EFO_0005924
#> 2 EFO_0000289 bipolar disorder http://www.ebi.ac.uk/efo/EFO_0000289
#> 3 EFO_0004699 gambling behaviour http://www.ebi.ac.uk/efo/EFO_0004699
The only search parameters that are not vectorised are
user_requested
and full_pvalue_set
from
get_studies()
. These parameters are not vectorised because
they take boolean values (TRUE
or FALSE
) and
thus only one of the values is sensical to be used as a query at a given
time.
There are two levels of trait description in the GWAS Catalog: (EFO) trait and reported trait.
Studies are assigned one or more terms from the Experimental Factor Ontology (EFO), i.e., an EFO trait, or simply trait, that best represents the phenotype under investigation.
In addition, each study is also assigned a free text reported trait. This is written by the GWAS Catalog curators and reflects the author language, and where necessary, it includes more specific and detailed description of the experimental design, e.g., interaction studies or studies with a background trait.
As an example take the study with accession identifier
GCST000206
by EM Behrens et al. (2008). We can get the EFO
trait with get_traits()
and the reported trait with
get_studies()
:
The (EFO) trait for the Behrens study is
chronic childhood arthritis
:
<- get_traits(study_id = 'GCST000206')
efo_trait @traits$trait
efo_trait#> [1] "juvenile idiopathic arthritis"
whereas the reported trait is
Arthritis (juvenile idiopathic)
:
<- get_studies(study_id = 'GCST000206')
study @studies$reported_trait
study#> [1] "Arthritis (juvenile idiopathic)"
The REST API response for variants contains an element named
genomic contexts. This element is mapped onto the table
genomic_contexts
of a variants
S4 object.
Now, there is indeed a server-side bug with the column
chromosome_position
of the genomic_contexts
table: the chromosome position returned is that of the variant and not
of the gene (genomic context) as it should be.
The GWAS Catalog team is aware of this bug, and they plan to fix it,
eventually. For the time being, just do not rely on
chromosome_position
of the genomic_contexts
table.
For this you may use the function get_variants()
with
parameter genomic_range
.
For example, to search for variants in chromosome Y in the interval
14692000–14695000, you start by defining a list of 3 elements:
chromosome
, start
and end
that
specify your genomic range:
# 'chromosome' names are case sensitive, and should be uppercase.
# 'start' and 'end' positions should be integer vectors.
<- list(
my_genomic_range chromosome = 'Y',
start = 14692000L,
end = 14695000L)
Now you can use my_genomic_range
to retrieve the
variants:
<- get_variants(genomic_range = my_genomic_range)
chr_Y_variants @variants[c('variant_id', 'functional_class')]
chr_Y_variants#> # A tibble: 1 × 2
#> variant_id functional_class
#> <chr> <chr>
#> 1 rs2115848 intron_variant
To search in multiple regions, construct your genomic range list with those locations just like in the previous example. For example, let’s search now for variants in chromosome X and Y, both in range 13000000–15000000:
<- list(
my_genomic_range chromosome = c('X', 'Y'),
start = c(13000000L, 13000000L),
end = c(15000000L, 15000000L))
<- get_variants(genomic_range = my_genomic_range)
chr_XY_variants @variants[c('variant_id',
chr_XY_variants'chromosome_name',
'chromosome_position')]
#> # A tibble: 22 × 3
#> variant_id chromosome_name chromosome_position
#> <chr> <chr> <int>
#> 1 rs111689944 X 13494655
#> 2 rs73633565 X 13459192
#> 3 rs5980075 X 14932409
#> 4 rs142204301 X 13358691
#> 5 rs2361151 X 13503433
#> 6 rs35164803 X 14911253
#> 7 rs12558341 X 13695943
#> 8 rs138331350 X 14841981
#> 9 rs66819623 X 13936278
#> 10 rs61273829 X 14150876
#> # … with 12 more rows
#> # ℹ Use `print(n = ...)` to see more rows
To search for variants within a cytogenetic band you can use the
parameter cytogenetic_band
of get_variants()
.
Here is an example, again for chromosome Y, using the cytogenetic band
'Yq11.221'
as query:
<- get_variants(cytogenetic_band = 'Yq11.221')
my_variants @variants[c('variant_id',
my_variants'chromosome_name',
'chromosome_position')]
#> # A tibble: 3 × 3
#> variant_id chromosome_name chromosome_position
#> <chr> <chr> <int>
#> 1 rs2115848 Y 14692972
#> 2 rs2032658 Y 13470103
#> 3 rs2032624 Y 12914512
How to know what are the cytogenetic bands for querying? We provide a
dataset (dataframe) named cytogenetic_bands
that you can
use:
# ?cytogenetic_bands for more details.
cytogenetic_bands#> # A tibble: 862 × 8
#> cytogenetic_band chromosome start end length assem…¹ stain last_…²
#> <chr> <chr> <int> <int> <dbl> <chr> <chr> <chr>
#> 1 1p36.33 1 1 2300000 2.3e6 GRCh38 gneg Thu Ju…
#> 2 1p36.32 1 2300001 5300000 3 e6 GRCh38 gpos… Thu Ju…
#> 3 1p36.31 1 5300001 7100000 1.8e6 GRCh38 gneg Thu Ju…
#> 4 1p36.23 1 7100001 9100000 2 e6 GRCh38 gpos… Thu Ju…
#> 5 1p36.22 1 9100001 12500000 3.4e6 GRCh38 gneg Thu Ju…
#> 6 1p36.21 1 12500001 15900000 3.4e6 GRCh38 gpos… Thu Ju…
#> 7 1p36.13 1 15900001 20100000 4.2e6 GRCh38 gneg Thu Ju…
#> 8 1p36.12 1 20100001 23600000 3.5e6 GRCh38 gpos… Thu Ju…
#> 9 1p36.11 1 23600001 27600000 4 e6 GRCh38 gneg Thu Ju…
#> 10 1p35.3 1 27600001 29900000 2.3e6 GRCh38 gpos… Thu Ju…
#> # … with 852 more rows, and abbreviated variable names ¹assembly,
#> # ²last_download_date
#> # ℹ Use `print(n = ...)` to see more rows
Let’s say you want to search for all variants in the shorter arm (p)
of chromosome 21, you can take advantage of the
cytogenetic_bands
to get all the corresponding cytogenetic
band names:
# Install package dplyr if you do not have it.
<- dplyr::filter(cytogenetic_bands, grepl('^21p', cytogenetic_band)) %>%
chr21_p_bands ::pull(cytogenetic_band)
dplyr
chr21_p_bands#> [1] "21p13" "21p12" "21p11.2" "21p11.1"
Now search by cytogenetic_band
:
<- get_variants(cytogenetic_band = chr21_p_bands)
my_variants @variants[c('variant_id',
my_variants'chromosome_name',
'chromosome_position',
'chromosome_region')]
#> # A tibble: 4 × 4
#> variant_id chromosome_name chromosome_position chromosome_region
#> <chr> <chr> <int> <chr>
#> 1 rs552964654 21 9053097 21p11.2
#> 2 rs240444 21 10510446 21p11.2
#> 3 rs10439884 21 10540506 21p11.2
#> 4 rs75275574 21 9040338 21p11.2
You can get the total length of a chromosome by using the provided
data set: cytogenetic_bands
. Here is an example for
chromosome 15:
# Install dplyr first.
::filter(cytogenetic_bands, chromosome == '15') %>%
dplyr::summarise(chromosome = dplyr::first(chromosome),
dplyrstart = min(start),
end = max(end)
)#> # A tibble: 1 × 3
#> chromosome start end
#> <chr> <int> <int>
#> 1 15 1 101991189
Currently, there is not an implemented solution in
{gwasrapidd}
. For example, if you search for variants by
EFO identifier (efo_id
):
<- c('EFO_0005543', 'EFO_0004762')
my_efo_ids <- get_variants(efo_id = my_efo_ids)
my_variants @variants$variant_id
my_variants#> [1] "rs11706832" "rs498872" "rs72714270" "rs11599775" "rs12803321"
#> [6] "rs7125115" "rs10069690" "rs12752552" "rs648044" "rs111976262"
#> [11] "rs11598018" "rs4252707" "rs688755" "rs634537" "rs78378222"
#> [16] "rs4975538" "rs2297433" "rs55705857" "rs320337" "rs10852606"
#> [21] "rs2297440" "rs1938964" "rs1106639" "rs12076373" "rs3772190"
#> [26] "rs2235573" "rs10842893" "rs7572263" "rs2736100" "rs80351950"
#> [31] "rs5839764" "rs77633900" "rs3751667" "rs7107785" "rs72714236"
#> [36] "rs10131032" "rs4774756" "rs4977756" "rs9841110" "rs6010620"
#> [41] "rs11233250" "rs75061358" "rs10927065" "rs1275600" "rs11979158"
#> [46] "rs78355601" "rs10411345" "rs144085478" "rs12214617" "rs181216459"
#> [51] "rs118086804" "rs12456390" "rs4714729" "rs12199215" "rs2639990"
#> [56] "rs9472155" "rs144160960" "rs34528081" "rs4082730" "rs7763358"
#> [61] "rs55864163" "rs9787438" "rs150821445" "rs12445232" "rs6598475"
#> [66] "rs10934631" "rs6475938" "rs6479877" "rs6921438" "rs10886366"
#> [71] "rs10761741" "rs114694170" "rs77961527" "rs112215592" "rs117580153"
#> [76] "rs73872715" "rs13206436" "rs4782371" "rs4513773" "rs186725382"
#> [81] "rs139893147" "rs10153304" "rs34524635" "rs7043199" "rs187918360"
#> [86] "rs11639051" "rs1740073" "rs71779653" "rs34881325" "rs181558074"
#> [91] "rs2375981" "rs59706856" "rs2304058" "rs73418461" "rs10761750"
#> [96] "rs8045833" "rs10761731" "rs7030781" "rs180936035" "rs550057"
#> [101] "rs9332599" "rs75455100" "rs143479231" "rs10822155" "rs7767396"
#> [106] "rs191332118" "rs186066666" "rs11392719" "rs6993770" "rs61818787"
#> [111] "rs144820908" "rs61829244" "rs6722871" "rs10738760"
So it is not immediately obvious which variants resulted from the
query 'EFO_0005543'
or 'EFO_0004762'
.
A possible workaround is to make multiple independent queries and save your results in a list whose names are the respective queries:
# Install purrr first.
# Add names to my_efo_ids
names(my_efo_ids) <- my_efo_ids
<- purrr::map(my_efo_ids, ~ get_variants(efo_id = .x)) my_variants
Now you can see which variants are associated with each EFO identifier.
For 'EFO_0005543'
we got the following variants:
'EFO_0005543']]@variants$variant_id
my_variants[[#> [1] "rs11706832" "rs498872" "rs72714270" "rs11599775" "rs12803321"
#> [6] "rs7125115" "rs10069690" "rs12752552" "rs648044" "rs111976262"
#> [11] "rs11598018" "rs4252707" "rs688755" "rs634537" "rs78378222"
#> [16] "rs4975538" "rs2297433" "rs55705857" "rs320337" "rs10852606"
#> [21] "rs2297440" "rs1938964" "rs1106639" "rs12076373" "rs3772190"
#> [26] "rs2235573" "rs10842893" "rs7572263" "rs2736100" "rs80351950"
#> [31] "rs5839764" "rs77633900" "rs3751667" "rs7107785" "rs72714236"
#> [36] "rs10131032" "rs4774756" "rs4977756" "rs9841110" "rs6010620"
#> [41] "rs11233250" "rs75061358" "rs10927065" "rs1275600" "rs11979158"
And for 'EFO_0004762'
:
'EFO_0004762']]@variants$variant_id
my_variants[[#> [1] "rs78355601" "rs10411345" "rs144085478" "rs12214617" "rs181216459"
#> [6] "rs118086804" "rs12456390" "rs4714729" "rs12199215" "rs2639990"
#> [11] "rs9472155" "rs144160960" "rs34528081" "rs4082730" "rs7763358"
#> [16] "rs55864163" "rs9787438" "rs150821445" "rs12445232" "rs6598475"
#> [21] "rs10934631" "rs6475938" "rs6479877" "rs6921438" "rs10886366"
#> [26] "rs10761741" "rs114694170" "rs77961527" "rs112215592" "rs117580153"
#> [31] "rs73872715" "rs13206436" "rs4782371" "rs4513773" "rs186725382"
#> [36] "rs139893147" "rs10153304" "rs34524635" "rs7043199" "rs187918360"
#> [41] "rs11639051" "rs1740073" "rs71779653" "rs34881325" "rs181558074"
#> [46] "rs2375981" "rs59706856" "rs2304058" "rs73418461" "rs10761750"
#> [51] "rs8045833" "rs10761731" "rs7030781" "rs180936035" "rs550057"
#> [56] "rs9332599" "rs75455100" "rs143479231" "rs10822155" "rs7767396"
#> [61] "rs191332118" "rs186066666" "rs11392719" "rs6993770" "rs61818787"
#> [66] "rs144820908" "rs61829244" "rs6722871" "rs10738760"
The four main retrieval functions get_studies()
,
get_associations()
, get_variants()
and
get_traits()
all allow you to search multiple criteria at
once. You can then combine results in an OR or AND fashion using the
parameter set_operation
.
Use set_operation = 'union'
to combine results in an OR
fashion:
<- get_variants(
my_variants_OR efo_trait = 'triple-negative breast cancer',
gene_name = 'MDM4',
set_operation = 'union')
@variants[c('variant_id',
my_variants_OR'chromosome_name',
'chromosome_position',
'chromosome_region')]
#> # A tibble: 46 × 4
#> variant_id chromosome_name chromosome_position chromosome_region
#> <chr> <chr> <int> <chr>
#> 1 rs3747636 1 204434531 1q32.1
#> 2 rs10793765 1 204580247 1q32.1
#> 3 rs116661163 1 204641544 1q32.1
#> 4 rs12083887 1 118339066 1p12
#> 5 rs12143943 1 204602943 1q32.1
#> 6 rs1008833 1 204457167 1q32.1
#> 7 rs2137255 1 204457245 1q32.1
#> 8 rs2290854 1 204546897 1q32.1
#> 9 rs3789044 1 204619973 1q32.1
#> 10 rs12036042 1 204500186 1q32.1
#> # … with 36 more rows
#> # ℹ Use `print(n = ...)` to see more rows
The code above retrieves variants whose associated
efo_trait
is equal to
'triple-negative breast cancer'
or variants that are
associated with gene 'MDM4'
.
Alternatively, we may use set_operation = 'intersection'
to combine results in an AND fashion:
<- get_variants(
my_variants_AND efo_trait = 'triple-negative breast cancer',
gene_name = 'MDM4',
set_operation = 'intersection')
@variants[c('variant_id',
my_variants_AND'chromosome_name',
'chromosome_position',
'chromosome_region')]
#> # A tibble: 1 × 4
#> variant_id chromosome_name chromosome_position chromosome_region
#> <chr> <chr> <int> <chr>
#> 1 rs4245739 1 204549714 1q32.1
With set_operation = 'intersection'
, as in the code
above, we get variants whose associated efo_trait
is equal
to 'triple-negative breast cancer'
and that are associated
with gene 'MDM4'
, i.e., only variants meeting both
conditions simultaneously are retrieved.
Please note that almost all search criteria to be used with the retrieval functions are vectorised, meaning that you can use multiple values with the same search criterion. In these cases results are always combined in an OR fashion.
In the following example, we will be using the gene name as the only search criterion. If we pass a vector of gene names then we get all variants that are associated with EITHER (OR) genes.
<- get_variants(gene_name = c('RNU6-367P', 'ABHD5'))
my_variants
@variants[c('variant_id',
my_variants'chromosome_name',
'chromosome_position',
'chromosome_region')]
#> # A tibble: 28 × 4
#> variant_id chromosome_name chromosome_position chromosome_region
#> <chr> <chr> <int> <chr>
#> 1 rs7625896 3 44021069 3p21.33
#> 2 rs142242702 3 44122660 3p21.32
#> 3 rs1554654 3 44002852 3p21.33
#> 4 rs7619427 3 44035549 3p21.33
#> 5 rs11718455 3 44015406 3p21.33
#> 6 rs73076675 3 43924731 3p21.33
#> 7 rs113706999 3 44117664 3p21.32
#> 8 rs6441814 3 44007622 3p21.33
#> 9 rs4017425 3 43987272 3p21.33
#> 10 rs79644353 3 44135707 3p21.32
#> # … with 18 more rows
#> # ℹ Use `print(n = ...)` to see more rows
In this case we retrieved 28 variants. Please note that the
set_operation
parameter does not affect this result. The
set_operation
only controls the function behaviour when
combining results from different criteria, e.g., when using
efo_trait
and gene_name
.
To retrieve variants that are concomitantly associated with genes
RNU6-367P and ABHD5, the user needs to place these queries separately
and then intersect them — using the intersect()
function,
i.e., combining in an AND fashion. Here we start by retrieving variants
associated with gene RNU6-367P:
<- get_variants(gene_name = 'RNU6-367P')
my_variants1
@variants[c('variant_id',
my_variants1'chromosome_name',
'chromosome_position',
'chromosome_region')]
#> # A tibble: 14 × 4
#> variant_id chromosome_name chromosome_position chromosome_region
#> <chr> <chr> <int> <chr>
#> 1 rs7625896 3 44021069 3p21.33
#> 2 rs142242702 3 44122660 3p21.32
#> 3 rs1554654 3 44002852 3p21.33
#> 4 rs7619427 3 44035549 3p21.33
#> 5 rs11718455 3 44015406 3p21.33
#> 6 rs73076675 3 43924731 3p21.33
#> 7 rs113706999 3 44117664 3p21.32
#> 8 rs6441814 3 44007622 3p21.33
#> 9 rs4017425 3 43987272 3p21.33
#> 10 rs79644353 3 44135707 3p21.32
#> 11 rs7619544 3 43852053 3p21.33
#> 12 rs189710954 3 43800315 3p21.33
#> 13 rs35283240 3 43990836 3p21.33
#> 14 rs62251261 3 44106761 3p21.32
There are 14 variants associated with gene RNU6-367P. Now, for gene ABHD5:
<- get_variants(gene_name = 'ABHD5')
my_variants2
@variants[c('variant_id',
my_variants2'chromosome_name',
'chromosome_position',
'chromosome_region')]
#> # A tibble: 19 × 4
#> variant_id chromosome_name chromosome_position chromosome_region
#> <chr> <chr> <int> <chr>
#> 1 rs191867523 3 43774449 3p21.33
#> 2 rs115421670 3 43737217 3p21.33
#> 3 rs6772840 3 43812828 3p21.33
#> 4 rs11720728 3 43832420 3p21.33
#> 5 rs142404191 3 43641812 3p21.33
#> 6 rs192944990 3 43826178 3p21.33
#> 7 rs182923613 3 43791731 3p21.33
#> 8 rs73087085 3 43687796 3p21.33
#> 9 rs75594032 3 43744082 3p21.33
#> 10 rs1468602 3 43701242 3p21.33
#> 11 rs4082244 3 135010695 3q22.2
#> 12 rs740838 3 43722823 3p21.33
#> 13 rs73076675 3 43924731 3p21.33
#> 14 rs141365045 3 43691004 3p21.33
#> 15 rs4017425 3 43987272 3p21.33
#> 16 rs17075898 3 43710695 3p21.33
#> 17 rs7619544 3 43852053 3p21.33
#> 18 rs189710954 3 43800315 3p21.33
#> 19 rs35283240 3 43990836 3p21.33
There are 19 variants associated with gene ABHD5. To find those
variants simultaneously associated with both genes, you can intersect
the two variants
objects using gwasrapidd::intersect()
:
<- gwasrapidd::intersect(my_variants1, my_variants2)
variants_intersect @variants[c('variant_id',
variants_intersect'chromosome_name',
'chromosome_position',
'chromosome_region')]
#> # A tibble: 5 × 4
#> variant_id chromosome_name chromosome_position chromosome_region
#> <chr> <chr> <int> <chr>
#> 1 rs73076675 3 43924731 3p21.33
#> 2 rs4017425 3 43987272 3p21.33
#> 3 rs7619544 3 43852053 3p21.33
#> 4 rs189710954 3 43800315 3p21.33
#> 5 rs35283240 3 43990836 3p21.33
Apparently only 5 variant(s) are related to both genes RNU6-367P and ABHD5.