Find all SNPs in linkage disequilibrium
1
0
Entering edit mode
8.8 years ago
johanne.hh • 0

I am trying to expand a set of SNPs (identified by their rsIDs), to include all SNPs in linkage disequilibrium (LD) with the input snps.

I am able to run the script of LD computation from this tutorial as well as the script on this page.

I therefore assume I have installed ensembl correctly on my computer.

I used this as a starting point for my code of getting all SNPs in LD. The code I am now testing is the following. It is run with an input file of SNP rsIDs.

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
    );

my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation' );
$variation_adaptor->db->use_vcf(1); # To get 1000G phase 3 data also

while (<>) {
    chomp; # Remove \n from input file line names
    my $variation = $variation_adaptor->fetch_by_name($_);
    print $variation->stable_id(), "\n";

    my @vfs = @{ $variation->get_all_VariationFeatures() };

    foreach my $vf (@vfs){
        print "get ld values\n";
        my $ld = $vf->get_all_LD_values();
        print "get ld variations\n";
        my @ldvs = @{ $ld->get_variations() };

        print "for each ld variation\n";
        foreach my $ldv (@ldvs) {
            print $ldv->stable_id();
        }
    }   
}

But when I try to run it, I get the following printout. It seems the failing line code is my $ld = $vf->get_all_LD_values, based on the printouts preceding the error:

$ perl ldCalculation.pl ldSNPs.txt
rs157580
get ld values
Use of uninitialized value $gt[1] in hash element at /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm line 716
.
.
Use of uninitialized value in string ne at /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm line 617.
.
.
Number of genotypes supported by the program (500) exceeded
.
.
Can't call method "get_all_VariationFeatures" on an undefined value at /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm line 870, <OUT> line 29120.

. in the terminal means the line above is repeated multiple times.

Could you please help me with running the script properly? I am fairly new to perl and the ENSEMBL api, so I might have done some basic coding errors (did try calling the variation functions with and without () with same result).

Also, I have rsIDs in sorted order right now. Would the script also work with rsIDs in an arbitrary order?

Alternative ways of expanding the original set with SNPs in LD are also of interest! Some of the api calls seem extremely slow, and the original SNP set might be quite large, 20k+ for instance.

ensembl snp ld SNP • 3.0k views
ADD COMMENT
0
Entering edit mode

you could also convert to plink and run ld pruning with plink 1.9

ADD REPLY
0
Entering edit mode
8.8 years ago
johanne.hh • 0

After mailing with the ensembl dev team, I was able to resolve this issue. In case any other experience similar problems, the help I got from Anja Thormann is post below. The correspondence will probably also be available on their mailing list:

A bug caused these errors of this format:

Use of uninitialized value $gt[1] in hash element at /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm line 716.

Use of uninitialized value in string ne at /Users/Johanne/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/LDFeatureContainerAdaptor.pm line 617.

They pushed a bug fix to the ensembl-variation release/83 branch, and pulling the changes resolved the issue.

She also tipped me to include failed variants. This is similar to setting the use_vcf flag:

$variation_adaptor->db->include_failed_variations(1)

Regarding the original code that failed, as posted above, I got this feedback from Thormann:

I would recommend using the LDFeatureContainerAdaptor. I have written a small script to show you how to use the adaptor. In order to avoid exceeding the number of genotypes as is printed in the error message, you should define a population and variation feature for which you want to compute LD data. I print all the populations for which we can compute LD data in the beginning of the script. We have been working on speeding up our LD computation. The improvements are going into the next release/84 which will be out in March.

..

my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation' );
my $ldfc_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'ldfeaturecontainer');
my $population_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'population');
$variation_adaptor->db->use_vcf(1); # To get 1000G phase 3 data also

my $ld_populations = $population_adaptor->fetch_all_LD_Populations();
foreach my $ld_population (@$ld_populations) {
  print $ld_population->name, "\n";
}

my $variation_name = 'rs157580';
my $variation = $variation_adaptor->fetch_by_name($variation_name);
my @vfs = @{ $variation->get_all_VariationFeatures() };

foreach my $vf (@vfs) {
  foreach my $ld_population (@$ld_populations) {
    print $ld_population->name, "\n";
    my $ldfc = $ldfc_adaptor->fetch_by_VariationFeature($vf, $ld_population);
    foreach my $ld_hash (@{$ldfc->get_all_ld_values}) {
      my $d_prime = $ld_hash->{d_prime};
      my $r2 = $ld_hash->{r2};
      my $variation_name1 = $ld_hash->{variation1}->variation_name;
      my $variation_name2 = $ld_hash->{variation2}->variation_name;
      print "$variation_name1 $variation_name2 d_prime=$d_prime r2=$r2\n";
    }
  }
}
ADD COMMENT

Login before adding your answer.

Traffic: 1375 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6