Question

Problem With Ensembl Variant Effect Predictor Stand Alone Perl Tool

1

Entering edit mode

14.0 years ago

Nasir ▴ 20

Hi All

I would be grateful for your help with this problem.

I am annotating SNPs in vcf files from 1000 genomes project using the Ensembl Variant Effect Predictor stand alone perl tool varianteffectpredictor.pl. Sometimes I am getting the correct output file, but sometimes I am having the following problems: (1) It is taking a long time to generate each output file, (2) Sometimes not all variants are being annotated; some SNPs are missed out in the output file, and (3) Sometimes I am getting no output file at all, but get the following error

$ perl varianteffectpredictor.pl -i ABCA12.vcf -format vcf -hgnc -sift b -polyphen b -condel b -o ABCA12phase.vep

Could not connect to database homosapienscore6237g as user anonymous using [DBI:mysql:database=homosapienscore6237g;host=ensembldb.ensembl.org;port=5306] as a locator: Lost connection to MySQL server at 'reading initial communication packet', system error: 0 at /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/DBSQL/DBConnection.pm line 290, <GEN0> line 186.

-------------------- EXCEPTION -------------------- MSG: Could not connect to database homosapienscore6237g as user anonymous using [DBI:mysql:database=homosapienscore6237g;host=ensembldb.ensembl.org;port=5306] as a locator: Lost connection to MySQL server at 'reading initial communication packet', system error: 0 STACK Bio::EnsEMBL::DBSQL::DBConnection::connect /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/DBSQL/DBConnection.pm:299 STACK Bio::EnsEMBL::DBSQL::DBConnection::dbhandle /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/DBSQL/DBConnection.pm:618 STACK Bio::EnsEMBL::DBSQL::DBConnection::prepare /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/DBSQL/DBConnection.pm:647 STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::genericfetch /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:509 STACK Bio::EnsEMBL::DBSQL::BaseFeatureAdaptor::slicefetch /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/DBSQL/BaseFeatureAdaptor.pm:495 STACK Bio::EnsEMBL::DBSQL::BaseFeatureAdaptor::fetchallbySliceconstraint /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/DBSQL/BaseFeatureAdaptor.pm:316 STACK Bio::EnsEMBL::DBSQL::TranscriptAdaptor::fetchallbySlice /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/DBSQL/TranscriptAdaptor.pm:372 STACK Bio::EnsEMBL::Slice::getallTranscripts /usr/local/lib/perl/5.10.1/Bio/EnsEMBL/Slice.pm:2398 STACK Bio::EnsEMBL::Variation::VariationFeature::getallTranscriptVariations /usr/local/share/perl/5.10.1/Bio/EnsEMBL/Variation/VariationFeature.pm:382 STACK main::printconsequences varianteffectpredictor.pl:233 STACK main::main varianteffectpredictor.pl:205 STACK toplevel varianteffectpredictor.pl:44 Ensembl API version = 62

I am not able to decipher this error message & would be grateful for suggestions about how to deal with the above problems.

ensembl variant • 8.7k views

ADD COMMENT • link updated 14.0 years ago by Pi ▴ 520 • written 14.0 years ago by Nasir ▴ 20

0

Entering edit mode

are you working behind a firewall ?

ADD REPLY • link 14.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

In case you are interested, I've already annotated that using my own tool (SnpEff: http://snpeff.sourceforge.net/). The process takes 20 minutes or so.

Here are the results: http://www.mcb.mcgill.ca/~pcingola/1k_genomes/1000_Genomes_snpEff.txt.gz

ADD REPLY • link 14.0 years ago by Pablo ★ 1.9k

0

Entering edit mode

And here is the summary page http://www.mcb.mcgill.ca/~pcingola/1k_genomes/1000_Genomes_snpEff_summary.html

ADD REPLY • link 14.0 years ago by Pablo ★ 1.9k

0

Entering edit mode

Thank you Pablo. In fact, I moved to using your very useful tool (snpEff) because of the slow progress I was making with using variant effect predictor.

ADD REPLY • link 14.0 years ago by Nasir ▴ 20

0

Entering edit mode

Great, send me an email if you have any questions.

ADD REPLY • link 14.0 years ago by Pablo ★ 1.9k

score 4 · Answer 1 · 2011-04-22

You are querying a remote server. There is a lot of overload involved on each of the queries performed against the server. It is slow. You have different options:

a. Replicate the ensembl the mysql database locally and query to your local server.

b. SPlit your snp list and query the server in parallel. You will increase the load on the server though. It has been working pretty well for me as I am not seeing performance penalties by using this approach.

c. Try another tools that download ensembl databases locally and build a datastructure in memory. Annovar is an option. It can annotate Millions of snps in less than 1 hour on a regular machine.

Did you notice there is a limit on the number of snps you can send to ensembl? The limit is 1000 snps. Could that be what is causing the problem?
From the error message it seems the socket that links your local machine with the server is broken. Next time that happens try to use the mysql client see if you get some extra information in the error message that may help you troubleshoot the problem.

Ram · Answer 2 · 2011-04-28

Hello,

I just wanted to add that the maximum of 1000 variant restriction is only for the online version not the downloaded script version.

If you have a very large amount of data, you can also try running the script in whole-genome mode - please refer to the README file that comes with the script for guidance before doing this. ftp://ftp.ensembl.org/pub/misc-scripts/Variant_effect_predictor_2.0/

Ram · Answer 3 · 2011-04-28

3

Entering edit mode

14.0 years ago

Willm ▴ 30

Hello,

1) As Fiona stated, you can try using whole-genome mode (add the -w flag to your command line). You should ensure that the data you have is suitable - the file should be ordered by chromosome and position, and ideally should represent a contiguous region (e.g. a gene, set of genes or a whole chromosome). You can refer to the README for more information about this.

2) If SNPs are missing from the output it means that they do not overlap or fall near any Ensembl-annotated transcripts - you can consider them to be intergenic with no predicted consequence.

3) As drio stated, you are querying a remote database, so connection issues can and will occasionally occur. To eliminate these, consider setting up a local copy of the human core Ensembl database.

ADD COMMENT • link 14.0 years ago by Willm ▴ 30

0

Entering edit mode

Hello,

My dataset contains 5 SNVs but only 3 of them have been annotated with the variant effect predictor tool. One of these two unannotated positions is a known SNP but the script, ran with default parameteres, does not provide any result. Here is one line:

20      57206550        .       G       A       30.88   PASS    AC=2;AF=1.00;AN=2;DP=3;Dels=0.00;HRun=3;HaplotypeScore=0.0000;MQ=35.51;MQ0=0;QD=10.29;SB=-0.01;sumGLbyD=21.03   GT:AD:DP:GQ:PL
    1/1:0,3:2:6.01:62,6,0

Is it possible to obtain at least the dbSNP id for this type of variants using Ensembl APIs?

Best regards,
S.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.8 years ago by User 6048 • 0

score 0 · Answer 4 · 2011-04-28

is there an error in the parseline function

pileup: chr1 60 T A

    if(
       ($config->{input_format} =~ /pileup/i) ||
       (
            $data[0] =~ /(chr)?\w+/ &&
            $data[1] =~ /\d+/ &&
            $data[2] =~ /^[ACGTN-]+$/ &&
            $data[3] =~ /^[ACGTNRYSWKM*+\/-]+$/
        )
    ) {
        my @return = ();

        if($data[2] ne "*"){
            my $var;

            if($data[**2**] =~ /^[A|C|G|T]$/) {
                         $var = $data[**2**];
            }
            else {
                ($var = unambiguity_code($data[3])) =~ s/$data[2]//ig

;

Shouldn't this be data[3] which contains the alternate allele (genotype)