Question

Dealing with adjacent SNPs in VCF file

0

Entering edit mode

7.6 years ago

wpierrick ▴ 90

Hi everyone,

I have VCF files containing SNPs called on different samples from mRNA sequencing data. There's usually the SNP position, the reference and alternative allele and, among other things, the effect of the mutation on the corresponding amino acid.

Here's the thing. If 2 SNPs are located next (1 or 2 bp apart) to each other, the caller doesn't necessarily output the right a.a as it considers only one SNP at a time. That might be right for a sample showing only one of the two mutations, but if a sample have 2 (or 3 actually) adjacent mutations, it can lead to false results in the predicted amino acid.

I am developing a script to tackle this issue but was wondering two things:

Is there any script / software already written to solve this problem?
If not, I will continue with my in-house script. Another problem arise: I have the position of the two SNPs, but have to find the 3rd base on the codon or the nearby ones (depending on the reading frame, it can cover only 1 or 2 codons). I can't really look it directly on the reference fasta file as I won't know the reading frame. So I have to use the GFF, to get the transcript, account for all the introns/exons lengths and then calculate the real reading frame to then get the actual missing bases and finally check for any new amino acid outputted considering the 2 adjacent SNPs. It's quite complicated, especially as my programming skills aren't great. Any easier method you could think of?

It would be easier to just manually check all the close SNPs with a genome browser and a codon table, but the whole point of that is to automate things a bit (when you have dozens of pairs it can get tedious).

Thanks a lot for your help!

SNP • 3.3k views

ADD COMMENT • link updated 4.4 years ago by Biostar 20 • written 7.6 years ago by wpierrick ▴ 90

1

Entering edit mode

You could have a look at GATK ReadBackedPhasing tool and the --maxGenomicDistanceForMNP argument https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php

I'd better state that you'll need the original Bam as well.

ADD REPLY • link 7.6 years ago by Garan ▴ 690

0

Entering edit mode

Thanks! I don't have access to GATK just yet, but I'll have a look if I can get it. It seems very promising indeed. Cheers,

ADD REPLY • link 7.6 years ago by wpierrick ▴ 90

score 1 · Answer 1 · 2017-05-05

1

Entering edit mode

7.6 years ago

Pierre Lindenbaum 164k

Is there any script / software already written to solve this problem?

yes

CooVar: Co-occurring variant analyzer

https://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-5-615

. Currently available programs for variant annotation depend on external databases or annotate multiple variants affecting the same transcript independently, which limits program use to organisms available in these databases or results in potentially incorrect or incomplete annotations.

and I've written one:

https://github.com/lindenb/jvarkit/wiki/VCFCombineTwoSnvs

ADD COMMENT • link 7.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks a lot ! This is great.

I've toyed a bit with Coovar and it seems really complete. I just need to sort out the multiple output files and see how double mutations affecting a codon are shown.

I'll try your script later today. If I understand correctly, the command line should look like this: -k myVCFfile -tmpdir tmp/ -R myfastafile.fa -B theBAMfiles.bam The script doesn't need the informations stored in the GTF/GFF ? How does it deals with multiple samples on a VCF file ? My VCF file is like that :

chr1_17928_SNP  Gcin01g04980    NON_SYNONYMOUS  NON_SYNONYMOUS[T](gene:Bcin01g04980|transcript:Bcin01g04980.1|P->S:225) C   T   C/C C/C C/C C/C C/C C/T 234 233 232 219 233 221 234 233 232 219 233 23  0   0   0   0   0   198     
hr1_1792869_SNP Gcin01g04980    NON_SYNONYMOUS  NON_SYNONYMOUS[T](gene:Bcin01g04980|transcript:Bcin01g04980.1|P->L:225) C   T   C/C C/T C/T C/T C/T C/T 240 236 233 220 232 220 240 96  66  80  30  25  0   140 166 140 202 194

But I guess I will have to reorder/clean it. :)

ADD REPLY • link 7.6 years ago by wpierrick ▴ 90

0

Entering edit mode

should look like this: -k myVCFfile

no, option k is a knownGene file like http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz

The script doesn't need the informations stored in the GTF/GFF ?

no it uses a UCSC knownGene structure. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.sql and http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz (much more simpe than GTF/GFF=

does it deals with multiple samples on a VCF file ?

as far as I remember , it just considers the REF/ALT columns, not the genotype. But you can also provide a BAM to get a count of reads carrying both mutations for each sample.

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for your reply. So it only works with human right? I'm working on a fungi specie, I just realized I didn't mention it in my first post, my apologies.

I'll keep that tool when I'll be working with human though!

ADD REPLY • link 7.6 years ago by wpierrick ▴ 90