Convert indel list with [-/A] notation to VCF with adjacent base
3
1
Entering edit mode
5.9 years ago

I couldn't find a previous answer to this question, but have to believe something exists!

I'm trying to take a list of mutations that looks like this:

chr1    120995  120996  -/TAT
chr1    1090526 1090526 A/C
chr1    2856178 2856178 G/C
chr1    3975090 3975090 G/A
chr1    4420314 4420315 TA/-

And convert it to a bare-bones VCF so that I can run some annotation tools. Doing SNVs is easy enough - it's just shuffling some fields, adding a 0/1 GT column, etc.

Indels are more tricky, since proper VCF needs the adjacent base:

-/TAT   ->    A      ATAT
TA/-    ->    GTA    G

I'm sure I could whip something up hacky with samtools faidx, but someone has to have solved this problem already, right?

vcf faidx bed • 2.4k views
ADD COMMENT
0
Entering edit mode

Indels are more tricky, since proper VCF needs the adjacent base

I think bcftools norm should do the trick with something like (not tested at all):

awk 'code to shuffle columns into vcf' $infile \
| bcftools norm -c s -f ref.fa > outfile.vcf
ADD REPLY
0
Entering edit mode
5.9 years ago

One option seems to be converting to VEP format, with something like this:

cut -f 1-5 $YOURFILE | perl -nae 'if($F[3] eq "-"){$F[2]=$F[1];$F[1]=$F[1]+1;};print join("\t",(@F[0..2],$F[3] . "/" . $F[4])) . "\n"'

(assumes input is tsv like chr1 123 124 - AA)

Then annotating with VEP to get a nicely formed VCF. Still interested in other answers that don't require a full annotation step.

ADD COMMENT
0
Entering edit mode
5.9 years ago

My first thought was also bcftools norm like dariober suggested. But I couldn't make it work, to handle - or . as REF/ALT.

So here is a solution with awk, samtools faidx and parallel.

First prepare the header data for a valid vcf file:

$ echo "##fileformat=VCFv4.2" > output.vcf  
$ awk '{print "##contig=<ID="$1",length="$2">" >> "output.vcf"}' genome.fa.fai
$ echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"  >> output.vcf

Now let's do some magic with awk, samtools faidx and parallel. What has to be done, is to get the reference base before the position reported in your file, to convert -/TAT to A/ATAT or TA/- to GTA/G

$ awk -v OFS="\t" '{print $1, $2-1, $2, $4}' mutations.txt | parallel --colsep "\t" 'samtools faidx genome.fa {1}:{2}-{2} | awk -v OFS="\t" -v chr={1} -v start0={2} -v start1={3} -v genotype={4} -f makevcf.awk'

The makevcf.awk looks like this:

NR==2 { 
    split(genotype, gt, "/");
    if (gt[1] == "-") {
        print chr, start0, ".", $1, $1gt[2], ".", ".", ".";
    } else if (gt[2] == "-") {
        print chr, start0, ".", $1gt[1], $1, ".", ".", ".";
    } else {
        print chr, start1, ".", gt[1], gt[2], ".", ".", ".";
    }
}

I would recommend to run bcftools sortand bcftools norm on the new vcf file:

$ bcftools sort output.vcf|bcftools norm -f genome.fa > final.vcf

fin swimmer

EDIT: I assume that the first value of the given genotype is REF and the second ALT. But is this correct?

ADD COMMENT
0
Entering edit mode
5.6 years ago
or.yaacov • 0

Hi, I couldn't find one so I wrote a simple R script that does exactly that- converting from a (A -> "-") notation to proper VCF notation (GA -> G) with a leading base:

https://github.com/orr161/VCF-indel-converter (available as a notebook w instructions too)

Basically, using BSgenome (of Bioconductor) to create a function that can fetch the leading base/sequence from hg19 (can be changed):

findbase <- function(chr, pos, len=0) { letter <- toString(Hsapiens[[chr]][pos:(pos+len)]) return(letter) }

and then iterating over the file as a dataframe.

Or.

ADD COMMENT

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6