vcf to bam
5
1
Entering edit mode
9.3 years ago

If I have a .vcf file, how can I convert it to .bam file?

vcf bam • 16k views
ADD COMMENT
1
Entering edit mode

In general, VCF files are derived from bam files. It would be more helpful if you can put the context of your question here. For eg. why you need to convert vcf to bam?

ADD REPLY
0
Entering edit mode

to use it in ANGSD for Tajima's D analysis. I've collected my data from 1000 genome for a specific population

ADD REPLY
0
Entering edit mode

Do you have bedtools? Also, what format is your vcf?

ADD REPLY
0
Entering edit mode

VCF=variant call format

ADD REPLY
0
Entering edit mode

yes, I do. my .vcf is in ##fileformat=VCFv4.1

ADD REPLY
0
Entering edit mode

I also tried it with bedtools. I first converted my .vcf file to .bed file using plink. Then I used bedToBam to convert it to .bam file. But ANGSD couldn't read the bam file. Even I couldn't view it with samtools. Its really depressing!

ADD REPLY
0
Entering edit mode

Thank you to all. I want to use the derived .bam file in ANGSD for Tajima's D analysis.

ADD REPLY
1
Entering edit mode

vcftools has a --TajimaD <integer> option, maybe it is what you want?

ADD REPLY
0
Entering edit mode

what if I want to perform PBS (population branch statistics). Is there any software that can take vcf file?

ADD REPLY
2
Entering edit mode

To keep the forum tidy, open a new question instead of using the space reserved for answers to your original question.

ADD REPLY
5
Entering edit mode
9.3 years ago

EDIT: found that one in GATK3 https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_simulatereads_SimulateReadsForVariants.php ( now https://github.com/broadgsa/gatk-protected/blob/master/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/simulatereads/SimulateReadsForVariants.java )

Given a set of variants, this tool will generate simulated reads that support the input variants.

I wrote an experimental tool to create a BAM from a VCF: http://lindenb.github.io/jvarkit/VcfToBam.html

$ java -jar dist/vcf2bam.jar -R ref.fa  samtools.vcf.gz 2> /dev/null | grep -v "100="

@HD VN:1.5  SO:unsorted
@SQ SN:seg1 LN:5101
@SQ SN:seg2 LN:4000
@RG ID:NA12878  SM:NA12878  LB:illumina DS:NA12878
@RG ID:NA12891  SM:NA12891  LB:illumina DS:NA12891
@RG ID:NA12892  SM:NA12892  LB:illumina DS:NA12892
@CO Generated with -R /home/lindenb/src/ngsxml/test/ref/ref.fa /home/lindenb/src/ngsxml/OUT/Projects/Proj1/VCF/samtools/Proj1.samtools.vcf.gz
0000003609  83  seg1    1449    60  99=1X   =   949 -500    TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003610  147 seg1    1449    60  99=1X   =   949 -500    TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003611  147 seg1    1449    60  99=1X   =   949 -500    TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003612  83  seg1    1449    60  99=1X   =   949 -500    TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003615  83  seg1    1450    60  98=1X1= =   950 -500    GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003616  147 seg1    1450    60  98=1X1= =   950 -500    GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003617  147 seg1    1450    60  98=1X1= =   950 -500    GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003627  147 seg1    1452    60  96=1X3= =   952 -500    TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003629  147 seg1    1452    60  96=1X3= =   952 -500    TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003630  147 seg1    1452    60  96=1X3= =   952 -500    TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003633  147 seg1    1453    60  95=1X4= =   953 -500    TTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003635  83  seg1    1453    60  95=1X4= =   953 -500    TTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003640  147 seg1    1454    60  94=1X5= =   954 -500    TAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTCA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
ADD COMMENT
0
Entering edit mode

Pierre Lindenbaum do you know how to run the GATK tool you mentiond? I tried to download several legacy versions of GATK3 and also one of GATK2 but none of them seem to have the mentioned tool. The GATK tool seems to be a little bit more sophisticated since it takes a user-defined read-error into account. But correct me if I'm wrong here and every other tool posted in this thread does this as well. Furthermore it seems that none of the tools takes care of mating reads.

ADD REPLY
5
Entering edit mode
9.3 years ago
John 13k

I have also written a tool to go from VCF to BAM, called VCF2BIN.py

#!/usr/bin/env python
import re
import sys
import pysam
import string
import random
exit()

inputVCF = sys.argv[-1]
chromosomes = []
with open(inputVCF, 'rb') as f:
    for row in f:
        if row[0] != '#': break
        if row[2:8] == 'contig':
            for datum in row[10:-2].split(','):
                if datum.split('=')[0] == 'ID': ID = datum.split('=')[1]
                if datum.split('=')[0] == 'length': length = int(datum.split('=')[1])
            chromosomes.append( {'LN':length, 'SN':ID} )
header = { 'HD': {'VN': '1.0'}, 'SQ': chromosomes }
outfile = pysam.AlignmentFile('./useless.bam', "wh", header=header)
for x in range(0,10000):
    chromosome = random.choice(chromosomes)
    readLen = random.randrange(40,80)
    a = pysam.AlignedSegment()
    a.query_name = "read_" + ''.join(random.choice(string.digits) for blah in range(10))
    a.query_sequence=''.join(random.choice(['A','C','G','T']) for blah in range(readLen))
    a.flag = random.randrange(1,4096)
    a.reference_id = chromosomes.index(chromosome)
    a.reference_start = random.randrange(1,chromosome['LN']-readLen)
    a.mapping_quality = 20
    a.cigar = ((0,readLen),)
    a.next_reference_id = a.reference_id
    a.next_reference_start=777
    a.template_length=readLen
    a.query_qualities = pysam.fromQualityString("I"*readLen)
    a.tags = (("NM", 1),("RG", "L1"))
    outfile.write(a)
outfile.close()

Usage:

> VCF2BIN ./path/to/vcf.vcf

Example Output:

@HD     VN:1.0
@SQ     SN:chr1 LN:199141577
@SQ     SN:chr1_GL456210_random LN:172527
@SQ     SN:chr1_GL456211_random LN:235147
@SQ     SN:chr1_GL456212_random LN:152998
@SQ     SN:chr1_GL456221_random LN:209299
@SQ     SN:chr2 LN:184859577
@SQ     SN:chr3 LN:162759764
@SQ     SN:chr4 LN:159434297
...
@SQ     SN:chrUn_GL456396       LN:40612
@SQ     SN:chrUn_JH584304       LN:71575
@SQ     SN:chrX LN:176147421
@SQ     SN:chrX_GL456233_random LN:366350
@SQ     SN:chrY LN:91586428
@SQ     SN:chrY_JH584300_random LN:65102
@SQ     SN:chrY_JH584301_random LN:80372
@SQ     SN:chrY_JH584302_random LN:150326
read_9201514959 1567    chr4_GL456350_random    66052   20      62M     =       778     62      TCGCTCNCNACGAGTATCTCGAGNCGGNGNGATAAAGAATTNNCGGTCCNCTTGCCANCGTN  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NM:i:1  RG:Z:L1
read_9162951900 1957    chrY_JH584302_random    5022    20      73M     =       778     73      GGNCANAACCTTCTNCACTGNTCTAATCGCNCGTANNTTACCTTTTCNCTAAGTANATNCTNTGGCTTCACTA       IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII       NM:i:1  RG:Z:L1
read_0771262576 1928    chr9    41692370        20      47M     =       778     47      TNTCNANTTNTGGTCCNCCGTACAANTGNACAANNNCANTAATNAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1  RG:Z:L1
read_2008154340 2796    chrUn_GL456383  42319   20      41M     =       778     41      CNTGTNCCGACTCATGCAGGNGATNNTGTATGGGAGCTNCG       IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII       NM:i:1  RG:Z:L1
read_5327352072 2985    chrUn_GL456367  11976   20      48M     =       778     48      CNCACNGCTTNANACTGATCGNNCNCNNNNAGNACAATTNGANAGTNC        IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        NM:i:1  RG:Z:L1
read_2515573039 3633    chr4_JH584295_random    5998    20      74M     =       778     74      TNGTTTCTAGCGCNNTGAGNGAGNNGACATCANANNGCACGCTTGNTAGACANCNAGTGGTGGCCTGGACTCTC      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII      NM:i:1  RG:Z:L1
read_9171559801 783     chrUn_GL456389  17972   20      61M     =       778     61      CNGGGTNCCNTTAACCCTTNATTTNAANANCATNTTAAGNGGANCGTTCGTNTTTCNAACC   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII   NM:i:1  RG:Z:L1
read_5146518951 909     chr4_JH584295_random    9303    20      41M     =       778     41      GTACANGAGNNTCANNCTACNACNGAGATCCNGGCTTATNA       IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII       NM:i:1  RG:Z:L1
read_5977278670 83      chrUn_GL456385  43659   20      74M     =       778     74      CGNGGNCCCAAATCGTCANCAGCTGGAANGNACAAGNGGTGGANGGTTTGTTTNTGNTATNAAGNTATNGTGNC      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII      NM:i:1  RG:Z:L1
...

This tool is significantly faster than all other answers provided, because it uses aggressive subsampling of the VCF. In fact it only samples the header. This will not be addressed in the next version, but v2.0 will support parallelism on Casio graphical calculators, and the option to automatically write the output to /dev/null where it belongs.

ADD COMMENT
1
Entering edit mode

Finally all my Casio calculators will be put back into use! Thanks.

ADD REPLY
2
Entering edit mode
9.3 years ago
Shicheng Guo ★ 9.6k

It depend what you want to use the Bam for. VCF is derived from Bam while VCF lost some information of Bam which include 1) read size 2) read depth. There, if you want to do it, the bam will not contain these two important information. and such Bam can be called Vcf drove simulated Bam file.

ADD COMMENT
0
Entering edit mode

I want to use the derived .bam file in ANGSD for Tajima's D analysis.

ADD REPLY
1
Entering edit mode
9.3 years ago
bioguy24 ▴ 230

To get a vcf derived bam try:

bedToBam -ubam -g file.genome -i snp.vcf  > snp.bam
ADD COMMENT
0
Entering edit mode
9.3 years ago

Conceptually you can't. a .vcf file contains only variants, and a .bam file contains reads coming from a sequencer, usually aligned against a reference. If you want to get the corresponding .bam files you should be able to get them from the same source you got the .vcf files. If you still want to generate a .bam file containing with the reads that would contain the variants on that .vcf file you can use a few software solutions available, although it will depend a lot on what you are really wanting that .bam file to.

ADD COMMENT

Login before adding your answer.

Traffic: 3385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6