If I have a .vcf file, how can I convert it to .bam file?
If I have a .vcf file, how can I convert it to .bam file?
EDIT: found that one in GATK3 https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_simulatereads_SimulateReadsForVariants.php ( now https://github.com/broadgsa/gatk-protected/blob/master/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/simulatereads/SimulateReadsForVariants.java )
Given a set of variants, this tool will generate simulated reads that support the input variants.
I wrote an experimental tool to create a BAM from a VCF: http://lindenb.github.io/jvarkit/VcfToBam.html
$ java -jar dist/vcf2bam.jar -R ref.fa samtools.vcf.gz 2> /dev/null | grep -v "100="
@HD VN:1.5 SO:unsorted
@SQ SN:seg1 LN:5101
@SQ SN:seg2 LN:4000
@RG ID:NA12878 SM:NA12878 LB:illumina DS:NA12878
@RG ID:NA12891 SM:NA12891 LB:illumina DS:NA12891
@RG ID:NA12892 SM:NA12892 LB:illumina DS:NA12892
@CO Generated with -R /home/lindenb/src/ngsxml/test/ref/ref.fa /home/lindenb/src/ngsxml/OUT/Projects/Proj1/VCF/samtools/Proj1.samtools.vcf.gz
0000003609 83 seg1 1449 60 99=1X = 949 -500 TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12891 NM:i:1
0000003610 147 seg1 1449 60 99=1X = 949 -500 TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12891 NM:i:1
0000003611 147 seg1 1449 60 99=1X = 949 -500 TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12892 NM:i:1
0000003612 83 seg1 1449 60 99=1X = 949 -500 TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12892 NM:i:1
0000003615 83 seg1 1450 60 98=1X1= = 950 -500 GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12891 NM:i:1
0000003616 147 seg1 1450 60 98=1X1= = 950 -500 GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12891 NM:i:1
0000003617 147 seg1 1450 60 98=1X1= = 950 -500 GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12892 NM:i:1
0000003627 147 seg1 1452 60 96=1X3= = 952 -500 TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12891 NM:i:1
0000003629 147 seg1 1452 60 96=1X3= = 952 -500 TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12892 NM:i:1
0000003630 147 seg1 1452 60 96=1X3= = 952 -500 TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12892 NM:i:1
0000003633 147 seg1 1453 60 95=1X4= = 953 -500 TTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12891 NM:i:1
0000003635 83 seg1 1453 60 95=1X4= = 953 -500 TTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12892 NM:i:1
0000003640 147 seg1 1454 60 94=1X5= = 954 -500 TAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTCA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII RG:Z:NA12891 NM:i:1
Pierre Lindenbaum do you know how to run the GATK tool you mentiond? I tried to download several legacy versions of GATK3 and also one of GATK2 but none of them seem to have the mentioned tool. The GATK tool seems to be a little bit more sophisticated since it takes a user-defined read-error into account. But correct me if I'm wrong here and every other tool posted in this thread does this as well. Furthermore it seems that none of the tools takes care of mating reads.
I have also written a tool to go from VCF to BAM, called VCF2BIN.py
#!/usr/bin/env python
import re
import sys
import pysam
import string
import random
exit()
inputVCF = sys.argv[-1]
chromosomes = []
with open(inputVCF, 'rb') as f:
for row in f:
if row[0] != '#': break
if row[2:8] == 'contig':
for datum in row[10:-2].split(','):
if datum.split('=')[0] == 'ID': ID = datum.split('=')[1]
if datum.split('=')[0] == 'length': length = int(datum.split('=')[1])
chromosomes.append( {'LN':length, 'SN':ID} )
header = { 'HD': {'VN': '1.0'}, 'SQ': chromosomes }
outfile = pysam.AlignmentFile('./useless.bam', "wh", header=header)
for x in range(0,10000):
chromosome = random.choice(chromosomes)
readLen = random.randrange(40,80)
a = pysam.AlignedSegment()
a.query_name = "read_" + ''.join(random.choice(string.digits) for blah in range(10))
a.query_sequence=''.join(random.choice(['A','C','G','T']) for blah in range(readLen))
a.flag = random.randrange(1,4096)
a.reference_id = chromosomes.index(chromosome)
a.reference_start = random.randrange(1,chromosome['LN']-readLen)
a.mapping_quality = 20
a.cigar = ((0,readLen),)
a.next_reference_id = a.reference_id
a.next_reference_start=777
a.template_length=readLen
a.query_qualities = pysam.fromQualityString("I"*readLen)
a.tags = (("NM", 1),("RG", "L1"))
outfile.write(a)
outfile.close()
> VCF2BIN ./path/to/vcf.vcf
@HD VN:1.0
@SQ SN:chr1 LN:199141577
@SQ SN:chr1_GL456210_random LN:172527
@SQ SN:chr1_GL456211_random LN:235147
@SQ SN:chr1_GL456212_random LN:152998
@SQ SN:chr1_GL456221_random LN:209299
@SQ SN:chr2 LN:184859577
@SQ SN:chr3 LN:162759764
@SQ SN:chr4 LN:159434297
...
@SQ SN:chrUn_GL456396 LN:40612
@SQ SN:chrUn_JH584304 LN:71575
@SQ SN:chrX LN:176147421
@SQ SN:chrX_GL456233_random LN:366350
@SQ SN:chrY LN:91586428
@SQ SN:chrY_JH584300_random LN:65102
@SQ SN:chrY_JH584301_random LN:80372
@SQ SN:chrY_JH584302_random LN:150326
read_9201514959 1567 chr4_GL456350_random 66052 20 62M = 778 62 TCGCTCNCNACGAGTATCTCGAGNCGGNGNGATAAAGAATTNNCGGTCCNCTTGCCANCGTN IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
read_9162951900 1957 chrY_JH584302_random 5022 20 73M = 778 73 GGNCANAACCTTCTNCACTGNTCTAATCGCNCGTANNTTACCTTTTCNCTAAGTANATNCTNTGGCTTCACTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
read_0771262576 1928 chr9 41692370 20 47M = 778 47 TNTCNANTTNTGGTCCNCCGTACAANTGNACAANNNCANTAATNAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
read_2008154340 2796 chrUn_GL456383 42319 20 41M = 778 41 CNTGTNCCGACTCATGCAGGNGATNNTGTATGGGAGCTNCG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
read_5327352072 2985 chrUn_GL456367 11976 20 48M = 778 48 CNCACNGCTTNANACTGATCGNNCNCNNNNAGNACAATTNGANAGTNC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
read_2515573039 3633 chr4_JH584295_random 5998 20 74M = 778 74 TNGTTTCTAGCGCNNTGAGNGAGNNGACATCANANNGCACGCTTGNTAGACANCNAGTGGTGGCCTGGACTCTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
read_9171559801 783 chrUn_GL456389 17972 20 61M = 778 61 CNGGGTNCCNTTAACCCTTNATTTNAANANCATNTTAAGNGGANCGTTCGTNTTTCNAACC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
read_5146518951 909 chr4_JH584295_random 9303 20 41M = 778 41 GTACANGAGNNTCANNCTACNACNGAGATCCNGGCTTATNA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
read_5977278670 83 chrUn_GL456385 43659 20 74M = 778 74 CGNGGNCCCAAATCGTCANCAGCTGGAANGNACAAGNGGTGGANGGTTTGTTTNTGNTATNAAGNTATNGTGNC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1 RG:Z:L1
...
This tool is significantly faster than all other answers provided, because it uses aggressive subsampling of the VCF. In fact it only samples the header. This will not be addressed in the next version, but v2.0 will support parallelism on Casio graphical calculators, and the option to automatically write the output to /dev/null where it belongs.
It depend what you want to use the Bam for. VCF is derived from Bam while VCF lost some information of Bam which include 1) read size 2) read depth. There, if you want to do it, the bam will not contain these two important information. and such Bam can be called Vcf drove simulated Bam file.
To get a vcf derived bam try:
bedToBam -ubam -g file.genome -i snp.vcf > snp.bam
Conceptually you can't. a .vcf file contains only variants, and a .bam file contains reads coming from a sequencer, usually aligned against a reference. If you want to get the corresponding .bam files you should be able to get them from the same source you got the .vcf files. If you still want to generate a .bam file containing with the reads that would contain the variants on that .vcf file you can use a few software solutions available, although it will depend a lot on what you are really wanting that .bam file to.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
In general, VCF files are derived from bam files. It would be more helpful if you can put the context of your question here. For eg. why you need to convert vcf to bam?
to use it in ANGSD for Tajima's D analysis. I've collected my data from 1000 genome for a specific population
Do you have bedtools? Also, what format is your vcf?
VCF=variant call format
yes, I do. my .vcf is in ##fileformat=VCFv4.1
I also tried it with bedtools. I first converted my .vcf file to .bed file using plink. Then I used bedToBam to convert it to .bam file. But ANGSD couldn't read the bam file. Even I couldn't view it with samtools. Its really depressing!
Thank you to all. I want to use the derived .bam file in ANGSD for Tajima's D analysis.
vcftools has a
--TajimaD <integer>
option, maybe it is what you want?what if I want to perform PBS (population branch statistics). Is there any software that can take vcf file?
To keep the forum tidy, open a new question instead of using the space reserved for answers to your original question.