Question

.bam file of human reference genome (hg19)

0

Entering edit mode

9.8 years ago

Max ▴ 150

I have a number of exome (and some wgs) sequences of tumors with no matched blood sequence data. In order to call somatic mutations, the approach that I'm taking is to compare the tumor sequence to the human reference genome, filter out variants that are known SNP sites, and assume the rest are somatic mutations. Not perfect, but reasonable.

The problem I'm having is finding a control .bam file to use with mutation callers like MuTect, SomaticSnpier, etc. Basically, I need a .bam file corresponding to the reference human genome (mostly assembly hg19 for the tumor data) to compare to the tumor .bams, but I don't know how to go about creating one from the reference fasta in the absence of read coordinates. Is there a straightforward way to get an input bam file that uses the consensus sequence of hg19?

samtools • 5.3k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Max ▴ 150

0

Entering edit mode

Wouldn't a more sensible approach be to simply use GATK/samtools/freeBayes/etc. to call variants and then just filter the resulting VCF file with dbSNP and 1000 genomes variants?

ADD REPLY • link 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Actually, I agree, that sounds like a better idea.

ADD REPLY • link 9.8 years ago by Brian Bushnell 20k

0

Entering edit mode

I don't think that I can do that with most of the variant callers that I like to use, they all require a reference .bam file.

ADD REPLY • link 9.8 years ago by Max ▴ 150

1

Entering edit mode

All of the callers you like to use are designed around having matched control samples. You don't have that and using a consensus BAM file is not going to help you appreciably. Use the right tool for the job, don't try to shoe-horn all problems into the same solution pipeline.

ADD REPLY • link 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Devon has the right answer. Those callers are not appropriate for your data. They make specific assumptions related to having a matched normal, and the results will be at best, disappointing, and at worst, flat out wrong if you try to use them.

ADD REPLY • link 9.8 years ago by Chris Miller 22k

Ram · Answer 1 · 2015-02-17

1

Entering edit mode

9.8 years ago

Brian Bushnell 20k

If you want a bam corresponding to the human reference, I suggest generating synthetic reads and mapping them. You can do that with BBTools in 3 steps:

bbmap.sh ref=hg19.fasta
randomreads.sh reads=450000000 length=100 paired out=synth.fq.gz minq=15 midq=30 maxq=40
bbmap.sh in=synth.fq.gz out=mapped.bam

Note that outputting in bam format requires samtools to be in your path; otherwise you would have to output in sam format (just specify mapped.sam instead of mapped.bam) and then convert it to bam.

To filter out variants using this method, I recommend using the same mapping program and same settings for the synthetic and real data.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi! I have tried to use this method to create a synthetic normal bam. However, the files that I got are extremely big. Synth.fq.gz is 112 gb and currently the bam file is over 116 gb and the code is still running. Is that normal or did I do something wrong? did I miss something? Thank you

ADD REPLY • link 6.7 years ago by danab • 0