Variant calling in genomic chunks
2
2
Entering edit mode
6.7 years ago

Is variant calling done on a per-position basis? I've read recommendations to split the BAM file by chromosomes and parallel call the chromosomes for speed

Could I further segment each chromosome into chunks and do calling on each chunk? For example if I: 1) Split a chromosome into 1MB segments. 2) Parallel variant call each 1MB segment. 3) Concatenate the VCF.

Would the resulting concatenated file be correct? Would I be missing any information that might be shared among sites on the same chromosome that variant callers use?

variant calling vcf • 2.4k views
ADD COMMENT
2
Entering edit mode

If you plan on calling SNPs only then I don't see a problem. However, if you are looking for structural variants, there would be missing data on the edges of the chunks, especially for SVs spanning multiple chunks. Additionally, how do you plan to keep track of the size or exact position, additional liftUp files?

ADD REPLY
0
Entering edit mode

Thanks for the reply. I see what you mean with the SVs and possibly even indels. I am not interested in SVs for now, but do want to preserve indel information if I can.

The BAM files I am working with are low coverage. I guess I'll to write a script to chunk the BAM file based on coverage "islands" where each island should be at least 100bp apart.

ADD REPLY
1
Entering edit mode
6.7 years ago

Brad Chapman has described this procedure previously on his blog here (search for "Parallelism by genomic regions"). This at least used to be part of the bcbio-nextgen workflow, though I don't know if it's still included. That procedure is slightly more involved since it finds appropriate segments, rather then using 1MB chunks, but the principle is the same.

ADD COMMENT
1
Entering edit mode
6.7 years ago

Hello Damian,

Is variant calling done on a per-position basis?

This depends on your variant caller. GATKS UnifiedGenotyper and samtools mpileup do so as far as I know. GATKs HaplotypeCaller and freebayes are doing local denovo assembly. So these variant caller need information around a suspected variant.

Instead of splitting the bam file, I would use the possibilty of most variant caller to do the calling within a given genomic region. So every process have than every alignment information it needs.

fin swimmer

ADD COMMENT

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6