Question

Parallelizing Alignment By Splitting Chromosome In Separate Files

0

Entering edit mode

11.4 years ago

skm770 ▴ 150

The easiest way to do the alignment quickly is to split fasta file for each chromosome and then do the alignment for each chromosome separately. But is this the right way to do so. Are we not bound to get reads which should have mapped to X chromosome but is mapping to Y chromosome?

thanks

bwa alignment parallel • 5.9k views

ADD COMMENT • link updated 11.4 years ago by Pierre Lindenbaum 166k • written 11.4 years ago by skm770 ▴ 150

3

Entering edit mode

Doing so will surely decrease your search space but it will lead to lot of false positive mapping. For example, a read may have its origin on mitochondria but it may align on chromosome 1 with few mismatches. As Sean Davis said splitting your read file or fastq file and mapping it against the whole genome is a much better option.

ADD REPLY • link 11.4 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Just an aside, splitting the FASTA files by chromosomes could be quite counterproductive with an aligner that scales linearly in the number of reads but is not very sensitive to the size of the genome. If there are 24 chromosomes, for such an aligner, the split by chromosome approach could take quite a bit longer than not splitting.

ADD REPLY • link 11.4 years ago by Sean Davis 27k

score 6 · Answer 1 · 2013-12-18

6

Entering edit mode

11.4 years ago

Sean Davis 27k

The more common approach--that avoids the problem you note in your question--is to split the read files (typically FASTQ) into smaller chunks, align, and then merge.

ADD COMMENT • link 11.4 years ago by Sean Davis 27k

score 2 · Answer 2 · 2013-12-18

2

Entering edit mode

11.4 years ago

Pierre Lindenbaum 166k

You would get some false positive by doing this. I split my results just after bwa using my tool https://github.com/lindenb/jvarkit/wiki/SplitBam

enter image description here

and , as said Sean. If your fastq files are divided into small chunks, you would merge all the chr1_001.bam chr1_002.bam chr1_002.bam chr1_003.bam ... aftter SplitBam

Example:

bwa sampe (...) |\
java -jar dist/splitbam.jar \
    VALIDATION_STRINGENCY=LENIENT  \
    OUT_FILE_PATTERN=TESTSPLITBAM/__CHROM__.bam \
    REF=human_g1k_v37.fasta \
    ADD_MOCK_RECORD=true \
    GENERATE_EMPTY_BAM=true \
    GP=split_g1k_v37_01.txt 


[Fri Jul 26 13:25:56 CEST 2013] Executing as lindenb@master on Linux 2.6.32-358.6.2.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_19-mockbuild_2013_04_17_19_18-b00; Picard version: null
INFO    2013-07-26 13:25:56    SplitBam    reading stdin
INFO    2013-07-26 13:25:56    SplitBam    opening TESTSPLITBAM/CHROMS_01_09.bam
INFO    2013-07-26 13:25:57    SplitBam    opening TESTSPLITBAM/CHROMS_10_0Y.bam
INFO    2013-07-26 13:25:58    SplitBam    opening TESTSPLITBAM/CHROMS_OTHER.bam
INFO    2013-07-26 13:35:58    SplitBam    closing group CHROMS_01_09
INFO    2013-07-26 13:35:59    SplitBam    closing group CHROMS_10_0Y
INFO    2013-07-26 13:35:59    SplitBam    closing group CHROMS_OTHER
INFO    2013-07-26 13:36:00    SplitBam    closing group Unmapped
Runtime.totalMemory()=1916600320

ADD COMMENT • link 11.4 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

Some variant callers take a "region" parameter, allowing parallelized variant calling on a single BAM file.

ADD REPLY • link 11.4 years ago by Sean Davis 27k

0

Entering edit mode

good point, but an operation like sorting+merging will be faster with one bam per chromosome.

ADD REPLY • link 11.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Won't split bams vs one huge bam also call variants differently? (based on algorithm chosen, of course)

ADD REPLY • link 11.4 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

no as long as all your reads are in the same file for the same chromosome.

ADD REPLY • link 11.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Good point, thanks.

ADD REPLY • link 11.4 years ago by Biomonika (Noolean) 3.2k