Parallelizing Alignment By Splitting Chromosome In Separate Files
2
0
Entering edit mode
11.0 years ago
skm770 ▴ 150

The easiest way to do the alignment quickly is to split fasta file for each chromosome and then do the alignment for each chromosome separately. But is this the right way to do so. Are we not bound to get reads which should have mapped to X chromosome but is mapping to Y chromosome?

thanks

bwa alignment parallel • 5.7k views
ADD COMMENT
3
Entering edit mode

Doing so will surely decrease your search space but it will lead to lot of false positive mapping. For example, a read may have its origin on mitochondria but it may align on chromosome 1 with few mismatches. As Sean Davis said splitting your read file or fastq file and mapping it against the whole genome is a much better option.

ADD REPLY
0
Entering edit mode

Just an aside, splitting the FASTA files by chromosomes could be quite counterproductive with an aligner that scales linearly in the number of reads but is not very sensitive to the size of the genome. If there are 24 chromosomes, for such an aligner, the split by chromosome approach could take quite a bit longer than not splitting.

ADD REPLY
6
Entering edit mode
11.0 years ago

The more common approach--that avoids the problem you note in your question--is to split the read files (typically FASTQ) into smaller chunks, align, and then merge.

ADD COMMENT
2
Entering edit mode
11.0 years ago

You would get some false positive by doing this. I split my results just after bwa using my tool https://github.com/lindenb/jvarkit/wiki/SplitBam

enter image description here

and , as said Sean. If your fastq files are divided into small chunks, you would merge all the chr1_001.bam chr1_002.bam chr1_002.bam chr1_003.bam ... aftter SplitBam

Example:

bwa sampe (...) |\
java -jar dist/splitbam.jar \
    VALIDATION_STRINGENCY=LENIENT  \
    OUT_FILE_PATTERN=TESTSPLITBAM/__CHROM__.bam \
    REF=human_g1k_v37.fasta \
    ADD_MOCK_RECORD=true \
    GENERATE_EMPTY_BAM=true \
    GP=split_g1k_v37_01.txt 


[Fri Jul 26 13:25:56 CEST 2013] Executing as lindenb@master on Linux 2.6.32-358.6.2.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_19-mockbuild_2013_04_17_19_18-b00; Picard version: null
INFO    2013-07-26 13:25:56    SplitBam    reading stdin
INFO    2013-07-26 13:25:56    SplitBam    opening TESTSPLITBAM/CHROMS_01_09.bam
INFO    2013-07-26 13:25:57    SplitBam    opening TESTSPLITBAM/CHROMS_10_0Y.bam
INFO    2013-07-26 13:25:58    SplitBam    opening TESTSPLITBAM/CHROMS_OTHER.bam
INFO    2013-07-26 13:35:58    SplitBam    closing group CHROMS_01_09
INFO    2013-07-26 13:35:59    SplitBam    closing group CHROMS_10_0Y
INFO    2013-07-26 13:35:59    SplitBam    closing group CHROMS_OTHER
INFO    2013-07-26 13:36:00    SplitBam    closing group Unmapped
Runtime.totalMemory()=1916600320
ADD COMMENT
2
Entering edit mode

Some variant callers take a "region" parameter, allowing parallelized variant calling on a single BAM file.

ADD REPLY
0
Entering edit mode

good point, but an operation like sorting+merging will be faster with one bam per chromosome.

ADD REPLY
0
Entering edit mode

Won't split bams vs one huge bam also call variants differently? (based on algorithm chosen, of course)

ADD REPLY
0
Entering edit mode

no as long as all your reads are in the same file for the same chromosome.

ADD REPLY
0
Entering edit mode

Good point, thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6