Split Read Mapping By Bwa And Bowtie2
4
16
Entering edit mode
11.8 years ago
Abhi ★ 1.6k

Hey Guys

Can someone help me understand the current status of how single reads which split maps to two different far apart (same or diff chromosome) are handled by BWA and Bowtie2. Also good to know any other mapper which caters to these corner cases specifically.

Example:

> ---------------------------------------------> read (100 bp)
> 
> ----------------> (first 40 bp maps to chr 2)            --------------------->(next 60bp maps to chr 13)

We are basically trying to zoom in on the break points in structural variation.

Thanks! -Abhi

mapping sequencing bwa bowtie2 • 18k views
ADD COMMENT
12
Entering edit mode
11.8 years ago
lh3 33k

Proper split alignment needs some special treatments. Most traditional aligners such as blast, blat and ssaha2 will give you contained hits you do not care, while most NGS aligners assume full-length match and will miss fragments. A proper split aligner should surpress contained hits while retaining the maximal non-overlapping fragments. To my limited knowledge, BWA-SW is the first that attempts to produce proper split alignments for sequence reads.

I think you may try BWA-SW to see how it works. For 100bp reads, it may not be very sensitive to 50bp fragments, but for SVs, you do not need ultra-high sensitivity. Analyzing the BWA-SW output is fairly straightforward: if the alignment is split, bwa-sw will give two or more SAM lines. The confidence of the split alignment is measured by the minimal mapping quality between the two fragments of the same read. I have not tried yaha, but from its paper, it seems a right tool, too. The developers has put more efforts in the context of SV discovery (while I have not). It is worth trying.

As to other mappers, BWA, Bowtie1, Soap2, Gsnap, GEM and YOABS assume full-length match. They won't do split alignment. Bowtie2 is not designed with split alignment in mind, but for 100bp reads, it may work in simple scenarios (e.g. both fragments are fairly unique; remember to use --local). PRISM and Pindel require an anchor mate. They won't find alignments across chromosomes. I heard that mosaik can do proper split alignment, but I do not know the details. Smalt might work for split alignment, but I am not sure, either.

EDIT: BTW, you may also try a new algorithm on the bwa mem branch that has a new component called mem. In theory, it should be more sensitive than bwa-sw for your data, but that branch is only a week old and is unstable.

ADD COMMENT
1
Entering edit mode

If I understand correctly, short read aligner like bwa or bowtie were based on seed align mode? means first 33 or 28bp as seed to search the reference database, the remaining bases were used for calculating mismatch etc? Please correct me if I am wrong. Thanks.

ADD REPLY
1
Entering edit mode

Not true for bwa. BWA restricts #mismatches in the seed, but it has to require full-length match. That is different from the standard seed-and-extend. Don't know much about bowtie. From its paper and documentation, it seems to extend allowing arbitrary mismatches in the rest of reads, but if that is true, bowtie should be very sensitive to 100bp reads, but in practice it seems not.

ADD REPLY
0
Entering edit mode

thanks for the details Heng

ADD REPLY
4
Entering edit mode
11.8 years ago

BWA and bowtie2 might not be the test tools for the job. BWASW and these might be useful to look into:

http://www.ncbi.nlm.nih.gov/pubmed/22829624 (yaha)

http://www.ncbi.nlm.nih.gov/pubmed/22851530 (PRISM)

ADD COMMENT
0
Entering edit mode

thanks Sean for the quickest reply :)

ADD REPLY
2
Entering edit mode
11.8 years ago
Ryan Thompson ★ 3.6k

Tophat-fusion (or tophat2 in fusion mode) might be an appropriate tool for the job.

ADD COMMENT
1
Entering edit mode
11.8 years ago

I can recommend segemehl for split read mapping. It allows multiple splits, which is of interest when using long reads from 454 machines and it can also find fusion splits mapping to different chromosomes.

ADD COMMENT
0
Entering edit mode

good to know. I can try this and may be compare to bwa-sw if time permits...

ADD REPLY
0
Entering edit mode

have you ever used segemehl for split?I have met some troubles about it.

ADD REPLY
0
Entering edit mode

When I run testrealign.x for extracting splice junctions, it happened:

[xsy@ibiopark1 ~]$ testrealign.x -d chr1.fa -q mymap_chr1_5.sam -n
[SEGEMEHL] Mon Dec 22 22:55:47 2014: reading database sequences.
[SEGEMEHL] Mon Dec 22 22:55:52 2014: 1 database sequences found.
[SEGEMEHL] Mon Dec 22 22:55:52 2014: reading query files.
*** glibc detected *** testrealign.x: realloc(): invalid pointer: 0x00007fff4842d80b ***
======= Backtrace: =========
/lib64/libc.so.6[0x31b4a76166]
/lib64/libc.so.6(realloc+0x2e2)[0x31b4a7bfd2]
testrealign.x[0x433e79]
testrealign.x[0x459ea1]
testrealign.x[0x45c10a]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x31b4a1ed1d]
testrealign.x[0x401fc9]
======= Memory map: ========
00400000-00470000 r-xp 00000000 fd:00 811622                             /usr/local/bin/testrealign.x
00670000-00671000 rw-p 00070000 fd:00 811622                             /usr/local/bin/testrealign.x
00671000-00672000 rw-p 00000000 00:00 0
01bd1000-01c04000 rw-p 00000000 00:00 0                                  [heap]
31b4200000-31b4220000 r-xp 00000000 fd:00 2098592                        /lib64/ld-2.12.so
31b441f000-31b4420000 r--p 0001f000 fd:00 2098592                        /lib64/ld-2.12.so
31b4420000-31b4421000 rw-p 00020000 fd:00 2098592                        /lib64/ld-2.12.so
31b4421000-31b4422000 rw-p 00000000 00:00 0
31b4600000-31b4602000 r-xp 00000000 fd:00 2098597                        /lib64/libdl-2.12.so
31b4602000-31b4802000 ---p 00002000 fd:00 2098597                        /lib64/libdl-2.12.so
31b4802000-31b4803000 r--p 00002000 fd:00 2098597                        /lib64/libdl-2.12.so
31b4803000-31b4804000 rw-p 00003000 fd:00 2098597                        /lib64/libdl-2.12.so
31b4a00000-31b4b8b000 r-xp 00000000 fd:00 2098593                        /lib64/libc-2.12.so
31b4b8b000-31b4d8a000 ---p 0018b000 fd:00 2098593                        /lib64/libc-2.12.so
31b4d8a000-31b4d8e000 r--p 0018a000 fd:00 2098593                        /lib64/libc-2.12.so
31b4d8e000-31b4d8f000 rw-p 0018e000 fd:00 2098593                        /lib64/libc-2.12.so
31b4d8f000-31b4d94000 rw-p 00000000 00:00 0
31b4e00000-31b4e17000 r-xp 00000000 fd:00 2098598                        /lib64/libpthread-2.12.so
31b4e17000-31b5017000 ---p 00017000 fd:00 2098598                        /lib64/libpthread-2.12.so
31b5017000-31b5018000 r--p 00017000 fd:00 2098598                        /lib64/libpthread-2.12.so
31b5018000-31b5019000 rw-p 00018000 fd:00 2098598                        /lib64/libpthread-2.12.so
31b5019000-31b501d000 rw-p 00000000 00:00 0
31b5600000-31b5683000 r-xp 00000000 fd:00 2098594                        /lib64/libm-2.12.so
31b5683000-31b5882000 ---p 00083000 fd:00 2098594                        /lib64/libm-2.12.so
31b5882000-31b5883000 r--p 00082000 fd:00 2098594                        /lib64/libm-2.12.so
31b5883000-31b5884000 rw-p 00083000 fd:00 2098594                        /lib64/libm-2.12.so
31b5a00000-31b5a15000 r-xp 00000000 fd:00 2098596                        /lib64/libz.so.1.2.3
31b5a15000-31b5c14000 ---p 00015000 fd:00 2098596                        /lib64/libz.so.1.2.3
31b5c14000-31b5c15000 r--p 00014000 fd:00 2098596                        /lib64/libz.so.1.2.3
31b5c15000-31b5c16000 rw-p 00015000 fd:00 2098596                        /lib64/libz.so.1.2.3
31bf200000-31bf21d000 r-xp 00000000 fd:00 2098636                        /lib64/libtinfo.so.5.7
31bf21d000-31bf41d000 ---p 0001d000 fd:00 2098636                        /lib64/libtinfo.so.5.7
31bf41d000-31bf421000 rw-p 0001d000 fd:00 2098636                        /lib64/libtinfo.so.5.7
31c1a00000-31c1a22000 r-xp 00000000 fd:00 2097244                        /lib64/libncurses.so.5.7
31c1a22000-31c1c21000 ---p 00022000 fd:00 2097244                        /lib64/libncurses.so.5.7
31c1c21000-31c1c22000 rw-p 00021000 fd:00 2097244                        /lib64/libncurses.so.5.7
3ca4c00000-3ca4c16000 r-xp 00000000 fd:00 2098623                        /lib64/libgcc_s-4.4.7-20120601.so.1
3ca4c16000-3ca4e15000 ---p 00016000 fd:00 2098623                        /lib64/libgcc_s-4.4.7-20120601.so.1
3ca4e15000-3ca4e16000 rw-p 00015000 fd:00 2098623                        /lib64/libgcc_s-4.4.7-20120601.so.1
7ffb6e0b6000-7ffb7b2df000 rw-p 00000000 00:00 0
7ffb951a9000-7ffb951ae000 rw-p 00000000 00:00 0
7ffb951bb000-7ffb951be000 rw-p 00000000 00:00 0
7fff48419000-7fff4842e000 rw-p 00000000 00:00 0                          [stack]
7fff48450000-7fff48451000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted (core dumped)
ADD REPLY
0
Entering edit mode

Is your sam file sorted?

ADD REPLY
0
Entering edit mode

I have sorted the sam file.

First I convert sam file to bam file, then use samtools sort to sort the bam file, then convert bam file back to sam file.

ADD REPLY
0
Entering edit mode

And I want to know that because I can not build indices for whole genome, that is to say, I can not match my reads to the whole genome. And I think it will miss many important information about split sites.

ADD REPLY
0
Entering edit mode

I always get this error:

testrealign.x: libs/memory.c:18: bl_realloc: Assertion `ptr != ((void *)0)' failed.
Aborted (core dumped)

code:

testrealign.x -d chr6.fa chr7.sam chr8.sam chr9.sam chr10.fa -q mymap_chr6_10.sorted.sam -n
ADD REPLY
0
Entering edit mode

Just to assure that we are talking about the same thing:

  1. You created an index for chr1

    ./segemehl.x -d chr1.fa -x chr1.idx
    
  2. You mapped your reads against chr1

    ./segemehl.x -d chr1.fa -i chr1.idx -q reads.fastq -S | samtools view -bS - | samtools sort -o - deleteme | samtools view -h - > mapped.sam #(​also -p if you are working with paired-end data)
    
  3. You call testrealign and get the error mentioned in you post

    ./testrealign.x -d chr1.fa -q mapped.sam -n
    

Is that correct? I just want to be sure that you called the tools correctly.

ADD REPLY
0
Entering edit mode

Thank you. it is correct.

and I have an another question,that is: if I want to build indices for the whole human chromosomes,what can I do?

Because my machines with only 16GB of RAM.I can not creat index for whole human genome chromosomes.

ADD REPLY
0
Entering edit mode

Can you provide a small down-sampled test-set of your data that produces this error?

With segemehl and 16GB of RAM it is not possible to map your reads against the complete human genome. You can either use another split-read mapper (tophat or STAR), or you can buy a HPC with enough memory.

ADD REPLY
1
Entering edit mode

Tophat and STAR are designed for RNA-seq, not really for DNA-seq. As I tried their older versions a year ago, they produced many false "gene fusions" from simulated DNA-seq reads without fusions. segemehl uses about 64-75GB as I remember. It would be good to have a machine with >=128GB RAM anyway. Not that expensive these days. BTW, on a 16GB machine, you cannot run STAR as it requires ~30GB RAM.

ADD REPLY
0
Entering edit mode

What is the difference between RNA-seq and DNA-seq?

Because I want to find tans-splicing sites for complete human chromosome and what can I do if I only have a machine with 16GB RAM?

tophat-fusion?it works

ADD REPLY
0
Entering edit mode

RNA-seq starts with RNA (which is discontinuous on the chromosome) while DNA-seq starts with DNA (which is roughly continuous on the chromosome). The two alignment problems are a bit different since RNA-seq alignment needs to account for intron-exon boundaries and DNA-seq does not.

ADD REPLY
0
Entering edit mode

thank you very much.

ADD REPLY
0
Entering edit mode

segemehl with 64G can work?I want to be sure that it is enough.Because I am considering buy a HPC with 64G memory.

ADD REPLY
0
Entering edit mode

I think I have used wrong sort ways.

my code is:

samtools sort -n mymap_chr11_15.bam mymap_chr11_15.sorted.bam

-n is wrong

ADD REPLY
0
Entering edit mode

how many memory for complete human genome?

ADD REPLY
0
Entering edit mode

You can find that information in the publication:

Otto C, Stadler PF, Hoffmann S: 'Lacking alignments? The next generation sequencing mapper segemehl revisited', Bioinformatics. 2014 Jul 1;30(13):1837-43 (2014)

ADD REPLY
0
Entering edit mode

Can I ask you a question?

After I run segemehl for split read mapping,I don't know how to do with it.

Can you give me some suggestion about the downstream splice site detection?

ADD REPLY
0
Entering edit mode

Dear syxbestmayer,

the downstream analysis of split-reads is not trivial.

In an answer above, there are the commands of how you can get the bed files containing three different junction types: normal, trans and circular.

./segemehl.x -d hg19.fa -i chr1.idx -q reads.fastq -S | samtools view -bS - | samtools sort -o - deleteme | samtools view -h - > mapped.sam
./testrealign.x -d hg19.fa -q mapped.sam -n

In the publication

  • Hoffmann S, Otto C, Doose G, Tanzer A, Langenberger D, Christ S, Kunz M, Holdt L, Teupser D, Hackermüeller J, Stadler PF: 'A multi-split mapping algorithm for circular RNA, splicing, trans-splicing, and fusion detection', Genome Biology, 15:R34, doi:10.1186/gb-2014-15-2-r34 (2014)

and in the manual, you can find information about how the algorithm works and what information the bed files hold.


If you are too lazy to read and understand all these information, there will be a workshop in early march this year. Here, one of the developers of segemehl will explain his and other split-read mappers (STAR and Tophat). Furthermore, you will be trained in how to analyze the results of all three mapping tools. It will be explained, how the output files look like, how you can get the splice junctions and what you can do with that information (isoform detection, differential expression analysis, fusion gene detection, circularized RNA detection, etc.).

Note: This course will not be a promotion event for segemehl! We assured that all three tool are treated equally. They all give very similar results, but have different pros and cons. The decision of which tool one wants to use in the end, is completely in the hand of the participants. But our trainer (Christian Otto; co-developer of segemehl) is probably one of the best choices you can get for that task.

If you are interested, take a look at our workshop website with more information: http://www.ecseq.com/workshops/workshop_2015-01.html

ADD REPLY

Login before adding your answer.

Traffic: 2415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6