Hello,
I've ran TopHat and aligned paired end (PE) reads to a reference genome.
TopHat outputs two bam files: accepted_hits.bam and unmapped.bam.
I'd like to extract from unmapped.bam only those reads where only one side of the pair (R1 or R2) was not mapped, i.e. exclude records where both R1 and R2 are unmapped.
An interesting fact that may have to do with the specifics of TopHat, is that the only sam flags in the file are 69 and 133. According to this nice tool, this means:
69 - read paired (0x1), read unmapped (0x4), first in pair (0x40)
133 - read paired (0x1), read unmapped (0x4), second in pair (0x80)
Pairs where both reads are unmapped have two records, one with flag 69 and one with flag 133. For example:
SOLEXA1:1:1:10:1400 69 * 0 255 * * 0 0 GGTGCATCTACTTCCCTAATTTTGGCTGAAACTTCAATCCGAANGGTGTTNNTCCGAGATC AB8CBBBBBBBBBABBBBABBBBBB@ABABA@AAABBB@AA?8%+=7:>=%%<?A=@>@;;
SOLEXA1:1:1:10:1400 133 * 0 255 * * 0 0 NGGAACAANACCNTTCGGATTNAAGTTTNAGNCNAANTTAGGGNAGTT %/;9;;;/%-;0%0;9;;;############################# ZT:A:N
While pairs where only one read is unmapped have only one record with either 69 or 133:
SOLEXA1:1:1:10:141 133 * 0 255 * * 0 0 NGGTACTACAAAAATTGAAGTNGACTATTACNGTCACTCTCCTGTTTC %.;8;;9;;;9;;9;;;;;9,%/99;;;;;/%.8;;9;##########
Is there any way, using samtools (or any other tool) to extract the pairs I'm interested in? I guess I could just script this, but I suspect there might be a better way to go.
Thanks a lot!
You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.
Aligners will write un-mapped reads to separate files (saving you this step). BBMap (
outu=
) and HISTAT2 (--un
option) are a couple of options.STAR
(--outReadsUnmapped Fastx
) can also do this. So consider using a newer aligner.