In the tophat folder (generated by running tophat), there is a file called align_summary.txt
. In the example below, it reports that out of 92923521 aligned pairs, 9.6% have multiple alignments, which means that there must be a way to generate a file containing only the uniquely aligned pairs (those that have only a single match as a pair). The accepted_hits.bam
file, if I understand correctly, contains both, unique and multiple alignments pairs. Which command can I use to extract only the uniquely aligned pairs?
Also, if the junctions.bed
file contains splice junctions which uniquely mapped based on alignments as well as multimappers, which command can I use to extract only the uniquely aligned? And since this is .bed rather than .bam format, is the pairing information utilized for determining the uniqueness of the splice junction read?
In the following 3 posts the code 255 (in both accepted_hits.bam
and junctions.bed
) is explained as either meaning that the mapping quality score is not available or that read has a single match in the reference: Tophat Rna-Seq Mapping Quality, https://biostar.usegalaxy.org/p/4077, Junctions.Bed File Produced From Tophat --- what is the final word on this? And if it means single match, does it refer to reads independently of pairing or also if as a pair the match is unique even if each read on its own could have more matches?
Another solution for accepted_hits.bam
file is: samtools view -bq 1 file.bam > unique.bam
, where unique reads = single read mapping at one best position, suggested in the following posts - How To Filter Mapped Reads With Samtools and href="http://seqanswers.com/forums/archive/index.php/t-6189.html --- here too, if as a pair the match is unique, even if each read on its own could have more matches, is it accepted as unique?
Alternatively, the following two posts suggest that using -g/--max-multihits
1 when running tophat will generate accepted_hits.bam file with reads that have only a single match in the genome, though I do not know whether this command will also include reads which have only 1 match as a pair but not each on its own? What Does The Tophat File Named 'Accepted_Hits.Bam' Include? and http://seqanswers.com/forums/showthread.php?t=8548
I assume that for all of the above, for a single match in the reference the reads/pairs tophat alignment run need to be set to disallow any mismatches, because if allowed, then more would get multiple matches and get excluded. On the other hand, if read/pair has 0 matches, it is preferable to allow at least 1 mismatch. How could such combined approach be implemented? The following post discusses the thresholds for alignment quality and I wonder if this could be used along with other approaches, so that reads pairs that have no matches would be allowed to have one mismatch? How to extract unique mapped results from Bowtie2 bam results? samtools view -b -q 10 foo.bam > foo.filtered.bam
, where -q
value represents the likelihood that an alignment is correct.
I do not have the expertise to evaluate all these approaches, and would appreciate advice. Thank you.
Left reads:
Input: 128979165
Mapped: 98314933 (76.2% of input)
of these: 11898655 (12.1%) have multiple alignments (9004 have >20)
Right reads:
Input: 128979165
Mapped: 95536410 (74.1% of input)
of these: 10769172 (11.3%) have multiple alignments (2289 have >20)
75.1% overall read alignment rate.
Aligned pairs: 92923521
of these: 8913959 ( 9.6%) have multiple alignments
and: 1899417 ( 2.0%) are discordant alignments
70.6% concordant pair alignment rate.
Regarding the BAM file, the newer versions of tophat2 will give a score of 50 for alignments that don't multimap (this used to be 255). Yes, 255 is defined in the specification as "unavailable", but what the specification says and people do aren't always the same. So, just
samtools view -bq 4 accepted_hits.bam > filtered.bam
to filter things. Note that -q 1 won't work, since tophat2 will give MAPQ scores of between 0 and 3 to multimappers (depending on how many times they map).I don't know about the BED files, so I'm just making this a comment to not dissuade others from answering.
BTW, you might switch to STAR. The author is pretty quick with responses over on seqanswers.
So,
will select only unique mapping reads from the tophat2
accepted_hits.bam
.While
works for bwa file to select unique mapping reads.
Yup!
Thanks!
How do you filter discordant reads from
filtered.bam
. For eg: in thealign_summary.txt
provided by trakhtenberg, there are about 2% of discordant reads. Tophat with options--no-discordant
always fails, probably need to do through samtools. Any suggestions. Thanks!You want bit 2 to be set.
"You want bit 2 to be set". You mean during tophat or samtools ? Could you please show an example command.
thanks!
With samtools, so
samtools view -bq 4 -f 2 ...
.Thank you. I also found on this post that the reads from
junctions.bed
are actually included in theaccepted_hits.bam
(tagged by 'N' characters in the CIGAR string, 6th column), so I only need to filter the BAM file.Although not critical to this discussion, I have another question which may influence the identification of uniquely aligned reads, and I posted it here
An N CIGAR operation means spliced, so that's correct. BTW, if you were to use STAR, you would find that it annotates its junctions with the number of uniquely aligned reads supporting them (it also gives you total reads). This is convenient for filtering novel junctions.
Thank you for clarifying the meaning of N CIGAR. I will soon start practicing with STAR too.