Hi,
Simple question but I cannot figure how to do it : How many reads are aligned using tophat 2 ? with paired-end data.
Thanks,
N.
Hi,
Simple question but I cannot figure how to do it : How many reads are aligned using tophat 2 ? with paired-end data.
Thanks,
N.
If you want to distinguish between the left and the right mate of your reads, you can just count the mappings like this:
samtools view –f 0x40 –F 0x4 run.bam | cut -f1 | sort | uniq | wc -l (left mate)
and
samtools view –f 0x80 –F 0x4 run.bam | cut -f1 | sort | uniq | wc -l (right mate)
-f 0x40 or -f 0x80 returns only the mappings of the left, or right mates. If the IDs for your left and right mates are the same, this is the only way to find out, which side you are counting.
-F 0x4 returns only mapped reads (most of the mapping tools return all reads, even they didn't find a mapping position).
(all reads in the fastq file of the left mates) / (mapping left mates) = % of mapped left mates
(all reads in the fastq file of the right mates) / (mapping right mates) = % of mapped right mates
Remember: If only one mate mappes to the genome, it is called. Thus, it is important to distinguish the mates.
Well, if you allow multiple mappings of one read, samtools flagstat will count all x mappings and thus destroy your statistics! To make it correctly, you should sort and uniq your mapping IDs.
But if you do not allow multiple mappings, you are fine with flagstat. But then you loose all duplicated genes and run into problems with identical domains of some genes. That would then destroy your DiffExp analysis for some genes.
;) You actually cannot decide where the read comes from. There are some ideas about using the context to make that decision... If there are in one gene no reads in the flanking of this read and the other there are reads, you can decide to choose the latter location as source of the read. But in the end... you don't know for sure.
:) So its a bias you essentially can't avoid. And since you are comparing the same entity for differential expression, be it genes or exons, you can safely assume the same amount of bias in other replicates and/or experimental conditions as well. In essence, given the demerits, and the fact that we compare same entities, I don't see the purpose of looking at multiple mapped reads.
As long, as you are not interested in duplicated genes, or small ncRNAs, or repeat associated regions, you are fine with unique hits. It completely depends on your analysis and point of view. I think it is very important to keep them, since it gives you information... even you don't know how to handle them correctly... for now. But think about e.g. zinc finger proteins
I get your point and that at the end, it is very opinionated. However, read sizes obtained from sequencers will continue to increase and there will be less ambiguity over repetitive regions, if any, I believe (my opinion :D ). Already, for example, bowtie uses BWT for seed and extension using dynamic programming. Nevertheless, always good to know other perspectives and keep them in mind! Appreciate the conversation! :)
A very simple statistics on your reads that might be what you are looking for is samtools flagstat
#!/usr/bin/perl -w
use strict;
my $bam = shift;
my $tmp = "/tmp/tmpNameList";
use List::Util qw(sum);
`samtools view $bam | cut -f1 > $tmp`;
`wc -l $tmp`;
open NAMES, $tmp;
my %hash;
while (<NAMES>) {
chomp;
my $id = $_;
$hash{$id} = 1;
}
my @values = values %hash;
my $readsNum = sum @values;
print "File: $bam\n";
print "Sum: $readsNum\n";
Maybe you could just use the hash of perl, If the sort | uniq
command is source consuming. Treat the id as the key, sign its corresponding value as 1
. Iteratively reading the id list, the same id still get the same value in light of multiple alignment would yield many same ids. After the hash construction, simply sum up all the value would give you how many reads actually in bam.
I think samtools -c
command would yield the alignment number, which is often confused with reads number. Once you allow multiple alignment during your reads alignment, the number by samtools -c
would be greatly larger than the real reads number in a bam file.
Even the pari-end alignement file could be counted with this easy perl script, as long as the id of each pair are same, just like the output of STAR
. In other words, in the fastq, you would see abcdefg/1 for read1, abcdef/2 for read2 for example, in STAR
the /1 and /2 are erased, leaving abcdefg for the read id.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
in the output folder you should find the file align_summary.txt with this information