Question

How to interpret reads mapped to genome from samtools flagstat vs TopHat align_summary.txt ?

1

Entering edit mode

8.6 years ago

Vasu ▴ 790

Hi all,

I am now working on bam files of paired-end RNA-Seq samples. And I want to extract paired end reads mapped to genome.

Basically, from the summary of tophat below, I want to get the mapped pairs reads.

Left reads:
      Input     :  41017921
       Mapped   :  39931054 (97.4% of input)
        of these:   3803778 ( 9.5%) have multiple alignments (71292 have >20)
Right reads:
      Input     :  41017921
       Mapped   :  39409647 (96.1% of input)
        of these:   3751917 ( 9.5%) have multiple alignments (71338 have >20)
96.7% overall read mapping rate.

Aligned pairs:  38811976
 of these:   3700400 ( 9.5%) have multiple alignments
             2438200 ( 6.3%) are discordant alignments
88.7% concordant pair alignment rate.

Is "Aligned pairs" is number of reads mapped to genome? Because when I did the same with "samtools" I got a different number.

samtools flagstat accepted_hits.bam

111314375 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
111314375 + 0 mapped (100.00%:-nan%)
111314375 + 0 paired in sequencing
55986752 + 0 read1
55327623 + 0 read2
47776282 + 0 properly paired (42.92%:-nan%)
109061240 + 0 with itself and mate mapped
2253135 + 0 singletons (2.02%:-nan%)
15869532 + 0 with mate mapped to a different chr
2586750 + 0 with mate mapped to a different chr (mapQ>=5)

Please tell me Which is the right one and also give some suggestions how to extract reads mapped to rRNA regions.

Thank you!

RNA-Seq • 4.5k views

ADD COMMENT • link 8.6 years ago by Vasu ▴ 790

0

Entering edit mode

Basically, from the summary of tophat below, I want to get the mapped pairs reads.

That is what should be in the accepted_hits.bam file that TopHat generated. Do you want to extract just those reads from the original raw data file? Do you expect reads just from rRNA and nothing else to be present?

BTW: There is some disconnect between the title of your post and what you are asking in the body of the post.

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

Sorry I had updated now.

ADD REPLY • link 8.6 years ago by Vasu ▴ 790

0

Entering edit mode

You could take the accepted_hits.bam files and use samtools fastq to generate the fastq files.

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

I have fastq files....Please check my question. The summary of tophat shows alignedpairs (mapped reads) and with the bame file I used samtools to check mapped reads. Both number are different. Which is the right one?

ADD REPLY • link 8.6 years ago by Vasu ▴ 790

1

Entering edit mode

See this thread for additional explanation: http://seqanswers.com/forums/showthread.php?t=44433

Also this: How To Interpret/Reconcile The Counts Obtained Via Samtools Flagstat

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much !!

ADD REPLY • link 8.6 years ago by Vasu ▴ 790

0

Entering edit mode

Could you please tell me how to count the reads mapped to rRNA regions?

ADD REPLY • link 8.6 years ago by Vasu ▴ 790

0

Entering edit mode

This will depend on the species. Some reference genomes (e.g., that from mouse) lack rRNA sequences. Others, such as the fruit fly, have a separate contig just for rRNA.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

This is aligned against human genome.

ADD REPLY • link 8.6 years ago by Vasu ▴ 790

0

Entering edit mode

For human samples I separately align against Rn45S. There's sequence in the GRCh38 reference, but I think there are multiple copies or various integrity, so quantification is difficult otherwise.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

As @Devon said, if you have a GTF file with annotation for rRNA you could use that to count. Otherwise you may need to align separately to rRNA repeat sequence for your genome.
If these samples were ribo-depleted hopefully you don't have many reads that will fall into this category.

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

Yes the samples are rRNA depleted but still there may be some present. So, how can I count the reads aligned to rRNA regions?

ADD REPLY • link 8.6 years ago by Vasu ▴ 790

0

Entering edit mode

If you wish to be absolutely sure then align the reads to human rDNA repeat.

If you had aligned your data using Ensembl version of human genome then that may have rRNA annotated in the GTF file. In that case you can use featureCounts (and that GTF file) to count the reads aligning to rRNA. UCSC version of the human genome does not have rRNA annotated.

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

yes I have GTF file and also bam file. what should be the parameters I need to give to use featurecounts? Or Can I use bedtools?

ADD REPLY • link 8.6 years ago by Vasu ▴ 790

0

Entering edit mode

Here is a link for how to use featureCounts. Remember this will only work if your GTF file has entries for rRNA. Otherwise you will get read counts for other genes.

ADD REPLY • link 8.6 years ago by GenoMax 147k