Entering edit mode
8.6 years ago
Vasu
▴
790
Hi all,
I am now working on bam files of paired-end RNA-Seq samples. And I want to extract paired end reads mapped to genome.
Basically, from the summary of tophat below, I want to get the mapped pairs reads.
Left reads:
Input : 41017921
Mapped : 39931054 (97.4% of input)
of these: 3803778 ( 9.5%) have multiple alignments (71292 have >20)
Right reads:
Input : 41017921
Mapped : 39409647 (96.1% of input)
of these: 3751917 ( 9.5%) have multiple alignments (71338 have >20)
96.7% overall read mapping rate.
Aligned pairs: 38811976
of these: 3700400 ( 9.5%) have multiple alignments
2438200 ( 6.3%) are discordant alignments
88.7% concordant pair alignment rate.
Is "Aligned pairs" is number of reads mapped to genome? Because when I did the same with "samtools" I got a different number.
samtools flagstat accepted_hits.bam
111314375 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
111314375 + 0 mapped (100.00%:-nan%)
111314375 + 0 paired in sequencing
55986752 + 0 read1
55327623 + 0 read2
47776282 + 0 properly paired (42.92%:-nan%)
109061240 + 0 with itself and mate mapped
2253135 + 0 singletons (2.02%:-nan%)
15869532 + 0 with mate mapped to a different chr
2586750 + 0 with mate mapped to a different chr (mapQ>=5)
Please tell me Which is the right one and also give some suggestions how to extract reads mapped to rRNA regions.
Thank you!
That is what should be in the accepted_hits.bam file that TopHat generated. Do you want to extract just those reads from the original raw data file? Do you expect reads just from rRNA and nothing else to be present?
BTW: There is some disconnect between the title of your post and what you are asking in the body of the post.
Sorry I had updated now.
You could take the accepted_hits.bam files and use
samtools fastq
to generate the fastq files.I have fastq files....Please check my question. The summary of tophat shows alignedpairs (mapped reads) and with the bame file I used samtools to check mapped reads. Both number are different. Which is the right one?
See this thread for additional explanation: http://seqanswers.com/forums/showthread.php?t=44433
Also this: How To Interpret/Reconcile The Counts Obtained Via Samtools Flagstat
Thank you very much !!
Could you please tell me how to count the reads mapped to rRNA regions?
This will depend on the species. Some reference genomes (e.g., that from mouse) lack rRNA sequences. Others, such as the fruit fly, have a separate contig just for rRNA.
This is aligned against human genome.
For human samples I separately align against Rn45S. There's sequence in the GRCh38 reference, but I think there are multiple copies or various integrity, so quantification is difficult otherwise.
As @Devon said, if you have a GTF file with annotation for rRNA you could use that to count. Otherwise you may need to align separately to rRNA repeat sequence for your genome.
If these samples were ribo-depleted hopefully you don't have many reads that will fall into this category.
Yes the samples are rRNA depleted but still there may be some present. So, how can I count the reads aligned to rRNA regions?
If you wish to be absolutely sure then align the reads to human rDNA repeat.
If you had aligned your data using Ensembl version of human genome then that may have rRNA annotated in the GTF file. In that case you can use featureCounts (and that GTF file) to count the reads aligning to rRNA. UCSC version of the human genome does not have rRNA annotated.
yes I have GTF file and also bam file. what should be the parameters I need to give to use featurecounts? Or Can I use bedtools?
Here is a link for how to use featureCounts. Remember this will only work if your GTF file has entries for rRNA. Otherwise you will get read counts for other genes.