Why I see very high mapping percentage, Is this normal or anything wrong in generating libraries?
0
0
Entering edit mode
6.6 years ago
Vasu ▴ 790

Hi,

With Rna-seq data I have used hisat2 for aligning reads to genome. To check the mapping percentage I used samtools flagstat sample.sorted.bam

When I looked at few samples I see big number of reads passed and mapped.

Samples           Total_reads              Mapped_reads
Sample1         1076.13 M reads     1035.27 M reads
Sample2          1273.74 M reads     1175.18 M reads
Sample3            768.89 M reads        760.35 M reads
Sample4             105.84 M reads        101.81 M reads
Sample5             506.19 M reads         496.32 M reads

1) Is that big number reads mapped for few samples common? Why I see very high mapping percentage? Libraries were generated with Ribosomal rna depletion kit.

2) For few samples I see almost 45k genes among 53k showing 0 read counts. Don't know why.

3) I would also like to check reads mapped to ribosomal genes or ribosomal RNA's. Map I know how to check that?

RNA-Seq hisat2 samtools mapping • 2.4k views
ADD COMMENT
1
Entering edit mode
  1. There is no norm for a certain % alignment. One would want to have as high a % as possible but at the same time you don't want rRNA/DNA contamination/PCR overamplification.
  2. Have you tried to see what the distribution is counts is for the rest of 8K reads? Inspect your alignments in IGV to see if you are bale to see any strange patterns.
  3. You can get the rDNA repeat sequence (e.g. for human) and align your data to it (if your genome does not include those sequences).
ADD REPLY
0
Entering edit mode

When I looked at counts_matrix using featureCounts, I see around 540 rRNA genes based on gene_type among 58k genes. All these rRNA genes have 0,1,10 counts for all the samples. As I see rRNA genes, does it mean some contamination happened while generating libraries?

ADD REPLY
0
Entering edit mode

Ribodepletion may be imperfect at times and may not remove entire rRNA present. Depending on aligner you used (and how you chose to handle the multi-mappers) you may have more or less counts for rRNA. You could of course omit considering those counts in downstream analysis but not having any counts associated with 45K genes (out of 53K) sounds a bit odd.

What genome are these samples from? How many reads did you have per sample to begin with?

ADD REPLY
0
Entering edit mode

It is human genome. 45k genes among 58K showed 0 counts only for couple of samples. For the rest of samples I see atleast 15-20k genes with 0 counts. Total reads per sample is given above in the question.

ADD REPLY
0
Entering edit mode

Ah I see. The reads are extremely unbalanced. A billion plus for two samples and only a tenth for another. Still not having any counts for 30-40% of genes seems a rather high number to me.

Have you examined the alignments to see if they look reasonable? If you have some idea about the experiment do basic visual QC checks out (e.g. a deletion has no counts associated with it, up-regulation where you expect it, no pileups outside of exons etc).

ADD REPLY
0
Entering edit mode

Yes I did QC check. Also mapping quality after alignment. Mean quality score, per seq quality score shows positive results. For per base seq content I saw a bias towards 3' end and so I trimmed 3 bases on the 3' end during alignment. For Per seq GC content all the samples show a failure results. There are spikes instead of normal distribution for all the samples. No, Adapter content.

What I should do now?

ADD REPLY
0
Entering edit mode

Without an understanding of what this experiment is about it may be hard for anyone here to offer constructive suggestions.

BTW; I was referring to post-alignment QC checks. You are mostly referring to the pre-alignment QC.

ADD REPLY
0
Entering edit mode

This experiment is mainly to get lncRNAs and to study them. For post alignment QC check what to do?

ADD REPLY
0
Entering edit mode

Use Qualimap on your BAM files.

ADD REPLY
0
Entering edit mode

Yes, I did rseqc (mapping statistics) for one of the sample.

Total records:                          444421196

QC failed:                              0
Optical/PCR duplicate:                  0
Non primary hits                        246823532
Unmapped reads:                         9448726
mapq < mapq_cut (non-unique):           49838905

mapq >= mapq_cut (unique):              138310033
Read-1:                                 69300754
Read-2:                                 69009279
Reads map to '+':                       69157969
Reads map to '-':                       69152064
Non-splice reads:                       107295331
Splice reads:                           31014702
Reads mapped in proper pairs:           125282854
Proper-paired reads map to different chrom:0
ADD REPLY
0
Entering edit mode

~50% of reads are multi-mapping in this example. Are you counting lncRNA's or trying to discover new ones?

ADD REPLY
0
Entering edit mode

counting lncRnas and pc genes. will do downstream analysis. Do u think it is fine from the above statistics? Also would like to discover new lncRNAS.

ADD REPLY
0
Entering edit mode

Hi, I'm using qualimap bamqc with this command qualimap bamqc -bam *.sorted.bam -p strand-specific-reverse -gff annotation.gtf -outformat HTML --java-mem-size=4G.

For few samples it gave the output. But suddenly for one of the sample this is what I see

Selected tool: bamqc Available memory (Mb): 32 Max memory (Mb): 3817 Starting bam qc.... Loading sam header... Loading locator... Loading reference... Number of windows: 400, effective number of windows: 593 Chunk of reads size: 1000 Number of threads: 20 Initializing regions from annotation.gtf..... Found 2617914 regions Filling region references... Processed 50 out of 593 windows... Processed 100 out of 593 windows... Processed 150 out of 593 windows... Processed 200 out of 593 windows...

WARNING: out of memory! Qualimap allows to set RAM size using special argument: --java-mem-size Check more details using --help command or read the manual.

What I should do now? Could you please also answer to my previous comment.

ADD REPLY
0
Entering edit mode

Hi could you please answer to my above comments

ADD REPLY
0
Entering edit mode

2.What are those genes without read counts ? Sometimes very small ncRNA such as miRNAs and pirRNAs are in the annotation but not well captured in total RNA-seq. So for those classes of RNA, it should be ok to have 0 read counts. On the other hand, if you have 0 read count for most protein coding genes, then something went wrong (probably).

ADD REPLY
0
Entering edit mode

There are very less protein coding genes with 0 counts among all the samples. Most of the genes with 0 counts are rRNA, snRNA etc...

ADD REPLY

Login before adding your answer.

Traffic: 3010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6