Hallo everone,
I came into some tricky problems with processing total stranded RNA-seq data. I used trimmomatic tool to trim adapter sequence and low quality reads. Then, used STAR to map the clean reads to GRCh38. But the unique mapping rate is quite poor than my colleague's who just use the raw data for STAR mapping. He thinks the reads have been trimmed by illumina system, there is no need to trim again. All parameters are same between us, all are default. This is not reasonable. Doesn't it should make a better mapping result after trimming? I tested several samples, they all make same results. That's very tricky. Here is an mapping result of a sample with or without trimming.
Without trimming:
Started job on | Nov 28 13:28:58
Started mapping on | Nov 28 13:29:25
Finished on | Nov 28 14:42:17
Mapping speed, Million of reads per hour | 122.99
Number of input reads | 149364482
Average input read length | 276
UNIQUE READS:
Uniquely mapped reads number | 115685706
Uniquely mapped reads % | 77.45%
Average mapped length | 276.34
Number of splices: Total | 43804823
Number of splices: Annotated (sjdb) | 42830914
Number of splices: GT/AG | 43313281
Number of splices: GC/AG | 252366
Number of splices: AT/AC | 27822
Number of splices: Non-canonical | 211354
Mismatch rate per base, % | 0.34%
Deletion rate per base | 0.01%
Deletion average length | 1.92
Insertion rate per base | 0.01%
Insertion average length | 1.66
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 31227061
% of reads mapped to multiple loci | 20.91%
Number of reads mapped to too many loci | 55837
% of reads mapped to too many loci | 0.04%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 1.53%
% of reads unmapped: other | 0.07%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
<h5>with trimming</h5>
Started job on | Nov 29 10:41:38
Started mapping on | Nov 29 10:45:48
Finished on | Nov 29 11:19:15
Mapping speed, Million of reads per hour | 258.01
Number of input reads | 143843180
Average input read length | 239
UNIQUE READS:
Uniquely mapped reads number | 73720415
Uniquely mapped reads % | 51.25%
Average mapped length | 244.17
Number of splices: Total | 24431193
Number of splices: Annotated (sjdb) | 23918159
Number of splices: GT/AG | 24164196
Number of splices: GC/AG | 117470
Number of splices: AT/AC | 13993
Number of splices: Non-canonical | 135534
Mismatch rate per base, % | 0.34%
Deletion rate per base | 0.01%
Deletion average length | 1.94
Insertion rate per base | 0.01%
Insertion average length | 1.57
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 11930748
% of reads mapped to multiple loci | 8.29%
Number of reads mapped to too many loci | 42329
% of reads mapped to too many loci | 0.03%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 40.37%
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
If the mapping result of trimmed data is worse, is it necessary to do trimming?
It is strictly not necessary to do trimming since
STAR
should take care of any extraneous sequence by soft-clipping.Without trimming:
With trimming:
Basically you have 20% reads you can't use
without trimming
as compared to8.2%
inwith trimming
when you do counting.Thanks genomax. Can you make it more clear wether or not I need to do trimming based on current condition. The % of reads mapped to multiple loci is higher, but the Uniquely mapped reads % is higher too in sample without trimming. It's unreasonable.
Do you know if your data needs to be trimmed (i.e. has some extraneous sequence)? If that is not the case you may be adding some bias.
Hi genomax, thanks. That's also what I thought. I think unique mapping rate is top priority for accessing the data quality. If this is higher without trimming, then I just pass the trimming process. If things go wring during mapping, then I will trace back and check it.