Hi all,
I am analyzing for the first time small RNA-Seq data. The goal, at the end, is to get counts of miRNA, but also other non coding RNA-Seq would be interesting.
Reads are from human, paired-end. I have in total almost 50 samples.
I have some doubts, regarding percentage of reads mapped (between 25% and 45% per sample) and percentage of reads assigned by using featureCounts (very low, around 5-10%).
I will tell you step by step how I performed the analysis, hoping some of you can tell me if there is something wrong. I really think some mistakes could have been done, so I am in "listening" mode :D
Also, on the bottom, I will copy the output of a STAR log and of a featureCounts log.
THANKS A LOT FOR ANY FEEDBACK.
Sergio
FIRST STEP: trimming by using Trim Galore 0.6.6 trim_galore --output /path/trimmed_reads --fastqc --paired --length 18 --small_rna --cores 6 reads_1.fq.gz reads_2.fq.gz
(default adapters used by Trim Galore are fine. I checked and it's exactly what they used for the experiment) Results will be at this point trimmed_reads_1.fq.gz and trimmed_reads_2.fq.gz
SECOND STEP: alignment by using STAR 2.7.5a
STAR --genomeDir /genome_reference/GRCh38/STAR_2.7_indices --readFilesIn trimmed_reads_1.fq trimmed_reads_2.fq --outFileNamePrefix /path_to_output_folder/aligned_reads --readFilesCommand zcat --outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1 --outSAMtype BAM SortedByCoordinate --runThreadN 6 Results will be aligned_reads.bam
THIRD STEP, part a: counting by featureCount 1.6.0 to detect miRNA
featureCounts -t miRNA -p -g 'Name' -a /path/to/miRbase/v22.1/hsa.gtf -o /path/to/output/folder/miRNA_counts.txt aligned_reads_miRNA.bam
THIRD STEP, part b: counting by featureCount 1.6.0 to detect all small noncoding RNA
featureCounts -t miRNA -p -a /path/to/gencode.v36.annotation_hg38_where_I_selected_only_small_RNA.gtf -o /path/to/output/folder/miRNA_counts.txt aligned_reads_ncRNA.bam
HERE THE STAR LOG OF A SAMPLE AND THE featureCounts Log for cases a and b, of the same sample:
star log: Started job on | Jul 07 13:22:08 Started mapping on | Jul 07 13:26:08 Finished on | Jul 07 13:37:19 Mapping speed, Million of reads per hour | 157.02
Number of input reads | 29267386
Average input read length | 58
UNIQUE READS:
Uniquely mapped reads number | 8418899
Uniquely mapped reads % | 28.77%
Average mapped length | 49.10
Number of splices: Total | 75064
Number of splices: Annotated (sjdb) | 75064
Number of splices: GT/AG | 74264
Number of splices: GC/AG | 535
Number of splices: AT/AC | 73
Number of splices: Non-canonical | 192
Mismatch rate per base, % | 0.27%
Deletion rate per base | 0.00%
Deletion average length | 1.00
Insertion rate per base | 0.00%
Insertion average length | 1.11
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 13873946
% of reads mapped to multiple loci | 47.40%
Number of reads mapped to too many loci | 6847549
% of reads mapped to too many loci | 23.40%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 41163
% of reads unmapped: too many mismatches | 0.14%
Number of reads unmapped: too short | 1456
% of reads unmapped: too short | 0.00%
Number of reads unmapped: other | 84373
% of reads unmapped: other | 0.29%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
featureCounts log case a:
|| Threads : 1 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Strand specific : no ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| ||
|| Load annotation file /path ... ||
|| Features : 2883 ||
|| Meta-features : 2652 ||
|| Chromosomes/contigs : 24 ||
|| ||
|| Process BAM file /path. ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Read re-ordering is performed. ||
|| ||
|| Total fragments : 96167215 ||
|| Successfully assigned fragments : 6053052 (6.3%) ||
|| Running time : 9.35 minutes ||
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
featureCounts log case b:
-----------------------------------------------------------------------------
|| Threads : 1 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Strand specific : no ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| ||
|| Load annotation file /path/ ... ||
|| Features : 7506 ||
|| Meta-features : 7506 ||
|| Chromosomes/contigs : 25 ||
|| ||
|| Process BAM file /path/... ||
|| Paired-end reads are included. ||
|| Assign fragments (read pairs) to features... ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Read re-ordering is performed. ||
|| ||
|| Total fragments : 96167215 ||
|| Successfully assigned fragments : 6758205 (7.0%) ||
|| Running time : 8.82 minutes
Hi Sergio, Concerning the low unique mapping rate, if you look carefully, you will see that almost all your reads are mapped (>99%), but many of them map to multiple loci. It happens a lot with tRNAs fragments and other small RNA species so it is not unexpected in a smallRNA-seq experiment.
Now concerning the output of featureCounts, you are discaring multimapping reads from the start (leaving you with 28% of total reads that are uniquely mapped to work with), then you only count them on miRNAs, ending up with about 7% reads assigned. That means that about 1/4 of your uniquely mapped smallRNA-seq reads map on miRNA, which I think is not bad since there is a lot more than miRNAs in a small RNA fraction.
Of course, more quality control is needed before affirming that this data is fine (for instance, look in a genome browser if you see the expected expression patterns), but I don't see a big problem just from the mapping or read assignment rate.
Hi Carlo, thank you very much for your clear reply!