Hello,
I am new to RNA seq analysis and was hoping if the community would help me understand a few things about the analysis. I have 16 human samples (8 samples pre-treatment and 8 samples post treatment), and I am trying to compare genes deferentially expressed between these two groups.
I aligned my data using STAR 2.5.2a using these parameters:
STAR --runThreadN 16 --runMode alignReads --genomeDir star-genome \
--readFilesIn R1.fastq R2.fastq --outSAMtype BAM SortedByCoordinate \
--twopassMode Basic --outSAMattrIHstart 0 --outReadsUnmapped Fastx \
--quantMode GeneCounts TranscriptomeSAM --outWigType wiggle
I read that- with STAR, an alignment of 80-90% is expected for human data. For my data sets, 13 out of 16 have lower alignment ranging from (58 - 75%), and for these samples the % of reads mapped to multiple loci range from (18 - 35%). The RNA-seq protocol used was Truseq stranded rna seq, rRNA depletion method.
1) a. Is the higher multi-mapping due to insufficient rRNA depletion? Below (end of the post) is the output for one of the samples, and for this I checked how many reads mapped to one of the rRNA locus chrUn_GL000220 - --GL000220.1 161802 479866 0 .. Is this number 479866 too high? I have read across forums that some people recommend proceeding with the analysis without worrying about rRNA and some say filtering out rRNA is a good idea. For my output below, is it okay to ignore the rRNA reads (or the 30% multi-mapping) and move on with the further analysis? Why? b. What other reasons could there be for high multi-mapping? c. Should I adjust some parameters in the STAR command to get a better alignment?
2) When it comes to deciding the next step based on numbers (# of input reads, % of uniquely mapped, % of multi-mapped), when is it fairly acceptable to proceed with DEA? What kind of numbers will give enough power for downstream analysis?
Number of input reads | 24316914
Average input read length | 150
UNIQUE READS:
Uniquely mapped reads number | 14992526
Uniquely mapped reads % | 61.65%
Average mapped length | 150.14
Number of splices: Total | 7431072
Number of splices: Annotated (sjdb) | 7422873
Number of splices: GT/AG | 7373882
Number of splices: GC/AG | 41453
Number of splices: AT/AC | 4380
Number of splices: Non-canonical | 11357
Mismatch rate per base, % | 0.60%
Deletion rate per base | 0.01%
Deletion average length | 1.55
Insertion rate per base | 0.00%
Insertion average length | 1.41
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 7311806
% of reads mapped to multiple loci | 30.07%
Number of reads mapped to too many loci | 59278
% of reads mapped to too many loci | 0.24%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 7.60%
% of reads unmapped: other | 0.44%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Thanks!
Hello bandita.adhikari,
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!
Thank you very much! Since this is one of my first couple posts, I overlooked the formatting. Apologies. I will use the formatting bar from next time.