Hi, I am new to bioinformatics and would love some help, please. We did bulk paired-end RNA-seq with Rattus norvegicus muscle tissue (48 files, N=24). An omics centre did the library prep and the sequencing for us, using MGI Tech. I ran FastQC and BLASTed the overrepresented sequences. I have read lots on forums, the fastqc resources and know not to take these results too seriously, but I still require guidance and want to assure I understand correctly before I proceed to STAR alignment and DE analysis. Most of the results seems good, except a few things caught my eye:
1) There are about 10 samples (N=10/24; 20/48 files) with overrepresented sequences that match to R. norvegicus mitoRNA or mRNA. There's always a little secondary peak on the per seq GC content graph and a warning (Fig. 1) Qs: 1a) Is this of any concern? 1b) I assume that they might indicate highly expressed genes and I should just ignore the warnings?
2) The two files for one sample (N=1/24) show a huge secondary peak (Fig. 2.) and the ~25 overrepresented all match rRNA from R. norvegicus (the files indicate ~32722582 seqs total and the overrep seqs make ~5%). I truly am not sure what to do here as this only happened to one sample. Qs: 2a) Why is this occurring in only one sample?, and 2b) How should I proceed with this sample?
3) A few files have similar GC content graphs as above but the overrepresented sequence(s) map to nothing ("No significant similarity found") or a random plant that is not part of the rats' diet, or mould, or a random rodent/mammal (e.g., Abelia forrestii, Paradiachea cylindrica, Elephant etc...; Fig. 3.). Qs: 3a) Are the sequence(s) that have no match novel transcripts? 3b) I assume that aligning without trimming should be fine as these non-Rattus seqs won't be mapped?
4) The two files for one sample (N=1/24) have overrepresented sequences belonging to E. coli and R. norvegicus mitoRNA or mRNA (Fig. 4.). Qs: 4a) As above, I assume that aligning without trimming should be fine as these bacterial seqs won't be mapped? 4b) As above, I assume that the non-E. coli overrepresented sequences are just highly expressed genes?
I have pseudoaligned with salmon and ran DESeq2 without trimming to quickly assess the data. Just for info, the PCA plot looks really good - distinct clusters for my groups.
Conclusion: I should be fine to not perform trimming prior to STAR alignment?
*I have attached some screenshots, I hope they show up once posted.
Thanks a lot in advance!
Thanks for the reply and the help! I ran SortMeRNA then re-ran FastQC. Here's what both GC content graphs look like now. So I guess rRNA was the culprit.
I aligned the non_rRNA_reads_R1.fq and _R2.fq with STAR and the alignment rate is super low: 1.88% while pre-SortMeRNA was 70% (for that sample, while all others are >85%). I am confused. Would you know why this happens? I made sure to align the non_rRNA _reads and not the aligned_reads from SortMeRNA.