After downloading my PI's RNA-sequencing data, I ran fastQC on it and got mediocre results. I took a subset of my data and trimmed it using trim_galore, then ran fastQC on the trimmed data (using the module parallel) via:
cd /Users/patrick/Desktop/sampleRNAseqdata
parallel trim_galore --paired --fastqc -o trim_galore/ {} {=s/_R1_/_R2_/=} ::: *_R1_001.fastq.gz
After trimming the data, however, the FastQC reports were even worse than those of the untrimmed data. Have I done something wrong, or is this to be expected? I've attached an image showing the original FastQC reports on the left and then the trimmed files' FastQC reports on the right. Assuming there's no fix for this, would there be any reason for me to trim the other files in the database? Or should I just begin aligning the reads to a reference genome?
Those indicators are decided by limits present in a file (I think it is called
limits.txt
or similar). You can change those limits.That said, those limits are for plain genomic sequencing. If you have an experiment that is not that then invariably some item on that list will
fail
. That result has to be taken in context of experiment you are doing. It is hard to see anything useful in those shots, so you will need to post larger version of things you are concerned about and tell us what this sequencing is for.If you have not seen these useful blog posts by authors of FastQC then take some time to browse.
Thanks for the reply, Geno. Since there's not much useful information in the screenshots I provided (my bad!), I've uploaded the full FastQC reports at this link if you'd be willing to look them over. And I will definitely have to check out some of those blog posts; thanks 🙂 In terms of the purpose of this data, it's RNA-seq data from one cohort that went through an experience and has some symptoms and another cohort that went through that experience and doesn't have symptoms. (I'm being purposefully vague just in case my PI has some reason to keep this experiment secret or something). So the goal is to check for gene-expression biomarkers that could predict who will develop symptoms.
What are those over represented sequences ?
It seemed that the overrepresented sequences in the trimmed files were the same as those in the originals. Those sequences' absolute representation dropped, but their relative representation seemed to rise as a result of trimming. If you'd be willing to look them over, here's a Google Drive link to the fastQC reports of the subset of data that I trimmed as well as their untrimmed counterparts. I didn't notice any red flags when comparing the QC reports, but I'm very new to all of this. But putting the overrepresented sequences aside, my question is: why did none of these metrics improve? Shouldn't some of these indicators improve in at least a few of the files? Would it be conceivable that the lab that sequenced this data had already trimmed off the low-quality (maybe bottom 10%) of reads? And that maybe my trimming the data a second time could have removed some acceptable reads? It sounds unlikely to me personally, but I can't think of anything else. Any thoughts are appreciated 😬🙂
I don't know about trim-galore specifically, but read trimmers generally trim very bad quality bases from the end of reads, rather than removing whole reads that have a medium-poor average quality, although most trimmers can be configured to do this as well.