Hi,
I am new to transcriptome work - I have 150 bp PE data from two strains, 3 biological reps of each from a HiSeq4000 run. We are planning on doing genome-guided assembly and ultimately DGE analysis.
My data passed in all the FastQC analyses except for failures (red) in ‘per base sequence content’, ‘sequence duplication levels’, & ‘kmer content’ – all of which I have read are not applicable to RNA Seq data. Some level of adapter content is present (yellow coding). The only difference that comes up between samples is related to overrepresented sequences in some samples, & I’ve blasted them against the genome, & they are, indeed, found in the genome.
1) The more I’ve been reading about interpreting FastQC reports, I’m thinking I may not need (much/any?) quality trimming based on the raw sequence FastQC reports. I think the quality is good (for the analyses applicable to RNA Seq data) & that I don’t need quality trimming. Would you agree? I've read that it is better to keep quality trimming to a minimum if possible.
2) I do have adapters present in all reads, which means some of the library inserts are small, so I will need to trim to remove adapters. Seems like I’ve seen papers where they say this isn’t necessary? Or maybe it depends on the downstream analyses?
3) Last question – how to decide on a minimum length cut-off. Clearly some of the inserts are small. I am not quite sure how to decide minlen. What is appropriate for 150 bp PE data? In looking at the QC reports, I don’t think anything there helps me decide. Is this true? Is there a commonly accepted minlen used for 150 bp reads?
Thanks!
Quality based trimming should not be needed for most data of a recent vintage. Kits/prep methods are now mature enough.
If your data has some adapter contamination then most aligners will manage those by soft clipping the adapters when they align the data. If you need to do any de novo assembly work you should scan/trim your data.
Ideally you should not have inserts smaller than 150 bp in standard RNAseq libraries but if you do then you can decide what length you want to keep as a minimum (40-50 bp is reasonable). Remember that shorter reads are going to have problems aligning uniquely and would likely not be counted if they multi-map in downstream processing.
Aligners will soft clip adapters while mapping, but maybe they will mess up assembly?
I did say in my comment above
Adapters are present in all reads? All of your sequences are so short they read through to the other side?
No, not all sequences. Assuming the Y axis label on the adapter content graph in FastQC is % of sequences, it runs around a maximum of 10% that have adapters.