I have a set of 100 samples (50 tumor and 50 matched normal) that I'd like to ultimately annotate variants. But I am having some difficulty in automating the initial pre-processing quality control step.
I am using FastQC to make sure sequence content, sequence quality, sequence representation (no over-representation of adapters), and KMER representation are all adequate for alignment. The samples will obviously show variable quality (e.g., different over-represented adapters) but I am wondering if this step can be automated for each unique file?
EDIT: samples are all RNA-seq
Short answer, most probably yes. Long answer: With a bit of detail on how you're processing each file, we can work on automating the process.
Thank you for your reply. As far as pre-alignment quality control, I manually load my fastq files in FastQC to make sure the average sequence quality score is at least 25. I would like to trim any overrepresented sequences (usually from adapters) and long mononucleotide repeats (length threshold is not yet defined). Because I have RNA seqs the wonky initial 5' per base sequence content will be tolerable.
I do not know if I should remove duplicates (which I am assuming will also fix KMER content assay) before or after alignment.
I also wonder if there is a file with all known Illumina adapters that I can feed into the pipeline. Or better, can I pull overrepresented sequences from each unique fastq file (which will most likely be adapters or PCR duplicates) to tailor the trimming process for each unique fastq file?
One point that may be helpful - In general, you should never de-duplicate quantitative assays like RNAseq.
That's interesting. May I ask the reason behind your assertion? I thought that PCR duplicates that remain can skew calling variants. Should I mark them instead? Thanks for your continued assistance, Chris.
- RNAseq reads will often start at the same position (the beginning of the transcript), especially in short transcripts. Since the dedup process works by marking reads that start at the same position, you will often be removing reads that are actually from unique molecules (and not just a duplicate from the amplification steps).
- If you want to do any kind of quantitation of transcript abundance (expression levels), this dedup process will skew things fairly dramatically