Question

Automating pre-alignment (and post-alignment) quality control

0

Entering edit mode

8.8 years ago

umn_bist ▴ 390

I have a set of 100 samples (50 tumor and 50 matched normal) that I'd like to ultimately annotate variants. But I am having some difficulty in automating the initial pre-processing quality control step.

I am using FastQC to make sure sequence content, sequence quality, sequence representation (no over-representation of adapters), and KMER representation are all adequate for alignment. The samples will obviously show variable quality (e.g., different over-represented adapters) but I am wondering if this step can be automated for each unique file?

EDIT: samples are all RNA-seq

RNA-Seq FastQC Quality Control • 4.0k views

ADD COMMENT • link updated 8.8 years ago by harold.smith.tarheel ★ 5.0k • written 8.8 years ago by umn_bist ▴ 390

1

Entering edit mode

Short answer, most probably yes. Long answer: With a bit of detail on how you're processing each file, we can work on automating the process.

ADD REPLY • link 8.8 years ago by Ram 44k

0

Entering edit mode

Thank you for your reply. As far as pre-alignment quality control, I manually load my fastq files in FastQC to make sure the average sequence quality score is at least 25. I would like to trim any overrepresented sequences (usually from adapters) and long mononucleotide repeats (length threshold is not yet defined). Because I have RNA seqs the wonky initial 5' per base sequence content will be tolerable.

I do not know if I should remove duplicates (which I am assuming will also fix KMER content assay) before or after alignment.

I also wonder if there is a file with all known Illumina adapters that I can feed into the pipeline. Or better, can I pull overrepresented sequences from each unique fastq file (which will most likely be adapters or PCR duplicates) to tailor the trimming process for each unique fastq file?

ADD REPLY • link 8.8 years ago by umn_bist ▴ 390

1

Entering edit mode

One point that may be helpful - In general, you should never de-duplicate quantitative assays like RNAseq.

ADD REPLY • link 8.8 years ago by Chris Miller 22k

0

Entering edit mode

That's interesting. May I ask the reason behind your assertion? I thought that PCR duplicates that remain can skew calling variants. Should I mark them instead? Thanks for your continued assistance, Chris.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by umn_bist ▴ 390

0

Entering edit mode

- RNAseq reads will often start at the same position (the beginning of the transcript), especially in short transcripts. Since the dedup process works by marking reads that start at the same position, you will often be removing reads that are actually from unique molecules (and not just a duplicate from the amplification steps).

- If you want to do any kind of quantitation of transcript abundance (expression levels), this dedup process will skew things fairly dramatically

ADD REPLY • link 8.8 years ago by Chris Miller 22k

Ram · Answer 1 · 2016-01-27

1

Entering edit mode

8.8 years ago

harold.smith.tarheel ★ 5.0k

FastQC is better suited to visualization of the quality metrics, but not really designed for automation. Have you considered using BBDuk from the BBMap package? It provides options for adapter trimming (including lists of standard TruSeq adapters), quality trimming, deduplication (although that's not recommended for RNA-Seq), and reporting, and is easily piped into an alignment workflow.

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

Fastqc is a command line program that can be as easily automated as any other program. I think the default is to open a GUI with no arguments, which may be what you are thinking about. That is very useful useful also, but not if you specifically want to process 100 files. My typical workflow is to run Fastqc pre- and post-trimming with trimmomatic to compare, though it isn't always necessary to do any trimming depending on the usage/quality of the data.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by SES 8.6k

1

Entering edit mode

FastQC is a useful program, and of course it can be automated. But it's designed as a reporting tool, and lacks the functionality (e.g., adapter and quality trimming) described in the OP's second post (and confirmed by your own use of Trimmomatic). He appears to be new to the field (apologies if not true), so I was pointing out an alternative to FastQC that might be better suited to his objective.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Thanks for your reply. Yes, I am new to this field. FastQC was the first tool that I had used to assess if adapter trimming and removing bad quality read/bases was necessary and just framed my needs wrt FastQC.

will definitely keep your suggestion in mind. The package seems to be able to remove adapters but can it also remove reads below a user-defined quality score? Also is there a publicly available "gold standard" list of all known Illumina adapter seqs that I can feed into the tool for trimming?

I found a wrapper called Trim Galore that may fit my needs and I was curious if you had any experience/input regarding this tool. Thanks again for your help.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by umn_bist ▴ 390

0

Entering edit mode

A description of some of BBDuk's functions can be found here. A list of all Illumina adapters is included as part of the package, or you can download Illumina's customer sequence letter from their website.

I have not used Trim Galore so cannot advise.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by harold.smith.tarheel ★ 5.0k