Hi all,
I am new to RNA-seq data analysis and just starting some QC analysis.
I run the fastqc with default setting and got the report. and would like some comments and suggestions for further QC steps. Please bear with me if the questions are stupid.
The base quality is pretty good. The problem is that the data has very high duplication levels. I read through the documents and find possible reasons are PCA amplification, adapter contamination etc. Any suggestions?
There are also many over-represented sequences and Kmer Content in the report. Any comments?
Another question is for the adapter content. I was told the adapter for trimming they used is CTGTCTCTTATACACATCT but there are certain number of universal adapters in the reads like AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA.
Thanks in advance
Thanks for your answer@Devon. I am not sure. What I am doing is part of big the project in a co-clinical trial.
Does peaky GC content indicate contaminated library like adapter (AGATCGG...) contamination?
This is Human RNA? Are you sure you don't have bacterial contamination?
@apelin20. Sorry for forgetting mention that.l Yes!!!!!, It is bacterial contamination sample!. I would like to use the sample to identify the bacterial integration site from the RNA seq data.
The middle peaks aren't from adapters, but they're from something else. Perhaps apelin20 is correct suggesting bacterial contamination (that'd also explain the weird GC distribution). Blasting a few of the top sequences should help in determining this.
I just want to say how knowledgeable you are! @apelin20 @Devon Ryan Thanks.
Based on that, do I need to adapter clipping and trim the low quality bases before I do analysis if I would like to identify the virus integration site? Thanks.
A bit of trimming won't hurt. Whether you can determine where the virus (I assume the earlier mention of bacterial integration was mistaken) integrated will depend on whether (1) the virus is getting included in a host transcript or (2) an integrated viral transcript includes host DNA. In either case, just trim adapter contamination and don't do any more than very light quality trimming.
Why don't you do a preliminary assembly with Velvet/Oases (or another assembler) and megablast the first top 100 or 1000 transcripts and see what's in your sample as suggested by Devon Ryan. It doesnt seem like typical adapters are present in your reads, and it doesn't look like you have a quality problem. You can always do the trimmings later (or now), you need to figure out what is in your sample and where is it.