Question

RNA-seq fastqc report explanation

2

Entering edit mode

9.7 years ago

Lerong ▴ 130

Hi all,

I am new to RNA-seq data analysis and just starting some QC analysis.

I run the fastqc with default setting and got the report. and would like some comments and suggestions for further QC steps. Please bear with me if the questions are stupid.

The base quality is pretty good. The problem is that the data has very high duplication levels. I read through the documents and find possible reasons are PCA amplification, adapter contamination etc. Any suggestions?

There are also many over-represented sequences and Kmer Content in the report. Any comments?

Another question is for the adapter content. I was told the adapter for trimming they used is CTGTCTCTTATACACATCT but there are certain number of universal adapters in the reads like AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA.

Thanks in advance

RNA-Seq • 8.1k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Lerong ▴ 130

Ram · Answer 1 · 2015-03-18

2

Entering edit mode

9.7 years ago

Devon Ryan 104k

High levels of apparent duplication are normal in RNAseq and aren't due to actual PCR or optical duplicates. Any sort of rRNA or any other highly expressed transcript will cause the apparent duplication rate to jump.

Something that does deserve a bit of looking into the GC distribution. It's not normally so peaky like that. Similarly, there's a big jump in the prevalence of TGCCGACTA and CAAGTCGTC in the middle of reads. A spike on the of a hexamer (well, octamer in this case) at the start of reads is common, but I don't typically see that in the middle of reads for standard mRNAseq dataset. Was this a standard RNAseq dataset experiment or were you doing something special (e.g., RIPseq)?

For trimming, I actually trim off AGATCGG... for most datasets. Have a look here for what Scythe looks for.

ADD COMMENT • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for your answer@Devon. I am not sure. What I am doing is part of big the project in a co-clinical trial.

Does peaky GC content indicate contaminated library like adapter (AGATCGG...) contamination?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Lerong ▴ 130

2

Entering edit mode

This is Human RNA? Are you sure you don't have bacterial contamination?

ADD REPLY • link 9.7 years ago by apelin20 ▴ 480

0

Entering edit mode

@apelin20. Sorry for forgetting mention that.l Yes!!!!!, It is bacterial contamination sample!. I would like to use the sample to identify the bacterial integration site from the RNA seq data.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Lerong ▴ 130

0

Entering edit mode

The middle peaks aren't from adapters, but they're from something else. Perhaps apelin20 is correct suggesting bacterial contamination (that'd also explain the weird GC distribution). Blasting a few of the top sequences should help in determining this.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

I just want to say how knowledgeable you are! @apelin20 @Devon Ryan Thanks.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Lerong ▴ 130

0

Entering edit mode

Based on that, do I need to adapter clipping and trim the low quality bases before I do analysis if I would like to identify the virus integration site? Thanks.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Lerong ▴ 130

1

Entering edit mode

A bit of trimming won't hurt. Whether you can determine where the virus (I assume the earlier mention of bacterial integration was mistaken) integrated will depend on whether (1) the virus is getting included in a host transcript or (2) an integrated viral transcript includes host DNA. In either case, just trim adapter contamination and don't do any more than very light quality trimming.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

1

Entering edit mode

Why don't you do a preliminary assembly with Velvet/Oases (or another assembler) and megablast the first top 100 or 1000 transcripts and see what's in your sample as suggested by Devon Ryan. It doesnt seem like typical adapters are present in your reads, and it doesn't look like you have a quality problem. You can always do the trimmings later (or now), you need to figure out what is in your sample and where is it.

ADD REPLY • link 9.7 years ago by apelin20 ▴ 480