Question

In RNA Seq, How does the AT content help us to know if the data is contaminated ?

0

Entering edit mode

6.1 years ago

Inquisitive8995 ▴ 280

Hello, I would like to know if the AT content of an RNA-Seq data play any role in determining if our data was contaminated ? If so, then how would we find it? Also, I would like to know a way to calculate it. I am new to RNA-seq so any help would be appreciated. Thanks.

RNA-Seq AT-content sequencing • 1.3k views

ADD COMMENT • link updated 6.1 years ago by Charles Warden 8.3k • written 6.1 years ago by Inquisitive8995 ▴ 280

score 1 · Accepted Answer · 2018-09-19

1

Entering edit mode

6.1 years ago

Charles Warden 8.3k

If it is a polyA enriched protocol, you might have had low RNA input (or an extra high number of target sequences) if you see mostly polyT/polyA reads.

There may be other GC-centric metrics that other people find useful. If the genome was small and contamination was high, you might be able to BLAST some over-represented sequences that are not part of the Illumina library preparation.

For bacterial contamination, I would probably expect a lower alignment rate to your vertebrate organism of interest (I'm assuming). If the alignment rate is low, you could try to do de novo assembly for unaligned reads in your samples (with a Bowtie alignment of contigs) and BLAST some of the contigs with the highest coverage (but that is something I would typically think of doing post-alignment, not based upon an early AT-enrichment).

ADD COMMENT • link 6.1 years ago by Charles Warden 8.3k

0

Entering edit mode

Hi, Thanks for your answer. The alignment is around 30% only. My next step was to do a BLAST to check it. Do you know a way to check the AT content or should I just count them in my assembled file ?

ADD REPLY • link 6.1 years ago by Inquisitive8995 ▴ 280

0

Entering edit mode

FastQC will give over-represented sequences with some Illumina library sequences (if you ended sequencing a bunch of adapter sequences, for example), and it can also include polyA/polyT sequences. It will also give some overall GC sequence content information.

FastQ Screen can also help with identifying the origin for some sequences.

With a 30% alignment rate, you may have to further reduce your number of unaligned reads for a de novo assembly (or run multiple assemblies on different subsamples of reads, since some may happen to be better at assembling informative regions for potential contamination). However, that does sound like a low alignment rate.

ADD REPLY • link 6.1 years ago by Charles Warden 8.3k

0

Entering edit mode

Thanks. Will try that. Also, Can you tell me, what would be the optimal AT content percentage for a good data in a human transcriptome data?

ADD REPLY • link 6.1 years ago by Inquisitive8995 ▴ 280

1

Entering edit mode

I am not sure about the optimal AT percentage, but I would usually expect an alignment rate greater than 90% (and I think the polyA/polyT over-represented sequences will probably be at less than 1% each)

ADD REPLY • link 6.1 years ago by Charles Warden 8.3k