Illumina Rnaseq: Adaptor Sequence Followed By Polya?
1
2
Entering edit mode
11.2 years ago

I have strand-specific RNASeq Illumina data (TrueSeq library preparation kit). After noticing adaptor contamination via FASTQC, I realized that the adaptor* sequence is always followed by polyA tail. I wonder why.

In one of the old posts, Jeremy mentiones something as "fakePolyA": C: How to interpret the kmer enrichment plot of a FastQC output

This can be seen only in /1 reads and looks like this: GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAA[here continues variable sequence that I think comes from the real sample:)]

Interestingly, almost all sequences have this A+ tail:

grep -E "GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG[A]+" R1_001.fastq | wc -l 5108

grep -E "GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG" R1_001.fastq | wc -l 5119

Any idea will be appreciated:-)

*I found adaptor sequences at this site: http://supportres.illumina.com/documents/myillumina/6378de81-c0cc-47d0-9281-724878bb1c30/2012-09-18_illuminacustomersequenceletter.pdf

EDIT: I have cut the first four characters of the fastq file and joined it with polyA (polyA tails longer than 6 bp were trimmed):

GATCAAAAAA

enter image description here

It seems that scores are pretty good. It's definitely not zero.

EDIT2:

Here are the actual fastq substrings. I am sorry that I wrote before that I provided the part between adaptor and polyA (which would be CTTGAAAAAA). I actually cut first 4 characters (GATC) and joined it with polyA (this results in GATCAAAAAA). This however does not effect polyA quality scores from fastq files:)

rnaseq adaptor • 6.3k views
ADD COMMENT
0
Entering edit mode

Could you also report the corresponding quality scores for some of these polyA tails?

ADD REPLY
0
Entering edit mode

I have just updated my answer. Thanks for suggestion.

ADD REPLY
0
Entering edit mode

Interesting. If I understand correctly, this indicates that the As aren't being put there because the basecaller couldn't find any intensity for the cycle.

ADD REPLY
0
Entering edit mode

OK, so this seems to be a different scenario from what I had in mind when writing my answer below. Back to the drawing board.

ADD REPLY
0
Entering edit mode

Can you give us the actual fastq quality values? Not a FastQC figure

ADD REPLY
0
Entering edit mode

Yes, I have just edited my answer and provided corresponding fastq file.

ADD REPLY
1
Entering edit mode
11.2 years ago

I have noticed this as well and just assumed that an A is reported whenever you have reached the end of the template and there is nothing left to sequence. Thus these would be "fake As". I'd be happy to be corrected on this by anyone who knows better.

ADD COMMENT
0
Entering edit mode

Yep that the explanation that I've come up with. If you look at the quality values are all zero (or a empty value such as #). Its just a nucleotide thats called in anger...

ADD REPLY

Login before adding your answer.

Traffic: 1840 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6