I have strand-specific RNASeq Illumina data (TrueSeq library preparation kit). After noticing adaptor contamination via FASTQC, I realized that the adaptor* sequence is always followed by polyA tail. I wonder why.
In one of the old posts, Jeremy mentiones something as "fakePolyA": C: How to interpret the kmer enrichment plot of a FastQC output
This can be seen only in /1 reads and looks like this: GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAA[here continues variable sequence that I think comes from the real sample:)]
Interestingly, almost all sequences have this A+ tail:
grep -E "GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG[A]+" R1_001.fastq | wc -l 5108
grep -E "GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG" R1_001.fastq | wc -l 5119
Any idea will be appreciated:-)
*I found adaptor sequences at this site: http://supportres.illumina.com/documents/myillumina/6378de81-c0cc-47d0-9281-724878bb1c30/2012-09-18_illuminacustomersequenceletter.pdf
EDIT: I have cut the first four characters of the fastq file and joined it with polyA (polyA tails longer than 6 bp were trimmed):
GATCAAAAAA
It seems that scores are pretty good. It's definitely not zero.
EDIT2:
Here are the actual fastq substrings. I am sorry that I wrote before that I provided the part between adaptor and polyA (which would be CTTGAAAAAA). I actually cut first 4 characters (GATC) and joined it with polyA (this results in GATCAAAAAA). This however does not effect polyA quality scores from fastq files:)
Could you also report the corresponding quality scores for some of these polyA tails?
I have just updated my answer. Thanks for suggestion.
Interesting. If I understand correctly, this indicates that the As aren't being put there because the basecaller couldn't find any intensity for the cycle.
OK, so this seems to be a different scenario from what I had in mind when writing my answer below. Back to the drawing board.
Can you give us the actual fastq quality values? Not a FastQC figure
Yes, I have just edited my answer and provided corresponding fastq file.