Good for you for checking the quality of your data, but be advised that FastQC is extremely strict and will often flag patterns that are not the result of poor quality. Think of FastQC like a quality screen. If it flags something, it means you should take a closer look, but it doesn't necessarily mean that the item being flagged indicates bad data.
The weird per-base content is extremely normal and is typically referred to as the "random" hexamer effect, since things aren't actually amplified in a uniform manner. Most people will have this and it's nothing to worry about. BTW, don't let anyone talk you into trimming that off, it's the legitimate sequence.
Duplicates are expected in RNAseq. Firstly, your samples are probably full of rRNAs, even if you performed ribodepletion. Secondly, any highly expressed gene is going to have apparent PCR duplication due solely to it being highly expressed (there is a ceiling on the coverage you can get before running into this, it's pretty high with paired-end reads but not infinite).
You'll see information about the adapters at the end of the fastQC report, but it's probably good to assume that you need to trim them off (at least unless the people who did the sequencing tell you otherwise).
The output from FastQC is most meaningful for sequencing DNA, so you often get warnings and errors when you use it to look at an RNAseq dataset.
I might tolerate random hexamer priming artefacts in the first 6-12 nucleotides of a read, but there has to be adapter contamination in there. I've never seen an RNA-SEq dataset with that amount of distortion in.
I agree that there is almost definitely adapter contamination. The second clue is that the quality scores are lower in the beginning of the read, go up (presumably after the adapter ends), then decrease gradually as you'd expect. Absent adapter contamination, you usually expect the beginning of the reads to have the highest quality and decrease from there. My guess is that the adapter sequences are synthesized with a lower accuracy rate than the reverse-transcribed RNA fragments.
In any case, you can just try to map it (or a subset) to your genome of interest and you'll quickly find out the answer.
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 10.7 years ago by
matted
7.8k
0
Entering edit mode
I have another fastq file from the same machine gave me the overpresented sequences
Overrepresented sequences
Sequence Count Percentage Possible Source
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG 42089 0.41441946173421285 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGATGTGAGGGCGATCTGGC 32502 0.32002331595631606 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGCGCGACCTCAGATCAGACGT 23822 0.2345577328383288 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAG 20383 0.20069642634722756 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGAGATTCTGAAACCATTTACT 16026 0.15779624827751895 No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGGTCAATAAGATATGTTGATTTT 15612 0.15371989442834305 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAG 15338 0.1510220177262315 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGACTGACACGCTGTCCTTTCC 13227 0.13023655160156888 No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGCCGTGAGTCTGTTCCAAGCTCC 12826 0.12628819920176326 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGGGTGTACTGGCTTCGAC 10313 0.10154453441195888 No Hit
Sequence Count Percentage Possible Source
AAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAG 45741 0.38499099869649644 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG 43533 0.36640679360430645 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGCGCGACCTCAGATCAGACGT 39535 0.3327565889129225 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGATGTGAGGGCGATCTGGC 32702 0.27524487088985433 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGACATTTGCTTCTGACATAG 24864 0.20927430951640075 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAG 22127 0.18623763862087356 No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGCCGTGAGTCTGTTCCAAGCTCC 19912 0.16759451621181518 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGCCGGGCGCGGTGGCGCACG 18570 0.15629922489219603 No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGGTCAATAAGATATGTTGATTTT 16668 0.14029054822310844 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGAAACCAGAAGACCAACACG 12398 0.10435098493341123 No Hit
AAGCAGTGGTATCAACGCAGAGTACTTGGTGACGGCCTTGGTGCCCTCCG 12271 0.1032820564702282 No Hit
If you look at the sequences, there is only one adapter here: AAGCAGTGGTATCAACGCAGAGTACATGGG. The rest of the reads are genomic, probably from high abundance RNAs like ribosomal RNAs. In fact, if you just BLAST the three sequences you showed below, two of them are rRNA (after the barcode above). The second one doesn't have the barcode, which is confusing though.
I would attempt to trim with that sequence and see how many sequences had it. I would also try to find out the source of the sequence with the service provider (since it doesn't match any Illumina primer I'm aware of).
ADD REPLY
• link
updated 5.0 years ago by
Ram
44k
•
written 10.7 years ago by
matted
7.8k
0
Entering edit mode
Thank you, I just contacted the service provider, and will update it here.
Hi Matted, I got replied from the service provider:
Thanks you for your feedback. I am forwarding these information regarding to rRNA residual to our support team. You will hear from them soon. Sorry for the inconvenience.
Considering 'Random primer' method had been used to this project, there was no adapter in cDNA preparation step. The adapters appeared in fastq files, if any, should be standard illumina sequencing adapter at 3'end of reads (when fragments were shorter than 100bp). Usually FastQC can identify these sequences, and report them in overrepresented part.
To remove this illumina adapter, please apply following 13 bp sequence with cutadapt to both reads:
5' 'AGATCGGAAGAGC' 3'
you can find more details of illumina adapters here.
I then just grep this pattern to see how many sequences have this adpater
(EDIT by @RamRS: Added `` to highlight location of adapter sequences, switched from highlighting using a `` tag to just using plain text)
Some showed up at the end of the read (3'end), some showed up in the middle of the reads. How the cutadapt works? it will remove the adapter in the middle also? I check the pdf file, and I do not see AGATCGGAAGAGC in the file, and it did not show up in my fastqc report either.
Yeah, now that you mention it that is high and, in fact, does have a different sequence profile than the normal bias that I've run into. I suspect that you're right and there's still adapter in there.
No hit for the possible source. I planned to use trimmomatic to trim the adapter. along with the package, there is a folder named adapters containing the fasta files for the adapter sequences for HiSeq machine,
You're probably good to go then, but trimming won't hurt (you can get rid of the occasional low quality read, though you probably won't gain much since you seem to have decent quality).
Good for you for checking the quality of your data, but be advised that FastQC is extremely strict and will often flag patterns that are not the result of poor quality. Think of FastQC like a quality screen. If it flags something, it means you should take a closer look, but it doesn't necessarily mean that the item being flagged indicates bad data.