Hello a newbie here, I am reanalyzing an article (GSE83931) for training purpose. I have two concerns/question.
1- I performed FASTQC on the sequences followed by multiqc. When I look at the reports individually it doesn't show any adapter sequence. (please see pic1). (Authors reported the they used Trimmomatic to remove them). I can see adapter in the multiqc report (pic2). Pictures belong to the same run. .
How can we explain the discrepancy here?
2- They reported that TruSeq3-SE.fa adapter sequence was removed by Trimmomatic. I used cutadapt instead. The adapter sequence (based on the FASTQC report) I found online corresponds to : AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
I used following command line parameters:
cutadapt -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -m 50 -j 4 -o SRR3734812_trim50.fastq.gz --length-tag 'length=' SRR3734812.fastq.gz
Output:
This is cutadapt 1.18 with Python 3.7.6 Command line parameters: -a
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -m 50 -j 4 -o
SRR3734812_trim50.fastq.gz --length-tag length= SRR3734812.fastq.gz
Processing reads on 4 cores in single-end mode ... Finished in 709.18
s (28 us/read; 2.16 M reads/minute).
=== Summary ===
Total reads processed: 25,562,072 Reads with adapters:
783,598 (3.1%) Reads that were too short: 0 (0.0%)
Reads written (passing filters): 25,562,072 (100.0%)
Total basepairs processed: 2,556,207,200 bp Total written (filtered):
2,553,044,075 bp (99.9%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA; Type: regular 3';
Length: 34; Trimmed: 783598 times.
No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp:
3
Bases preceding removed adapters: A: 24.0% C: 31.0% G: 29.6%
T: 15.5% none/other: 0.0%
Overview of removed sequences length count expect max.err error counts
3 529182 399407.4 0 529182 4 116588 99851.8 0 116588
5 39583 24963.0 0 39583 6 16724 6240.7 0 16724 7 14190 1560.2 0 14190
8 12594 390.0 0 12594 9 11809 97.5 0 11202 607 10 10917 24.4 1 10045
872 11 9490 6.1 1 9007 483 12 8432 1.5 1 8112 320 13 7396 0.4 1 7214
182 14 6684 0.1 1 2 6682 15 8 0.0 1 0 8 17 1 0.0 1 0 1
After trimming I performed FASTQC again on the same sequence. Apparently, it did something as the sequence length is now 83-100 (pic3). When I compare the first 3-4 reads from before and after trimming, it looks same. How can I validate trimming step ?
A naïve question: Should all reads have a adapter or only some of them have adapters? (because in the report it say 3% of the runs have adapter) Although not mentioned in the article, could authors upload already trimmed sequences to GEO?
Thank you for your time!
After comparing my FASTQC report with google search images of FASTQC reports with higher adapter content, I decided that my <1 % "adapter content" is actually not adapter but rather something else. If there is any opposing idea please let me know!
There may still be a bit of adapter left. A program like
bbduk.sh
(GUIDE) can get rid of even last base by using a method that overlaps R1/R2 reads (look at thetbo tpe
options). That said you generally need not worry about this since aligners will soft-clip any bases that do not match/map. Only if you are planning to dode novo
assemblies then you may need to worry about those.thank you I appreciated !