Hello,
I have two samples: a control and infected. For each sample, I have two fastqc files: R1(forward) and R2(reverse).
I initially performed FastQC analysis and saw that there were over represented sequences present in my files, R2 results shown below for non-infected:
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT 10012948 15.265471635087929 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTG 192028 0.29276073211831966 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTA 102871 0.15683436412264704 Clontech SMART CDS Primer II A (100% over 26bp)
GCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTT 75230 0.11469363778855786 Clontech SMART CDS Primer II A (100% over 24bp)
AAGCAGTGGTATAAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT 75166 0.11459606510720113 Clontech SMART CDS Primer II A (96% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG 70585 0.10761199552446307 Clontech SMARTer II A Oligonucleotide (100% over 25bp)
I also performed this on my R1 reads and got flagged for over-represented sequences yet some were TruSeq and some were truseq and some clontech:
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCACCACCTAATCTCGT 772364 1.1775254134909172 TruSeq Adapter, Index 23 (97% over 37bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCACCACCTAATCTCGG 632984 0.9650304083736875 TruSeq Adapter, Index 23 (97% over 37bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCACCACCTAATCGCGG 220259 0.3358009566086663 TruSeq Adapter, Index 23 (97% over 37bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT 195017 0.29731768125230873 Clontech SMART CDS Primer II A (100% over 26bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCACCACCTAATCGCGT 185486 0.2827869745958852 TruSeq Adapter, Index 23 (97% over 37bp)
GCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTT 95866 0.14615472923352238 Clontech SMART CDS Primer II A (100% over 24bp)
First, A sample wouldnt have two different types of Adapters so this confuses me?
I ran trimmomatic anyway with the illuminaclip parameter:
java -jar /mnt/Active/Trimmomatic-0.39/trimmomatic-0.39.jar PE /mnt/Active/rna_seq/rsv_mock.CCACCACCTA-ATCGAATCCG.HKW73DSX3_CCACCACCTA-ATCGAATCCG_L004_R1.fastq.gz /mnt/Active/rna_seq/rsv_mock.CCACCACCTA-ATCGAATCCG.HKW73DSX3_CCACCACCTA-ATCGAATCCG_L004_R2.fastq.gz rsv_mock_R1_paired.fq.gz rsv_mock_R1_unpaired.fq.gz rsv_mock_R2_paired.fq.gz rsv_mock_R2_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True SLIDINGWINDOW:4:30 MINLEN:50
This removed the over-represented features flag for the R1 file yet the R2 file now has even more sequences flagged:
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 1128037 2.38682200926412 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTT 1073014 2.270398427931469 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 965710 2.0433530837786824 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTT 866803 1.834075015355141 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 671489 1.4208086473925545 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTT 650235 1.3758371482441225 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT 444462 0.9404405031763581 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 404197 0.8552434855226643 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 210997 0.44645014612880746 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 106459 0.22525740226982716 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 52962 0.11206269586427249 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACATGGGGAGGCATTGAGGCAGCCAGC 48149 0.10187883280784063 Clontech SMARTer II A Oligonucleotide (100% over 25bp)
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG 47423 0.10034268392378298 Clontech SMARTer II A Oligonucleotide (100% over 25bp)
```
What exactly are these sequences? Could they not be primers as the fastqc file suggests and actually just be genes that are highly expressed? Why would there be more after running trimmomatic?
Also, the "Per Sequence GC content" does not have the nice bell curve that it previously had before running trimmomatic....
Your help will be greatly appreciated, thank you!
One is a primer and other is an adapter. Those are two different entities. Do you expect to see primer sequences in your data? Looks like clonetech kit uses some kind of poly-A capture technology which is probably represented by the poly-T's you see in the results above.
You have not given us information about other parts of FastQC report. How long are these reads? Did other parameters in FastQC look reasonable?
More than likely they are things that should get removed once you properly scan and trim your data. If you are willing I suggest you give
bbduk.sh
a try; A guide is available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/