Hey all,
So I currently dealing with deep-sequencing data of 16S amplicons from multiple variable regions on the 16S gene. I used different sets of primers to amplify different regions and then sequenced them all together, I am trying to de-multiplex based on those primers using cutadapt and the following code:
cutadapt \
--pair-adapters \
--no-indels \
-g file:primers_16S_V1-9_qiagen_for.fasta \
-G file:primers_16S_V1-9_qiagen_rev.fasta \
-o {name1}_{name2}.1.fastq.gz -p {name1}_{name2}.2.fastq.gz \
1_S1_L001_R1_001.fastq.gz 1_S1_L001_R2_001.fastq.gz
where my primer files look like that:
primers_16S_V1-9_qiagen_for.fasta:
>V1V2_f
AGRGTTTGATYMTGGCTC
>V2V3_f
GGCGNACGGGTGAGTAA
>V3V4_f
CCTACGGGNGGCWGCAG
>V4V5_f
GTGYCAGCMGCCGCGGTAA
>V5V7_f
GGATTAGATACCCBRGTAGTC
>V7V9_f
YAACGAGCGMRACCC
primers_16S_V1-9_qiagen_rev.fasta:
>V1V2_r
CTGCTGCCTYCCGTA
>V2V3_r
WTTACCGCGGCTGCTGG
>V3V4_r
GACTACHVGGGTATCTAATCC
>V4V5_r
CCGYCAATTYMTTTRAGTTT
>V5V7_r
ACGTCRTCCCCDCCTTCCTC
>V7V9_r
GCTGCGTTCTTCATCGATGC
My primers are not necessarily anchored so they can appear at any place in the reads. When I demultiplex I almost don't get any matches to the V3V4f and V3V4r combination which is weird since I know it is a major part of my reads.
So I tried running the following code on the unknown reads only with the V3V4f and V3V4r combination:
cutadapt \
--pair-adapters \
--no-indels \
--action=none \
-g CCTACGGGNGGCWGCAG \
-G GACTACHVGGGTATCTAATCC \
-o {name1}_{name2}.1.fastq.gz -p {name1}_{name2}.2.fastq.gz \
unknown-unknown.1.fastq.gz unknown-unknown.2.fastq.gz
and lo and behold I get tens of thousands of reads matching the primer combination.
Does anybody have any idea what is happening here? I am sure I am missing something. Why are those matches not found in my first run of cutadapt ?
Sorry for the long post and thank u for any suggestions
Cheers