Hi,
Recently we have sequenced (V3-V4 region of16S) some samples with half of amplicon with adapters switched.
Using the F primer 5' CCTACGGGNGGCWGCAG 3' & Rev primer 5' GACTACHVGGGTATCTAATCC 3'.
So now we have both forward and reverse reads in both R1 and R2 files.
For example: *R1.fastq
@HWI-1KL166:431:HWKJKBCXY:1:1101:5020:2038 1:N:0:TCTAGACTCGTCGCTA
CCTACGGGGGGCAGCAGTGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTCTGAAGAAGGCCTTCGGGTTGTAAAGGACTTTTGTCAGGGAAGAAAAGGGCGGGGTTAATACCCCTGTCTGATGACGGTACCTGAAGAATAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCACGCAGGCGGTG
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIHIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIHIIIIIIIIIIIIIIIIHHIIIIIGHHHGIIIICHII
@HWI-1KL166:431:HWKJKBCXY:1:1101:6494:2219 1:N:0:TCTAGACTCGTCGCTA
CCTACGGGGGGCTGCAGTGAGGAATATTGGTCAATGGGCGGGAGCCTGAACCAGCCAAGTAGCGTGCAGGATGACGGCCCTATGGGTTGTAAACTGCTTTTATGCGGGGATAAAGTGAGGGACGTGTCCTTCATTGCAGGTACCGCATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCTGGAGATTA
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIHIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIHIIIIIIIIIIHIIIEHEHIIIIIIIIIIHIHHG@
@HWI-1KL166:431:HWKJKBCXY:1:1101:11712:2422 1:N:0:TCTAGACTCGTCGCTA
CCTACGGGGGGCTGCAGTGGGGAATATTGCGCAATGGGGGCAACCCTGACGCAGCCATGCCGCGTGAATGAAGAAGGCCTTCGGGTTGTAAAGTTCTTTCGGTAGCGAGGAAGGCATTTAGTTTAATAAACTAAGTGATTGACGTTAACTACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTTCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGG
+
DDDDDIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIEHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIHIIHIIIIIIIIIIIIIIIIIIHIIIIIIIIIHIIIIHIIIIIIIIIIIIIIIIHIIIIIIIHHIIIIIIIIIIIIIGIIIIIHEIIIIIIIIEHIFHHHIIIIIIIGHIIIIH#
@HWI-1KL166:431:HWKJKBCXY:1:1101:14311:2424 1:N:0:TCTAGACTCGTCGCTA
GACTACTAGGGTATCTAATCCTGTTCGATACCCGCACCTTCGAGCTTCAGCGTCAGTTGCGCTCCCGTCAGCTGCCTTCGCAATCGGAGTTCTTCGTCATATCTATGCATTCCACCGCTACACCACGCATTCCGCCTACCTCATCTACACTCAAGCCCGCCAGTATCAATGGCAATTTAGGAGTTAAGCTCCTAGATTTCACCGCTGACTTAACAGGCCGCCTACGCACCCTTTAAACCCAATAAATCCG
+
DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIHIIIIIHIDGHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIHHIIIIIIIIGIIIIIIIHHHIIIIIGIHIIIIIIIIIIIIIIIIIIHHIIIIIHIIIIIIIIIFHIGHHHIHIIHIIE
@HWI-1KL166:431:HWKJKBCXY:1:1101:8278:3000 1:N:0:TCTAGACTCGTCGCTA
GACTACCGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTACATTCCCAAGGGGCTGCCTTCGCCTTCGGTATTCCTCCACATCTCTACGCATTTCACCGCTACACGTGGAATTCTACCCCTCCCTAAAGTACTCTAGCGACCCAGTATGAAATGCAATTCCCAGGTTAAGCCCGGGGCTTTCACACCTCACTTAAATCACCGCCTGCGCGCCCTTTACGCCCAGTTATTCCG
And same for *R2.fastq
@HWI-1KL166:431:HWKJKBCXY:1:1101:5020:2038 2:N:0:TCTAGACTCGTCGCTA GACTACCCGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTACATTCCCAAGGGGCTGCCTTCGCCTTCGGTATTCCTCCACATCTCTACGCATTTCACCGCTACACGTGGAATTCTACCCCTCCCTAAAGTACTCTAGCGACCCAGTATGAAATGCAATTCCCAGGTTAAGCCCGGGGCTTTCACACCTCACTTAAGTCACCGCCTGCGTGCCCTTTACGCCCAGTTATTCCGATTAA + DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIICHIIIIIIIIIIIIIHIIIIIIIIIIHIIHIIIIHHIIIIIIIIIIIIIIIIIIIHHHIGIHHII-8DHHIEDHIIHIIIIHIGFHHIIHH?GHHIICHHI@ @HWI-1KL166:431:HWKJKBCXY:1:1101:6494:2219 2:N:0:TCTAGACTCGTCGCTA GACTACTCGGGTATCTAATCCTGTTCGATACCCGCACCTTCGAGCTTCAGCGTCAGTTGCGCTCCCGTCAGCTGCCTTCGCAATCGGAGTTCTTCGACATATCTAAGCATTTCACCGCTACACGACGAATTCCGCCAACGTTGTGCGTACTCAAGGAAACCAGTATGCGCTGCAATTCAGACGTTGAGCGTCTACATTTCACAACACACTTAATCTCCAGCCTACGCTCCCTTTAAACCCAATAAATCCGGATAA + DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIHIIIIIIIH1EHHIFHIIIIIIIIIIIIIIIIIIIIIIIIIIHIHIIIIIIIHIIIICHIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIHIIHHEHFHIIHHIIIIHHHEHHHIIGHIIIHEHHCHHIH @HWI-1KL166:431:HWKJKBCXY:1:1101:11712:2422 2:N:0:TCTAGACTCGTCGCTA GACTACCGGGGTATCTAATCCTGTTCGATACCCGCACCTTCGAGCTTCAGCGTCAGTTGCGCTCCCGTCAGCTGCCTTCGCAATCGGAGTTCTTCGTCATATCTAAGCATTTCACCGCTACACGACGAATTCCGCCAACGTTGTGCGTACTCAAGGAAACCAGTATGCGCTGCAAGTCAGACGTTGAGCGTCTACATTTCACAACACACTTAATCTCCGGCCTACGCTCCCTTTAAACCCAATAAATCCGGATAA + DDDDDIIIIIIIIIIIIIIIIIIIIIIHIIIHIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFGHIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIHIIHIIIIIIIIIIGIIIIIIGIIIGHHIIIIIGIIHHHIHFHIIIIIIIHIIIIIIGHHHHIIFHIIIGHIIIII@F@GHHIIHIIFHHHHHEEF @HWI-1KL166:431:HWKJKBCXY:1:1101:14311:2424 2:N:0:TCTAGACTCGTCGCTA CCTACGGGTGGCTGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTCGCGTGAAGGATGACTGTCTTATGGATTGTAAACTTCTTTTATACGGGAATAACAAGAGCCACGTGTGACTCCCTGCATGTACCGTATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGTTAAGTCC + ADDDDIIIHIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIHIIIIIIIIIIIIIIIDGHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIHIIIHHHHIIIIIHIIIIIGIIHIGIIIIIIIIHGHHHIHIIIHIG?>CHHHIIDHIIIIGEFEHHGIHH@HCHHII?HHHFIHGHIIGHDCGH6@FH?6F# @HWI-1KL166:431:HWKJKBCXY:1:1101:8278:3000 2:N:0:TCTAGACTCGTCGCTA CCTACGGGTGGCTGCAGTGGGGAATATTGCGCAATGGGGGGAACCCTGACGCAGCCATGCCGCGTGAATGAAGAAGGCCTTCGGGTTGTAAAGTTCTTTCGGTATTGAGGAAGGAGTGTATGTTAATAGCATACATTATTGACGTTAAATACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATAACTGGGCGTAAAGGGCGCGCAGGCGGTGATTTA +
Here 4th and 5th sequences are actually reverse and fwd reads in R1 and R2 resp.
Now, the challenging part is to separate the R1 and R2 reads. I have tried to extract the sequences matching the primers sequences (even with few basepair of primers).
grep -A 2 -B 1 'CCTA' *L001_R1_001.fastq | sed '/^--/d' > out_R1.fq
grep -A 2 -B 1 'GACT' *L001_R1_001.fastq | sed '/^--/d' > out_R2.fq
But the final result was unequal sequences in the files. The problem here is the mismatches inthe primers.
I have tried bbduk.sh and seqtk tool where I extracted sequences matching CCTA from R1 file then saved seq id to extract the sequence with similar names. But it doesn't work for me.
cat out_R1.fq | awk '{if(NR%4==1) print ($0)}' > R1_id.txt
I tried to bbduk.sh where ref file saved with primer bases
bbduk.sh in=*_L001_R1_001.fastq out=sep_r1.fastq ref=ref.txt ow=t
But it didn't give any output in the file, same for seqtk also. It would be appreciable if anyone can help on this.
So, in short, I want to seperate the reverse sequence from R1 and corresponding forward seq from R2.
Applogies if I made it confusing :(
Typically you have to align a read to reference genome to know whether it's reverse or forward. And I cannot understand why you want to separate them in FASTQ.
thank you for the comment. Because we have tried the switch tail (adapter switched) method and want to know how it works. is there any bias or anything.