How to separate mixed orientation raw illumina sequence into forward and reverse fastqs?
1
0
Entering edit mode
6.1 years ago
will.wcb ▴ 20

Hey there,

I have received some paired-end illumina MySeq sequences, and they are in two files, R1 and R2. The problem is that each of these files has a combination of forward and reverse reads. For instance:

R1

Sample1-seq1: barcode, forward primer, forward sequence

Sample1-seq2: reverse primer, reverse sequence

Sample2-seq3: barcode, forward primer, forward sequence

etc.

R2

Sample1-seq1: reverse primer, reverse sequence

Sample1-seq2: barcode, forward primer, forward sequence

Sample2-seq3: reverse primer, reverse sequence

etc.

How can I separate these into forward and reverse read files for use in QIIME2, for instance?

I will past a few lines of each

R1:

@D00420:195:HK5N5BCX2:2:1101:1200:2095 1:N:0:TGACCA
GGACTACGGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCTCCTCAGTGTCAGTTCCGGCCCAGAGCGCCGCCTTCGNNNNNNNNNNTCNNNNNNATANNNNNNNANNNNNNNNNNNNNNNNNNNNNTCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGACTTACTAAGCCACCTACGAGCTCTTTACGCCCAATAAATCC
+
GAGGGGGGGIIIIIIIIIIIGIIGIIIIGIIGGGGGIGIIIIIGIIIGGGGIIIGIIIIGGGGGGGGGGGGGGGGGIIIIG##########<<######<<<#######<#####################<<<##########################################################################777AGGGGAGAAGGGAGGGAAGGGGAGGGIIIG<.GA77AGAG
@D00420:195:HK5N5BCX2:2:1101:1327:2093 1:N:0:TGACCA
NCCTCCCTGTGTCAGCCGCCGCGGTAATACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGNNNNNNNNNNNTNNNNNNNAANN
+
#<GGGIGGIGGIIIIIIIIIIGGIIIIIIIIGIIGIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIII###########<#######<<##
@D00420:195:HK5N5BCX2:2:1101:1946:2119 1:N:0:TGACCA
TCCTCGTAGTGTCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTCATGCAAGACAGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTTGTGACTGCATGGCTGGAGTGCGGCAGAGGGGGATGGAATTCCGCGTGTANNNNNNNNNNNNNNAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGGCCTGCACTGACGCT
+
GAGAGGGGIAGGGGGGGAGGAGGGIGGGGIIGIGGGGIGGGGGGGGGGIGGGGGGGIIIIIIIIGGIGIGIIGIIGA<GAGGGAGGGIGGGGGGGGGGGGGGGGGGIIGG.GGGIIGGGGG.GGGIGAGGIGGAAGGGGGGGA.GGGGGIGIGGGGAGGGA<GIGAGAGGGIGIGGGGGGA##############.77AGGGGI.A.<<GGAGGGGGGGGIGG.<GGAGGIGGGI.77AGGGGAA7GGIG.
@D00420:195:HK5N5BCX2:2:1101:2658:2140 1:N:0:TGACCA
ACTACTAGGGTTTCTAATCCTGTTCGCTACCCACGCTTTCGCTCCTCAGCGTCAGGTAAGGCCCAGAGAGCCGCCTTCGCCACCGGTGTTCTTCCTGATATCTGCGCATTCCACCGCTACACCAGGAGTTCCGCTCTCCCCTGCCTACCTCTAGTCTGCCCGTATCGGAAGCAGGCTCGGAGTTAAGCTCCGAGTTTTCACTCCCGACGTGACGAACCGCCTACGAGCCCTTTACGCCCAATAATTCCGG
+
GGGGGGGIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIGGIIIGIIIIIGIIIGIIIIIIIIIGIIIIIIIIIGGIIIIIIIGIIIIGIIIIIIIIGGGGIIIIGGGIIGGIIIGGGGIGGIIIIGGIIIIGIIGGGIIGIGIIIIIIGIIIIIIIIIIGIGGGIIIIIIIAGGIGGGIIIIIIIIIIIIIIGIIIIIIIIIIGIIGGIIGIGIGIGGGGGGGGIIGGIIGGGIIIIIGIGGGGA
@D00420:195:HK5N5BCX2:2:1101:2763:2157 1:N:0:TGACCA
TCCGGCCGGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATCATTGGGCGTAAAGAGCGCGTAGGCGGCCCTGTAAGTCCGCTGTGAAAGTCAAGGGCTCAACCCTTGAATGCCGGTGGATACTGCAGGGCTAGAGTCCGGAAGAGGCGAGTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGCTCGCTGGGACGGTACTGACGCG
+
GGGGGIIGIIIIIIIIIIIIIIIIIGIIIIIIIIGIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIGGGGGIIGIIIIIIIIIIIIGIIIIIIIGIIIIIIIIGIIIIIIIIIIIIIIIIGIIIIIGIIIIIIIIGIIIIIIIIIIIIIIIGGIIIIIIIIIIIIGIIIGIIIIIIIGIIIIIIIIIGIIIIIIIIIIIIIIGIIGGGGGIIGGG.

R2:

@D00420:195:HK5N5BCX2:2:1101:1200:2095 2:N:0:TGACCA
TCCTCCCTGTGCCAGCCGCCGCGGTAACACGTAGGGGGCA
+
GGGGGGIIIIIIIGGGG<GGIGGIIIIIGIIIIIIGIIGI
@D00420:195:HK5N5BCX2:2:1101:1327:2093 2:N:0:TGACCA
GGACTACGGGGGTTTCTAATCCTGTTTGCTCCCCACGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
AGAGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIG#####################################################################################################################################################################################################################
@D00420:195:HK5N5BCX2:2:1101:1946:2119 2:N:0:TGACCA
GGACTACAGGGGTTTCTAATCCTGTTTGCGCCCCACGCTTGCGTGCATGAGCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
GAAAGA.<A.G<<..<....<<<<G.A.A..<.<A..<...<AA..AAGGG.<A##########################################################################################################################
@D00420:195:HK5N5BCX2:2:1101:2658:2140 2:N:0:TGACCA
CGTGTCAGCAGTCGCGGTAATACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGNNNTCGTNNNNNGTTCNNNNCGTCNNNNGTGANNNCTCGGNGCTTNACTNNGAGCCTGCTTCCGATACGGGCAGACNAGAGGNAGGCAGGGGAGAGCGGAACTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAAGAACACCGGTGGCGAAGGCGGCTCTCTGGGCCTTACCTGACGCTGAGGAGG
+
GGAGGIIIIIGIIIIIIIIIIIIIGIIIIIIIIIIIIIGGGIIIIIIIGIIIIIIIIIIIII###<<GG#####<<GG####<<<G####<<GG###<<GGG#<<GG#<<G##<<AGGGGIIIIIIIIIIIIGGGIIG#<<GGG#<<GGGGGIIGGGGIGIGIIIIIIIIIIIGGIIIGGIIGGGGGGIGGGGGIIIIGGIGIGIIIIGGIIGGGGGGGGIA<GGGGGIIGGGAGAGGAAGGI.AGGIAG.
@D00420:195:HK5N5BCX2:2:1101:2763:2157 2:N:0:TGACCA
GGACTACACGGGTTTCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTACCGTCCCAGCGAGCTGCCTTCGCCATCGGTGTTCCTCCTGATATCTGCGCATTTCACCGCTACACCAGGAATTCCACTCGCCTCTTCCGGACTCTAGCCCTGCAGTATCCACCGGCATTCAAGGGTTGAGCCCTTGACTTTCACAGCGGACTTACAGGGCCGCCTACGCGCTCTTTACGCCCAATGATTCCG
+
AGGGGIIIIIIGGIIGIIIIIGGGIIIIIIIIIIIGIIIIIIIIIGGGIIGGIIIIIIIIGIIIGGGGIIIIIIIIIIGIIIGIIGGGIIGGGGGIGGGGGAGGGGGGIGIGIIIIGIGIIIIIIIGGIIGIGIGIGGGIIIIIGGGGGIIIIIIGGIIIGIIIIIIIGGIIGGGGIGIGGIIIGIIIIGGGIGIGGGGIGGIIIIGIGG<GGGGGGIIGGGAGAGGGGIGGGGIGGIIIIIAGGGGGGA.

Thank you very much for any help, even just to point me in the right direction.

sequence next-gen • 2.2k views
ADD COMMENT
2
Entering edit mode
6.1 years ago
n,n ▴ 370

I am a little bit confused since the example lines you posted seem to be the correct expected output in fastq from a normal pair-end illumina run (all of your sequence headers in R1 are 1:N:0 and all in R2 are 2:N:0) meaning forward reads are correctly placed in R1 and reverse reads in R2. However if you want to make sure all reads are correctly placed you can do the following:

zcat R?.fastq.gz | paste - - - - | grep '1:.:.:' | tr '\t' '\n' >> correct_R1.fastq
zcat R?.fastq.gz | paste - - - - | grep '2:.:.:' | tr '\t' '\n' >> correct_R2.fastq
gzip correct_R1.fastq correct_R2.fastq

Just substitute the zcat argument for the name of your actual files, the '?' instead of the strand number is so that you process both files at the same time.

Hope this helps.

ADD COMMENT
1
Entering edit mode

Thank you very much. I thought this to be the case as well, but I am inexperienced, and the documentation provided by the sequencing company instructed us that they would be mixed. I suppose that was out of date or incorrect. Really appreciate the help.

ADD REPLY

Login before adding your answer.

Traffic: 1312 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6