Split Multimer Sequences at motif in FASTQ
1
1
Entering edit mode
3.7 years ago
schmau ▴ 10

Hi,

I have FASTQ file containing PCR multimeres and i need to split the sequences with a known primer sequence and kind of demultiplex them, but i want my primer to remain:

My input looks like this (primer bold)

@A00877:568:HVV57DSXY:4:1101:27724:1408 1:N:0

GGGTCAGTAGCGGAC GGGAACTGCATCACGCAATACGACTCACTATA GGGTCAGTAGCGGAC......

+

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.......

output:

@A00877:568:HVV57DSXY:4:1101:27724:1408 1:N:0:1

GGGTCAGTAGCGGAC GGGAACTGCATCACGCAATACGACTCACTATA

+

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

@A00877:568:HVV57DSXY:4:1101:27724:1408 1:N:0:2

GGGTCAGTAGCGGAC......

+

FFFFFFFFFFFFFFFFFFFFFF.......

Thanks a lot!

sequencing next-gen fastq split motif • 919 views
ADD COMMENT
1
Entering edit mode

with cutadapt:

input:

$ cat test.fq                                                                                                                                                                        

@A00877:568:HVV57DSXY:4:1101:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAACTGCATCACGCAATACGACTCACTATAGGGTCAGTAGCGGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00877:568:HVV57DSXY:4:1102:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAAACGTCGCACGCAATACGACTCACTATAGGGTCAGTAGCGGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00877:568:HVV57DSXY:4:1103:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAACTGCACGTCAGCTGGCGACTCACTATAGGGTCAGTAGCGGAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

output:

$ cutadapt -e 0 -g GGGTCAGTAGCGGAC...GGGTCAGTAGCGGAC  --action=retain --discard-untrimmed --quiet test.fq                                                                            

@A00877:568:HVV57DSXY:4:1101:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAACTGCATCACGCAATACGACTCACTATAGGGTCAGTAGCGGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00877:568:HVV57DSXY:4:1102:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAAACGTCGCACGCAATACGACTCACTATAGGGTCAGTAGCGGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

updated to remove primer at the end:

$ cutadapt -e 0  -g GGGTCAGTAGCGGAC...GGGTCAGTAGCGGAC  --action=retain --discard-untrimmed  --quiet test.fq | cutadapt -u -17 - --quiet

@A00877:568:HVV57DSXY:4:1101:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAACTGCATCACGCAATACGACTCACTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00877:568:HVV57DSXY:4:1102:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAAACGTCGCACGCAATACGACTCACTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

-e 0 allowed errors zero as the third read differs by one base at 3' end.

ADD REPLY
3
Entering edit mode
3.7 years ago
GenoMax 147k

If I understand what you want then following should work. From BBMap suite.

$ more test.fq
@A00877:568:HVV57DSXY:4:1101:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAACTGCATCACGCAATACGACTCACTATAGGGTCAGTAGCGGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00877:568:HVV57DSXY:4:1102:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAAACGTCGCACGCAATACGACTCACTATAGGGTCAGTAGCGGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00877:568:HVV57DSXY:4:1103:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAACTGCACGTCAGCTGGCGACTCACTATAGGGTCAGTAGCGGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Do the trimming.

$ bbduk.sh -Xmx2g in=test.fq outu=stdout.fq literal=GGGTCAGTAGCGGAC ktrim=r restrictright=20 -da k=7

Version 38.35

0.020 seconds.
Initial:
Memory: max=2147m, total=2147m, free=2129m, used=18m

Added 9 kmers; time:    0.005 seconds.
Memory: max=2147m, total=2147m, free=2126m, used=21m

Input is being processed as unpaired
Started output streams: 0.012 seconds.
@A00877:568:HVV57DSXY:4:1101:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAACTGCATCACGCAATACGACTCACTATA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00877:568:HVV57DSXY:4:1102:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAAACGTCGCACGCAATACGACTCACTATA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00877:568:HVV57DSXY:4:1103:27724:1408 1:N:0
GGGTCAGTAGCGGACGGGAACTGCACGTCAGCTGGCGACTCACTATA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Processing time:        0.005 seconds.

Input:                      3 reads         186 bases.
KTrimmed:                   3 reads (100.00%)   45 bases (24.19%)
Total Removed:              0 reads (0.00%)     45 bases (24.19%)
Result:                     3 reads (100.00%)   141 bases (75.81%)

Time:                           0.025 seconds.
Reads Processed:           3    0.12k reads/sec
Bases Processed:         186    0.01m bases/sec

Change stdout.fq to a file name to write the result out to a file instead of STDOUT. Use in1= in2= outu1= outu2= if you have paired-end data.

ADD COMMENT

Login before adding your answer.

Traffic: 1590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6