Split large fastq based on value in description line
3
0
Entering edit mode
3.0 years ago
mmarcell • 0

I have a large fastq.gz file with around 2 million reads. The description line of each read contains additional metadata separated by whitespace. Among these metadata there is a parameter "barcode=". I would like to split my fastq.gz into separate fastq.gz files based on the barcode number following "barcode=". Any suggestion how to do it?

Thanks in advance!

fastq • 1.5k views
ADD COMMENT
0
Entering edit mode

Can you show a couple of example reads?

ADD REPLY
0
Entering edit mode

I've created a dummy read set:

@a8d6db31-13ec-4521-8ad8-7d751712375f runid=9a84c5 read=24005 ch=42 start_time=2010-12-04T04:21:24Z flow_cell_id=XXX111 barcode=barcode01 barcode_alias=barcode01
ACACAACACAACACAACACAAC
+
&%&'()130/83*)))(&$$$#
@ab880098-ef6d-4066-8b84-b890eb0d180d runid=9a84c5 read=23768 ch=88 start_time=2000-01-04T04:21:18Z flow_cell_id=XXX111 barcode=barcode02 barcode_alias=barcode02
ACACAACACAACACAACACAAC
+
'(+,*))*+7++3:.-..'$##
@6220e674-81a9-4aa6-a7e7-35e838979c3c runid=9a84c5 read=18764 ch=453 start_time=2000-01-04T04:21:24Z flow_cell_id=XXX111 barcode=barcode04 barcode_alias=barcode04
ACACAACACAACACAACACAAC
+
%&%$##$$$&))*,72/,,,+-
@6220e674-81a9-4aa6-a7e7-35e838978c3c runid=9a84c5 read=18764 ch=453 start_time=2000-01-04T04:21:24Z flow_cell_id=XXX111 barcode=barcode04 barcode_alias=barcode04
ACACAACACAACACAACACAAC
+
&'''();{{{C8{{{{<45{{4
@941c2264-faf5-43e2-bfd4-1ff8c869a637 runid=9a84c5 read=20530 ch=403 start_time=2000-01-04T04:21:27Z flow_cell_id=XXX111 barcode=unclassified barcode_alias=unclassified
ACACAACACAACACAACACAAC
+
()-)((()-+''&&&'%&%$%#
@7caabf66-cd11-4edc-be40-53dca3a4c878 runid=9a84c5 read=27412 ch=233 start_time=2000-01-04T04:21:23Z flow_cell_id=XXX111 barcode=barcode02 barcode_alias=barcode02
ACACAACACAACACAACACAAC
+
$%%'((()-,,760-)('%%%&
@33a645f8-0ae8-42e8-9246-36b72ccc42da runid=9a84c5 read=26823 ch=306 start_time=2000-01-04T04:21:25Z flow_cell_id=XXX111 barcode=barcode03 barcode_alias=barcode03
ACACAACACAACACAACACAAC
+
$&%#%&,-/67'%&&&%$$%.0
ADD REPLY
2
Entering edit mode
3.0 years ago

Here's a seqkit option.

seqkit split -i --id-regexp "barcode=([[:alnum:]]+)" test.fastq

If your fastq files are large make sure to set the number of CPU threads with -j. For paired end fastq files see the split2 subcommand.

ADD COMMENT
0
Entering edit mode

Thank you very much! It works very well!

ADD REPLY
1
Entering edit mode
3.0 years ago
GenoMax 147k

While @Bob's solution will work, there are dedicated packages (porechop, qcat and last) to demutiplex nanopore reads. They will also help with removing adapters.

ADD COMMENT
0
Entering edit mode

Thank you! Although seqkit worked fine I will surely look into your suggestions as well.

ADD REPLY
1
Entering edit mode
3.0 years ago
5heikki 11k
zcat file.gz | mawk 'BEGIN{FS=" "}{if(/^@/){OF=substr($7,9)}{print $0>OF".fq"}}'

The output is non-compressed fastq files though

ADD COMMENT

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6