Question

Split large fastq based on value in description line

0

Entering edit mode

3.0 years ago

mmarcell • 0

I have a large fastq.gz file with around 2 million reads. The description line of each read contains additional metadata separated by whitespace. Among these metadata there is a parameter "barcode=". I would like to split my fastq.gz into separate fastq.gz files based on the barcode number following "barcode=". Any suggestion how to do it?

Thanks in advance!

fastq • 1.5k views

ADD COMMENT • link 3.0 years ago by mmarcell • 0

0

Entering edit mode

Can you show a couple of example reads?

ADD REPLY • link 3.0 years ago by GenoMax 147k

0

Entering edit mode

I've created a dummy read set:

@a8d6db31-13ec-4521-8ad8-7d751712375f runid=9a84c5 read=24005 ch=42 start_time=2010-12-04T04:21:24Z flow_cell_id=XXX111 barcode=barcode01 barcode_alias=barcode01
ACACAACACAACACAACACAAC
+
&%&'()130/83*)))(&$$$#
@ab880098-ef6d-4066-8b84-b890eb0d180d runid=9a84c5 read=23768 ch=88 start_time=2000-01-04T04:21:18Z flow_cell_id=XXX111 barcode=barcode02 barcode_alias=barcode02
ACACAACACAACACAACACAAC
+
'(+,*))*+7++3:.-..'$##
@6220e674-81a9-4aa6-a7e7-35e838979c3c runid=9a84c5 read=18764 ch=453 start_time=2000-01-04T04:21:24Z flow_cell_id=XXX111 barcode=barcode04 barcode_alias=barcode04
ACACAACACAACACAACACAAC
+
%&%$##$$$&))*,72/,,,+-
@6220e674-81a9-4aa6-a7e7-35e838978c3c runid=9a84c5 read=18764 ch=453 start_time=2000-01-04T04:21:24Z flow_cell_id=XXX111 barcode=barcode04 barcode_alias=barcode04
ACACAACACAACACAACACAAC
+
&'''();{{{C8{{{{<45{{4
@941c2264-faf5-43e2-bfd4-1ff8c869a637 runid=9a84c5 read=20530 ch=403 start_time=2000-01-04T04:21:27Z flow_cell_id=XXX111 barcode=unclassified barcode_alias=unclassified
ACACAACACAACACAACACAAC
+
()-)((()-+''&&&'%&%$%#
@7caabf66-cd11-4edc-be40-53dca3a4c878 runid=9a84c5 read=27412 ch=233 start_time=2000-01-04T04:21:23Z flow_cell_id=XXX111 barcode=barcode02 barcode_alias=barcode02
ACACAACACAACACAACACAAC
+
$%%'((()-,,760-)('%%%&
@33a645f8-0ae8-42e8-9246-36b72ccc42da runid=9a84c5 read=26823 ch=306 start_time=2000-01-04T04:21:25Z flow_cell_id=XXX111 barcode=barcode03 barcode_alias=barcode03
ACACAACACAACACAACACAAC
+
$&%#%&,-/67'%&&&%$$%.0

ADD REPLY • link 3.0 years ago by mmarcell • 0

score 2 · Answer 1 · 2021-12-17

2

Entering edit mode

3.0 years ago

rpolicastro 13k

Here's a seqkit option.

seqkit split -i --id-regexp "barcode=([[:alnum:]]+)" test.fastq

If your fastq files are large make sure to set the number of CPU threads with -j. For paired end fastq files see the split2 subcommand.

ADD COMMENT • link 3.0 years ago by rpolicastro 13k

0

Entering edit mode

Thank you very much! It works very well!

ADD REPLY • link 3.0 years ago by mmarcell • 0

score 1 · Answer 2 · 2021-12-17

1

Entering edit mode

3.0 years ago

GenoMax 147k

While @Bob's solution will work, there are dedicated packages (porechop, qcat and last) to demutiplex nanopore reads. They will also help with removing adapters.

ADD COMMENT • link 3.0 years ago by GenoMax 147k

0

Entering edit mode

Thank you! Although seqkit worked fine I will surely look into your suggestions as well.

ADD REPLY • link 3.0 years ago by mmarcell • 0

score 1 · Answer 3 · 2021-12-17

1

Entering edit mode

3.0 years ago

5heikki 11k

zcat file.gz | mawk 'BEGIN{FS=" "}{if(/^@/){OF=substr($7,9)}{print $0>OF".fq"}}'

The output is non-compressed fastq files though

ADD COMMENT • link 3.0 years ago by 5heikki 11k