Question

Sorting (big) FASTQ file

1

Entering edit mode

20 months ago

mathieu.bahin ▴ 90

Hi,

I'm trying to sort a big FASTQ file by read name (which looks like something not exotic). I managed with seqkit but this doesn't scale well for a big file (memory crash). Many approaches ranked

M08001:52:000000000-DJRKT:1:1101:10000:10368

before

M08001:52:000000000-DJRKT:1:1101:1720:15314

although I have to match this with its corresponding BAM that was "correctly" sorted by samtools sort. My guess is that, since 10000 has one more digit than 1720, it comes first (probably because a digit comes before a colon). I had this results with a bash solution based on sort and BBmap for example. I could code it myself (like sorting each number between the colons) but I'm pretty astonished this doesn't exist. Any hint?

Cheers, Mathieu

sort FASTQ • 2.5k views

ADD COMMENT • link updated 20 months ago by ATpoint 86k • written 20 months ago by mathieu.bahin ▴ 90

1

Entering edit mode

What is the use case for doing this? Perhaps we can suggest an alternative. Are you trying to filter reads, if so filterbyname.sh from BBMap would be the way to go.

ADD REPLY • link 20 months ago by GenoMax 148k

0

Entering edit mode

I'm picking info from the sequence in the FASTQ (the beginning of the sequence since we are doing some particular stuff, and this part was removed before aligning) and info from the BAM tags stemming from STAR aligner (CR, UR, GX). My code was working for MiSeq sequencing but crashes for NovaSeq ones (memory). So I'm chunking the FASTQ and BAM but have to make the chunks match so have to sort the BAM and the FASTQ the same way at first.

ADD REPLY • link 20 months ago by mathieu.bahin ▴ 90

0

Entering edit mode

Is BAM file being chunked first? If you have that file then samtools sorting it would give you the names of the reads and then it would be a matter of extracting them from fastq file.

ADD REPLY • link 20 months ago by GenoMax 148k

0

Entering edit mode

What you would ideally do is sort the fastq file based on the order of the names in the BAM file. To my knowledge there isn't a CL program to sort a fastq file based on the order of names in a separate text file, but this shouldn't be too bad to do in Python.

ADD REPLY • link 20 months ago by rpolicastro 13k

0

Entering edit mode

Thanks. I thought about this option and gave a try with seqtk but, unfortunately, it kept the original FASTQ order to pick the sequences (it was actually a tool to subsequence at first but I tried to use it to reorder). But maybe there is another one, or, as you said, I could do it myself.

ADD REPLY • link 20 months ago by mathieu.bahin ▴ 90

2

Entering edit mode

20 months ago

mathieu.bahin ▴ 90

Hi all,

Thanks for all your quick answers, I was finally able to sort the FASTQ file in "human" order thanks to Pierre Lindenbaum bash command. I didn't know the sort "V" argument that is doing exactly what I need!

And thanks Matthias Zepper for pointing at the reason of what happened first, I didn't exactly identified it.

Cheers, Mathieu

ADD COMMENT • link 20 months ago by mathieu.bahin ▴ 90

0

Entering edit mode

Great, thanks for following up! I moved Pierre Lindenbaum comment to answer, please accept it (green checkmark symbol) if it solved the issue.

ADD REPLY • link 20 months ago by ATpoint 86k

1

Entering edit mode

20 months ago

Matthias Zepper 5.0k

One can definitely say that putting the sequence of

M08001:52:000000000-DJRKT:1:1101:10000:10368

before

M08001:52:000000000-DJRKT:1:1101:1720:15314

is comprehensible. The two strings are compared until the first mismatch, and then the draw is resolved by putting 0 before 7.

How exactly two strings are ordered is subject to your locale settings, which you can change globally or set for each invocation by prepending the setting to the command, e.g. LC_COLLATE=C mycommand.sh.

I think, you can achieve your desired sort order by playing around with changing the settings of

LC_CTYPE determines which characters are letters, numbers, space characters, punctuation, as this differs in various languages.
LC_COLLATE, the collation order determines how strings are compared and sorted.

To you, the sort order seems odd, because you know that this string represents an ID composed of machine ID, flowcell ID, tile, coordinates etc. but, for the name sorting algorithm it is just one long string.

To change that, you would need to provide a custom LC_CTYPE, where the colon is classified as a space character and always collated after any numeric digit with LC_COLLATE. However, I have never created a locale, so I can't help you with this. Check your /usr/share/locale/ or /usr/share/i18n/locale/ on your system to see how those files need to look like.

ADD COMMENT • link 20 months ago by Matthias Zepper 5.0k

score 6 · Accepted Answer · 2023-04-28

6

Entering edit mode

20 months ago

Pierre Lindenbaum 164k

what about 'just' using sort

something like::

gunzip -c in.fastq.gz | paste - - - - | awk -f script_to_add_a_sorting_column.awk | LC_ALL=C sort -t $'\t'  --buffer-size=XXXXXX -T /path/to/TMP -k1,1V | cut -f 2- | tr "\t" "\n" | gzip > out.fastq.gz

ADD COMMENT • link 20 months ago by Pierre Lindenbaum 164k