Hi,
I'm trying to sort a big FASTQ file by read name (which looks like something not exotic). I managed with seqkit but this doesn't scale well for a big file (memory crash). Many approaches ranked
M08001:52:000000000-DJRKT:1:1101:10000:10368
before
M08001:52:000000000-DJRKT:1:1101:1720:15314
although I have to match this with its corresponding BAM that was "correctly" sorted by samtools sort. My guess is that, since 10000 has one more digit than 1720, it comes first (probably because a digit comes before a colon). I had this results with a bash solution based on sort and BBmap for example. I could code it myself (like sorting each number between the colons) but I'm pretty astonished this doesn't exist. Any hint?
Cheers, Mathieu
What is the use case for doing this? Perhaps we can suggest an alternative. Are you trying to filter reads, if so
filterbyname.sh
from BBMap would be the way to go.I'm picking info from the sequence in the FASTQ (the beginning of the sequence since we are doing some particular stuff, and this part was removed before aligning) and info from the BAM tags stemming from STAR aligner (CR, UR, GX). My code was working for MiSeq sequencing but crashes for NovaSeq ones (memory). So I'm chunking the FASTQ and BAM but have to make the chunks match so have to sort the BAM and the FASTQ the same way at first.
Is BAM file being chunked first? If you have that file then
samtools sort
ing it would give you the names of the reads and then it would be a matter of extracting them from fastq file.What you would ideally do is sort the fastq file based on the order of the names in the BAM file. To my knowledge there isn't a CL program to sort a fastq file based on the order of names in a separate text file, but this shouldn't be too bad to do in Python.
Thanks. I thought about this option and gave a try with seqtk but, unfortunately, it kept the original FASTQ order to pick the sequences (it was actually a tool to subsequence at first but I tried to use it to reorder). But maybe there is another one, or, as you said, I could do it myself.