How can I efficiently sort a FASTQ file by the entry ID? e.g. by the read name, if it's illumina data? I'd prefer to use a Python library or a command line C utility for this, but not awk/sed if possible. as far as I can tell,standard libraries like BioPython can't do this.
EDIT: I've been trying to make use of the solution posted here but it does not seem to work. It still thinks that
@42EBKAAXX090828:6:73:204:1871/2
comes before @42EBKAAXX090828:6:1:270:128/2
, for example.
Example:
$ cat test.fq | perl mergelines.pl | sort --stable -k1,1 | perl splitlines.pl > sorted
$ head sorted
@42EBKAAXX090828:6:100:1699:328/2
TTATTGCTTAATATTTATCACTGCTGAGTCCCGTGGGGGTGTGGCTAAAAGAGGAGGGGTCTAGCTTTTTTTTTTG
+
-557459:<8<:7:;:798=<:=:;;8;8:;;58=77::8####################################
@42EBKAAXX090828:6:10:1077:1883/2
GGCCTTATAATTAATTAGAGGTAAAATTACACATGCAAACCTCCATAGACCGGTGTAAAATCCCTTAAACATTTAC
+
as you can see it thinks that @42EBKAAXX090828:6:100:1699:328/2
comes before @42EBKAAXX090828:6:10:1077:1883/2
which is clearly wrong, since 10 is actually smaller than 100.
Does anyone know how to fix this? I can't tell it when the numeric part begins. That trick doesn't work.
If anyone else has efficient suggestions on how to do this I would be very interested to know.
Thanks
In case you want to either: (1) pair it with the FASTQ file of another library, e.g. with fastq of the other mate in paired-end sequencing, or (2) to pull out all unmapped reads relative to a BAM file, see http://biostar.stackexchange.com/questions/15049/getting-unmapped-reads-comparing-fastq-to-bam
Just out of curiousity - why would one need to sort a fastq file by sequence ID? Aren't sequencer assigned IDs essentially arbitrary?
"@42EBKAAXX090828:6:100:1699:328/2" does come before "@42EBKAAXX090828:6:10:1077:1883/2" because as strings "0" comes before ":". If you wanted to sort on the fields as integers you'd have to split at the ":" and use string/integer sorting as appropriate. That part is possible in Python but specific to the ID scheme.
Also if you want random access to a (large) FASTQ file, you could try Biopython's Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...) functions.