I have the following to sort my fastq by their sequence identifiers:
zcat 001_T1_1.fastq.gz | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" | gzip -c > 001_T1_1_sorted.fastq.gz
zcat 001_T1_2.fastq.gz | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" | gzip -c > 001_T1_2_sorted.fastq.gz
It is a bit slow when i am trying it for one sample. Can we make it faster?
How to make it run for all fastq.gz in a directory with bash ?
Check the options
--parallel=
and-S
to increase cores and memory usage.How to make it run for all fastq.gz in a directory ?
will use 8GB of RAM and 4 threads/cores.
a little bit faster:
... | LC_ALL=C sort -T /path/to/quick/filesystem -k1,1 -t $'\t' | ...
Something like this ?
Individual files:
Multiple files:
I tried it for one sample with 8 cores and it is extremely slow!
Curious as to what is the use case here? Which sequence identifier are you sorting on? Only thing that should be different are the tile numbers and X/Y coordinates in the fastq header (unless you have data from multiples lanes in each file).
i have collected data from multiple lanes for PE reads so for example for 001_T1_1 i have collected the data from different lanes and similar for 001_T1_2. I want to see if after sorting,my mapping is improved and i want to sort my FASTQ files by their sequence identifiers based on this https://edwards.flinders.edu.au/sorting-fastq-files-by-their-sequence-identifiers/ for all (800) files in a directory
How would sorting fastq files help with checking mapping?