Use cat to sort multiple fastq files in a directory
2
1
Entering edit mode
2.8 years ago
dzisis1986 ▴ 70

I have the following to sort my fastq by their sequence identifiers:

zcat 001_T1_1.fastq.gz | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" | gzip -c > 001_T1_1_sorted.fastq.gz
zcat 001_T1_2.fastq.gz | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" | gzip -c > 001_T1_2_sorted.fastq.gz

It is a bit slow when i am trying it for one sample. Can we make it faster?

How to make it run for all fastq.gz in a directory with bash ?

bash • 3.5k views
ADD COMMENT
1
Entering edit mode

Check the options --parallel= and -S to increase cores and memory usage.

ADD REPLY
0
Entering edit mode

How to make it run for all fastq.gz in a directory ?

ADD REPLY
0
Entering edit mode
sort --parallel=4 -S8G

will use 8GB of RAM and 4 threads/cores.

ADD REPLY
1
Entering edit mode

a little bit faster:

... | LC_ALL=C sort -T /path/to/quick/filesystem -k1,1 -t $'\t' | ...

ADD REPLY
0
Entering edit mode

Something like this ?

zcat *.fastq.gz | paste - - - - | LC_ALL=C sort -T /path/to/fastq  -k1,1 -t $'\t' | tr "\t" "\n" | gzip -c > *_sorted.fastq.gz
ADD REPLY
1
Entering edit mode

Individual files:

seqkit sort -j <threads> -iNn <input.fq.gz>  -o <output.fg.gz>

Multiple files:

$ parallel --plus --dry-run seqkit sort  -j <threads> -iNn {}  -o {..}_sorted.fastq.gz ::: *.fq.gz 
ADD REPLY
0
Entering edit mode

I tried it for one sample with 8 cores and it is extremely slow!

ADD REPLY
0
Entering edit mode

Curious as to what is the use case here? Which sequence identifier are you sorting on? Only thing that should be different are the tile numbers and X/Y coordinates in the fastq header (unless you have data from multiples lanes in each file).

ADD REPLY
0
Entering edit mode

i have collected data from multiple lanes for PE reads so for example for 001_T1_1 i have collected the data from different lanes and similar for 001_T1_2. I want to see if after sorting,my mapping is improved and i want to sort my FASTQ files by their sequence identifiers based on this https://edwards.flinders.edu.au/sorting-fastq-files-by-their-sequence-identifiers/ for all (800) files in a directory

ADD REPLY
0
Entering edit mode

I want to see if after sorting,my mapping is improved

How would sorting fastq files help with checking mapping?

ADD REPLY
1
Entering edit mode
2.8 years ago

with a Makefile

FASTQS=$(shell  /path/to/dir -type f -name "*fq.gz")

OUTDIR=NEWFASTQ

define run
$(addprefix $(OUTDIR)/,$(notdir $(1))) : $(1)
    mkdir -p $$(dir $$@)
    gunzip -c $$< | paste - - - - | LC_ALL=C sort -T $$(dir $$@) -k1,1 -t $$$$'\t' | tr "\t" "\n" > $$(addsuffix .tmp,$$@)
    mv $$(addsuffix .tmp,$$@) $$@
endef


all: $(addprefix $(OUTDIR)/,$(notdir $(FASTQS)))

$(foreach F,$(FASTQS),$(eval $(call run,$(F))))

invoke with make -j <number-of-parallel-jobs>

ADD COMMENT
2
Entering edit mode

enter image description here

ADD REPLY
1
Entering edit mode

of course... of course :-)

process sortFastq{
input:
   val(fastq) from fastq_ch
output:
   path("sorted.fq.gz") into sorted_fq
script:
"""
set -o pipefail
gunzip -c "${fastq}" | paste - - - - | LC_ALL=C sort -T . -k1,1 -t '\t' | tr "\t" "\\n" > "sorted.fq.gz 
"""
}
ADD REPLY
0
Entering edit mode

I used snakemake recently which i suppose is similar to nextflow. So i will try to write it there. Since i have to do all this in a server with permissions etc and i had problems with snakemake i was thinking something simpler in bash!! But thank you for your advise !!!

ADD REPLY
1
Entering edit mode
2.8 years ago

Unexpectedly the fastest way to do this is to import to a BAM file, sort by name, export back to FASTQ

First I shuffled 1 million sequences just to make the comparison fair.

Original solution:

time cat r1.fq | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" > t1.fq

Takes 27 seconds, and you have to do it twice, so about one minute for both pairs

now using samtools import

time samtools import -1 r1.fq -2 r2.fq | samtools sort -n | samtools fastq -1 p1.fq -2 p2.fq

takes 7 seconds now lets use 4 threads during sorting:

time samtools import -1 r1.fq -2 r2.fq | samtools sort -@ 4  -n | samtools fastq -1 p1.fq -2 p2.fq

runtime is down to 3 seconds

ADD COMMENT
0
Entering edit mode

Yes but in my case we have big fastq.gz files and this is for uncompressed fastq files

ADD REPLY
1
Entering edit mode

the reason I showed uncompressed files is to separate the performance of compression from the other steps: sorting/conversion

Incidentally, using compression makes the samtools solution more efficient compared to linearization and sorting, I am surprised again.

Just replace the names with gz names (samtools can import gz files and fastq will automatically compress the files if the names have the extension gz)

time samtools import -1 r1.fq.gz -2 r2.fq.gz | samtools sort -n | samtools fastq -1 p1.fq.gz -2 p2.fq.gz

still 7 seconds :-) whereas the version that lists gz file takes a few seconds longer than before:

time gzcat r1.fq.gz | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" | gzip -c > t1.fq

32 seconds

ADD REPLY
0
Entering edit mode

Samtools imports gz files and writes compressed files. In addition, newly created files (gz or unzipped) need space. How can you say this is for uncompressed fastq files without even trying?

ADD REPLY
0
Entering edit mode

I tried -c but it didnt work : samtools import -c -1 test_1.fastq.gz -2 test_2.fastq.gz | samtools sort -@ 4 -n | samtools fastq -1 test_1_sorted.fastq.gz -2 test_sorted.fastq.gz

ADD REPLY
0
Entering edit mode

it didnt work

What did not work?

ADD REPLY
0
Entering edit mode
samtools sort: failed to read header from "-"
Failed to read header for "-"
ADD REPLY
1
Entering edit mode

samtools import does not have -c flag, it raises an error, nothing gets piped into sort, which in turn fails,

if you scroll up on the error list you should see the error message:

import: invalid option -- 'c'

make sure to use the examples as typed and always double-check for typos - especially when you get an error ;-)

ADD REPLY
0
Entering edit mode
time samtools import -1 011_T1_1.fastq.gz -2 011_T1_2.fastq.gz | samtools sort -n | samtools fastq -1 011_T1_1_sorted.fastq.gz -2 011_T1_2_sorted.fastq.gz
Usage: samtools import <in.ref_list> <in.sam> <out.bam>
samtools sort: failed to read header from "-"
\Failed to read header for "-"
ADD REPLY
0
Entering edit mode

This worked for me and output files are in blocked GNU Zip Format (BGZF; gzip compatible).

samtools import -1 hcc1395_normal_rep1_r1.fastq.gz -2 hcc1395_normal_rep1_r2.fastq.gz | samtools sort -n | samtools fastq -1 sorted/hcc1395_normal_rep1_r1.fastq.gz -2 sorted/hcc1395_normal_rep1_r2.fastq.gz -c 9
ADD REPLY
0
Entering edit mode

I would recommend to stop running commands in a chain if you do not understand how it all works, run them one at a time

 samtools import -1 011_T1_1.fastq.gz -2 011_T1_2.fastq.gz > step1.sam

 ..etc..

one of your commands is incorrect, it raises an error and sends nothing further down the chain. is that simple, what you show is not real error, the error happens before that step

then when you figured out your error, pipe them back together

perhaps your samtools are old, perhaps the filenames are not correct etc, each will generate the error

ADD REPLY

Login before adding your answer.

Traffic: 1535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6