My multiple fasta looks something like that:
>chr2_1000-1020
NNNNNNNNNNNNNNNN
>chr2_1-20
NNNNNNNNNNNN
>chr2_20-40
...........
>chr2-200-220
And I want to sort all the sequences in numerical order. Any suggestions?
My multiple fasta looks something like that:
>chr2_1000-1020
NNNNNNNNNNNNNNNN
>chr2_1-20
NNNNNNNNNNNN
>chr2_20-40
...........
>chr2-200-220
And I want to sort all the sequences in numerical order. Any suggestions?
The following is not the best answer, but I cannot help mentioning my alpha-numeric sort, which I use daily. It is modified from unix sort and sorts "chr10,chr2" to "chr2,chr10" and also coordinates in the right way. You can use Pierre's command line to turn fasta into one line, I will just use bioawk for simplicity:
bioawk -c fastx '{print}' in.fa | sort -k1,1N | awk '{print ">"$1;print $2}'
where "sort" here is compiled from by modified sort. For me, this alpha-numeric sort has reduced a lot of tedious works related to coordinate sorting.
Edit 2.3.21 by ATpoint
In the original answer it was awk -c fastx
instead of bioawk
, I made that change now.
One might also try sort -k1,1V
to get a simple natural sort here in case that is of interest.
This simple command line should do the trick:
sed -e '/>/s/^/@/' -e '/>/s/$/#/' file.fasta | tr -d "\n" | tr "@" "\n" | sort -t "_" -k2n | tr "#" "\n" | sed -e '/^$/d'
Here is an explanation: first add special characters around the fasta header, remove new line symbols, sort numerically on the second part of the fasta header, put back new line characters, remove the empty line created during the process.
Linearize your sequence as described here. But use awk or perl to pad your chrom/position with the same number of zeros.
e.g:
>chr02_000000001-000000020 NNNN...
>chr02_000000020-000000040 NNNN...
>chr02_000000200-000000240 NNNN...
then sort on this first column... I'm too lazy to write the script tonight.
Here is a solution using cdbfasta that is fast and won't keep all the data in memory. The caveat is that cdbfasta has limitations with the size of the index it can create, but I don't know how many sequences you have so it's worth a shot. Assuming your file is named "myseqs.fasta" then:
grep ">" myseqs.fasta | sed 's/>//g' | sort -k1.6n > myseqs_idlist.txt
cdbfasta myseqs.fasta -o myseqs.fasta.index
cat myseqs_idlist.txt | cdbyank myseqs.fasta.index > myseqs_sorted.fasta
rm myseqs.fasta.index
Hello! Also if you want to sort by sequence length you can try this for fastq
<reads.fastq bioawk -c fastx '{print length($seq)"\t@"$name"\t"$seq"\t+\t"$qual }' | sort -V -r | awk '{print $2 "\n" $3 "\n" $4 "\n" $5}' > filtered.fastq
and this for fasta
<reads.fasta bioawk -c fastx '{print length($seq)"\t>"$name"\t"$seq }' | sort -V -r | awk '{print $2 "\n" $3}' > filtered.fasta
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This is interesting but it will not sort sequences such as the last one in the example correctly.
True. I considered the hyphen in the last one as a typo and replaced it with an underscore. am I wrong?