how to rearrange fasta file according to its length
7
1
Entering edit mode
9.3 years ago
Eva_Maria ▴ 190

Hi

I want to rearrange my fasta file according to length of sequence

next-gen sequence • 22k views
ADD COMMENT
3
Entering edit mode

Hi, welcome to Biostars.

In this forum, showing that one spent some time to search for a solution beforehand (what has been tried / which language...? ) is much appreciated. Moreover, whereas this request is quite clear, paying a bit attention to the form makes the forum easier and more pleasant to browse.

Feel free to edit your own question in order to fullfill these expectations. Thanks.

ADD REPLY
15
Entering edit mode
7.9 years ago

answering because

:-P

    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta  |\
    awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
    sort -k1,1n | cut -f 2- | tr "\t" "\n"

.

ADD COMMENT
0
Entering edit mode

I love the solutions using GNU utils. Wish I was better at awk!

Do you know if anyone has an at all comprehensive list of simple bioinfo file manipulations like this using just built in tools? It would be a really cool resource for people who aren't very 'sys-admin-y' and get stuck when installing things (or just don't have permissions!)

ADD REPLY
2
Entering edit mode

biostars + search field!

ADD REPLY
0
Entering edit mode

Maybe this one? stephenturner/oneliners: Useful bash one-liners for bioinformatics. https://github.com/stephenturner/oneliners#awk--sed-for-bioinformatics

ADD REPLY
0
Entering edit mode

Note that this solution sorts sequences by increasing length. To sort by decreasing length, simply modify the sort call with the r parameter: sort -k1,1nr

ADD REPLY
6
Entering edit mode
7.9 years ago

Using the BBMap package:

sortbyname.sh in=file.fa out=sorted.fa length descending

Default is to sort by name, but it can also sort by length or quality.

ADD COMMENT
3
Entering edit mode
7.8 years ago

Sorting by seq length using seqkit:

$ seqkit sort -l hairpin.fa

Filtering by seq length using seqkit seq:

# before filtering
$ seqkit stat hairpin.fa
file        format  type  num_seqs    sum_len  min_len  avg_len  max_len
hairpin.fa  FASTA   RNA     28,645  2,949,871       39      103    2,354

# length >= 100
$ seqkit seq --min-len 100 hairpin.fa | seqkit stat
file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-     FASTA   RNA     10,975  1,565,486      100    142.6    2,354

Never worry about the installation of the seqkit (download), it provide sexecutable binary files for Linux/Windows/OS X. Just download, decompress and immediately use.

ADD COMMENT
2
Entering edit mode
7.8 years ago
st.ph.n ★ 2.7k

Here's a quick GUI in python

 #!/usr/bin/env python

import Tkinter, tkFileDialog
from Tkinter import *
from Bio import SeqIO

class App(object):
        def __init__(self):
                self.root = Tk()
                self.root.wm_title("Format Fasta")

                self.inp = StringVar(self.root)
                Label(self.root, text = "\nPlease provide the FASTA file containing your sequences.").pack()
                Button(self.root, text = "FASTA", command=lambda:self.inp.set(tkFileDialog.askopenfilename())).pack()

                self.output = StringVar(self.root)
                Label(self.root, text = "\nPlease enter a prefix for your output file.").pack()
                Entry(self.root, textvariable=self.output).pack()
                Label(self.root, text = "").pack()

                self.request = StringVar(self.root)
                Label(self.root, text = "\nPlease enter the min. length of a sequence to keep.").pack()
                Entry(self.root, textvariable = self.request).pack()
                Label(self.root, text = "").pack()

                Label(self.root, text = "").pack()
                Button(self.root, text = "Run", command = self.clickedrun).pack()
                Button(self.root, text = "Exit", command = sys.exit).pack()

                self.root.geometry("375x425")
                self.root.mainloop()

        def clickedrun(self):
                length = self.request.get()
                prefix = self.output.get()
                Label(self.root, text = "\nTrimming sequences to first " + length + " bp..", fg='blue').pack()
                inpfile = self.inp.get()
                outfile = prefix + '.fasta'
                with open(inpfile, 'rU') as f:
                        records = list(SeqIO.parse(f, "fasta"))
                with open(outfile, 'w') as out:
                        for r in range(len(records)):
                                if len(records[r].seq) > length:
                                        print >> out, '>' + records[r].id, '\n', records[r].seq
                Label(self.root, text="\nDone!", fg='blue').pack()


App()

Copy/paste, save as Python file. Click the 'FASTA' button to provide the path to the input fasta. Then there are two entry fields, one for the output file prefix, and another for the desired minimum length. Click 'Run', and you will get a new file in the same directory as your input file.

ADD COMMENT
1
Entering edit mode
9.3 years ago

I would use pyfaidx

Get length of each sequence and sort (ascending or descending):

faidx  --transform chromsizes test.fasta | sort -k2,2n > sorted_list

Then extract sequences in that order:

from pyfaidx import Fasta
sq = Fasta("test.fasta")

with open("sorted_list") as regions:
    for line in regions:
        cord=line.split()
        print ">"+sq[cord[0]].long_name
        print sq[cord[0]]

or you could use script given in this repository.

ADD COMMENT
0
Entering edit mode

thank u very much sir

I want to extract all sequences of length above 30 in a large fasta file

ADD REPLY
0
Entering edit mode
faidx  --transform chromsizes test.fasta | awk '{if ($2>=30) print }' | sort -k2,2n > sorted_list
ADD REPLY
0
Entering edit mode

You can also use faFilter to extract sequences. Other options are also available.

$./faFilter

-v - invert match, select non-matching records.
    -minSize=N - Only pass sequences at least this big.
    -maxSize=N - Only pass sequences this size or smaller.
    -maxN=N Only pass sequences with fewer than this number of N's
    -uniq - Removes duplicate sequence ids, keeping the first.
    -i    - make -uniq ignore case so sequence IDs ABC and abc count as dupes.
ADD REPLY
0
Entering edit mode

thank u sir

ADD REPLY
0
Entering edit mode

Hi, sorry if this is obvious. How do you sort the fasta file into descending order using he faidx tool?

ADD REPLY
0
Entering edit mode

You can sort fasta by using BBMAP

example cmd

./sortbyname.sh in=input.fa out=out.fa descending
ADD REPLY
0
Entering edit mode
9.3 years ago

A somewhat contrived way to do it with only Unix tools. Keep sequences longer than 20 and sort them in decreasing order of length:

MIN_LEN=20
INFILE=seqs.fa

awk -v RS=">" 'NR > 1{sub("\n", "\t", $0); gsub("\n", "_", $0); sub("_$", "", $0); print ">"$0}' $INFILE \
| awk -v MIN_LEN=$MIN_LEN -v FS="\t" -v OFS="\t" '{if(length($2) > MIN_LEN) {print $0, length($2)}}' \
| sort -k3,3nr \
| awk -v FS="\t" '{gsub("_", "\n", $2); print $1 "\n" $2}'

(Sequence names must not contain the tab character)

ADD COMMENT
0
Entering edit mode
7.8 years ago
AK ★ 2.2k

At first thought (FASTX-Toolkit + awk):

fasta_formatter -i input.fasta -t \
  | awk -F $'\t' '{print length($2) "\t" $0}' \
  | sort -k1,1nr \
  | awk '{print ">" $2 "\n" $3}' \
  | fasta_formatter -w 80
ADD COMMENT

Login before adding your answer.

Traffic: 1841 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6