Question

Sort Sequences In A Fasta File According To The peg List

1

Entering edit mode

8.8 years ago

k.kathirvel93 ▴ 310

Hi, Guys I have a fasta file with thousands of sequences with peg IDs. Can anyone please help me to sort out the sequences in fasta file according to their peg ID....... Thanks in advance !!

.fasta file looks...

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD

The expected result should be like this :

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE

RNA-Seq sequence next-gen • 4.1k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by k.kathirvel93 ▴ 310

0

Entering edit mode

Thanks, Pierre.... I got linearised fasta with help of linearizefasta.awk. I am sry to say, Am new to awk. let me know what should I do next to get ouput like this....

The expected result should be like this :

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE

And I can't understand this awk command. Where should I mention my input file in this command.......Thanks again.....looking forward......

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by k.kathirvel93 ▴ 310

0

Entering edit mode

You are giving your input file in the provided link input.fa. After replacing input.fa with your file name add the Pierre's answer (now you understand 3 dots --> linearizing fasta | answer below)

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by venu 7.1k

0

Entering edit mode

Dear Venu,

Thanks for your idea, but it's making problem for my data......it's not working proberly......will you suggest something else...

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by k.kathirvel93 ▴ 310

0

Entering edit mode

it's not working properly- What is it showing? Any error on the screen?

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by venu 7.1k

0

Entering edit mode

Yes,

It's Running.....but the output.fa file is empty...

ADD REPLY • link 8.8 years ago by k.kathirvel93 ▴ 310

0

Entering edit mode

But it's working properly for me. You are using linux right?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by venu 7.1k

0

Entering edit mode

yes, am using ubuntu 14.04

ADD REPLY • link 8.8 years ago by k.kathirvel93 ▴ 310

Ram · Answer 1 · 2016-02-19

1

Entering edit mode

8.8 years ago

Pierre Lindenbaum 164k

linearize fasta

Use awk to extract the peg id (split by '.', get the last item) , sort , remove the first column containing the id and convert back to fasta

... | awk '{N=split($1,a,/\./); printf("%s\t%s\n",a[N],$0);}' | sort -t '(insert-a-tab-here)' -k1,1n | cut -f 2,3 | tr "\t" "\n"

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by Pierre Lindenbaum 164k

Ram · Answer 2 · 2016-02-19

1

Entering edit mode

8.8 years ago

venu 7.1k

Extract ids and sort them

grep '^>' file.fa | sort > ids.txt

Then a simple bash loop

while read id
do
    grep -A1 "$id" file.fa >> sorted_fasta.fa
done < ids.txt

And use -m 1 with grep if it is taking a long time (It stops searching entire file if the id is found once).

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by venu 7.1k

Ram · Answer 3 · 2016-02-19

1

Entering edit mode

8.8 years ago

GouthamAtla 12k

This should also work:

grep '^>' file.fa | sed 's/^>//' | sort > ids.txt
samtools faidx file.fa `cat ids.txt` > new.fa

It should be very fast.

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by GouthamAtla 12k

Ram · Answer 4 · 2016-02-19

1

Entering edit mode

8.8 years ago

Matt Shirley 10k

ADD COMMENT • link 8.8 years ago by Matt Shirley 10k

0

Entering edit mode

Dear Matt,

Here the Error, Pls check this out...

sorted_fasta.write('>' + peg_fasta[id].longname)
AttributeError: 'FastaRecord' object has no attribute 'longname'

What should I mention here in the longname position... Thanks again..

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by k.kathirvel93 ▴ 310

0

Entering edit mode

Sorry - I forgot the underscore (I shouldn't have given two separate interfaces in the packages "longname" and "long_name"...) and the updated code should work. The FastaRecord.long_name property actually reads the full sequence name from the FASTA file, instead of the FastaRecord.name attribute which is the record's name from the .fai index file, which will lose anything after the first whitespace character.

ADD REPLY • link updated 6.2 years ago by Ram 44k • written 8.8 years ago by Matt Shirley 10k

Ram · Answer 5 · 2016-02-19

Assuming the data is as in your example (no line breaks in sequences and all headers are in fig|6666666.167416.peg.$number format)

for next in $(grep "^>" file.fa | sort -k4,4g -t "."); do grep -m 1 -wA 1 "$next" file.fa; done > sorted.fa

If you have proper GNU sort tool then:

for next in $(grep "^>" file.fa | sort --parallel $NumCoresYouHave -k4,4g -t "."); do grep -m 1 -wA 1 "$next" file.fa; done > sorted.fa

Ram · Answer 6 · 2016-02-19

Just for fun, here's a superfast in-memory sort using numpy. It uses 16bytes per FASTA/FASTQ entry, and i'm confident you could get that down to just 12 or even 8 (depends on the input) with a minor tweak:

import numpy
import subprocess
inFile = "./someFile.fasta"
linesInFile = int(subprocess.check_output('wc -l ' + inFile, shell=True).strip().split()[0])
a = numpy.arange(linesInFile)                       # Struct to store line & ID
with open(inFile,'rb') as f:
    ID_ROW = False
    digits = str.isdigit                            # reduce lookups later on
    for lineNo,line in enumerate(f):
        ID_ROW = not ID_ROW                         # because its opposites day
        if ID_ROW:
            a[lineNo+1] = int(filter(digits, line)) # Keep the IDs numbers only
a.view('i8,i8').sort(order=['f1'], axis=0) # In-place sort rows by ID
a.view('i8,i8')['f0']                      # Order of rows in new file

On your demo data:

array([ 0,  4,  2,  8, 12, 14,  6, 10])

You don't really need the wc -l, since you can grow numpy arrays as you add data to them (manually or with MMU).

I haven't included a way to write out the sorted file as a new FASTA file because there are many ways to do that depending on the size of the data, what kind of hard drive you have, etc. Personally I would store the DNA in the numpy array too (in either 2bit or up2bit format), and then from the table you could write a new FASTA file at once. If that takes up too much RAM, one could find a tool (or write one) that takes an input file,a list of the order of the rows for the new file (which is what the above prints), and outputs the shuffled file.