extract multiple sequences from a fasta file with a list of numbers in header
3
0
Entering edit mode
9.9 years ago
barvazduck ▴ 20

Hi all,

I got a large fasta file that I need to extract multiple sequences from the header for each sequence is composed of several parameters such as Chromosome, Genetic distance etc, and also has an ID number in the end. I need to extract sequences based on the ID number.

I tried using 'grep':

grep -A 2 -wFf LIST.txt IN.fa > OUT.fa

But this matches also non-specific numbers. For example if I search for ID: 79695 I also get sequences with IDs such as 1379695 and 7969522.

Any idea how to solve this or other solutions?

Thanks

sequence • 14k views
ADD COMMENT
0
Entering edit mode

Can you post example of your header? Is it ID: 79695 or ID:79695 or just 79695?

ADD REPLY
0
Entering edit mode
>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695
ADD REPLY
0
Entering edit mode

This is probably simplest in either biopython or bioperl. They'll handle the presence of multi-line entries transparently and allow you to just use a regex to get out the ID (just store the target IDs in a hash/dict and query that).

ADD REPLY
6
Entering edit mode
9.9 years ago
robert.davey ▴ 280

As long as I'm understanding your question accurately, the simplest way would be to combine grep and sed. If your LIST.txt contains IDs, and you are specifically looking for "marker id: "+<exact ID match>, then the following should work:

grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) IN.fa > OUT.fa

Edit: apologies, missed the -w flag off in my retype. The previous answer failed when the start of the ID field contained a search ID.

ADD COMMENT
0
Entering edit mode

Thanks,

But this still wont get the exact ID number but also all the nested ones

ADD REPLY
0
Entering edit mode

EDIT: Fixed missing -w flag

Seems to work properly for me:

test.fa:

>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695
ATGCTAGCTAGCTAGCCGGCTAGCTACTATCGGCTATCGTACGTAGC
>chr7B, genetic location: 72.6 cM, bin num: 7969, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

When LIST.txt:

davey:~/temp$ cat LIST.txt
7969

davey:~/temp$ grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) test.fa
davey:~/temp$

I get no output. With LIST.txt:

davey:~/temp$ cat LIST.txt
1079696

I get:

davey:~/temp$ grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) test.fa
>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

What bash/grep versions are you using?

ADD REPLY
0
Entering edit mode

bash version 4.3.11(1)-release

grep 2.16

Anyway, it works partially because if you use 107969 then you get

>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695
ATGCTAGCTAGCTAGCCGGCTAGCTACTATCGGCTATCGTACGTAGC
>chr7B, genetic location: 72.6 cM, bin num: 7969, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

Thanks for the efforts.

ADD REPLY
0
Entering edit mode

I've fixed my previous post. Sorry about the omission of the -w option.

ADD REPLY
0
Entering edit mode

Yeah I tried that, seems like something is wrong with my LIST.txt

ADD REPLY
0
Entering edit mode

Do you have non-standard (i.e. Windows) carriage returns in your file? You can use dos2unix to get rid of them.

It does indeed fail with the following:

1079696^M
1079695^M

If you open up your LIST.txt in vi, the status bar will help tell you if you have a DOS file or not: "testlist.txt" [dos] 2L, 18C

ADD REPLY
0
Entering edit mode

Actually it was a very stupid mistake, I just didn't use -w when I checked the result file using another grep line

ADD REPLY
4
Entering edit mode
9.9 years ago

Sort your list of ID LIST.txt

Linearize the fasta, extract the id after marker_id, sort on the first column, join with the LIST, convert back to fasta.

awk '/^>/ {i=index($0,"marker id:"); printf("%s%s\t%s\t",(N==0?"":"\n"),substr($0,i+11),$0);++N;next;} {printf("%s",$0);} END{printf("\n");}' input.fa|\
sort -t '    '  -k1,1 |\
join -t '    ' -1 1 -2 1 LIST.txt - |\
cut -f 2,3 |\
tr "\t" "\n"
ADD COMMENT
0
Entering edit mode
9.9 years ago
Bara'a ▴ 270

If you are aiming at retrieving certain sequences by ID for further manipulation, you can use the Bio.SeqIO.index() function of Bio.SeqIO module provided in Biopython .

This functions allows you to index moderately large sequence files without consuming much memory as it stores where each record is within the file and parse it on demand given identifiers .

A short Biopython code will do the work :

from Bio import SeqIO

def get_ID (identifier):
    parts= identifier.split (",")

    assert len (parts)==6 and parts[0]=="chr" and parts[1]=="genetic location:" and parts[2]=="bin num:" and parts[3]=="phase id:" and parts[4]=="scaf" and parts[5]=="marker id:"

    return parts [5]

file_dictionary=SeqIO.index(" your_file.fasta ", " fasta ", key_function=get_ID)
file_dictionary.keys()

handle=open(" selected.fasta" , "w")

for ID in [ "id1", "id2", "id3", ...  ] :
      handle.write( file_dictionary(ID).seq )

handle.close()

You can refer to Biopython tutorial section 5.4.2 for more details and: http://biopython.org/DIST/docs/tutorial/Tutorial.html

Also, you can find other indexing methods for extremely large sequence files in section 5.4.

Hope you find this useful :)

EDITED to suit your file's header format.

ADD COMMENT
1
Entering edit mode

I'd hate to write out the IDs by hand in your for loop :) Consider reading them in by file:

with open('LIST.txt','r') as ids:
  for ID in ids:
    ID = ID.rstrip()
    handle.write( file_dictionary(ID).seq )
ADD REPLY
1
Entering edit mode

I wrote that command assuming that she wants to retrieve very few sequences, so why bother reading them from a file?!

Anyways, All roads lead to Rome !! :D

ADD REPLY
1
Entering edit mode

Quite so! :)

ADD REPLY

Login before adding your answer.

Traffic: 1619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6