Question

extract multiple sequences from a fasta file with a list of numbers in header

0

Entering edit mode

10.6 years ago

barvazduck ▴ 20

Hi all,

I got a large fasta file that I need to extract multiple sequences from the header for each sequence is composed of several parameters such as Chromosome, Genetic distance etc, and also has an ID number in the end. I need to extract sequences based on the ID number.

I tried using 'grep':

grep -A 2 -wFf LIST.txt IN.fa > OUT.fa

But this matches also non-specific numbers. For example if I search for ID: 79695 I also get sequences with IDs such as 1379695 and 7969522.

Any idea how to solve this or other solutions?

Thanks

sequence • 14k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by barvazduck ▴ 20

0

Entering edit mode

Can you post example of your header? Is it ID: 79695 or ID:79695 or just 79695?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by PoGibas 5.1k

0

Entering edit mode

>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by barvazduck ▴ 20

0

Entering edit mode

This is probably simplest in either biopython or bioperl. They'll handle the presence of multi-line entries transparently and allow you to just use a regex to get out the ID (just store the target IDs in a hash/dict and query that).

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Devon Ryan 105k

4

Entering edit mode

10.6 years ago

Pierre Lindenbaum 166k

Sort your list of ID LIST.txt

Linearize the fasta, extract the id after marker_id, sort on the first column, join with the LIST, convert back to fasta.

awk '/^>/ {i=index($0,"marker id:"); printf("%s%s\t%s\t",(N==0?"":"\n"),substr($0,i+11),$0);++N;next;} {printf("%s",$0);} END{printf("\n");}' input.fa|\
sort -t '    '  -k1,1 |\
join -t '    ' -1 1 -2 1 LIST.txt - |\
cut -f 2,3 |\
tr "\t" "\n"

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

10.6 years ago

Bara'a ▴ 270

If you are aiming at retrieving certain sequences by ID for further manipulation, you can use the Bio.SeqIO.index() function of Bio.SeqIO module provided in Biopython .

This functions allows you to index moderately large sequence files without consuming much memory as it stores where each record is within the file and parse it on demand given identifiers .

A short Biopython code will do the work :

from Bio import SeqIO

def get_ID (identifier):
    parts= identifier.split (",")

    assert len (parts)==6 and parts[0]=="chr" and parts[1]=="genetic location:" and parts[2]=="bin num:" and parts[3]=="phase id:" and parts[4]=="scaf" and parts[5]=="marker id:"

    return parts [5]

file_dictionary=SeqIO.index(" your_file.fasta ", " fasta ", key_function=get_ID)
file_dictionary.keys()

handle=open(" selected.fasta" , "w")

for ID in [ "id1", "id2", "id3", ...  ] :
      handle.write( file_dictionary(ID).seq )

handle.close()

You can refer to Biopython tutorial section 5.4.2 for more details and: http://biopython.org/DIST/docs/tutorial/Tutorial.html

Also, you can find other indexing methods for extremely large sequence files in section 5.4.

Hope you find this useful :)

EDITED to suit your file's header format.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Bara'a ▴ 270

1

Entering edit mode

I'd hate to write out the IDs by hand in your for loop :) Consider reading them in by file:

with open('LIST.txt','r') as ids:
  for ID in ids:
    ID = ID.rstrip()
    handle.write( file_dictionary(ID).seq )

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by robert.davey ▴ 280

1

Entering edit mode

I wrote that command assuming that she wants to retrieve very few sequences, so why bother reading them from a file?!

Anyways, All roads lead to Rome !! :D

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Bara'a ▴ 270

1

Entering edit mode

Quite so! :)

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by robert.davey ▴ 280

Ram · Accepted Answer · 2014-12-22

6

Entering edit mode

10.6 years ago

robert.davey ▴ 280

As long as I'm understanding your question accurately, the simplest way would be to combine grep and sed. If your LIST.txt contains IDs, and you are specifically looking for "marker id: "+<exact ID match>, then the following should work:

grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) IN.fa > OUT.fa

Edit: apologies, missed the -w flag off in my retype. The previous answer failed when the start of the ID field contained a search ID.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by robert.davey ▴ 280

0

Entering edit mode

Thanks,

But this still wont get the exact ID number but also all the nested ones

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by barvazduck ▴ 20

0

Entering edit mode

EDIT: Fixed missing -w flag

Seems to work properly for me:

test.fa:

>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695
ATGCTAGCTAGCTAGCCGGCTAGCTACTATCGGCTATCGTACGTAGC
>chr7B, genetic location: 72.6 cM, bin num: 7969, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

When LIST.txt:

davey:~/temp$ cat LIST.txt
7969

davey:~/temp$ grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) test.fa
davey:~/temp$

I get no output. With LIST.txt:

davey:~/temp$ cat LIST.txt
1079696

I get:

davey:~/temp$ grep -A 1 -wFf <( sed -r 's/^/marker id: /' LIST.txt ) test.fa
>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

What bash/grep versions are you using?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by robert.davey ▴ 280

0

Entering edit mode

bash version 4.3.11(1)-release

grep 2.16

Anyway, it works partially because if you use 107969 then you get

>chr7B, genetic location: 72.6 cM, bin num: 3629, phase id: 1, scaf14, marker id: 1079695
ATGCTAGCTAGCTAGCCGGCTAGCTACTATCGGCTATCGTACGTAGC
>chr7B, genetic location: 72.6 cM, bin num: 7969, phase id: 1, scaf14, marker id: 1079696
GTATCTGGCATCTTACTGACGGCGATCGATGCGCGCTAGCTAGCTAT

Thanks for the efforts.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by barvazduck ▴ 20

0

Entering edit mode

I've fixed my previous post. Sorry about the omission of the -w option.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by robert.davey ▴ 280

0

Entering edit mode

Yeah I tried that, seems like something is wrong with my LIST.txt

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by barvazduck ▴ 20

0

Entering edit mode

Do you have non-standard (i.e. Windows) carriage returns in your file? You can use dos2unix to get rid of them.

It does indeed fail with the following:

1079696^M
1079695^M

If you open up your LIST.txt in vi, the status bar will help tell you if you have a DOS file or not: "testlist.txt" [dos] 2L, 18C

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by robert.davey ▴ 280

0

Entering edit mode

Actually it was a very stupid mistake, I just didn't use -w when I checked the result file using another grep line

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by barvazduck ▴ 20