How to extract information from headers of fasta file
4
3
Entering edit mode
9.7 years ago
Crystal ▴ 70

Hi All,

I know a lot of people asked similar questions before. I want to specify my question. I have a database with 5000+ sequences. The format of the header for each sequence is

>AAA23421(AI041) fim41, [Escherichia coli]

>AAA23421 is the gene ID and AI041 is the VFID. I want to extract gene ID in one txt file and VFID in another txt file. The code I used before is:

grep "^>" file.fa | cut -c 2-9 > destination_file.txt
grep "^>" file.fa | cut -c 11-16 > destination_file.txt

because I thought all gene ID is the same length. BUT, I was wrong. So I can't extract right information. Is there any modification I can do to extract gene ID between > and ( and then extract VFID between ()? I have another database which I asked yesterday. The format of the headers (before I remove all the space) is

>VFG0676 lef - anthrax toxin lethal factor, lef, [Bacteria Name] (VF0142)

Is there anyway I can only extract VFG0676 and (VF0142) together to a new txt file? Since some of VFGs do not have their corresponding VFs, so I'd like to extract them in two columns of the same file. PS: the lengths of the headers are definitely not the same. But all the VFG ID are in the front with same length and if they have VF ID, all the VF IDs are in () with same length.

Thank you

sequence • 10k views
ADD COMMENT
2
Entering edit mode
9.7 years ago

For the first:

grep "^>" foo.fa | sed 's/[>()]/\t/g' | cut -f 2,3 > destination_file.txt

for the second:

grep "^>" foo.fa | perl -ne '@f=/^>(\w+) .+\((\w+)\)$/;printf "%s\t%s\n", @f' > destination_file.txt

Edit: There are likely nicer ways to do that in perl, it's not a language I particularly enjoy.

ADD COMMENT
0
Entering edit mode

The first one works fantastically!!!!!!!

The second one didn't work. :(

It is really complicated for me, and I can only copy everything you typed and can't figure out where the problem is.

ADD REPLY
1
Entering edit mode

For the second case, the following might work

grep "^>" foo.fa | perl -ne '@f=/^>(\w+) .+?\(?(\w+)?\)?$/;printf "%s\t%s\n", @f' > destination_file.txt

The various question marks make things optional. This will print both the VFG... and VF... part if they're both there and use print nothing in place of the VF... portion if it's not there. That's at least true of the example I tested locally. There are likely edge cases I've not considered, so scroll through the results to ensure they look essentially correct.

ADD REPLY
0
Entering edit mode

It was weird. My output is 4KB, but the inside is blank. I can even scroll down in the txt file, but just couldn't see anything there. Are all the space in the code necessary or did I missed anything?

Thank you.

ADD REPLY
1
Entering edit mode

Almost every space in there is needed. If you get a bunch of blank lines (cat -A destination_file.txt | head to see line endings and such) then either you copied something over incorrectly or there's some kind of difference between either what perl is doing on each of our systems or my assumptions about the input file's format. You could always post the output of grep "^>" foo.fa > header_lines.txt somewhere and then I or someone else can look to see if this is a silly mistake on my end or what.

ADD REPLY
0
Entering edit mode

Thank you so much for your patience.

I tried the code twice (copied the code very carefully), and got the same results (all blanks)

I used cat -A destination_file.txt | head, and it showed a list of ^I$.

I never tried perl on linux terminal, is this the problem?

Also, how I can attach my files here? I've had the header_lines file ready.

Thank you.

ADD REPLY
0
Entering edit mode

You'll have to upload that file elsewhere, biostars doesn't allow attachments. If the file isn't too big then pastebin would probably work. There's also always, dropbox, google drive, etc.

ADD REPLY
0
Entering edit mode

I just added a link to the original post. It is the file of headers for the second database.

I really want to solve the problem. Otherwise, i have to manually extract the information during the weekend.

ADD REPLY
1
Entering edit mode

Oh, well the example in your post didn't match what was in the file, where all spaces were replaced with underscores. That's why nothing was working. Anyway, you can just download the results here. For future reference, the appropriate perl part of the command was:

perl -ne '@f=/^>(\w+?)_.+?\(?(\w+)?\)?$/;printf "%s\t%s\n", @f'
ADD REPLY
0
Entering edit mode

I'm so sorry for this confusion. I asked the question on how to remove space in headers two days ago and the code just helped me replace my original file.

I should try your code on my original file.

But, THANK YOU SO MUCH FOR HELPING ME!!

ADD REPLY
0
Entering edit mode

Ah, I'd missed that the VF... bit was optional in the second case (just reread your post). I'll have to think about that.

ADD REPLY
0
Entering edit mode

Hi,

I have another question, is it possible to extract the both VFG and VF together for the first database.

>AAA23421(AI041) fim41, [Escherichia coli]

Extract "AAA23421(AI041)" together from the header?

Thank you.

ADD REPLY
0
Entering edit mode
grep "^>" some_file | sed -e "s/>//g;s/ /\t/g" | cut -f 1

grep finds all lines starting with >. Sed then replaces all > with nothing and then all spaces with tabs. cut then displays only the first column. Note that I didn't test that, but it should more or less work.

ADD REPLY
0
Entering edit mode

I followed the code, but the output seemed like remove all space and extract all the information from the header.

This is the format looks like:

AAA23421(AI041) fim41,[Escherichiacoli];

Basically, what I need is all the information before the first column, there is a space between fim41 and ).

I have attached the link here, would you like to check it, please.

BTW, just a stupid question, what is t/g mean in the code?

Thanks
Crystal

ADD REPLY
1
Entering edit mode

It works on my computer:

AAA23421(AI041)
AAA23584(AI036)
AAA23585(AI036)
AAA23586(AI036)
AAA23598(AI036)
AAA23661(AI016)
AAA23662(AI016)
AAA23663(AI016)
AAA23780(AI036)
AAA57451(AI007)

\t is a tab. s/>//g means search (s) for > and replace it with nothing (what's between the 2nd and third /). The results of that are then put through s/ /\t/g, which means search (s) for a space ( ) and replace it with a tab (\t).

BTW, the g part of all of that just means "globally", which I guess isn't needed here. This is standard regex syntax. Regex is very confusing to learn, but extremely useful.

ADD REPLY
0
Entering edit mode

I know what the problem is!

I ran your code on my local computer terminal and it didn't work.

Then I ran yours on the server and it worked well! The same as the other code I modified!

So all problems have been solved. :)

ADD REPLY
1
Entering edit mode

BTW, here are the results.

ADD REPLY
0
Entering edit mode

Thank you so so so much, Devon!!!!!!!!!!!!

Yes, your code and the other one made sense to me and I tried to study from them.

But I don't know why both didn't work on my Mac terminal. Is this the system you used?

Again, I really appreciate your help these days.

Have a nice day!

ADD REPLY
0
Entering edit mode

Hi Devon,

Is there anyway that I can extract sequence with certain lengths (< 100 bp) from the fasta file?

I have some short sequences in my database and I want to find them out.

Thanks
Crystal

ADD REPLY
0
Entering edit mode

Sure, you can do that with biopython easily enough. Something along the line of:

for record in Seqio.parse(open("something.fa","r"), "fasta") :
    if(len(record.seq) < 100) :
        do stuff
ADD REPLY
0
Entering edit mode

Hi Devon,

I have a new file and I need to do similar thing to extract ID from the headers. The ID is like VFG000676(gi:4894323) (lef) anthrax toxin lethal factor precursor [Anthrax toxin (VF0142)] [Bacillus anthracis str. Sterne] Is there a way just extract information of of VF0142 in the third ()? Thanks

ADD REPLY
0
Entering edit mode

Use something similar with the re module, namely re.match().

ADD REPLY
0
Entering edit mode

Do I need to write a python code to run this re.match()?

ADD REPLY
0
Entering edit mode

Yes

ADD REPLY
0
Entering edit mode

if I want to specify I want to extract VF#### in the (). Should my code be m=re.match("()\(VF\wt)",element)?

ADD REPLY
0
Entering edit mode

m = re.search("(VF[\d]+)", element)

This will find VF followed by 1 or more digits.

ADD REPLY
0
Entering edit mode

But I also need to specify the VF should be within the (), otherwise, there is a chance to extract VFG#### from the beginning of the header. Also should I import re? I found a example of using re.match, it seems I also need to import re? I haven't had any experience using python. Sorry for the stupid questions.

ADD REPLY
0
Entering edit mode

VFG### isn't VF followed by one or more integers. Yes, you'll need to import re.

ADD REPLY
1
Entering edit mode

Great. So here is what I have so far:

> import re 
header=('header.txt'),
 for element in header:
> m=re.search("(VF[\d]+)",element) 


if m: 
 print(m)

How can I have a output.txt?

ADD REPLY
0
Entering edit mode

Quite close, once you have m:

of = open("output.txt", "w") if m is not None: of.write("{}".format(m.group(0))) of.close()

You'll need to modify header=('header.txt') to header = open('header.txt') as well. Otherwise, you exactly got the idea!

ADD REPLY
0
Entering edit mode

Thank you Devon. I tried the code using print, it worked well. However, then i figured out in some IDs, there was no VF###. Instead, they are SS###, AI### or CVF###. Then I tried to modify the code as:

m=re.search("(VF[\d]+)\(SS[\d]+)\(AI[\d]+)\(CVF[\d]+)",element)

But I got the error message as "unbalanced parentheses".

Also for your code about the output, should I separate them into several lines? Or should I just write after I define m:

   of=open("output.txt,"w")
   if m:
   of.write("{}".format(m.group(0))) of.close()

Thank you.

ADD REPLY
0
Entering edit mode

Instead of \), try |(. \ is an escape character, so it says, "Treat the next character as just a normal character and not something special, like an enclosure." So you ended up removing a bunch of the left parentheses.

BTW, you might find this site useful. You can play around with regular expressions and input a test string. I find that to be a more convenient way to use to get things like this working.

ADD REPLY
0
Entering edit mode

Thank you so much, Devon.

ADD REPLY
2
Entering edit mode
9.7 years ago

Using sed and group expression

grep '^>' input.fa | sed -r 's/>([A-Z0-9]+)[^\(]*\(([A-Z0-9]+)\).*/\1,\2/g' > out.txt
ADD COMMENT
1
Entering edit mode

Hi Pierre, I suggest a few cosmetic modifications (remove the global flag g, target headers with /^>/):

sed -r '/^>/ s/>([A-Z0-9]+)[^\(]*\(([A-Z0-9]+)\).*/\1,\2/' input.fa > out.txt
ADD REPLY
0
Entering edit mode

Is this code for the first database?

How about the second database, it seems more difficult to grep VFG id from the beginning and VF from the end.

ADD REPLY
1
Entering edit mode
9.7 years ago
Prakki Rama ★ 2.7k

I want to extract gene ID in one txt file

grep -Po -e ">.*?\(" fileName.fa | sed 's/[>(]//g' >file1.txt
  1. Find pattern between > and (
  2. Replace > and ( with null
  3. Redirect to a file

...and VFID in another txt file.

grep -Po -e "\(.*?\)" fileName.fa | sed 's/[()]//g' >file2.txt

Similar summary like the above.

...can only extract VFG0676 and (VF0142) together to a new txt file?

grep -Po -e "VF.*? |\(VF.*?\)" fileName.fa | sed -e :a -e '$!N;s/\n(//;ta' -e 'P;D' | tr ')' ' ' >file3.txt
  1. grep: Extract characters after VF until it hits space OR characters after VF inside the brackets (in this case it hits both, prints in two seperate lines)
  2. sed: If line starts with "(", append to previous line
  3. tr the ")" into null
  4. redirect to a file
ADD COMMENT
0
Entering edit mode

Your explanation really make sense to me, and at least I knew how did these code work.

Thank You.

ADD REPLY
0
Entering edit mode
9.7 years ago
awk  -F "(" '/^>/{print substr($1,2) > "GeneId.txt"; x=match($2,/\)/);print substr($2,1,x-1) > "VFID.txt"}' yourfile.fa

To get both the ids in the same file:

awk 'BEGIN{FS="(";OFS="\t"} /^>/{x=match($2,/\)/);print substr($1,2), substr($2,1,x-1)}' yourfile.fa > tab_separated_ids.txt

For your other database (getting both the ids in the same file):

>VFG0676 lef - anthrax toxin lethal factor, lef, [Bacteria Name] (VF0142)
awk -v OFS="\t"  '/^>/{gsub(/\(|\)/,"",$NF);  print sub(print substr($1,2), $NF}'  database.fa > tab_separated_ids.txt
ADD COMMENT
0
Entering edit mode

It's typically unwise to assume constant length IDs, particularly when they can be separated by variable width descriptors.

ADD REPLY
1
Entering edit mode

yep.. Didn't see that.. Edited

ADD REPLY

Login before adding your answer.

Traffic: 2720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6