Question

replace fasta headers with another name in a text file

6

Entering edit mode

10.4 years ago

Jemo ▴ 60

Hi everyone,

I have a fasta file and a text file with names on each row:

The fasta file looks like this:

>BQG3565;size=525
AGGCTT.....
>BGET752;size=3
TTGCCAG.....

and so on

The text file looks like this:

ANT_39
ANT_5676
ANT_3
...

and so on.

I would like to replace each header from the fasta file with the name from each row in the text file. I am a beginner in bioinformatics and was wondering if anyone would be able to help me on this?

Many thanks!

perl • 28k views

ADD COMMENT • link updated 16 months ago by nr299 • 0 • written 10.4 years ago by Jemo ▴ 60

Ram · Answer 1 · 2014-06-10

7

Entering edit mode

10.4 years ago

Sukhi Singh 11k

How about this

# fetch every alternate line (sequence in our case)
awk 'NR%2==0' fasta.fas > seq.fas

# merge line by line using headers from the text file
paste -d'\n' headerFile.txt seq.fas > output

or a one liner would be

awk 'NR%2==0' fasta.fas | paste -d'\n' headerFile.txt - > output

ADD COMMENT • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Sukhi Singh 11k

0

Entering edit mode

But this assumes sequences span only one line, right?

ADD REPLY • link 10.4 years ago by dariober 15k

0

Entering edit mode

Yes, you are right, this will fail, if the sequences span to multiple lines!!

ADD REPLY • link 10.4 years ago by Sukhi Singh 11k

0

Entering edit mode

How to do it , if the sequence spans multiple lines?

ADD REPLY • link 22 months ago by iankeetkumar • 0

Ram · Answer 2 · 2014-06-10

5

Entering edit mode

10.4 years ago

dariober 15k

I haven't tested this at all. It's python, see if it works:

fasta= open('seq.fa')
newnames= open('newnames.txt')
newfasta= open('seqnew.fa', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

ADD COMMENT • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by dariober 15k

0

Entering edit mode

Hi thanks for the helpful insights. I executed your suggested code by saving it as replace_name.py:

#!/usr/bin/env python

fasta= open('terS_non1.fasta')
newnames= open('terS_name.txt')
newfasta= open('terS_new_non1.fasta', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

But it doesn't seem to work, with the following error message:

File "replace_name.py", line 3
SyntaxError: Non-ASCII character '\xe2' in file replace_name.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Jemo ▴ 60

1

Entering edit mode

In case anyone ever needs this code. The quotation marks aren't ASCII characters which further complicates the script.

#!/usr/bin/env python

fasta= open('Galaxy58-[Extract_Genomic_DNA_on_data_46_and_data_37].fasta')
newnames= open('names_for_fasta_file.txt')
newfasta= open('trial_new_non1.fasta', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

This is the edited version

ADD REPLY • link 7.3 years ago by jjrin ▴ 40

1

Entering edit mode

Here's a modified version of jjrin's code that uses argparse so that you can use flags to indicate an input file, record replacement file, and an output file.

import argparse

parser=argparse.ArgumentParser(description="program that replaces fasta records")
parser.add_argument("-i", help="input fasta", type=file)
parser.add_argument("-r", help="replacement records file", type=file)
parser.add_argument("-o", help="output file")
args = parser.parse_args()
newfasta=open(args.o,'w') 

for line in args.i:
    if line.startswith('>'):
        newname=args.r.readline()
        newfasta.write(newname)
    else: 
        newfasta.write(line)

ADD REPLY • link 4.8 years ago by Digsby ▴ 10

0

Entering edit mode

Here is a modified version that takes in a tab-delimited lookup table of headers to replace, this works even if only a subset of the headers need replacing and it also works if the headers that need replacing are in a different order than the entries in the lookup table.

[edit] I just tested this and it worked with very large fa files, I just did it for the hg38 and it worked

# replace specific headers from a fa file using a custom made lookup table, tab delimited. 
# Use grep "^>" fasta.fa to help generate that lookup table.
# Code based off of solution from replace fasta headers with another name in a text file

#Example lookup table line
#>old_line  >new_linegrep

import argparse
import csv

parser=argparse.ArgumentParser(description="program that replaces fasta headers")
parser.add_argument("-i", help="input fasta", type=file)
parser.add_argument("-l", help="lookup table with replacement header lines")
parser.add_argument("-o", help="output fasta")
args = parser.parse_args()

# create an output file
newfasta=open(args.o,'w') 

# load lookup table into dict format
lookup_dict = {}
with open(args.l) as lookup_handle:                                                                                          
    lookup_list = csv.reader(lookup_handle, delimiter='\t')
    for entry in lookup_list:
        lookup_dict[entry[0]] = entry[1]

# read in the fa line by line and replace the header if it is in the lookup table
for line in args.i:
    line = line.rstrip("\n")
    if line.startswith('>'):
        if str(line) in lookup_dict.keys():
            newname = lookup_dict[line]
            newfasta.write(newname+"\n")
        else:
            newfasta.write(line+"\n")
    else: 
        newfasta.write(line+"\n")

ADD REPLY • link 4.4 years ago by brismiller ▴ 60

0

Entering edit mode

Hi @brismiller, this script could be used with Biopython module?

ADD REPLY • link 3.7 years ago by diego1530 ▴ 80

0

Entering edit mode

Hi @brismiller, I am trying to use this script, but my formatting is a bit different.

My fasta file headers are like this:

>704357
>2592645

But I want to replace them with the ids that have taxonomy attached like this:

>704357;tax=k:Bacteria,p:Planctomycetes,c:Planctomycetia,o:Pirellulales,f:Pirellulaceae,g:Pirellula,s:unclassified
>2592645;tax=k:Archaea,p:Crenarchaeota,c:MBGB,o:unclassified,f:unclassified,g:unclassified,s:unclassified

I know the problem is that the lookup function can't recognize the ids because they're embedded in the strings, but I'm not sure how to modify the script to overcome this issue.

Any help will be appreciated! Thanks

ADD REPLY • link 16 months ago by nr299 • 0

0

Entering edit mode

What editor did you use to copy and paste the script? If you used MS word or similar it will contain non-printable characters (Non-ASCII) which you can't see but python will.

ADD REPLY • link 10.4 years ago by dariober 15k

0

Entering edit mode

Can a similar script be written for bash as well?

ADD REPLY • link 22 months ago by iankeetkumar • 0

Ram · Answer 3 · 2014-06-10

3

Entering edit mode

10.4 years ago

Kenosis ★ 1.3k

Here's a Perl option:

use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Usage:

perl script.pl textFile fastaFile [>outFile]

The last, optional parameter directs output to a file.

Hope this helps!

ADD COMMENT • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

I tried it, it kind of worked, but the new file has their header apart from their respective fasta sequences.

My original fasta file would be something like this:

>650_16551;size=22371;
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRDFTTGAVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNLELEKVYWPYFL

>bs5_4497;size=326624;
EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL

And when I execute your code I get something like this:

ANT_1
ANT_2
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRDFTTGAVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNLELEKVYWPYFL

EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL

So the format of the new file is not like a fasta file. Any idea why?

Thanks!!

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Jemo ▴ 60

0

Entering edit mode

I got your desired output on your datasets you've just included. However, I'm not too sure about your text file's formatting. Thus, I've refactored the code block after the first while. Perhaps that will be helpful.

I got the following from both versions:

ANT_1
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGIL
PCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
ELEKVYWPYFL
ANT_2
EPTLFGDRTYRFAQDVPSLLPAILLELKQFRKKAKKDMAAATGYEEVYNGKQLAYKISMNSVYGFTGAGKGILPCVPIAS
TTTFRGRAMIEETKNYVEKNFPGTJEOLLLEVMVEFDVGDLKGEEAVKYSWEIGEKAAEECSALFKKPNNLELEKVYWPYFL

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Thanks for the prompt reply, I really appreciate your inputs! My txt file contains just a single column with rows of ID names (ANT_1, ANT_2, etc...).

I re-executed your updated code, and for some reason I still get the same output as before.

Thanks!

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Jemo ▴ 60

0

Entering edit mode

You're most welcome!

I accidently omitted naming the perl script in the directions. Have fixed the original posting. You should do the following (the last parameter being optional):

perl script.pl textFile fastaFile [>outFile]

My apologies for this oversight.

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Dear Kenosis,

It's really weird because I still get the same output file using your posted code, which I save as rename.pl:

#!/usr/local/bin/perl 
use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Then I execute my code:

perl rename.pl name.txt seq.fasta > new seq.fasta

Is it the new updated code?

Many thanks for your time!

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Jemo ▴ 60

0

Entering edit mode

The update only insures that blank lines are skipped -- just in case any exist.

You have:

perl rename.pl name.txt seq.fasta > new seq.fasta

Did you mean new_seq.fasta? You do need the underscore in the name.

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Thanks for noticing my typo, but yes I made sure to add the underscore in the name of the new file. Using your script as below, the output is still not formatted as it would need to be. Something must be missing.

#!/usr/bin/perl 
use strict;
use warnings;

my @arr;

while (<>) {
    chomp;
    push @arr, $_ if length;
    last if eof;
}

while (<>) {
    print /^>/ ? shift(@arr) . "\n" : $_;
}

Here's what the output looks like after executing the script:

Ant_1

Ant_2

Ant_3
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

It would be wonderful if it could have been in this format, instead:

Ant_1
VTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
Ant_2
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL
Ant_3
ELEKVYWPYFLVTSLGLCMKPSKWDRTYKFAQGVPSLYSILLELKQFRKKAKRDMAAATGSMKJOOWLGKQLAYKISMNSVYGFTGAGKGILPCVPIASTTTSRGRSMIEETKAYVEEHFPIOLKVRYGDTDSVMVEFDVGDRKGEEAIEYSWELGERAAEECSSLFKKPNNL

Please let me know if you would have any advice or trick that could improve the output.

Thanks!!

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Jemo ▴ 60

1

Entering edit mode

This was really helpful for me as I am very new to bioinformatics, I used the python script to change my fasta file headings. However I had the same formatting problem initially, and found out that my text file had dos line endings that were incompatible with the unix system I was using. View the text file in terminal with less name.txt and if your list appears as one contiguous line separated by ^M then it was created using dos format. I converted to unix format by re-saving my text file in textwrangler changing the settings. Then the script worked perfectly.

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.0 years ago by emily.remnant ▴ 10