How to replace Accession Number for GI in a fasta header

0

Entering edit mode

9.6 years ago

marcelavillegasp ▴ 10

Hi everyone!

I need help with something. I am very new to bioinformatics.

I have a fasta file with 32K reference sequences for an X gene. The headers are the Accession numbers, but I need to change them for the GI.

I already have a txt file with the GI corresponding to each Accession Number (So I think I already did the hardest part) but now I need to combine this information and change de headers of my fasta for the GI of each sequence.

I've tried with this script:

fasta= open('seq.fa')
newnames= open('newnames.txt')
newfasta= open('seqnew.fa', 'w')

for line in fasta:
    if line.startswith('>'):
        newname= newnames.readline()
        newfasta.write(newname)
    else:
        newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

But this is changing the headers for the corresponding raw of the txt. How can I change this Accession number for the GI that I already have in a tabulated txt file?

My txt file is:

Accession Number         GI
AB079690        22212526
EF394164        126842524
EU113233        157361205

Thanks

header fasta • 3.6k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.6 years ago by marcelavillegasp ▴ 10

0

Entering edit mode

Looks like you might have an answer here: Fasta sequence replacement based on header name

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.6 years ago by Matt Shirley 10k

0

Entering edit mode

9.6 years ago

Matt Shirley 10k

If you're interested in using a library you can take advantage of the key_function argument in pyfaidx:

	from pyfaidx import Fasta

	name_map = {}
	with open('newnames.txt') as newnames:
	next(newnames) # remove header
	for line in newnames:
	old, new = line.rstrip().split()
	name_map[old] = new

	with open('seqnew.fa', 'w') as new_fasta:
	# replace the fasta sequence names for lookup
	fasta = Fasta('seq.fa', key_function = lambda x: name_map[x])
	for seq in fasta:
	new_fasta.write(seq.name + '/n')
	for line in seq:
	new_fasta.write(line + '/n')

view raw answer.py hosted with ❤ by GitHub

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.6 years ago by Matt Shirley 10k