How to Append numbers on ALL headers in FASTA (e.g., >gene#1)?
3
0
Entering edit mode
3.4 years ago
sasa ▴ 10

Hi,

I want to append numbers on all headers (gene names), not only duplicates or multiplicates but also unique ones, in FASTA file. Does anyone have solutions? My file is pretty large so I am trying to find some code. Unfortunately, I can not currently run R scripts in our institute server, so I would prefer python, perl, etc. It is similar to the previous one, but a bit different. Here is my fasta:

>geneA
atcc
>geneB
aaat
>geneB
aaat
>geneB
aaat
>geneC
atgg
>geneC
atgg
>geneD 
atcg 

I need to append numbers like this:

>geneA#1
atcc
>geneB#1
aaat
>geneB#2
aaat
>geneB#3
aaat
>geneC#1
atgg
>geneC#2
atgg
>geneD#1
atcg 

I have tried a python code but it only adds numbers to multiplicates and doesn't add #1 to all unique genes.

from Bio import SeqIO

records = set()
of = open("output.fa", "w")
for record in SeqIO.parse("myfasta.fa", "fasta"):
    ID = record.id
    num = 1
    while ID in records:
        ID = "{}#{}".format(record.id, num)
        num += 1
    records.add(ID)
    record.id = ID
    record.name = ID
    record.description = ID
    SeqIO.write(record, of, "fasta")
of.close()

The output is:

>geneA
atcc
>geneB#1
aaat
>geneB#2
aaat
>geneB#3
aaat
>geneC#1
atgg
>geneC#2
atgg
>geneD
atcg 

I would appreciate it if anyone can provide a solution or any ideas.

Thanks!

FASTA perl Python • 1.5k views
ADD COMMENT
0
Entering edit mode

Thank you very much, everyone. I was able to covert my dataset.

ADD REPLY
2
Entering edit mode
3.4 years ago

linearize, sort on name, use awk to count the consecutive entries.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  in.fa |\
sort -t $'\t' -k1,1 |\
awk -F '\t' '/^>/{if(PREV==$1) {N++;} else {N=1} print $1 "#" N "\n" $2;PREV=$1;}'
ADD COMMENT
2
Entering edit mode
3.4 years ago
cfos4698 ★ 1.1k

You were pretty close. A minor modification to your biopython code yields what you want:

from Bio import SeqIO

records = set()
of = open("output.fa", "w")

with open("output.fa", "w") as of:
    for record in SeqIO.parse("myfasta.fa", "fasta"):
        ID = record.id
        print(ID)
        num = 1
        ID = "{}#{}".format(record.id, num)
        while ID in records:
            ID = "{}#{}".format(record.id, num)
            num += 1
        records.add(ID)
        record.id = ID
        record.name = ID
        record.description = ID
        SeqIO.write(record, of, "fasta")

Output:

>geneA#1
atcc
>geneB#1
aaat
>geneB#2
aaat
>geneB#3
aaat
>geneC#1
atgg
>geneC#2
atgg
>geneD#1
atcg
ADD COMMENT
2
Entering edit mode
3.4 years ago

with Awk only and copy/pasted from here:

$ awk '/>/ {$1=$1"#"++a[$1]}1' gene.fa
ADD COMMENT
0
Entering edit mode

in python:

#! /usr/bin/env python3

import os
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys

input=sys.argv[1]
output=sys.argv[2]

rpid=""
num=1
new_seq=[]

for record in SeqIO.parse(input, "fasta"):
    if record.id == rpid:
        new_num+=num
    else:
        new_num=num
    newrecord_id='{0}#{1}'.format(record.id,new_num)
    new_seq.append(SeqRecord(record.seq,newrecord_id, description=""))
    rpid=record.id
SeqIO.write(new_seq, output, format="fasta")

try it in shell (bash/zsh)

$ python append_duplicate_numbers.py gene.fa out.fa

out.fa is output and gene.fa is input.

ADD REPLY

Login before adding your answer.

Traffic: 2258 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6