Question

How to Append numbers on ALL headers in FASTA (e.g., >gene#1)?

0

Entering edit mode

3.4 years ago

sasa ▴ 10

Hi,

I want to append numbers on all headers (gene names), not only duplicates or multiplicates but also unique ones, in FASTA file. Does anyone have solutions? My file is pretty large so I am trying to find some code. Unfortunately, I can not currently run R scripts in our institute server, so I would prefer python, perl, etc. It is similar to the previous one, but a bit different. Here is my fasta:

>geneA
atcc
>geneB
aaat
>geneB
aaat
>geneB
aaat
>geneC
atgg
>geneC
atgg
>geneD 
atcg

I need to append numbers like this:

>geneA#1
atcc
>geneB#1
aaat
>geneB#2
aaat
>geneB#3
aaat
>geneC#1
atgg
>geneC#2
atgg
>geneD#1
atcg

I have tried a python code but it only adds numbers to multiplicates and doesn't add #1 to all unique genes.

from Bio import SeqIO

records = set()
of = open("output.fa", "w")
for record in SeqIO.parse("myfasta.fa", "fasta"):
    ID = record.id
    num = 1
    while ID in records:
        ID = "{}#{}".format(record.id, num)
        num += 1
    records.add(ID)
    record.id = ID
    record.name = ID
    record.description = ID
    SeqIO.write(record, of, "fasta")
of.close()

The output is:

>geneA
atcc
>geneB#1
aaat
>geneB#2
aaat
>geneB#3
aaat
>geneC#1
atgg
>geneC#2
atgg
>geneD
atcg

I would appreciate it if anyone can provide a solution or any ideas.

Thanks!

FASTA perl Python • 1.5k views

ADD COMMENT • link updated 3.4 years ago by cpad0112 21k • written 3.4 years ago by sasa ▴ 10

0

Entering edit mode

Thank you very much, everyone. I was able to covert my dataset.

ADD REPLY • link 3.4 years ago by sasa ▴ 10

score 2 · Accepted Answer · 2021-08-05

2

Entering edit mode

3.4 years ago

Pierre Lindenbaum 164k

linearize, sort on name, use awk to count the consecutive entries.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  in.fa |\
sort -t $'\t' -k1,1 |\
awk -F '\t' '/^>/{if(PREV==$1) {N++;} else {N=1} print $1 "#" N "\n" $2;PREV=$1;}'

ADD COMMENT • link 3.4 years ago by Pierre Lindenbaum 164k

score 2 · Accepted Answer · 2021-08-06

You were pretty close. A minor modification to your biopython code yields what you want:

from Bio import SeqIO

records = set()
of = open("output.fa", "w")

with open("output.fa", "w") as of:
    for record in SeqIO.parse("myfasta.fa", "fasta"):
        ID = record.id
        print(ID)
        num = 1
        ID = "{}#{}".format(record.id, num)
        while ID in records:
            ID = "{}#{}".format(record.id, num)
            num += 1
        records.add(ID)
        record.id = ID
        record.name = ID
        record.description = ID
        SeqIO.write(record, of, "fasta")

Output:

>geneA#1
atcc
>geneB#1
aaat
>geneB#2
aaat
>geneB#3
aaat
>geneC#1
atgg
>geneC#2
atgg
>geneD#1
atcg

score 2 · Accepted Answer · 2021-08-06

2

Entering edit mode

3.4 years ago

cpad0112 21k

with Awk only and copy/pasted from here:

$ awk '/>/ {$1=$1"#"++a[$1]}1' gene.fa

ADD COMMENT • link 3.4 years ago by cpad0112 21k

0

Entering edit mode

in python:

#! /usr/bin/env python3

import os
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys

input=sys.argv[1]
output=sys.argv[2]

rpid=""
num=1
new_seq=[]

for record in SeqIO.parse(input, "fasta"):
    if record.id == rpid:
        new_num+=num
    else:
        new_num=num
    newrecord_id='{0}#{1}'.format(record.id,new_num)
    new_seq.append(SeqRecord(record.seq,newrecord_id, description=""))
    rpid=record.id
SeqIO.write(new_seq, output, format="fasta")

try it in shell (bash/zsh)

$ python append_duplicate_numbers.py gene.fa out.fa

out.fa is output and gene.fa is input.

ADD REPLY • link 3.4 years ago by cpad0112 21k