renaming all fasta headers in a file
2
1
Entering edit mode
9.6 years ago
branokdrung ▴ 10

Hi everyone,

I'm encountering a problem with too long fasta headers. They get truncated at the 20th position by a program (TargetP) I'm using.

Example:

>ConsensusfromContig10000-snap_masked-ConsensusfromContig10000-abinit-gene-0.1-mRNA-1:cds:3144/1451-1467:0:+
MKKSGDIDEIWKSMQEDARPKPRLPPLPAAAPPAPAPPAPAPKAAAAQPAAASSSNAMVAVNGGASRAFDYSNANALQRDINSLGDEALGTRKRAAERLEAVIVGAEGEAAEATVRALTGDLFKPLLKRFADPGEK

What remains are thousands of entries named "ConsensusfromContig1".

Is there any software or any script I can use to rename the headers in a way that they are 20 characters long and still able to get identified? I have only found scripts for truncating too long headers so far. The desired naming for the example would be something like 10000|3144/1451-1467:0 .

I would be grateful for any help provided.

fasta • 5.0k views
ADD COMMENT
2
Entering edit mode
9.6 years ago
Anima Mundi ★ 2.9k

In Python:

for line in open('input.fa'):
    if '>' in line:
        r_line = line[::-1]
        r_header = r_line[1:19]
        print '>' + r_header[::-1]
    else:
        print line,
ADD COMMENT
2
Entering edit mode
9.6 years ago
iraun 6.2k

If you have always the same format of header line, I mean, always Contig word and cds word, you can use this awk command:

awk '{if($1 ~ /^>/){split($1,a,"-"); split(a[1],b,"Contig");split($1,c,"cds:"); print ">"b[2]"|"c[2]}else{print}}' file
ADD COMMENT
0
Entering edit mode

Thanks a lot! Never imagined it could be done so easy. I used your command in the following way:

awk '{if($1 ~ /^>/){split($1,a,"-"); split(a[1],b,"Contig");split($1,c,"cds:"); print ">"b[2]"|"c[2]}else{print}}' Cyanophora_paradoxa_MAKER_gene_predictions-022111-aa.fasta >> Cyanophora_paradoxa_MAKER_gene_predictions-022111-aa-newHeaders.fasta

It worked like a charm. Big thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 1706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6