Dear Biostars friends, I am learning the programming and encountered some problems while I can't solve it now I want to add _1,_2,_3... to the transcripts ID with the same gene,my original file like this :
scaffold_1 transcript 55098 57492 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200"
scaffold_1 exon 55098 55372 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200"
scaffold_1 transcript 55102 57490 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200"
scaffold_1 exon 55102 55372 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200"
scaffold_1 transcript 55102 57480 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200"
scaffold_1 exon 55102 55372 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200"
scaffold_1 transcript 75108 76843 . + . gene_id "Seita.1G000300"; transcript_id "Seita.1G000300"
scaffold_1 exon 75108 76406 . + . gene_id "Seita.1G000300"; transcript_id "Seita.1G000300"
while the the target file like this:
scaffold_1 transcript 55098 57492 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200_1"
scaffold_1 exon 55098 55372 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200_1"
scaffold_1 transcript 55102 57490 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200_2"
scaffold_1 exon 55102 55372 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200_2"
scaffold_1 transcript 55102 57480 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200_3"
scaffold_1 exon 55102 55372 . + . gene_id "Seita.1G000200"; transcript_id "Seita.1G000200_3"
scaffold_1 transcript 75108 76843 . + . gene_id "Seita.1G000300"; transcript_id "Seita.1G000300_1"
scaffold_1 exon 75108 76406 . + . gene_id "Seita.1G000300"; transcript_id "Seita.1G000300_1"
Thanks for the help
Two remarks:
When you post some tabular file content like in this case, wrap it with the "code sample" option. It's the 5th button from the left of your message editor panel.
This is actually an easy task if you know a little bit of scripting. I would suggest you to learn some
python
,perl
orbash
to achieve this result quickly. Creating a dictionary with genes would help you, or a list of tuples.If you don't want to, you can use a counter that starts from 1 and adds up as long as the "gene_id" field is the same as the line before. This requires your file to be sorted.