Hi all,
Dear all,
I am currently facing an issue for which a student of mine and myself have tried to find a solution (I don’t think we are the only ones with this problem). But we have not found a solution ourselves yet. It might be a good idea to get some advice from the experts: you!
For a genome synteny analysis we need for a bunch of species a protein sequence file (.fasta) and the accompanying bed file (.bed) that describe the location of these peptides in the genome. The bed file should have the following columns: (1) chromosome/scaffold, (2) gene name, (3) start, (4) stop All this information is in the gff file and I have no problems extracting this. The issue is however that for a majority of the >50 insect species for which I need this data the names of the protein sequences in the fasta file are not similar to the names of the protein sequences given in the gff file. However, this should be the case and I also need this to let the analysis run without issues. Probably the protein sequence names are not similar to the names given in the gff due to downstream analysis by the various researchers?
What do you think is the most efficient way of getting all the proteincoding genes and pseudogenes present in the genome in a protein fasta file with the exact same name as given in the gff (or bed) file that I need?
A wrong example like it is now, as downloaded:
A sequence name from the .fasta file:
>HEL_007193-RA heliconius_erato_lativitta_v3_core_32_85_1 protein
MGNVKTLFCTLRPEVCTNKVAIVLGGLPGVTSETRAERPYFDDVSPRNVSAVVGQAAVLRCRAKHTGNRTVSWMRKRDLHILTSHIFTYTGDARFSVLHPEPSDDWDLKIDYVQPRDAGVYECQINTEPKINMAVMLNVEAAAASIWGSQDVYVKKGSTISLTCSVNVHSSPPSSASVLWYHGNAVVDFDSPRGGISLETEKTEGGTTSKLLVTKAALTDSGNYTCVPNNAHPASNILNKSTYVGTPKDK
A “gene line” from the gff file shows that the protein sequence names are different:
Hel_chr2_13 2993800 2996321 HEL_008367 . - maker gene . ID=HEL_008367;Name=HEL_008367;Alias=maker-Hel_chr2_13-snap-gene-30.30;
A correct example:
.fasta file:
>AT1G50680
MRLDDEPENALVVSSSPKTVVASGNVKYKGVVQQQNGHWGAQIYADHKRIWLGTFKSADEAATAYDSASIKLRSFDANSHRNFPWSTITLNEPDFQNCYTTETVLNMIRDGSYQHKFRDFLRIRSQIVASINIGGPKQARGEVNQESDKCFSCTQLFQKELTPSDVGKLNRLVIPKKYAVKYMPFISADQSEKEEGEIVGSVEDVEVVFYDRAMRQWKFRYCYWKSSQSFVFTRGWNSFVKEKNLKEKDVIAFYTCDVPNNVKTLEGQRKNFLMIDVHCFSDNGSVVAEEVSMTVHDSSVQVKKTENLVSSMLEDKETKSEENKGGFMLFGVRIECP*
>AT1G50690
MDPQVVVDKKSEEPDLKRQKLEEEEEEDCEEMSSYSESTCSFDSEDERLVEEEYQRSGYYDFDTTKQRRLVFCYPVIFEDSDVAHKPETDGDLVHRLSKIALQKYNDDKLENLELVRAVKANRKYGAGFIFYITFEAKDANSHTDPITFQAAVRYLRGIETVYRVHPKPLLDSTK*
.bed file:
Chr1 AT1G50680 18777601 18778614
Chr1 AT1G50690 18779606 18780693
So in short: how do I get to a similar naming for both the protein sequence file and the gff/bed file? Is the only solution by extracting protein sequences using only the gff from the genome file? if so, how? Or are there other ways?
Thanks so much in advance!
Is this data in GenBank or are these private files that you obtained from somewhere else?
Unfortunately this is not data of just a single source but it is mixed from various public databases or supplementary datafiles of papers. Thats also why some part of the datafiles are perfectly fine and protein files have similar naming as the GFF but unfortunately for a very large part the names are not identical.
@genomax And just to add, because it is from various sources the files have there own specific issues. So a simple rewriting solution for one, does not help for the other. Might be the best to simply extract proteins from scratch using the gff file?
If the source of these varies then that may be the solution. Solutions mentioned here are worth a look: How to get proteins from GFF file resulted from MAKER annotation