Good day!
We are trying to extract protein sequences from the fasta file found in NCBI. However, the IDs we have are the transcripts IDs instead of gene IDs. Is there an easy way to get the geneIDs from the fasta file using the transcript IDs that we have?
For example I have this protein sequence in fasta file format:
then I have the transcript ID: ENSGALG00000042750.1 ENSGALG00000032142.1
However, we need to get the geneID from the fasta file using the transcript ID.
example: for ENSGALG00000042750.1 >> ENSGALP00000056694.1 for ENSGALG00000032142.1 >> ENSGALP00000046506.1
So far, we are manually putting the geneID in a txt file using the transcript ID, however, the data is almost 4,000 which means we need to manually encode 4,000 geneIDs from the transcript ID.
Thank you very much! It worked! May I also ask if you have also a script extracting protein IDs using transcript IDs? Thank you!
Or the sequenceI ID rather. We already have a list of the transcript IDs of the gene we wanted, however, in order to extract the protein sequence, we need the seqID or proteinID (the ID after ">").
If you find the answer useful then accept it, as it would be suggested to other users. The script works by grabbing the fields from a GTF file. It has no informations on protein IDs, so you couldn't. Use BioMart instead https://www.ensembl.org/biomart/martview/
Thank you!