I did create a Blast database using the TREMBL fasta file from Uniprot.
Inside my database (I used the -parse_seqids
option):
(Thanks to the command : blast/bin/blastdbcmd -db my_blastdb -dbtype prot -entry 'G3S368'
)
>tr|G3S368|G3S368_GORGO Uncharacterized protein OS=Gorilla gorilla gorilla GN=CTLA4 PE=4 SV=1
RYSVGDNDSNNVSIIDTSTNSVVGTVNVGLSTYNVAFTPDGKKIYATNSRNNTTSVIDVTTNKVTATVPTGDHPTDIAVS
PDGNKVYITNTGSNDLSVIDVTTNKVTATVPVGDGPCGVAVTLDGKKAYVPNKRSNTVSVINATTNTVTATVPVGITPLG
VAVTPDGNKVYVTNAESGNVSVIDTATNKVTATVNTGKYYMNYPVEVVIVPFMDSNMTDQSIGATSNAT
On the uniprot website:
>tr|G3S368|G3S368_GORGO Uncharacterized protein OS=Gorilla gorilla gorilla GN=CTLA4 PE=4 SV=1
MACLGFQRHKAQLNLATRTWPCTLLFFLLFIPVFCKAMHVAQPAVVLASSRGIASFVCEY
ASPGKATEVRVTVLRQADSQVTEVCAATYMMGNELTFLDDSICTGTSSGNQVNLTIQGLR
AMDTGLYICKVELMYPPPYYLGIGNGTQIYVIDPEPCPDSDFLLAFWVFFVKLSQSLFLL
SSIQVGTQYVLSSIMLKKRSPLTTGVYVKMPPTEPECEKQFQPYFIPIN
Any idea ? I think this is not the only record with that issue.
I did check and the sequence in the FASTA file is the good one.
Update
I've created the database a second time. Seems I have not run into the same issue.
The sequence is the same in the database and in the original FASTA file.
This time, I didn't use the following option: -max_file_sz 5GB
Is it possible this option was the reason of the encountered issue?
I don't quite get what are you pointing out as being the underlying problem. The fact that id's don't match? The fact that sequences are not the same? And how exactly is this a makeblastdb issue ?
mxs
The sequences are different but it's the same protein. Why do I have this sequence in the database and not the same as uniprot ? Is it an error during makeblastdb ? I've checked the file and I've found the record uniprot has. I wonder if it's possible that there are the two sequences with the same header in the fasta file.
It seems that's not the case. The FASTA file contains only one sequence for this identifier.
Can you locate the G3S368 in your downloaded TREMBL database? Though it is possible that during the makeblastdb something went wrong this is highly UNLIKELY. so my guess is that some kind of a mix up might have happened on a web <-> trembl-dmp relation, given that you did not yourself do some pre-processing of the downloaded data.
In the FASTA file downloaded from Uniprot, I've found the following entry:
(~ line 90585230)
You're right I didn't do anything to the downloaded data. But the FASTA file is structured so that I'm able to get hit_id, hit_def and sequence. I am not able to figure out what went wrong.