Entering edit mode
15 months ago
Nilo
•
0
Hello everyone,
I am trying to blast a .gff file from which I extracted all protein sequences to a protein database from NCBI. The crop I have data on is from the zingiberaceae and I was wondering which database from this link I should use? https://ftp.ncbi.nlm.nih.gov/blast/db/ Hopefully one could tell me or even better, explain how to find the correct database so that I can do it myself in de future haha:)
They look all the same regarding the update date and file size. Which of the 35 options should I choose, or should I blast against all??
Thanks in advance!
Niels
So you have protein sequence data from one species (that you know which it is)? What are you trying to do with that sequence data? Are you looking to identify the proteins or looking to get homologs/orthologs from other related species?
No they don't. Files have the same dates because NCBI refreshes all database at the same time. Their sizes are wildly different. Check this readme file to understand what the different databases include: https://ftp.ncbi.nih.gov/blast/blastftp.txt
Thanks a lot!! So I had a fully assembled genome. I used AUGUSTUS arabidopsis database to annotate the genome and now I want to functionally annotate it to predict the function of the by augustus predicted genes.
Thank you for the list but I still donnot fully get how I should select a database. There seems to be one for H. sapiens but I am assuming I should find a similar one but then closely related to my crop. In the link I placed in my opening post there are databases as refseq_protein.1 till refseq_protein.35.
I think I am not understanding the whole picture of how to select a database or what these 35 different files are.
Could you explain it please?
Niels:)
If there is a relatively complete proteome available for a species near the one you are working with in UniProt then you could download that proteome. Make a blast database out of it and the get an idea of what your proteins may be functionally doing.
You could use
swissprot
orrefseq_protein
database as the next larger superset before going on tonr
. Since these pre-formatted databases are large they are split in multiple files. You will need to download all pieces for a specific database (e.g.refseq_protein
) and uncompress all file pieces in a single directory. You will then use the basename of the file (e.g.nr
) with your-db
option inblastp
.There are other packages like
maker
andorthoFinder
that can help with annotations.Thanks a lot!! I downloaded the proteom of the normal ginger that we buy in every supermarket. I find that there are a lot of uncharacterized proteins still, but I managed to go from a .gff file, to a .xml output file containing the blast results and I merged that into a .bed file with the positional information from the original .gff file.
I noticed that a lot of the proteins are uncharachterized proteins
It now looks like this in IGV:
They say $1000 genome but fail to note the $1M annotation that is required afterwards. Easy to sequence but much more difficult to annotate. If your aim is to fill this information in then a lot of additional work is going to be required.