Please bear with my rookie questions:
I have in hand thousands of gene names from Mus musculus (house mouse), what I want to do is to get sequence of each gene and then use BLAST for pairwise calculations.
The gene names are like:
MATN1
1300002E11RIK
LYSMD1
IBTK
MS4A10
TRIM11
...
- I first tested to use NCBI blastn with any one of these gene names to the whole online database , it will return 0 hit.
- I just googled the gene name, it directed me to another database other than Genebank, where I can download the corresponding sequence, and with which I can use blastn to the whole online database it will return good results.
- When I search any one of these gene names in Entrez, I can get results of many different species with the same gene name.
Questions:
a. I wonder if the gene names are not recognized in NCBI database (because of 1. and 2.)? but it seems it can be recognized (because of 3.) ?
b. due to the thousands of gene names, I am not expecting to google each gene to get sequence or search in Entrez and manually identify the mouse species among many species, right?
Then how should I map these gene names to Accession Numbers, and then be able to use (for example) getgenbank(AccessionNumber) in Matlab for batch retrieving the sequences, and then use which to run a local pairwise blastn ?
(Is my plan of this workflow ok? or there should be better way for my purpose of comparing two genes with their names in hand?)
I am really confused now as a rookie in bioinoformatics, Please help! Thank you very much!!
Uniprot maintains a list of synonyms for gene names. Gene names can be converted to uniprot ids and then these ids could be used to get other related ids using ID mapping tool http://www.uniprot.org/uploadlists/ . Alternatively ensemble biomart could be used.
Your Q missing key detail. What are the parameters/options that you use for NCBI blast (and also give the URL)