I am trying to set up protein BLAST alignment. I have a list of genes formatted as [EntrezGeneID_GeneSymbol], such as 395552_CCL1. I'm wondering if there's a way to convert these genes to FASTA entries so I can run protein BLAST+. Please let me know if you can help. I have a large number of genes so cannot do this process manually through the NCBI's online BLAST interface.
Thank you! The entrez direct command worked like a charm. I also tried Batch Entrez but I think my file was too large, since the site did not end up loading the second page.
Hi, i am wondering if there is a way to edit this command so only the first Isoform is returned? Thank you!
If you want only one isoform for every protein coding gene, I recommend using RefSeq Select to pick one isoform per gene. As of now, the scope of RefSeq Select is limited to human and mouse only. If you are interested in other organisms, you will have to come up with your own set of specifications for picking one isoform per gene. Perhaps I may be able to help you better if you can specify exactly what you are interested in. For example, you can use Entrez Direct to generate a 2-column table with protein accession and size:
and then add gene_id to that table to pick the longest isoform for each protein. Depending on how big your starting set of gene IDs is and the taxonomic scope, you will have to tailor your approach.
Thank you for the reply. My overarching goal is to use BLAST as an intermediate step to find similar genes between organisms, some of which would not be matched by just comparing the gene names alone to differing annotation between species. I am therefore blasting chicken genes against the mouse genome. My goal is to return only one alignment per gene, ignoring isoforms. My set of gene IDs is around 850 genes.
Additionally, due to the large number of FASTA entries in my output, my terminal times out when running BLAST. Is there a way to modify the code so it returns multiple FASTA files (one for each gene)? This way, I can pass these to BLAST via a loop and store output even if the program terminates early. Alternatively, if you have a better way to run large bLAST alignment queries locally please feel free to share.
For mouse, you are better off just using RefSeq Select. You can use the query
Mus musculus[Organism] AND RefSeq_Select[Filter]
in the NCBI Protein portal for that; see here.For the 850 chicken genes, you can download the FASTA file for the largest protein for each gene in a separate FASTA file using Entrez Direct as follows:
Hi, Unfortunately the above script isn't working for me for my gene list. It returns errors such as: WebEnv value not found in link output - WebEnv1 Db value not found in summary input Failure of post to find data to load Db value not found in fetch input
Here are a few entries from my input file of gene IDs: 396320 395771 396128 408047 417179 395970 422145 396526.
Will this work?
I am not sure why it isn't working for you. In am able to get the FASTA sequences for the genes you list. The
gene_id_list.txt
file should have one gene id per line. It won't work if you have them as space-delimited list.I am wondering if there's a way to get FASTA sequences (just like in the command above) but using gene symbols rather than gene IDs?
If you know which species the gene symbols are from and that they are the official symbols, you can use
esearch
with a query like this:"Gallus gallus"[Organism] AND GJB6[Gene Name]
. Then you pipe theesearch
results toelink
just as above.I tried this command in the command line and it works well. Is there a way to call this BASH script from R, to make it simpler for another researcher to use?
There's REntrez package that may do the trick. I haven't used it myself as much so I cannot say how well it works.