Hi all,
I have a tabular m.8 file after running a Diamond annotation on a "reference" transcriptome for a non-model species (assembled with Trinity).
How to I get gene-names from the genebank ID's?
Example from the datasheet:* gi|736186330|ref|XP_010770183.1|*
The goal is to lift an analysis of differential gene expression from transcript level to gene level - and to do that, I really would like use the gene names over the ID's given in the example.
Thank you!
The file consists of following columns (maybe that will help in imagining how the data sheet looks like):
# qseqid means Query Seq-id
# sseqid means Subject Seq-id
# pident means Percentage of identical matches
# length means Alignment length
# mismatch means Number of mismatches
# gapopen means Number of gap openings
# qstart means Start of alignment in query
# qend means End of alignment in query
# sstart means Start of alignment in subject
# send means End of alignment in subject
# evalue means Expect value
# bitscore means Bit score
You want to get the 'gene names' from your transcriptome that match the 'reference' from ensembl? Or do I have this backwards? You can parse your Trinity fasta to find the longest isoform per transcript cluster, with that being the representative of that cluster, and re-run BLAST, if you don't want to parse the BLAST result.
Please see this for how Trinity defines Genes vs. Transcripts in the output fasta.
I am working with a non-model species with no available genome or transcriptome. I did a very deep sequencing and did a de novo assembly with Trinity. None of the available Ensembl references are close to the species I am working on.
I ran a Diamond annotation on the assembled Trinity transcripts and do now have a list of NCBI ID's corresponding to each of the transcripts (of those that the software was able to identify with the desired e-value cutoff's etc.).
What I am looking for is some sort of tool that can do a conversion from the NCBI ID's into the corresponding gene names.
for instance convert gi|736186330|ref|XP_010770183.1| into PREDICTED: opsin-5-like for each of the transcripts (it is a long list, so there must be an easier way than doing each of them manually?)
Use NCBI eUtils. Something like:
esearch -db protein -query "XP_010770183" | efetch -db protein -format docsum -id XP_010770183 | grep Title
produces<Title>PREDICTED: opsin-5-like [Notothenia coriiceps]</Title>
Edit: If you have access to blast+ software and
nr
blast database then it would be easier to doblastdbcmd -db /path_to/nr -entry XP_010770183 -outfmt %t
. This will producePREDICTED: opsin-5-like [Notothenia coriiceps]
That sounds interesting! I will definitely check it out - do you know if it is able to run "bulk IDs" as well? or is it only one at a time?
You could also run the Trinotate pipeline.
I would love to use Trinotate! :-) But unfortunately, I am only a guest on the server that I am using for data analyses and have no permissions to install Trinotate and its dependencies (my laptop is not able to run the analyses on its own without exploding ;-) or at least it sounds like that when I try). The reason for using Diamond over Trinotate is that I the past months have tried to get Trinotate installed on the server in cooperation with one of the Bioinformaticians that runs the server - but things are going very (very) slow, and time does not allow me to be patient much longer. I am aware of better tools etc. but bottom line is that I have to use the tools that are available and that my laptop (unless it is available on the server) allows me to run in order to get the job done in time.