Since you're blasting with ncbi-NR you're almost there. By changing the blastx output options to outfmt 6 you're receiving tabular output in avenae.out
, you can customise the output columns to also receive the species names.
For example, you could run:
blastx -query /home/nkarim/avenae/trinity_even_out_dir/Trinity.300.longest.fasta \
-db /home/nkarim/blast/db/nr \
-outfmt "6 qseqid sseqid sscinames scomnames staxid pident length mismatch gapopen qstart qend sstart send ppos evalue bitscore" \
-evalue 1e-3 \
-out /home/nkarim/blast/output/avenae.out
Now you have all these extra columns, sscinames
and scomnames
are useful for your end as these contain the scientific name and the common name of your hits.
The laziest way to get the 'top' hit of your queries is to keep only the first hit seen, that normally has the lowest e-value or the highest score :
awk '! a[$1]++' avenae.out > avenae_first_hit_only.out
This falls apart when you have several species with equally good hits, you should still check whether the first hits reported are actually the best hits (manually?). From this out file you also get the percentage of genes with at least one BLAST hit:
wc -l avenae_first_hit_only.out
That will report the number of genes with at least one hit. For GO terms, I'm not sure - you could run GO annotation for your own genes, there are a few online tools for that like PANNZER2 or eggnog-mapper where you can upload your proteins.
For Table S1 (Assembly stats), Trinity has scripts for that, see this older thread Where to find the Trinity assembly statistics?
I appreciate your help!
Firstly, I had already ran Trinity_Stats.pl. After reading various papers, I thought that these item was not enough in it. So I'm collecting additional data.
If I try it as you say, do I have to re-run BLAST? I would like to avoid re-running if I can, because it takes a long time.
And I tried it as you, an error message was outputed as follows. Was "qseqid" an error option? Or does it involve my environment? I use Anaconda.
I'm begginer at analysis of RNAseq, so sorry if these are an inappropriate question. Thank you.
Hm it looks like your job is removing the quotation marks around the blastx command?
works fine,
results in the error message you're seeing. Are you running this command via some job submission script? In which case you might want to use escapes (\") to keep the quotation marks.
You could pull out the common names with a script from the efetch API using your NR hits, might be a bit fiddly thoguh! I hven't done that in a while
Sorry for my late. It took a time for finisheing BLAST.
I forgot using escape(\"), you're right.
I got another problem, but I try to think I will ask question in difference place because the content is different.
Thank you very much for your cooperation! I was helped by your kind advice everytime!