I am trying to locate the orthologs of genes identified in a genome assembly in five other bird genomes (red-throated loon, adelie penguin, northern fulmar, chicken, pigeon).
To do this I made a blast database of each of the five genomes: makeblastdb -in '"Columba_livia.cds" "Fulmarus_glacialis.cds" "Gallus_gallus.cds" "Gavia_stellata.cds"' -out Expanded.txt -parse_seqids -dbtype nucl #new db
However, I am so far at a loss about how to search for each gene in my assembly and then return the best hit ortholog in all five genomes using local blast. I am new to running command-line blast and my best guess didn't work: blastn -query COLOgenes.fasta -db Expanded.txt -task blastn -dust no -outfmt "6 qseqid sseqid pident length qstart qend sstart send sallgi evalue qseq sseq" -evalue 1e-6 -num_alignments 1 -max_hsps 1 -out outputExpanded1.blast.txt
The most important thing is to get the actual sequence for each ortholog (should get five match sequences; one from each genome), but blast seems to only have options for returning one subject sequence.
Any ideas would be appreciated! Thanks, Zach
Thanks very much for the ideas, and sorry about the late response. I have done what you suggested under 2) and now have five blast databases, one for each bird species. But now the question I am struggling with is how to parse all five databases, each with a single alignment to the query, so that I can group potential orthologous sequences from each blast file together with the query.
For example for the gene CELSR3, I have the query species CELSR3, MATCH1_CELSR3, MATCH2_CELSR3, MATCH3_CELSR3, MATCH4_CELSR3, and MATCH5_CELSR3, with each ortholog of CELSR3 aligned to the query in a separate file, but I need to group the results from each file together for each possible ortholog.
Are there any tricks for doing this?