Question

BLAST Multiple Subject sequences?

0

Entering edit mode

8.7 years ago

zgayk ▴ 90

I am trying to locate the orthologs of genes identified in a genome assembly in five other bird genomes (red-throated loon, adelie penguin, northern fulmar, chicken, pigeon).

To do this I made a blast database of each of the five genomes: makeblastdb -in '"Columba_livia.cds" "Fulmarus_glacialis.cds" "Gallus_gallus.cds" "Gavia_stellata.cds"' -out Expanded.txt -parse_seqids -dbtype nucl #new db

However, I am so far at a loss about how to search for each gene in my assembly and then return the best hit ortholog in all five genomes using local blast. I am new to running command-line blast and my best guess didn't work: blastn -query COLOgenes.fasta -db Expanded.txt -task blastn -dust no -outfmt "6 qseqid sseqid pident length qstart qend sstart send sallgi evalue qseq sseq" -evalue 1e-6 -num_alignments 1 -max_hsps 1 -out outputExpanded1.blast.txt

The most important thing is to get the actual sequence for each ortholog (should get five match sequences; one from each genome), but blast seems to only have options for returning one subject sequence.

Any ideas would be appreciated! Thanks, Zach

blast • 5.2k views

ADD COMMENT • link updated 8.7 years ago by pld 5.1k • written 8.7 years ago by zgayk ▴ 90

score 1 · Answer 1 · 2016-03-11

1

Entering edit mode

8.7 years ago

Damian Kao 16k

It looks like you are putting all your subject sequences (5 birds) into one blast database. And then you are querying against that database with the parameter of outputting just a single alignment.

You have two options here:

1) If you want to learn some basic scripting skills, you can have blastn output all the possible subject hits and parse the resulting blast file for the top hits for each of the five species.

2) You can create individual blast databases for each of the five bird species instead of one big database. Then blast your query against the 5 databases separately with the parameter you have now to only output a single alignment.

ADD COMMENT • link 8.7 years ago by Damian Kao 16k

0

Entering edit mode

Thanks very much for the ideas, and sorry about the late response. I have done what you suggested under 2) and now have five blast databases, one for each bird species. But now the question I am struggling with is how to parse all five databases, each with a single alignment to the query, so that I can group potential orthologous sequences from each blast file together with the query.

For example for the gene CELSR3, I have the query species CELSR3, MATCH1_CELSR3, MATCH2_CELSR3, MATCH3_CELSR3, MATCH4_CELSR3, and MATCH5_CELSR3, with each ortholog of CELSR3 aligned to the query in a separate file, but I need to group the results from each file together for each possible ortholog.

Are there any tricks for doing this?

ADD REPLY • link 8.6 years ago by zgayk ▴ 90

score 1 · Answer 2 · 2016-03-11

One approach is to use Best Reciprocal Hits: http://www.ncbi.nlm.nih.gov/pubmed/18042555/

Basically you blast your transcriptome against references, take the best hit and blast them against a database made out of your transcriptome. Orthologs would be instances where you have reciprocal best hits. The best hit of a is b, and the best hit of b is a.

You do not need the results to return the query sequence, just get the ID and you can retrieve the sequence from the blast database using blastdbcmd.

Something like:

BLAST your transcriptome against your reference database. This is the "forward" blast run
Collect the ID of the best hit for each query and use them to pull hit sequences from the reference database using blastdbcmd (query sequences for the reverse run)
BLAST the reverse queries against a database made out of your transcriptome. This is the "reverse" blast run
Find cases of best reciprocal hits, these are your potential orthologs.

With the approach you're using now, BLAST sees all of the input files as a single database. That is why you're not seeing five hits, blast only sees a single database. If you want orthologs for each reference species, you should make separate databases and run this process for each species.

This is typically run on peptide sequences, so I'd generate predicted peptides for your transcriptome and use blastp against reference peptides.