Hi everyone,
I am trying to extract the coding portion of particular nuclear gene sequences from whole genome sequences from certain species (meaning, the desired DNA stretches are interspersed with non-coding introns). I have the nucleotide sequence for the protein-coding portion of the target genes. My approach so far has been blast-based, in which I:
- Download whole genomes from target species from NCBI, and turn them into a local blast database with makeblastdb
- Blast the fasta of the known gene sequence (the query) against the whole genome blast database with blastn
- Extract blast hits in fasta format using -outfmt 6 with the following code:
blastn -db $DBPATH -query $GENEFASTA -outfmt '6 sseqid sseq' \
| awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/-/, "", $2); print ">"$1,$2}'
The problem I'm encountering is that the blast hits (i.e. the gene sequences I want to extract from the whole genomes) are not contiguous in their genomes owing to introns. This would be fine if the stretches weren't overlapping and I could simply concatenate them together. However, the blast hits typically have variable amounts of overlap between the reads (I'm not really sure why, maybe due to blast hits extending into the introns?). For example, one gene might yield the following (produced with -outfmt "6 sseqid qstart qend sstart send sseq" to show a clear example):
sseqid qstart qend sstart send sseq
SCAF1 1 172 1571 1400 ......AGGTA
SCAF1 168 318 1255 1105 AGGTA.......
In this example, there are 5 bp of overlap between the end of the first portion of the gene and the start of the second portion of the gene. So, I want to basically merge the two reads to get the full nucleotide coding sequence (318 bp long) for this gene for this species, without duplicating the region of overlap between the two reads. But, I can't figure out the best way to do this!
It seems like a simple and logical solution to the problem would be to store these two hit sequences as a multifasta (what I've done with the code below step #3 above), then align them back to the query sequence, and merge the alignment of the two sequences as the final single fasta for this gene for this species. I need to do this with command line tools (not a GUI or the NCBI Blast web interface) because I have many genes from many pathways across many species. But as silly as it seems, I can't find a way to map multifastas to a "reference" (the query sequence) and take the consensus of just the aligned target sequences with any of the common MSA tools (mafft, clustal, etc)!
Does anyone know how to accomplish this alignment goal? Or do you have suggestions for alternative approaches to extracting nucleotide coding sequences from whole genome data?
Thanks in advance for any thoughts or suggestions.
Exonerate is indeed the perfect tool for this!
Wow, I've tested it out and exonerate works perfectly for this. Thanks so much for pointing me in the right direction!
Super - many bioinformatics challenges have a funny pattern of coming back to those same ones we faced 10 or 20 years ago :)