Extracting nucleotide gene sequence as a single fasta from whole genome data with blastn
2
1
Entering edit mode
3 months ago
Kristen ▴ 10

Hi everyone,

I am trying to extract the coding portion of particular nuclear gene sequences from whole genome sequences from certain species (meaning, the desired DNA stretches are interspersed with non-coding introns). I have the nucleotide sequence for the protein-coding portion of the target genes. My approach so far has been blast-based, in which I:

  1. Download whole genomes from target species from NCBI, and turn them into a local blast database with makeblastdb
  2. Blast the fasta of the known gene sequence (the query) against the whole genome blast database with blastn
  3. Extract blast hits in fasta format using -outfmt 6 with the following code:
    blastn -db $DBPATH -query $GENEFASTA -outfmt '6 sseqid sseq' \
| awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/-/, "", $2); print ">"$1,$2}'

The problem I'm encountering is that the blast hits (i.e. the gene sequences I want to extract from the whole genomes) are not contiguous in their genomes owing to introns. This would be fine if the stretches weren't overlapping and I could simply concatenate them together. However, the blast hits typically have variable amounts of overlap between the reads (I'm not really sure why, maybe due to blast hits extending into the introns?). For example, one gene might yield the following (produced with -outfmt "6 sseqid qstart qend sstart send sseq" to show a clear example):

sseqid   qstart   qend   sstart   send   sseq

SCAF1   1  172   1571   1400 ......AGGTA
SCAF1   168   318   1255   1105   AGGTA.......

In this example, there are 5 bp of overlap between the end of the first portion of the gene and the start of the second portion of the gene. So, I want to basically merge the two reads to get the full nucleotide coding sequence (318 bp long) for this gene for this species, without duplicating the region of overlap between the two reads. But, I can't figure out the best way to do this!

It seems like a simple and logical solution to the problem would be to store these two hit sequences as a multifasta (what I've done with the code below step #3 above), then align them back to the query sequence, and merge the alignment of the two sequences as the final single fasta for this gene for this species. I need to do this with command line tools (not a GUI or the NCBI Blast web interface) because I have many genes from many pathways across many species. But as silly as it seems, I can't find a way to map multifastas to a "reference" (the query sequence) and take the consensus of just the aligned target sequences with any of the common MSA tools (mafft, clustal, etc)!

Does anyone know how to accomplish this alignment goal? Or do you have suggestions for alternative approaches to extracting nucleotide coding sequences from whole genome data?

Thanks in advance for any thoughts or suggestions.

blast cds alignment whole-genome • 1.0k views
ADD COMMENT
3
Entering edit mode
3 months ago
Michael 55k

You will be better off when switching to a more specialized tool, e.g. exonerate or gmap. If you have a little time, I recommend to use exonerate with the model settings coding2genome and --ryo <format> to format your own fasta header and output the CDS, e.g.:

exonerate -c 50 --model coding2genome -q $1 -t $2 --refine full \
  --ryo ">%qi vs. %ti %td  %m aln.length: %qal score: %s %%-ident: %pi %%-sim: %ps strand: %tS  CDS sequence \n%tcs"

Try to get a later version that supports multithreading (-c 50)

ADD COMMENT
0
Entering edit mode

Exonerate is indeed the perfect tool for this!

ADD REPLY
0
Entering edit mode

Wow, I've tested it out and exonerate works perfectly for this. Thanks so much for pointing me in the right direction!

ADD REPLY
0
Entering edit mode

Super - many bioinformatics challenges have a funny pattern of coming back to those same ones we faced 10 or 20 years ago :)

ADD REPLY
0
Entering edit mode
3 months ago
Pei ▴ 230

It looked like that you are searching the ortholog genes in less-annotated species B for well-annotated coding genes in species A? Am I right?

maybe tblast (translated BLAST) will be better than blastn, as it was designed to align an intact protein to a DNA/genome so the result will included separated exons.

If you are dealing with different species, how to be sure that the 5bps (168 - 172) not occur twice in the less-annotated genome due to tandem duplication? It is not surprising that the sequence length of the SAME protein is different between two species, especially when they are distantly related. Did you manually check some sequence?

Not sure if these comments help. Thanks.

ADD COMMENT
1
Entering edit mode

Blast isn't generally a good solution for doing gene model reconstruction. This task is very common in gene prediction to align protein or transcript training data to assemblies. In general, you will want refinement of the alignments generated by heuristics to optimize gene models generated. Blast HSPs are no good to attempt to model correct intron-exon boundaries.

ADD REPLY
0
Entering edit mode

Good to know. I have not worked on such topics for a while. At that time we made use of Blast plus some in-house scripts.

Maybe I should try tools like Exonerate in the future.

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1276 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6