I am trying gene identification and annotation of a draft genome (de novo assembly) of a non-model species. I have downloaded protein (fasta) files of three closely related species and employed tblastn with the proteins as query and repeat masked draft (de novo) assembly as target and obtained output with outfmt 6.
This is a pipeline I have come across:
fastaindex genome.fa genome.index
fastafetch genome.fa genome.index seq > seq.fa
fastasubseq seq.fa start length > genomic.fa
exonerate -m p2g --showtargetgff -q $sample -t genomic.fa > $sample.exonerate.gff
egrep -w exon $sample.exonerate.gff > $sample.exon.gff
fastaseqfromGFF.pl genomic.fa $sample.exon.gff > $sampl.pred.nuc.fa
fastatranslate -F 1 $sample_pred.nuc.fa > $sample_pred.aa.fa
t_coffee $sample.pred.aa.fa $sample > $sample.tcoffee
Since the draft genome is fragmented there were multiple hits for any given protein. I have filtered the output of the tblastn step to have hits that satisfy the following pattern which guarantees that query hits have monotonically increasing coordinates and no overlap of query sequences (protein)
proteinSequenceID_n_th q_start_1 q_end_1 scaffold_i s_start_i s_end_i .........
proteinSequenceID_n_th q_start_2 q_end_2 scaffold_j s_start_j s_end_j .........
proteinSequenceID_n_th q_start_3 q_end_3 scaffold_k s_start_k s_end_k .........
.
.
.
proteinSequenceID_m_th q_start_1 q_end_1 scaffold_t s_start_t s_end_t .........
where q_end_1 <= q_start_2 and q_end_2 <= q_start_3. Also after initial filtering scaffold lengths are kept positive by switching s_start with s_end in case s_start > s_end (necessary for employing fastasubseq which only accepts positive length).
Now after I employed the entire pipeline in a loop for all the protein sequences (15k to 20k for each of the three species) I am left with numerous gff files, fa files (predicted amino acids) and tcoffee alignment files.
#!/usr/bin/env bash
noOverlapTblastnOutput=$1 #file with start and end of proteins and start-end of scaffolds(having hits) with no overlap and scaffold start < end
GENOME=$2 #Reference genome fasta file
GENOMEINDEX=$3 #Reference genome indexed file
PROTEINFILE=$4 #Multiple Proteins in fasta format with each sequence in single line only
IFS=$'\n'; for line in `cat $noOverlapTblastnOutput`; do qid=`echo $line | awk '{print $1}'`; sid=`echo $line | awk '{print $4}'`; sstart=`echo $line | awk '{print $5}'`; send=`echo $line | awk '{print $6}'`; Length=$((send - sstart)); fastafetch $GENOME $GENOMEINDEX $sid > scaffold.fa; fastasubseq scaffold.fa $sstart $Length > subsequence.fa; grep -A 1 ">"$qid $PROTEINFILE > queryFile.fa; exonerate -m p2g --showalignment FALSE --showtargetgff TRUE -q queryFile.fa -t subsequence.fa >> $qid".exonerate.gff"; done; unset IFS
IFS=$'\n'; for line in `cat $noOverlapTblastnOutput`; do qid=`echo $line | awk '{print $1}'`; sid=`echo $line | awk '{print $4}'`; sstart=`echo $line | awk '{print $5}'`; send=`echo $line | awk '{print $6}'`; Length=$((send - sstart)); fastafetch $GENOME $GENOMEINDEX $sid > scaffold.fa; fastasubseq scaffold.fa $sstart $Length > subsequence.fa; grep -A 1 ">"$qid $PROTEINFILE > queryFile.fa; egrep -w exon $qid".exonerate.gff" > $qid".exon.gff"; fastaseqfromGFF.pl subsequence.fa $qid".exon.gff" > $qid".pred.nuc.fa"; fastatranslate -F 1 $qid".pred.nuc.fa" > $qid".pred.nuc.aa.fa"; t_coffee $qid".pred.nuc.aa.fa" queryFile.fa > $qid".tcoffee"; done; unset IFS
Now I am lost in this sea of files as I need to find a non-redundant set of genes. Will anyone be kind enough to tell if the above pipeline is correct and if so how do I find the non-redundant set of genes in the following step.
Thanks in advance
Wouldn’t it be simpler to use an existing annotation pipeline? Even with it being non-model, most annotation tools should do a semi decent job I would expect.
Pardon my ignorance but I failed to find any pipeline explicitly for homology-based gene model prediction. The existing pipelines are mostly for de novo gene prediction, e.g. MAKER. The steps I am following roughly follows the steps given in most papers on draft genome/gene prediction (which do not of course mention every details, for brevity, I presume). Could you please suggest a pipeline and/or a paper that provides a detailed account of homology based gene- prediction? Thank you very much for your help.
Sorry for the late reply, this must have gotten buried in my notifications.
I can only give you limited examples, as I'm only familiar with microbial pipelines, but
prokka
, for example, can be given a list of 'trusted proteins' which to annotate from via homology in the first instance. It then iteratively uses Prodigal for de novo prediction of CDSs, and rounds of HMM/blast refinement to attribute annotations to them all.Its a sort of hybrid approach. I'm still not totally clear why you only want to use homology bases methods though?