Question

Maker Pipeline blast results

0

Entering edit mode

9.6 years ago

zgayk ▴ 90

Hello,

I am hoping to use maker on a small cluster (~6 compute nodes) to annotate a fairly fragmented de novo assembly that has some longer contigs. We have maker installed, but so far even though every program runs, RepeatMasker seems to be the only program finding matches. Namely, blastx and exonerate don't find any alignment matches even though they seem to be set up correctly in the maker control file.

What I was wondering was whether this is an artifact of the fragmented assembly or some sort of setup error? I find the former hard to believe considering I got at least 2-3 blast hits for each longer contig in the entire assembly using galaxy megablast. I think the error lies in the fact that I get 0 hits, but I am not sure why:

Widget::blastx:
/usr/bin/blastx -db /tmp/maker_sHnU1b/chickenproteomeuniprot%2Efasta.mpi.10.9 -query /tmp/maker_sHnU1b/0/scaffold_1035.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/zgayk/MakerExample2/Gaviaimmerheader.maker.output/Gaviaimmerheader_datastore/38/7C/scaffold_1035//theVoid.scaffold_1035/0/scaffold_1035.0.chickenproteomeuniprot%2Efasta.blastx.temp_dir/chickenproteomeuniprot%2Efasta.mpi.10.9.blastx
#-------------------------------#
deleted:0 hits
collecting blastx reports
flattening protein clusters
prepare section files
processing the chunk divide
preparing evidence clusters for annotations
Preparing evidence for hint based annotation
clustering transcripts into genes for annotations
Processing transcripts into genes
choosing best annotation set
Choosing best annotations
processing chunk output
processing contig output
examining contents of the fasta file and run log

Essentially each .gff file produced for each contig is empty. If anyone knew how to fix this, I would be very appreciative.

Zach Gayk

Annotation • 2.7k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.6 years ago by zgayk ▴ 90

0

Entering edit mode

Could you tell us

What is your N50?
What did you fill for the min_contig parameter in the maker_opts.ctl?
What kind of proteins (database?) Do you try to align on your genome?
Which kind of genome do you try to annotate? Bird? Fungi?

As specified in the maker_opts.ctl, under 10kb try to annotate a sequence is often useless.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by Juke34 8.9k

0

Entering edit mode

Hello, the assembly is fragmented:

Minimum     Number            Number            Total             Total             Scaffold
Scaffold    of                of                Scaffold          Contig            Contig  
Length      Scaffolds         Contigs           Length            Length            Coverage
--------    --------------    --------------    --------------    --------------    --------
    All          5,237,924         5,238,436       767,438,425       767,326,331      99.99%
     50          3,616,441         3,616,953       710,236,525       710,124,431      99.98%
    100          2,146,720         2,147,232       604,271,394       604,159,300      99.98%
    250            743,885           744,397       394,016,485       393,904,391      99.97%
    500            247,247           247,755       223,350,732       223,238,838      99.95%
   1 KB             62,044            62,409        98,533,822        98,431,583      99.90%
 2.5 KB              5,725             5,731        18,713,830        18,710,728      99.98%
   5 KB                231               231         1,310,589         1,310,589     100.00%

The assembly is from a bird: the common loon (Gavia immer). I used the chicken (Gallus gallus) proteome as protein data, along with chicken cDNA for EST evidence. I put the minimum contig length at 500. The contig N50 is 814 bp.

Most of the assembly is in small contigs less than 1 kb, and I was only going to use maker as a trial. I thought it might be possible to get valid annotations for the longer contigs at least, but if you think this is not feasible let me know. The assembly was produced using abyss with pe read data and a k-mer size of 32. Then, because it was still so fragmented, I aligned the contigs to the available red-throated loon genome and this is what is shown. I am not sure why the assembly remain this fragmented (we have basically have no scaffolds), although it could be that the group that did the sequencing used one pe library (8kb). If there are any suggestions as to why the assembly remains so fragmented, I would be very interested. Are we too limited by having one insert library?

Thanks,
Zach

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zgayk ▴ 90

Ram · Answer 1 · 2015-05-11

0

Entering edit mode

9.6 years ago

Juke34 8.9k

According to the size of your contigs, your Maker result it's not surprising. Moreover, the genes in that kind of genome are quite long.

I think you should focus your work on the assembly before to try to perform any annotation. You must improve significantly the size of your contigs ! I suggest you to try other assembly tools... but I'm not expert in this field.

Good luck :)

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.6 years ago by Juke34 8.9k

0

Entering edit mode

Does anyone have any ideas as to why the assembly is so fragmented, and specifically that no scaffolds are being produced? If I'm to improve the assembly I'll need to identify whether the current result is due to low quality data for making long contigs (only one insert size) or perhaps an error in the assembly process. I realize it is hard to determine from a distance, but any help would be appreciated.

Thanks,
Zach

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.6 years ago by zgayk ▴ 90