Hi,
And your scaffolds also don't have very long headers?
Maybe try;
grep '>' fasta_file.fa | wc -L
this should give the length of your largest header.
So one problem I had, but I don't remember if it was with pre-process step or with Maker/blast itself was gi headers couldn't be processed properly. The |
character, blank spaces and *
gave errors.
Maybe make a small subset of you genome assembly (for example only 1 chromosome/scaffold/contig) and test if using EST and Prot data that does not have these characters in the headers works for you
sed 's/[^=>]*|*|//' file_in.fa > file_out.fa # Remove the character |
sed '/^$/d' file_in.fa > file_out.fa # Remove blank lines
sed '/\*$/d' file_in.fa > file_out.fa # Remove the character *
One last possibly remark I can make is that it might be a problem is you having set two paths for alt_est files
Have you tried concatenating both fasta's into one and just adding one path? I never read anywhere that MAKER is able to handle multiple paths in its variables, but that might just be something i missed because i never needed to do it.
Let me know if anything worked, and if not I cannot figure out anything wrong here sorry
You need to provide more information. Like, what commands did you run, what is the error message you got? What sub program fails in maker etc. With the amount of information you've given, I don't think anyone can help you.
200,000,000 scaffolds?! Is each read its own scaffold? This sounds a bit fishy to me.
Sorry...I meant that to read "base pairs", nice catch
From the errors, you might want to check: (1) if all the blast executebles are in path (maker_exe.ctl will auto configure while installing) (2) if est sequences are DNA and proteins are amino acids (3) all your input sequences have unique ids (preferably short), if not recode them to just numbers (4) enabled repeat masking (you'll never be able to complete predictions without masking all the repeats).
What exactly is repeat masking and which options would I want to change?
I'm going to try maker with the newest version of BLAST tomorrow. Also I downloaded all of the ests/proteins directly from the respective NCBI databases using biopython, so I don't think there are any issues there, I confirmed they were all in fasta format with a quick script. I will also try recoding them to numbers, there may be duplicates (species sequences also being under order sequences)
Hi, so couple of questions to make it easier to help you.
In which format did you provide the EST and proteins (fasta,fastq,something else)? can you maybe give an example (
head -n 20 est_file > example_est.txt
)(
head -n 20 protein_file > example_prot.txt
)? I had the experience that maker is really picky about the headers it can handle.I see for example a line that says
Title is very long: 1038 characters (max is 1000)
so maybe your headers are really hugeWhat does your
maker_opts.ctl
file look like? Maybe there is a wrong path somewhere or you forgot to turn on some setting.Also to answer you question about repeatmasker, it finds low complexity repeats, transposons etc and masks them with N's before annotations. This way you get an annotated file with repeats and it reduces computational time during the annotation of the rest of the genome.
Have you looked at the GMOD training? It explains all of that stuff in detail.
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014
The ESTs and proteins were in fasta format. I used usearch to create centroid fasta files after downloading all available sequences off of ncbi with BioPython. Additionally I grep'd out all of the headers, and none immediately looked 1000 characters long but I did not confirm this programatically. Here is an example of each:
ests: http://pastebin.com/NAJ2fHY4
proteins: http://pastebin.com/9T6m0UGi
opts file: http://pastebin.com/803dEXRL
I replaced the paths for privacy, but they are all valid paths. For altEST, I separated two paths with a comma. I am currently running Maker with JUST the species nucleotide sequence file to see if it completes without errors (protein2genome and est2genome turned off). Thanks for the reply