Hi,
I'm using OMA-standalone on my own data as well as data downloaded from NCBI. I'm working with CDS nucleotide/DNA files. However, these CDS files have some gaps in represented as Ns. I believe this is when annotations are crossing contig boundaries or gaps in the genome assembly.
When running OMA standalone I'm getting many warnings like the ones below, and whilst I know they're probably just because of these gaps, I'm worried this will cause erroneous results, due to the X's being misaligned.
WARNING: IUPAC ambiguity characters for DNA/RNA not supported. Will replace them with 'X'
Pat index with 18353224 entries sorted, from "A</seq></e>\n" to "XXXXXXXXXXXXXXXXXXX"
Pat index with 41395238 entries sorted, from "A</seq></e>\n" to "XXXXXXXXXAAAATATATC"
So my main question is, does OMA standalone account for these gaps, should I just leave the Ns in or is there a better way to go about this? And is using the CDS better than using the full genome?
For context, I'm trying to get orthologous groups to help build a species tree and I'm working with insect genomes.
Thank you, Emma
Hi Adrian,
Thanks for your reply. I was confused by this because I checked my sequences for any characters which were not A,T,C,G or N and did not find any. I did try a dos2unix in case, but this didn't make any difference. And I've definitely set "InputDataType := 'DNA';" in the parameters file.
Is it possible it's complaining about the headers in my cds files? I might need to trim them, although they are on a single line, they look like this:
Thank you,
Emma
Hi Emma,
I don't think the length of the fasta header is a problem per se. Please make sure the '>' is really the first character on the line, so no spaces before. if you're still stuck, can you please send me an example directly or paste a link to a file here?
Best Adrian
it might in any way not be a bad idea to shorten those fasta header lines, not sure OMA has a problem with it but I'm sure other programs do (so perhaps here as well thus). Just remove the parts that are not informative for this analysis but do keep them unique of course!