Hello, I was wondering if anyone knows the best way to doing this. I have results from 2 ORF predictors, Mga and Genemark. I now want to combine all the predicted ORFs together. I would like to keep all overlapping ORFs and also the genes that are slightly longer (e.g. +10bp). The MGA result is as follow:
NC_003552.1 mga cds 1 1500 133.183 - 0 gene_1; 11
NC_003552.1 mga cds 1501 2000 114.856 + 0 gene_2; 11
NC_003552.1 mga cds 1750 2000 81.7025 - 0 gene_3; 11
The Genemark result is this:
NC_003552.1 gms cds 1 1503 133.183 - 0 gene_1; 11
NC_003552.1 gms cds 1501 1780 114.856 + 0 gene_2; 11
My desired result is this:
NC_003552.1 gms cds 1 1503 133.183 - 0 gene_1; 11
NC_003552.1 mga cds 1501 2000 114.856 + 0 gene_2; 11
NC_003552.1 mga cds 1750 2000 81.7025 - 0 gene_3; 11
I have tried to use bedtools intersect but it didn't get me the right answer. Other better tools to doing this will be welcomed too! Thanks!
you bluntly want to merge the 2 predictions together? I'm not convinced this is the way forward as this does not guarantee the 'merged' models will make any biological sense.
But I'm guessing you might know that already. So what is the goal of all this?
@Alex may provide a specific answer for this but you can take a look at his older answer here: How To Get Annotation For Bed File From Another Bed File
I do not know anything about these predictors and their output format. But I can certainly say
BED
tools will not work on them (most of theBED
tools functions requireBED
regions to work).From what I understand from your example, you just need union of all results?
cat MGA_results.txt GENEMARK_result.txt | sort -u > all_results.txt
Is this something you are looking for?
Thanks for the response! I actually made a mistake in the desired result, it's corrected now. I aim to get all of the orfs predicted by mga and genemark. However, when the same orf is predicted in the same region, I'll give preference to the longer one:
In this case, I'll take the second and the third ones.
If there are overlapping ORFs or orfs in the other strand (-), then I'll keep both:
This will result in the final result:
Try this and let me know if it scales up (remove last two columns):