getting a combination of all predicted genes
1
0
Entering edit mode
6.6 years ago
bitpir ▴ 250

Hello, I was wondering if anyone knows the best way to doing this. I have results from 2 ORF predictors, Mga and Genemark. I now want to combine all the predicted ORFs together. I would like to keep all overlapping ORFs and also the genes that are slightly longer (e.g. +10bp). The MGA result is as follow:

NC_003552.1 mga cds 1   1500    133.183 -   0   gene_1; 11  
NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11

The Genemark result is this:

NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11
NC_003552.1 gms cds 1501    1780    114.856 +   0   gene_2; 11

My desired result is this:

NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11
NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11

I have tried to use bedtools intersect but it didn't get me the right answer. Other better tools to doing this will be welcomed too! Thanks!

bedtools gff • 1.5k views
ADD COMMENT
0
Entering edit mode

you bluntly want to merge the 2 predictions together? I'm not convinced this is the way forward as this does not guarantee the 'merged' models will make any biological sense.

But I'm guessing you might know that already. So what is the goal of all this?

ADD REPLY
0
Entering edit mode

@Alex may provide a specific answer for this but you can take a look at his older answer here: How To Get Annotation For Bed File From Another Bed File

ADD REPLY
0
Entering edit mode

I do not know anything about these predictors and their output format. But I can certainly say BEDtools will not work on them (most of the BEDtools functions require BED regions to work).

From what I understand from your example, you just need union of all results?

cat MGA_results.txt GENEMARK_result.txt | sort -u > all_results.txt

Is this something you are looking for?

ADD REPLY
0
Entering edit mode

Thanks for the response! I actually made a mistake in the desired result, it's corrected now. I aim to get all of the orfs predicted by mga and genemark. However, when the same orf is predicted in the same region, I'll give preference to the longer one:

NC_003552.1 mga cds 1   1500    133.183 -   0   gene_1; 11  
NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11
NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 gms cds 1501    1780    114.856 +   0   gene_2; 11

In this case, I'll take the second and the third ones.

If there are overlapping ORFs or orfs in the other strand (-), then I'll keep both:

NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11

This will result in the final result:

NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11
NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11
ADD REPLY
0
Entering edit mode

Try this and let me know if it scales up (remove last two columns):

$ cat gms.txt mga.txt | datamash  -s -g 1,3,4,7  --full min 4 max 5
    NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11  1   1503
    NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  1501    2000
    NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11  1750    2000
ADD REPLY
0
Entering edit mode
6.6 years ago

Union with bedops -u and perhaps filter the unioned set by overlap with bedmap --range or bedmap --echo-overlap-size operations, piping to awk to print qualifying records. I have to admit that the logic is a bit unclear to me but that's perhaps where I might start.

ADD COMMENT

Login before adding your answer.

Traffic: 1386 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6