Question

extract common CDS from two gff3 files (using confidence interval)

0

Entering edit mode

10.7 years ago

mahomarr • 0

Is there any script/program available to compare two gff3 files and find common matches? If possible, with a confidence interval, I would like to accept as equivalent a +2 confidence interval for "start" column, and -2 for "end" column.

e.g.

first gff3 file:

gene-1    GeneMark.hmm    CDS    2    181    -231.755620    +    0    ID=.
gene-2    GeneMark.hmm    CDS    2    295    -377.586623    -    0    ID=.
gene-3    GeneMark.hmm    CDS    39    278    -307.845998    +    0    ID=.
gene-4    GeneMark.hmm    CDS    2    133    -167.844314    -    0    ID=.
gene-5    GeneMark.hmm    CDS    140    298    -206.922922    -    0    ID=.

second gff3 file:

gene-1    mga    CDS    1    181    1.332138    +    2    ID=.
gene-2    mga    CDS    1    300    1.314849    -    2    ID=.
gene-3    mga    CDS    39    300    1.318473    +    3    ID=.
gene-4    mga    CDS    1    133    1.315638    -    2    ID=.
gene-4    mga    CDS    146    300    1.339628    -    2    ID=.
gene-5    mga    CDS    1    121    1.373858    -    3    ID=.
gene-5    mga    CDS    140    300    1.318199    -    2    ID=.

As you can see gene-1 from both files are equal (inside the intervals), gene-2 is outside due the end column (295-300, more than -2 difference, 298 would being acceptable).

And the second gff3 file found one more CDS in gene-4 and gene-5 than first file, so both of them must be compared with single gene-4 and gene-5 from first gff3 file.

Then the output should be the common CDS:

gene-1
gene-4
gene-5

Thanks in advance

python gene comparison model biopython • 3.3k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by mahomarr • 0

1

Entering edit mode

Quick answer: bedtools intersect can do what you want (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html). However, do you think that 2bp criteria is correct when some genes might be 10kb in length and some only 100bp (see bedtools intersect -f -r options)?

Have in mind that bedtools compares genomic intervals, but not genes. If you need to compare gene sequences use blast.

ADD REPLY • link 10.7 years ago by PoGibas 5.1k

0

Entering edit mode

yes, in this case 2bp criteria is correct. Thanks for your help.

ADD REPLY • link 10.7 years ago by mahomarr • 0

1

Entering edit mode

You need to define a bit better exactly how you want to handle the GFF3 file, as Pgibas mentioned. For example, do you only care about coding regions and not UTRs? What if only one CDS of a gene with 100 matches, do you still want that output?

In general, bedtools, bedops, or even just GenomicRanges in R could be used to do this sort of thing. For the latter, you'd make a two ranges of starts, with one extended by the 2bp window (this is not a confidence interval...don't use that phrase in this context as it has no meaning) on each side, and then findOverlaps. You'd then do the same with the end positions and intersect the results. That would be relatively efficient, though bedtools and bedops are usually faster.