Is there any script/program available to compare two gff3 files and find common matches? If possible, with a confidence interval, I would like to accept as equivalent a +2 confidence interval for "start" column, and -2 for "end" column.
e.g.
first gff3 file:
gene-1 GeneMark.hmm CDS 2 181 -231.755620 + 0 ID=.
gene-2 GeneMark.hmm CDS 2 295 -377.586623 - 0 ID=.
gene-3 GeneMark.hmm CDS 39 278 -307.845998 + 0 ID=.
gene-4 GeneMark.hmm CDS 2 133 -167.844314 - 0 ID=.
gene-5 GeneMark.hmm CDS 140 298 -206.922922 - 0 ID=.
second gff3 file:
gene-1 mga CDS 1 181 1.332138 + 2 ID=.
gene-2 mga CDS 1 300 1.314849 - 2 ID=.
gene-3 mga CDS 39 300 1.318473 + 3 ID=.
gene-4 mga CDS 1 133 1.315638 - 2 ID=.
gene-4 mga CDS 146 300 1.339628 - 2 ID=.
gene-5 mga CDS 1 121 1.373858 - 3 ID=.
gene-5 mga CDS 140 300 1.318199 - 2 ID=.
As you can see gene-1 from both files are equal (inside the intervals), gene-2 is outside due the end column (295-300, more than -2 difference, 298 would being acceptable).
And the second gff3 file found one more CDS in gene-4 and gene-5 than first file, so both of them must be compared with single gene-4 and gene-5 from first gff3 file.
Then the output should be the common CDS:
gene-1
gene-4
gene-5
Thanks in advance
Quick answer: bedtools intersect can do what you want (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html). However, do you think that 2bp criteria is correct when some genes might be 10kb in length and some only 100bp (see bedtools intersect -f -r options)?
Have in mind that bedtools compares genomic intervals, but not genes. If you need to compare gene sequences use blast.
yes, in this case 2bp criteria is correct. Thanks for your help.
You need to define a bit better exactly how you want to handle the GFF3 file, as Pgibas mentioned. For example, do you only care about coding regions and not UTRs? What if only one CDS of a gene with 100 matches, do you still want that output?
In general, bedtools, bedops, or even just GenomicRanges in R could be used to do this sort of thing. For the latter, you'd make a two ranges of starts, with one extended by the 2bp window (this is not a confidence interval...don't use that phrase in this context as it has no meaning) on each side, and then findOverlaps. You'd then do the same with the end positions and intersect the results. That would be relatively efficient, though bedtools and bedops are usually faster.
copy that, thanks you very much.