You could use BEDOPS gtf2bed
to losslessly convert your two GTF files to BED, and then BEDOPS bedops --not-element-of
to return elements from one file which do not overlap the second file:
$ gtf2bed < firstSet.gtf > firstSet.bed
$ gtf2bed < secondSet.gtf > secondSet.bed
$ bedops --not-element-of -1 firstSet.bed secondSet.bed > answer.bed
At the end, you have answer.bed
, a BED file that can be converted back to GTF via awk
, e.g.:
$ awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' answer.bed > filteredFirstSet.gtf
If you need to separate your GTF files by strand, you could do this with awk
, e.g.:
$ gtf2bed < firstSet.gtf \
| awk '$6=="+"' - \
> firstSetSense.bed
$ gtf2bed < firstSet.gtf \
| awk '$6=="-"' - \
> firstSetAntisense.bed
Repeat for secondSet.gtf
.
Run bedops --not-element-of
on the two *Sense.bed
files:
$ bedops --not-element-of -1 firstSetSense.bed secondSetSense.bed > answerSense.bed
Repeat for the two *Antisense.bed
files:
$ bedops --not-element-of -1 firstSetAntisense.bed secondSetAntisense.bed > answerAntisense.bed
Then take the multiset union of the two answer*.bed
files:
$ bedops --everything answerSense.bed answerAntisense.bed > answer.bed
Then awk
that merged result back to GTF, if needed.
Great ! Is it possible to specify a minimum overlap fraction ?
Yes, you can specify either bases or percentage ("fraction"). Take a look at the documentation or
bedops --help
.It seems that bedops do not take into account the strand of the feature.
A.bed : chr1 100 200 A 100 + 100 200 0 3 20,30,20 1,30,80
chr1 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80
B.bed chr1 90 195 A 100 + 90 195 0 3 25,30,20 5,35,90
bedops --not-element-of -1 A.bed B.bed gives me nothing but it should give me the second line of A.bed because the feature is on the negative strand and do not overlap the actual feature in B.bed (encoded on the + strand)
but when I change the chromosome of the second feature in A.bed : chr1 100 200 A 100 + 100 200 0 3 20,30,20 1,30,80
chr2 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80
bedops --not-element-of -1 A.bed B.bed gives me : chr2 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80
Perhaps you could use
awk
to split the BED files by strand and runbedops
on each strand-specific subset. You can merge (multiset union) the results at the end withbedops --everything
, running the BED-to-GTF conversion step on that merged result. See the edited answer for one way to approach this.