Question

Filter Gtf With Another Gtf

0

Entering edit mode

11.5 years ago

Nicolas Rosewick 11k

Hi,

I've a two gtf annotation files and I want to filter one of them using the other one. The idea is to remove all the transcript that overlaps with a transcript contained in the second file. It has to be sens specific. I thought about bedtools but I don't know if such a thing is possible with it. Otherwise I will write a little script to do that but I don't want to reinvent the wheel..

Thanks

N.

gtf filter • 3.1k views

ADD COMMENT • link updated 9.6 years ago by Biostar 20 • written 11.5 years ago by Nicolas Rosewick 11k

score 2 · Answer 1 · 2013-05-07

2

Entering edit mode

11.5 years ago

Alex Reynolds 35k

You could use BEDOPS gtf2bed to losslessly convert your two GTF files to BED, and then BEDOPS bedops --not-element-of to return elements from one file which do not overlap the second file:

$ gtf2bed < firstSet.gtf > firstSet.bed
$ gtf2bed < secondSet.gtf > secondSet.bed
$ bedops --not-element-of -1 firstSet.bed secondSet.bed > answer.bed

At the end, you have answer.bed, a BED file that can be converted back to GTF via awk, e.g.:

$ awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' answer.bed > filteredFirstSet.gtf

If you need to separate your GTF files by strand, you could do this with awk, e.g.:

$ gtf2bed < firstSet.gtf \
    | awk '$6=="+"' - \
    > firstSetSense.bed

$ gtf2bed < firstSet.gtf \
    | awk '$6=="-"' - \
    > firstSetAntisense.bed

Repeat for secondSet.gtf.

Run bedops --not-element-of on the two *Sense.bed files:

$ bedops --not-element-of -1 firstSetSense.bed secondSetSense.bed > answerSense.bed

Repeat for the two *Antisense.bed files:

$ bedops --not-element-of -1 firstSetAntisense.bed secondSetAntisense.bed > answerAntisense.bed

Then take the multiset union of the two answer*.bed files:

$ bedops --everything answerSense.bed answerAntisense.bed > answer.bed

Then awk that merged result back to GTF, if needed.

ADD COMMENT • link 11.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Great ! Is it possible to specify a minimum overlap fraction ?

ADD REPLY • link 11.5 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Yes, you can specify either bases or percentage ("fraction"). Take a look at the documentation or bedops --help.

ADD REPLY • link 11.5 years ago by Alex Reynolds 35k

0

Entering edit mode

It seems that bedops do not take into account the strand of the feature.

A.bed : chr1 100 200 A 100 + 100 200 0 3 20,30,20 1,30,80
chr1 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

B.bed chr1 90 195 A 100 + 90 195 0 3 25,30,20 5,35,90

bedops --not-element-of -1 A.bed B.bed gives me nothing but it should give me the second line of A.bed because the feature is on the negative strand and do not overlap the actual feature in B.bed (encoded on the + strand)

but when I change the chromosome of the second feature in A.bed : chr1 100 200 A 100 + 100 200 0 3 20,30,20 1,30,80
chr2 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

bedops --not-element-of -1 A.bed B.bed gives me : chr2 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

ADD REPLY • link 11.5 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Perhaps you could use awk to split the BED files by strand and run bedops on each strand-specific subset. You can merge (multiset union) the results at the end with bedops --everything, running the BED-to-GTF conversion step on that merged result. See the edited answer for one way to approach this.

ADD REPLY • link 11.5 years ago by Alex Reynolds 35k