Filter Gtf With Another Gtf
1
0
Entering edit mode
11.5 years ago

Hi,

I've a two gtf annotation files and I want to filter one of them using the other one. The idea is to remove all the transcript that overlaps with a transcript contained in the second file. It has to be sens specific. I thought about bedtools but I don't know if such a thing is possible with it. Otherwise I will write a little script to do that but I don't want to reinvent the wheel..

Thanks

N.

gtf filter • 3.1k views
ADD COMMENT
2
Entering edit mode
11.5 years ago

You could use BEDOPS gtf2bed to losslessly convert your two GTF files to BED, and then BEDOPS bedops --not-element-of to return elements from one file which do not overlap the second file:

$ gtf2bed < firstSet.gtf > firstSet.bed
$ gtf2bed < secondSet.gtf > secondSet.bed
$ bedops --not-element-of -1 firstSet.bed secondSet.bed > answer.bed

At the end, you have answer.bed, a BED file that can be converted back to GTF via awk, e.g.:

$ awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' answer.bed > filteredFirstSet.gtf

If you need to separate your GTF files by strand, you could do this with awk, e.g.:

$ gtf2bed < firstSet.gtf \
    | awk '$6=="+"' - \
    > firstSetSense.bed

$ gtf2bed < firstSet.gtf \
    | awk '$6=="-"' - \
    > firstSetAntisense.bed

Repeat for secondSet.gtf.

Run bedops --not-element-of on the two *Sense.bed files:

$ bedops --not-element-of -1 firstSetSense.bed secondSetSense.bed > answerSense.bed

Repeat for the two *Antisense.bed files:

$ bedops --not-element-of -1 firstSetAntisense.bed secondSetAntisense.bed > answerAntisense.bed

Then take the multiset union of the two answer*.bed files:

$ bedops --everything answerSense.bed answerAntisense.bed > answer.bed

Then awk that merged result back to GTF, if needed.

ADD COMMENT
0
Entering edit mode

Great ! Is it possible to specify a minimum overlap fraction ?

ADD REPLY
0
Entering edit mode

Yes, you can specify either bases or percentage ("fraction"). Take a look at the documentation or bedops --help.

ADD REPLY
0
Entering edit mode

It seems that bedops do not take into account the strand of the feature.

A.bed : chr1 100 200 A 100 + 100 200 0 3 20,30,20 1,30,80
chr1 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

B.bed chr1 90 195 A 100 + 90 195 0 3 25,30,20 5,35,90

bedops --not-element-of -1 A.bed B.bed gives me nothing but it should give me the second line of A.bed because the feature is on the negative strand and do not overlap the actual feature in B.bed (encoded on the + strand)

but when I change the chromosome of the second feature in A.bed : chr1 100 200 A 100 + 100 200 0 3 20,30,20 1,30,80
chr2 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

bedops --not-element-of -1 A.bed B.bed gives me : chr2 100 200 B 100 - 100 200 0 3 20,30,20 1,30,80

ADD REPLY
0
Entering edit mode

Perhaps you could use awk to split the BED files by strand and run bedops on each strand-specific subset. You can merge (multiset union) the results at the end with bedops --everything, running the BED-to-GTF conversion step on that merged result. See the edited answer for one way to approach this.

ADD REPLY

Login before adding your answer.

Traffic: 1029 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6