Combining two databases in gtf file, bedtools gives wierd result
0
0
Entering edit mode
6.5 years ago

I'm trying to put genes from ENCODE and NONCODE in one gtf file and filter out NONCODE genes which are already in Encode. My bedtools command is:

bedtools intersect -wa -a NONCODE.gtf -b Gencode_exon.gtf -v -s -f 0.9

(I only take "exon" features from Gencode, otherwise it removes more transcripts from NONCODE)

This works okay for the most part but if I have overlapping transcripts in my NONCODE file, I get only one exon in the output:

enter image description here

Does anybody have any idea why does it happen and how to fix it? I can use another tool, just need to make sure I get all the Gencode genes and only the transcripts from NONCODE which don't completely (90%) overlap with Gencode. It has to be strand-specific too.

bedtools gtf noncode • 1.9k views
ADD COMMENT
1
Entering edit mode

If you want the GENCODE transcript where there's overlap, then why are you using -wa with -a being your NONCODE list?

-a and -b should be flipped, no?

I also question the use of -v...

To get what you want, I think that you need -wao, with -a GENCODE and -b NONCODE

ADD REPLY
0
Entering edit mode

Thank you, Kevin! I tried that but it doesn't really work the way I need: I guess, when bedtools works with gtf file, it doesn't really pay attention to "gene", "transcript", "exon" fields and doesn't see the relations between them... If I use Encode file with "gene" and "transcript" fields, it overlaps it with the whole thing, not paying attention that exons are not along the whole gene and have breaks. If I use only "exon" fields, then I'll have to write another script of calculating the overlap of all the exons in the gene and, if they overlap > 90% exclude the whole gene_ID from the original file. I was wondering if there is already a tool which does all that...

ADD REPLY
0
Entering edit mode

Some example data and expected output would help to understand and resolve the issue better @OP

ADD REPLY
0
Entering edit mode

Sorry, I'll try to explain again. I need to add the transcripts from NONCODE to Gencode database but have to make sure the ones which overlap with Gencode > 90% are not included (the transcripts in the circle should be excluded). The closest I've gotten to this by using

bedtools intersect -wa -a NONCODE.gtf -b Gencode_exon.gtf -v -s -f 0.9

But that option excludes some original exons from the NONCODE transcripts (see exons with the arrows), and I need to add the original full transcripts, not modified.

Gencode_NONCODE overlap

I tried to use the suggested option bedtools intersect -wao -a Gencode.gtf -b NONCODE.gtf -s but it gives an overlap of all the features with everything (genes with exons, transcripts with exons, etc) and will require a lot of downstream parsing. I don't really know if any other software can do an overlap looking at the exons from one transcript but bedtools doesn't seem to really do that.

ADD REPLY

Login before adding your answer.

Traffic: 1853 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6