use bedtools to find overlaps of bed and gff files
1
2
Entering edit mode
9.8 years ago
biolab ★ 1.4k

Dear all

I am trying to use bedtools to compare a BED file with a GFF file.

The BED file is like below

Chr1    3641    5640    .    .    +

The GFF file is as follows

Chr1    TAIR10  gene    3631    5899    .       +       .       ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .       +       .       ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .       +       .       ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .       +       .       Parent=AT1G01010.1

I used the command intersectBed -a A.bed -b B.gff -wa -wb to find overlaps.

However, my problem is: how to define intergenic, intron and intron-exon-overlapping or intergenic-gene-overlapping regions?

Thank you for any of your suggestions!

gff bedtools • 7.1k views
ADD COMMENT
0
Entering edit mode

I don't understand exactly your problem, what are you trying? Do you want to extract the features of the gff file that are included within your bed region?

ADD REPLY
0
Entering edit mode

Yes, exactly as you mentioned, I want to extract features of GFF file according to BED regions. My problem is: the GFF file does not have intron or intergenic features. More complex thing is: BED regions may have exon-intron overlap. THANKS!

ADD REPLY
0
Entering edit mode

You should really be able to figure this out yourself. For example, do you expect purely intergenic regions to have any overlaps (hint: no, you don't).

ADD REPLY
0
Entering edit mode

Hi Devon

Sorry for my unclear description. As airan said, I want to define every BED region. The difficulty is: the GFF file does not have intergenic and intron information. THANKS!

ADD REPLY
3
Entering edit mode
9.8 years ago
Vishaka Datta ▴ 100

You can find intergenic regions of your GFF file by using complementBed. You require a fasta index file as an input to this method though (usually a file with a .fai extension). The fasta index file can be created by using the faidx tool from SAMtools. Here's an example pipeline for finding overlaps of your BED file with intergenic regions :

Step 1: Create the .fai file from SAMTools faidx. This requires the genome sequence fasta file as input.

samtools faidx <your genome.fa>

This will create a file name <your genome.fa.fai>

Step 2: Run complementBed with this .fai file:

complementBed -i <your GFF gene annotation file> -g <your genome.fa.fai> > intergenic.bed

intergenic.bed now contains the chromosomal coordinates (only) of all intergenic regions. You can now run the intersectBed command of A.bed with intergenic.bed.

To generate a intronic regions bed file, you should either download the exon annotations for your genome and run complementBedon it, or download the intronic region file directly if available.

To run higher order intersections (like say, between three files), you can pass multiple files to the -b argument through wildcard characters. See the docs for help on this - I haven't tried it before.

Hope I understood your question right :)

ADD COMMENT
0
Entering edit mode

Hi, Vishaka,

Thank you very much for your detailed method. My question is not properly described, but you well guessed what I was asking. I am trying your method. Cheers!

ADD REPLY

Login before adding your answer.

Traffic: 2472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6