How to extract non-overlapping genes from a a bed file?
2
1
Entering edit mode
8.3 years ago

Hi,

I am entirely new to bioinformatics, therefore I sincerely apologize for asking a very naive question. I am interested in counting the number of genes per chromosome in the human genome. For this purpose I have downloaded the latest release of the 'gff' file from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.34_GRCh38.p8/ and have also sorted it using IGVtools from the Integrative Genomics Viewer. IGV also allows for exporting the features as 'bed' file, although there are a single base differences between the start positions of a gene in the 'gff' files and the 'bed' files generated in this way. While scanning through the 'bed' file I observed a lot of repetitions and overlaps. Now my questions are:

1) Is there any available tool for extracting non-overlapping genes from the bed file?

2) Is there any way to automate the selection of only one of the several overlapping genes?

3) Is the 'bed' file converted using the 'Export Features' tool in IGV reliable enough for further processing? What are the preferred alternatives?

I am sure such a trivial topic has already been discussed scores of times in your forum. I would appreciate if you could direct me to some such discussions. It would be wonderful if you could provide a solution that can be carried out using Windows. I sincerely thank you and apologize once again.

bed gff gene gene density • 2.8k views
ADD COMMENT
2
Entering edit mode
8.3 years ago

Devon Ryan's answer is almost certainly the quickest way to get the number of genes per chromosome. But answer your questions as you asked them:

The features you get in bed format have different co-ordinates to those in gff format because of the different co-ordinate systems in these two formats.

BED files are zero-based, half-open. That means that the first base of the chromosome is base 0, and that the end co-ordinate of an entry in a BED file is not included in the gene. I.e. an entry that looks like:

chr1    0    100

refers to the first 100 bases of chromosome 1, which is the bases labeled 0,1,2,3,4 ..... 97, 98 and 99.

GFF files use a 1-based, closed co-ordinate system. That means that the first base of the chromosome is base 1 and that the end co-ordinate of an entry in a GFF file IS included in the gene. I.e. nd entry that looks like:

chr1    protein_coding    exon    1    100

refers to the first 100 bases of chromosome 1, which is the bases labeled 1,2,3,4,5 ..... 98, 99, 100 which are the same bases the BED entry above referred to. This is the source of many "off-by-one" errors in bioinformatics. So the IGV out is correct, the start bases of BED files should be one different from the start bases of GFF files.

Many genes in the human genome do overlap, in all sorts of interesting ways - you have genes that are in the introns of other genes, genes that are on the other strand and whose ends overlap, or genes that share exons. Not matter how weird, if you can think of an way in which genes could be arranged, I can guarantee it happens somewhere in the genome.

There are many ways to get non-overlapping intervals. The tool gtf2gtf from the cgat package has a mode that will select only the longest gene where two genes overlap:

cgat gtf2gtf -I my_gtf_file.gtf --method=filter --filter-method=longest-gene

But the cgat package is a massive toolkit and might be a bit overkill for your situation.

ADD COMMENT
0
Entering edit mode

I do not have words to thank you for explaining the differences between the two kinds of files in such beautiful details, and also for sharing ideas about the concept of gene and whether they are 'overlapping'. I have installed Ubuntu using a Oracle VirtuaBox on my Windows laptop and am struggling to install the CGAT package in it. Hope I will succeed soon and would be able to try your suggestion. Also, I get from your message that may be I should ponder over whether to select non-overlapping genes at all. Thank you so much i.sudbery, I literally can't thank you enough.

ADD REPLY
0
Entering edit mode

I do not have words to thank you for explaining the differences between the two kinds of files in such beautiful details, and also for sharing ideas about the concept of gene and whether they are 'overlapping'. I have installed Ubuntu using a Oracle VirtuaBox on my Windows laptop and am struggling to install the CGAT package in it. Hope I will succeed soon and would be able to try your suggestion. Also, I get from your message that may be I should ponder over whether to select non-overlapping genes at all. Thank you so much i.sudbery, I literally can't thank you enough.

ADD REPLY
1
Entering edit mode
8.3 years ago

It sounds like IGV is doing something you're not expecting. You can more easily get what you want with biomart. For example, this query will yield a list with each chromosome appearing once for each gene on it. You can save that as a text file and sort file.txt | uniq -c to get what you want.

If you want to select non-overlapping genes, then get genes in BED format from biomart and use either bedtools or GenomicRanges in R. The latter allows you to more easily tweak how you want overlaps handled.

ADD COMMENT
0
Entering edit mode

Thank you so much for your attention to such a silly question and for your quick reply. Sorry for this late reply. I am still trying to get the task done. The GenomicRanges is quite a detailed toolkit and I am still trying to find my way through it. Hope I'll find a solution soon.

ADD REPLY

Login before adding your answer.

Traffic: 2209 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6