Finding gene density from reference genome using R

0

Entering edit mode

9.3 years ago

jamespoweraid2 • 0

Hi,

I was wondering if there is an easy way to find gene density (by counting the number of genes for example in a 10-Mbp window) in R across the whole genome?

Is this normally done by downloading the transcript data bed file?

Thank you

R biomaRt • 8.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.3 years ago by jamespoweraid2 • 0

5

Entering edit mode

9.3 years ago

Giovanni M Dall'Olio 28k

Yes it's very easy in R.

First of all, install and load the Homo.sapiens package:

> biocLite('Homo.sapiens')
> library(Homo.sapiens)
> human.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> human.genes
GRanges object with 23056 ranges and 1 metadata column:
        seqnames                 ranges strand   |     gene_id
           <Rle>              <IRanges>  <Rle>   | <character>
      1    chr19 [ 58858172,  58874214]      -   |           1
     10     chr8 [ 18248755,  18258723]      +   |          10
    100    chr20 [ 43248163,  43280376]      -   |         100
   1000    chr18 [ 25530930,  25757445]      -   |        1000
  10000     chr1 [243651535, 244006886]      -   |       10000
    ...      ...                    ...    ... ...         ...
   9991     chr9 [114979995, 115095944]      -   |        9991
   9992    chr21 [ 35736323,  35743440]      +   |        9992
   9993    chr22 [ 19023795,  19109967]      -   |        9993
   9994     chr6 [ 90539619,  90584155]      +   |        9994
   9997    chr22 [ 50961997,  50964905]      -   |        9997

The human.genes object contains the coordinates of all human genes.

Then, create another GRanges object with the coordinates of all the sliding windows:

human.windows = tileGenome(seqinfo(human.genes), tilewidth=100000, cut.last.tile.in.chrom=T)

You can add a column with the number of overlapping genes by using countOverlaps:

> human.windows$totgenes = countOverlaps(human.windows, human.genes)
> human.windows
GRanges object with 31419 ranges and 1 metadata column:
                seqnames           ranges strand   |  totgenes
                   <Rle>        <IRanges>  <Rle>   | <integer>
      [1]           chr1 [     1, 100000]      *   |         3
      [2]           chr1 [100001, 200000]      *   |         1
      [3]           chr1 [200001, 300000]      *   |         0
      [4]           chr1 [300001, 400000]      *   |         0
      [5]           chr1 [400001, 500000]      *   |         0
      ...            ...              ...    ... ...       ...
  [31415] chrUn_gl000245       [1, 36651]      *   |         0
  [31416] chrUn_gl000246       [1, 38154]      *   |         0
  [31417] chrUn_gl000247       [1, 36422]      *   |         0
  [31418] chrUn_gl000248       [1, 39786]      *   |         0
  [31419] chrUn_gl000249       [1, 38502]      *   |         0

Let's verify if this is correct. For example, the first window describes that the first 100,000 bases in chromosome 1 should contain three genes:

> subset(human.genes, seqnames=='chr1' & end < 100000)
GRanges object with 3 ranges and 1 metadata column:
            seqnames         ranges strand |     gene_id
               <Rle>      <IRanges>  <Rle> | <character>
  100287102     chr1 [11874, 14409]      + |   100287102
     653635     chr1 [14362, 29961]      - |      653635
      79501     chr1 [69091, 70008]      + |       79501

Let's take a window with more than 10 genes, and verify if this is correct:

	> subset(human.genes, totgenes>10)
	GRanges object with 57 ranges and 1 metadata column:
	seqnames ranges strand \| totgenes
	<Rle> <IRanges> <Rle> \| <integer>
	[1] chr1 [ 45200001, 45300000] * \| 11
	[2] chr1 [161100001, 161200000] * \| 12
	[3] chr1 [173800001, 173900000] * \| 15
	[4] chr2 [ 74600001, 74700000] * \| 11
	[5] chr2 [ 74700001, 74800000] * \| 12
	... ... ... ... ... ...
	[53] chr19 [54100001, 54200000] * \| 14
	[54] chr19 [54200001, 54300000] * \| 36
	[55] chr21 [46000001, 46100000] * \| 11
	[56] chrX [49100001, 49200000] * \| 11
	[57] chrX [49200001, 49300000] * \| 11
	> subset(seqnames=='chr19' & start> 54100001 & end < 54200000)
	GRanges object with 14 ranges and 1 metadata column:
	seqnames ranges strand \| gene_id
	<Rle> <IRanges> <Rle> \| <character>
	100302255 chr19 [54175222, 54175294] + \| 100302255
	100302265 chr19 [54191735, 54191821] + \| 100302265
	284379 chr19 [54102885, 54106751] - \| 284379
	503834 chr19 [54135310, 54140263] + \| 503834
	574459 chr19 [54169927, 54172508] + \| 574459
	... ... ... ... ... ...
	574464 chr19 [54185413, 54185499] + \| 574464
	574466 chr19 [54189723, 54189809] + \| 574466
	574467 chr19 [54194135, 54194219] + \| 574467
	574468 chr19 [54197647, 54197729] + \| 574468
	574469 chr19 [54198467, 54198547] + \| 574469

view raw biostars-169211.R hosted with ❤ by GitHub

In summary, you now have a human.windows object, where the totgenes column contains the count of genes per window. You can access it with human.windows$totgenes.

Note: you may want to merge all the overlapping genes into a single one. To do so, just use reduce(human.genes) at the beginning.

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

hello, sorry to bring this question up after so long, but do you know if there ways to do the same thing with my own genomic sequences and annotation gff file rather than using what is incorporated in bioconductor? Thanks a lot!

ADD REPLY • link 8.7 years ago by milk841103 ▴ 10

0

Entering edit mode

sure, you can use makeTxDbFromGFF from GenomicRanges to create the same TxDB object as above, then follow the same instructions.

ADD REPLY • link 8.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

hey, sorry to bother you again but i'm having problems when looking for genes that fall into each bin. to take a step further from finding genes in each bin, I was also trying to find genes with copy number variations(CNV) in each bin. I have already have a list of genes (with chromosome names, starting and ending positions for each gene) and converted into a GRange object using makeGRangefromdataframe function (plus i also added length of each chromosome at seqlength column), and then used countOverlap function for it and the window vector. it gave me number of CNV-affected genes in each bin, but when i tried to add # of cnv genes in all bins, the sum was larger than the total number of CNV-affected genes from my original files. Are you able to tell if I did those steps right, or why the total number of CNV-genes did not match? Thank you so much!

ADD REPLY • link 8.6 years ago by milk841103 ▴ 10

6

Entering edit mode

9.3 years ago

Alex Reynolds 36k

BEDOPS is pretty useful for solving this problem.

If you're not stuck using R, you could use something like this to get chromosome sizes, depending on your genome build:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -N -A -e "SELECT chrom,0,size FROM chromInfo;" hg19 | sort-bed - > hg19.chromSizes.bed

To run the window-to-gene mapping step, you will first need a sorted BED file called genes.bed. Assuming you are working with human hg19, you could get that via the following:

$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
    | gunzip -c \
    | grep -w "gene" \
    | gtf2bed \
    > gencode.v19.genes.bed

Once you have both chromosome sizes and genes in BED format, you could generate 10 Mb windows across each chromosome with bedops --chop and pipe the result to bedmap --count to get the number of genes over each window within each chromosome:

$ bedops --chop 10000000 hg19.chromSizes.bed | bedmap --echo --count - gencode.v19.genes.bed > hg19.10MbWindowsWithCountsOfOverlappingGencodeV19Genes.bed

All of this would depend on your reference genome or build, and what annotations you want to use to get genes, but the general commands and process are all identical.

ADD COMMENT • link 9.0 years ago by Alex Reynolds 36k

0

Entering edit mode

9.3 years ago

Zev.Kronenberg 12k

This sounds like a job for bedtools.

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Zev.Kronenberg 12k

Login before adding your answer.

	> subset(human.genes, totgenes>10)
	GRanges object with 57 ranges and 1 metadata column:
	seqnames ranges strand \| totgenes
	<Rle> <IRanges> <Rle> \| <integer>
	[1] chr1 [ 45200001, 45300000] * \| 11
	[2] chr1 [161100001, 161200000] * \| 12
	[3] chr1 [173800001, 173900000] * \| 15
	[4] chr2 [ 74600001, 74700000] * \| 11
	[5] chr2 [ 74700001, 74800000] * \| 12
	... ... ... ... ... ...
	[53] chr19 [54100001, 54200000] * \| 14
	[54] chr19 [54200001, 54300000] * \| 36
	[55] chr21 [46000001, 46100000] * \| 11
	[56] chrX [49100001, 49200000] * \| 11
	[57] chrX [49200001, 49300000] * \| 11
	> subset(seqnames=='chr19' & start> 54100001 & end < 54200000)
	GRanges object with 14 ranges and 1 metadata column:
	seqnames ranges strand \| gene_id
	<Rle> <IRanges> <Rle> \| <character>
	100302255 chr19 [54175222, 54175294] + \| 100302255
	100302265 chr19 [54191735, 54191821] + \| 100302265
	284379 chr19 [54102885, 54106751] - \| 284379
	503834 chr19 [54135310, 54140263] + \| 503834
	574459 chr19 [54169927, 54172508] + \| 574459
	... ... ... ... ... ...
	574464 chr19 [54185413, 54185499] + \| 574464
	574466 chr19 [54189723, 54189809] + \| 574466
	574467 chr19 [54194135, 54194219] + \| 574467
	574468 chr19 [54197647, 54197729] + \| 574468
	574469 chr19 [54198467, 54198547] + \| 574469