Having trouble with biomaRt, getBM and chromosome_end
2
1
Entering edit mode
8.1 years ago
R.Blues ▴ 160

Hello everyone,

I guess it will be a very silly mistake, but I am not able to make this work.

I am using the biomaRt package to obtain the chromosome length of different chromosomes of the human genome. The thing is, when retrieving other information such as the ensembl ID, it works well (using other filters). However, with this code, the programme never stops running. Why? What am I doing wrong?

mart_h <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
test <- getBM("chromosome_end", filters="chromosome_name", values=c(1:2), mart_h)
test_2 <- getBM("chromosome_end", filters="chromosome_name", values=1, mart_h)

I know there are more efficient ways of obtaining these lengths, but I would prefer using this package (this code it is part of a pipeline).

Thank you very much, I am pretty sure it will be a very silly thing, but I am not able to solve this.

Have a nice day!

R biomart • 2.5k views
ADD COMMENT
2
Entering edit mode
8.1 years ago
ddiez ★ 2.0k

An alternative and simpler way to obtain chromosome lengths is to use the GenomeInfoDb Bioconductor package:

library(GenomeInfoDb)
Seqinfo(genome="hg38")
Seqinfo object with 455 sequences (1 circular) from hg38 genome:
  seqnames         seqlengths isCircular genome
  chr1              248956422      FALSE   hg38
  chr2              242193529      FALSE   hg38
  chr3              198295559      FALSE   hg38
  chr4              190214555      FALSE   hg38
  chr5              181538259      FALSE   hg38
  ...                     ...        ...    ...
  chrUn_KI270753v1      62944      FALSE   hg38
  chrUn_KI270754v1      40191      FALSE   hg38
  chrUn_KI270755v1      36723      FALSE   hg38
  chrUn_KI270756v1      79590      FALSE   hg38
  chrUn_KI270757v1      71251      FALSE   hg38

The chromosome lengths are stored in the column seqlengths. Take a look at ?Seqinfo for details.

ADD COMMENT
2
Entering edit mode
8.1 years ago
ddiez ★ 2.0k

I am not sure whether it is possible to get just chromosome lengths with biomaRt. But a small tests points to the possible cause of your problem. Say you want to obtain the hgnc_symbol based on the external_gene_name. The you do something like this:

getBM("hgnc_symbol", filters = "external_gene_name", values = "STAT1", mart_h)
  hgnc_symbol
1       STAT1

But if instead you want to get the chromosome_end:

foo <- getBM("chromosome_end", filters = "external_gene_name", values = "STAT1", mart_h)
head(foo)
  chromosome_end
1      190959382
2      190959427
3      190959445
4      190959462
5      190959498
6      190959506
dim(foo)
[1] 3843    1

Note the size of the object. At this moment I am not sure about how to interpret this, by my guess is that you are getting something similar to this but for a whole chromosome, which obviously takes a lot more time and you may run into memory problems.

ADD COMMENT
0
Entering edit mode

Oh, my God.

Then... what is exactly what I am retrieving? What is that object?

(Oh, and thank you very much!)

ADD REPLY
2
Entering edit mode

I haven't checked but, based on the values and number of entries, I suspect it's the positions of the exon ends (STAT1 is b/t 190-191 Mb on chrII and is encoded by a variety of alternatively spliced transcripts).

ADD REPLY
0
Entering edit mode

Ahá. Well, I guess I will have to think of another way of getting this information. Do you know if there is any way to query it somehow or will I have to use the typical chrom.size files?

In any case, thank both of you for your kind help. It is very nice to deal with people like you! :)

ADD REPLY
2
Entering edit mode

Most efficient would probably be to use the typical chromosome sizes, as you also don't really expect that those will change from week to week. It's just a small data file you need to save to disk and read when starting the script (or less optimal: have the data hard coded in your tool). You can always try ensembl biomart using your browser to see which information is available, without trial and error on the command line.

ADD REPLY
1
Entering edit mode

Agree with this. For common organisms the GenomeInfoDb package is an alternative to using a local stored file.

ADD REPLY
2
Entering edit mode

You should be able to parse this information from the sequence-length field of the sequence_report:

accession   sequence-name   sequence-length sequence-role   replicon-name   replicon-type   assembly-unit
CM000663.2  1   248956422   assembled-molecule  1   Chromosome  Primary Assembly
CM000664.2  2   242193529   assembled-molecule  2   Chromosome  Primary Assembly
CM000665.2  3   198295559   assembled-molecule  3   Chromosome  Primary Assembly
CM000666.2  4   190214555   assembled-molecule  4   Chromosome  Primary Assembly
CM000667.2  5   181538259   assembled-molecule  5   Chromosome  Primary Assembly
CM000668.2  6   170805979   assembled-molecule  6   Chromosome  Primary Assembly
CM000669.2  7   159345973   assembled-molecule  7   Chromosome  Primary Assembly
CM000670.2  8   145138636   assembled-molecule  8   Chromosome  Primary Assembly
CM000671.2  9   138394717   assembled-molecule  9   Chromosome  Primary Assembly
CM000672.2  10  133797422   assembled-molecule  10  Chromosome  Primary Assembly
CM000673.2  11  135086622   assembled-molecule  11  Chromosome  Primary Assembly
CM000674.2  12  133275309   assembled-molecule  12  Chromosome  Primary Assembly
CM000675.2  13  114364328   assembled-molecule  13  Chromosome  Primary Assembly
CM000676.2  14  107043718   assembled-molecule  14  Chromosome  Primary Assembly
CM000677.2  15  101991189   assembled-molecule  15  Chromosome  Primary Assembly
CM000678.2  16  90338345    assembled-molecule  16  Chromosome  Primary Assembly
CM000679.2  17  83257441    assembled-molecule  17  Chromosome  Primary Assembly
CM000680.2  18  80373285    assembled-molecule  18  Chromosome  Primary Assembly
CM000681.2  19  58617616    assembled-molecule  19  Chromosome  Primary Assembly
CM000682.2  20  64444167    assembled-molecule  20  Chromosome  Primary Assembly
CM000683.2  21  46709983    assembled-molecule  21  Chromosome  Primary Assembly
CM000684.2  22  50818468    assembled-molecule  22  Chromosome  Primary Assembly
CM000685.2  X   156040895   assembled-molecule  X   Chromosome  Primary Assembly
CM000686.2  Y   57227415    assembled-molecule  Y   Chromosome  Primary Assembly
.
.
.
ADD REPLY

Login before adding your answer.

Traffic: 1961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6