How to get a nucleotide at a specified position using a GenBank chromosome accession ID with biomaRt?
1
0
Entering edit mode
6.4 years ago
foxdie ▴ 20

What I have: I have analyzed a set of WES data with VarScan 2. I now have a somatic variant call .csv with "chrom", "position", "ref", "var", etc. as columns headers. E.g.:

chrom       position  ref   var   normal_reads1
CM000994.2  3681266     T     G             229
CM000994.2  6558171     A     C              11
.
.
.

What I need: I am trying to (1) get the reference nucleotide at a position as well as the flanking nucleotides (i.e. n–1 and n+1). I then need to concatenate them to make a trinucleotide string for each variant (e.g. "ATG", where ref="T", n–1="A", n+1="G").

My problem: The VarScan 2 output has, as far as I can tell, GenBank accession ID's as values under the "chrom" column (e.g. CM000994.2) I am working with mouse build mm10 and have set that as my biomaRt dataset. I am trying to use biomaRt to get the reference nucleotide, the n–1 nucleotide, and the n+1 nucleotide. However, I can't figure out how to use GenBank accession ID as a query input for biomaRt. Here is an example of what I am getting:

> getSequence(chromosome = CM000994.2, start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = 'cdna', type = chromosome_name)

Error in getBM(c(seqType, type), filters = c("chromosome_name", "start",  : 
  object 'CM000994.2' not found

How do I either 1) access nucleotide positions using a GenBank chromosome accesion ID and a position, or 2) convert GenBank chromosome accession ID to a biomaRt-usable chromosome label? ANY help would be appreciated.

EDIT: Should I just use BIostrings+BSgenome+GenomicRanges for this?

biomaRt GenBank accession ID Bioconductor R • 2.2k views
ADD COMMENT
1
Entering edit mode
6.4 years ago
pbpanigrahi ▴ 430

The error says: object CM000994.2 not found. Try giving in double quote. It may solve. Let us know if it don't.

getSequence(chromosome = "CM000994.2", start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = 'cdna', type = chromosome_name)

"CM000994.2"

ADD COMMENT
0
Entering edit mode

I think OP has to remove .2 extension. However, OP didn't post what following objects are chromosome_name, mouse_dataset are. Some thing like this: data[,1]=gsub("\\.[0-9]$","",data[,1])

ADD REPLY
0
Entering edit mode

@cpad, based on the R error, "object 'CM000994.2' not found", it seems obvious to me that he has missed double or single quote since CM000994.2 is character string. If we try XYZ in R console..we will get error that object 'XYZ' not found.

I agree with you that following objects are chromosome_name, mouse_dataset are missing, based on which we can precisely know the error.

ADD REPLY
0
Entering edit mode

@pb: you are correct that OP has to quote chormosome name.

ADD REPLY
0
Entering edit mode

@pbpanigrahi, what do you mean:

following objects are chromosome_name, mouse_dataset are missing

? For reference, I have mouse_dataset = useDataset("mmusculus_gene_ensembl", mart = ensembl) if that is what you wanted to know. Let me know if that helps.

ADD REPLY
0
Entering edit mode

(1of3) Hi, thanks for the reply.

When I enter:

> getSequence(chromosome = "CM000994.2", start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = "cdna", type = chromosome_name)

I get a similar object not found error again for type = chromosome_name:

Error in attributes %in% listAttributes(mart, what = "name") : 
object 'chromosome_name' not found

Even then, when I put "chromosome_name" in quotes, the query goes through, but it doesn't return any data:

> getSequence(chromosome = "CM000994", start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = "cdna", type = "chromosome_name")
[1] cdna            chromosome_name
<0 rows> (or 0-length row.names)

I'm not sure what I'm doing wrong. It doesn't seem to actually be accessing the genome coordinates. Is it possible that it's simply not recognizing the GenBank ID, or would it give me a definitive error?

From Google/NCBI, I know that GenBank ID CM000994.2 is mouse chromosome 1, so I decided to do some sleuthing and went into the mouse_dataset@filters to see what chromosome labels ("options") are actually accepted for chromosome:

> colnames(mouse_dataset@filters)
[1] "name"            "description"     "options"        
[4] "fullDescription" "filters"         "type"           
[7] "operation"       "filters8"        "filters9" 

> mouse_dataset@filters$name[1:5]
[1] "chromosome_name" "start"           "end"            
[4] "band_start"      "band_end"
ADD REPLY
0
Entering edit mode
 > mouse_dataset@filters[1,]
         name              description
1 chromosome_name Chromosome/scaffold name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        options
1 [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,CHR_CAST_EI_MMCHR11_CTG4,CHR_CAST_EI_MMCHR11_CTG5,CHR_MG51_PATCH,CHR_MG65_PATCH,CHR_MG74_PATCH,CHR_MG89_PATCH,CHR_MG104_PATCH,CHR_MG117_PATCH,CHR_MG132_PATCH,CHR_MG153_PATCH,CHR_MG171_PATCH,CHR_MG184_PATCH,CHR_MG190_MG3751_PATCH,CHR_MG191_PATCH,CHR_MG209_PATCH,CHR_MG3172_PATCH,CHR_MG3231_PATCH,CHR_MG3251_PATCH,CHR_MG3490_PATCH,CHR_MG3496_PATCH,CHR_MG3530_PATCH,CHR_MG3561_PATCH,CHR_MG3562_PATCH,CHR_MG3609_PATCH,CHR_MG3618_PATCH,CHR_MG3627_PATCH,CHR_MG3648_PATCH,CHR_MG3656_PATCH,CHR_MG3683_PATCH,CHR_MG3686_PATCH,CHR_MG3700_PATCH,CHR_MG3712_PATCH,CHR_MG3714_PATCH,CHR_MG3829_PATCH,CHR_MG3833_MG4220_PATCH,CHR_MG3835_PATCH,CHR_MG3836_PATCH,CHR_MG3999_PATCH,CHR_MG4136_PATCH,CHR_MG4138_PATCH,CHR_MG4151_PATCH,CHR_MG4162_PATCH,CHR_MG4180_PATCH,CHR_MG4198_PATCH,CHR_MG4200_PATCH,CHR_MG4209_PATCH,CHR_MG4211_PATCH,CHR_MG4212_PATCH,CHR_MG4213_PATCH,CHR_MG4214_PATCH,CHR_MG4222_MG3908_PATCH,CHR_MG4243_PATCH,CHR_MG4248_PATCH,CHR_MG4249_PATCH,CHR_MG4254_PATCH,CHR_MG4255_PATCH,CHR_MG4259_PATCH,CHR_MG4261_PATCH,CHR_MG4264_PATCH,CHR_MG4265_PATCH,CHR_MG4266_PATCH,CHR_MG4281_PATCH,CHR_MG4288_PATCH,CHR_MG4308_PATCH,CHR_MG4310_MG4311_PATCH,CHR_MMCHR1_CHORI29_IDD5_1,CHR_PWK_PHJ_MMCHR11_CTG1,CHR_PWK_PHJ_MMCHR11_CTG2,CHR_PWK_PHJ_MMCHR11_CTG3,CHR_WSB_EIJ_MMCHR11_CTG1,CHR_WSB_EIJ_MMCHR11_CTG2,CHR_WSB_EIJ_MMCHR11_CTG3,GL456210.1,GL456211.1,GL456212.1,GL456216.1,GL456219.1,GL456221.1,GL456233.1,GL456239.1,GL456350.1,GL456354.1,GL456372.1,GL456381.1,GL456385.1,JH584292.1,JH584293.1,JH584294.1,JH584295.1,JH584296.1,JH584297.1,JH584298.1,JH584299.1,JH584303.1,JH584304.1,MT,X,Y]
  fullDescription filters type operation
1                 filters text         =
                        filters8  filters9
1 mmusculus_gene_ensembl__gene__main name_1059
ADD REPLY
0
Entering edit mode

(3of3)So, from this I tried to run the following:

> getSequence(chromosome = "1", start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = 'cdna', type = 'chromosome_name')
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              cdna
1 CCTCCTTCAGTCTTCCCCTCACATGGCCCATCTCCCCACACTTTACTTCACCTCTGAGAAGAGGGAGGCCCCTCACAGGTAGCCAACACACTCTATCACTTCAGGTCATTGCAGGACTAGGTACACTCTCCCTCTGAGGCTAGGCAAAACAGCTCAGTTTCAAGAACAGAATTAAAGTGAGAATTCTTAAATAAAAGACATAAGAATACTTTAAACCCTGCTCCAAACAGACACGTTAAGCAATCATGAACACTATTTTTGTTTTGTTTTGTTCTGTTTTACAATGCTTAGAAAATAAAGGAATGTATATGTTGTAGTTCTTATTGTGGTTGTCTAATTAAGACAACCCAATATCAACTAAAATAAATGAATAAATAAAAGAGCCGTAACACTAATATCCTCAGAGAATCTCCCAAACATGCATAACTCAGTCATTTTCATTGACTGGCTGGTGAACAACACACAAAAAAGGAGAATATCTGAGCCTTGCTCTCTTCCAGAGGCACATAGCCCAAGGAAGAAATACAGACTGACTGGGGTAGAAGTATTAAATATCCACACATGGCCTTACAACTGCTCAGTGCTGACAATATTTTTTTTAATTATTTTAAAAACTAACACAATGTTTCTTTGTTCACTGAATGATTTGTATATATATCCCCTTTATTCCTCTTCTATAGATTTTAGTATCTTCTTTCTTATCTATTTTATTCTCCTAATAACAAACTTCTATCTAGCATCTATCTATATATCTAACTAACTTTCATTTTATTTTGATTCTCAGTATGTTTCTCTTGCCTTCATTGGGGGCTCTGCTCCTTCTGGGTCTATCCCAAGAGAAAGAGTGGGGGAGGTTTCTAGAGTTTAAAACAAAGAGGTAGCACATTTTTTACTTTAAATATTCTAACTTTCTTTTTATTCATTTTCTTCTCTCCTATTTTTCTATCTATCTATTATGTTTGATATGTTTTGGTTATTTCTGCTATGATGAATCCTATTATTTCTATGATGGTTCAAGATGTGATTATTAGCACTTAGTTTTTATCTTATTCTTGTATGTGTTTTACTGCTACTTGTTGCTGATATTTTTATTTCTCCTTAATACATGGAGCTTTAATGGCCATGCCTCCTGTGTAGACGTTCTGCCACTAAGTGAAAGCCTCAGTCCAAAGAATAGGGGAGATAAAAACATCTCTATGTCAATGACTACTTTCCAAATAGATAAAGGAAGAGAATAGGAGTTCTGAAGACCCATTCGTGGACTTTCATGACTACAGCAAGTATAAAAAGAAGCAGCAAAAAGTACAAGGGCACACATTTTTGAGAAACTTCAGACAAAGCCTTTCAAAGCATGCATTACCAATAAAAACAAGAGGAATAATTACAGAGGGTACATCCTACACATTGCTCCATCAACATGATATCTTCCTGACTGTGCCCTTCCCCCAGCTTGTATAGGCTAAACAAACCCTGGTGGTTTTACAGCCTGGCTTCACCACATTTGCATTTTCTCGACCAGAGCTTTGAGAGGCTGACTACCTGCTGAGCTTGCTTTGCAACACAATCCAGCTGAAACAAAGAATAGATCTGTGTTTTTTCCTAATCAGAGCTGAAAATTAAACTTTGATCCTTGATC
chromosome_name
1               1

So, it returned something, but it returned the whole section of cDNA, which is not what I want. I simply want the nucleotide, n–1, and n+1 in the reference genome at the given positions! I'm pretty lost.

ADD REPLY

Login before adding your answer.

Traffic: 1795 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6