What I have: I have analyzed a set of WES data with VarScan 2. I now have a somatic variant call .csv with "chrom", "position", "ref", "var", etc. as columns headers. E.g.:
chrom position ref var normal_reads1
CM000994.2 3681266 T G 229
CM000994.2 6558171 A C 11
.
.
.
What I need: I am trying to (1) get the reference nucleotide at a position as well as the flanking nucleotides (i.e. n–1 and n+1). I then need to concatenate them to make a trinucleotide string for each variant (e.g. "ATG", where ref="T", n–1="A", n+1="G").
My problem: The VarScan 2 output has, as far as I can tell, GenBank accession ID's as values under the "chrom" column (e.g. CM000994.2) I am working with mouse build mm10 and have set that as my biomaRt dataset. I am trying to use biomaRt to get the reference nucleotide, the n–1 nucleotide, and the n+1 nucleotide. However, I can't figure out how to use GenBank accession ID as a query input for biomaRt. Here is an example of what I am getting:
> getSequence(chromosome = CM000994.2, start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = 'cdna', type = chromosome_name)
Error in getBM(c(seqType, type), filters = c("chromosome_name", "start", :
object 'CM000994.2' not found
How do I either 1) access nucleotide positions using a GenBank chromosome accesion ID and a position, or 2) convert GenBank chromosome accession ID to a biomaRt-usable chromosome label? ANY help would be appreciated.
EDIT: Should I just use BIostrings+BSgenome+GenomicRanges for this?
I think OP has to remove .2 extension. However, OP didn't post what following objects are chromosome_name, mouse_dataset are. Some thing like this:
data[,1]=gsub("\\.[0-9]$","",data[,1])
@cpad, based on the R error, "object 'CM000994.2' not found", it seems obvious to me that he has missed double or single quote since CM000994.2 is character string. If we try XYZ in R console..we will get error that object 'XYZ' not found.
I agree with you that following objects are chromosome_name, mouse_dataset are missing, based on which we can precisely know the error.
@pb: you are correct that OP has to quote chormosome name.
@pbpanigrahi, what do you mean:
? For reference, I have
mouse_dataset = useDataset("mmusculus_gene_ensembl", mart = ensembl)
if that is what you wanted to know. Let me know if that helps.(1of3) Hi, thanks for the reply.
When I enter:
I get a similar
object not found
error again fortype = chromosome_name
:Even then, when I put "chromosome_name" in quotes, the query goes through, but it doesn't return any data:
I'm not sure what I'm doing wrong. It doesn't seem to actually be accessing the genome coordinates. Is it possible that it's simply not recognizing the GenBank ID, or would it give me a definitive error?
From Google/NCBI, I know that GenBank ID CM000994.2 is mouse chromosome 1, so I decided to do some sleuthing and went into the
mouse_dataset@filters
to see what chromosome labels ("options") are actually accepted for chromosome:(3of3)So, from this I tried to run the following:
So, it returned something, but it returned the whole section of cDNA, which is not what I want. I simply want the nucleotide, n–1, and n+1 in the reference genome at the given positions! I'm pretty lost.