Hello to everyone. After reading through the biomaRt Reference and Users guides, I have a question I'm hoping someone here can help with. I am using Bioconductor's biomaRt package to query and store 500 nt sequences from the Ensembl gene database (mouse dataset) based on lists of unique MGI IDs. These are ultimately being used on a local instance of the MEME suite for motif analyses. Currently, I am working with lists containing from ~20 to ~450 MGI IDs. In some, but not all lists, some IDs are being duplicated. For example, in a list of 425 MGI IDs, one ID is being duplicated (MGI:1920713). On another list, 2 IDs are being duplicated (MGI:102935 and 1923008). From visually checking (i.e.: looking at my "MGI_IDs_1.txt" and my "geneIDs" and "seqs" variables), I can tell that the duplication is occurring during the getSequence()
step and not during any earlier step including the readLines()
step. By looking at the MGI IDs on MGI's site, I also can't find any reason why those IDs would be singled out. Anyone have any ideas why? Thanks very much in advance.
Here's my R code:
library("biomaRt")
ensembl<-useMart("ensembl",dataset"mmusculus_gene_ensembl")
geneIDs<-readLines("/home/ed/R/Projects/MGI_IDs_1.txt")
seqs<-getSequence(id=geneIDs,type="mgi_id",seqType="gene_flank",upstream=500,mart=ensembl)
exportFASTA(seqs,"/home/ed/R/Projects/list_1.fasta")
Hi Ram. I'm not sure I follow exactly what you mean. If you are asking about the FASTA file which is created at the end, there is no white space after the header. Here's an example of the first few lines:
Can you give us the first few (~20) lines of theMGI_ IDs file please?
OK, this doesn't seem to be the problem I thought it was. Let's wait for someone with better or more specific inputs.