I have my own microarray data to be annotated, however, the official annotation document is too complex to be used. And it seems like below:
> head(ann.df$gene_id,1)
[1] NR_046018 // RefSeq // Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA. // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002844 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000362751 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000450305 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000456328 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0
135757 Levels: --- // --- // Antigenomic background control // --- // --- // --- // --- // --- // --- ...
I want to get their ensemblID or entrezID and futher divide them into different biotypes by using biomaRt package for DEG analysis using limma. I now get the first character:
ann.df$gene_id <- sub ('//.*', '', ann.df$gene_id)
ann.clean <- ann.df [-grep(ann.df$locus.type, pattern = "Unassigned"),]
head(ann.clean)
probeset_id chromosome_name start end locus.type gene_id
1 TC0100006432.hg.1 chr1 11869 14412 Multiple_Complex NR_046018
2 TC0100006433.hg.1 chr1 28046 29178 Coding spopoybu.aAug10-unspliced
3 TC0100006434.hg.1 chr1 29554 31109 Multiple_Complex NR_036267
4 TC0100006435.hg.1 chr1 52473 53312 Pseudogene OTTHUMT00000471235
5 TC0100006436.hg.1 chr1 62948 63887 Multiple_Complex ENST00000492842
The question is, the gene_id I have now are coded by different form and many of them are lncRNA,miRNA and circRNA. I have to divide them first, and I used biomaRt,
ensembl <- useMart ('ensembl', dataset = 'hsapiens_gene_ensembl')
tem <- getBM (ensembl, attributes = c(
"hgnc_symbol",
"ensembl_gene_id",
"gene_biotype",
"external_gene_name"),
filter = ' ',#?
values = ann.clean$gene_id,
uniqueRows =TRUE)
I am not sure what I should use as a filter? Anybody can help me? Many thanks.
Filter is what is used to limit the database query, here it should be your gene IDs. It is not possible to mix IDs from different sources in a filter when using Biomart. For this type of queries, I would write a short script using the Ensembl perl API. I suggest to use only one source as reference to avoid inconsistencies due to differences in the notion of gene between the databases.
Thanks for your reply. I have no idea about perl, and I think maybe I can use filter = c('chromosome_names', 'start', 'end') to do my annotation. There is another problem: I have to use dlply() to transfer my data into a list which is formed as:
Now I have a problem about how to transfer them.
I know some of them are RefSeq and some are Ensembl, while the other genes using different code include pseudogene, splicing, miRNA, I have no idea for those genes with different annotation form.