How to get the ensembl ID and divide my genes into different biotypes?
1
0
Entering edit mode
5.2 years ago
jiamacro ▴ 20

I have my own microarray data to be annotated, however, the official annotation document is too complex to be used. And it seems like below:

> head(ann.df$gene_id,1)
[1] NR_046018 // RefSeq // Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA. // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002844 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000362751 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000450305 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000456328 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0
135757 Levels: --- // --- // Antigenomic background control // --- // --- // --- // --- // --- // --- ...

I want to get their ensemblID or entrezID and futher divide them into different biotypes by using biomaRt package for DEG analysis using limma. I now get the first character:

ann.df$gene_id <- sub ('//.*', '', ann.df$gene_id)
ann.clean <- ann.df [-grep(ann.df$locus.type, pattern = "Unassigned"),]
head(ann.clean)
        probeset_id chromosome_name start   end       locus.type                    gene_id
1 TC0100006432.hg.1            chr1 11869 14412 Multiple_Complex                 NR_046018 
2 TC0100006433.hg.1            chr1 28046 29178           Coding spopoybu.aAug10-unspliced 
3 TC0100006434.hg.1            chr1 29554 31109 Multiple_Complex                 NR_036267 
4 TC0100006435.hg.1            chr1 52473 53312       Pseudogene        OTTHUMT00000471235 
5 TC0100006436.hg.1            chr1 62948 63887 Multiple_Complex           ENST00000492842

The question is, the gene_id I have now are coded by different form and many of them are lncRNA,miRNA and circRNA. I have to divide them first, and I used biomaRt,

ensembl <- useMart ('ensembl', dataset = 'hsapiens_gene_ensembl')
tem <- getBM (ensembl, attributes = c(
    "hgnc_symbol",
    "ensembl_gene_id",
    "gene_biotype",
    "external_gene_name"),
  filter = ' ',#?
  values = ann.clean$gene_id,
  uniqueRows =TRUE)

I am not sure what I should use as a filter? Anybody can help me? Many thanks.

R microarray annotation clariomdhuman • 1.5k views
ADD COMMENT
1
Entering edit mode

Filter is what is used to limit the database query, here it should be your gene IDs. It is not possible to mix IDs from different sources in a filter when using Biomart. For this type of queries, I would write a short script using the Ensembl perl API. I suggest to use only one source as reference to avoid inconsistencies due to differences in the notion of gene between the databases.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I have no idea about perl, and I think maybe I can use filter = c('chromosome_names', 'start', 'end') to do my annotation. There is another problem: I have to use dlply() to transfer my data into a list which is formed as:

  (1, 11869, 14412)

Now I have a problem about how to transfer them.

ADD REPLY
0
Entering edit mode

I know some of them are RefSeq and some are Ensembl, while the other genes using different code include pseudogene, splicing, miRNA, I have no idea for those genes with different annotation form.

ADD REPLY
1
Entering edit mode
5.2 years ago
Emily 24k

Depending on the array you used, you can filter by the probeset list in biomaRt.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. My data is a clariomdhuaman whole transcriptome assay, and I do used the clariomdhuamsntranscriptcluster.db package to annotate my data. While I found that although I delete the NA value, there is still many NA value in ensembl and entrezid colum. And I have checked it a lot, there shows not many research use my microarray. I am not very sure if there is the same probeset list used in my array and the classical arrays. So I think I should download the official csv document to annotate it myself.

ADD REPLY
0
Entering edit mode

I think maybe I can use filter = c('chromosome_names', 'start', 'end') to do my annotation. There is another problem: I have to use dlply() to transfer my data into a list which is formed as: (1, 11869, 14412) Now I have a problem about how to transfer them.

ADD REPLY
0
Entering edit mode

Yes, filtering by locus would work.

ADD REPLY
0
Entering edit mode

No, you're right, we don't have it. That won't work.

ADD REPLY
0
Entering edit mode

You mean we don't have a good annotation package to annotate clariomdhuman assay, right? And the probeset list used in clariomdhuman is different from others, so we can not use the probeset list in biomaRt directly, right? The questions are

1) the clariomdhuman transcriptcluster annotation csv document is too complex to be used, which contains not only one resource. So I have to use the locus to filter the information.
2) there are still many probes that have their ENST code without locus information. The count is not big enough (which maybe 2 or 3 hundred, compared to the whole 130,000 probes), since I am not familiar to the annotation program you guys did, what can I do to interpret these data well without overreading them, it still confused me a lot. 3) the assay I used is a product of affy in 2016 which is not used frequently. The assay can run mRNA, lncRNA, miRNA and circRNA at the same time. I used to calculate the DEG before annotation and did GO&KEGG enrichment analysis consequently, there shows nothing significant. I guess that is Simpson's Paradox, and I should annotate the data first and divide it into different biotypes before DEG analysis. Do you think it's correct thinking?

Sorry for such a long question. My account is low and only can post 5 times in 6 hr, I do appreciate your reply. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2159 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6