Question

What if you have annotation information from different resource?

1

Entering edit mode

5.8 years ago

jiamacro ▴ 20

I have got my expression matrix and downloaded the official annotation .csv document. The problem is there shows different resources of annotation information in my mrna_assignment column, include RefSeq, Ensembl, lncRNAWiki, AceView and so on. Some of probes have only NM_/NR_, some of them have only ENST, and the others have neither.

 > head(ann.df$gene_id,1)
[1] NR_046018 // RefSeq // Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA. // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002844 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000362751 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000450305 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000456328 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0
135757 Levels: --- // --- // Antigenomic background control // --- // --- // --- // --- // --- // --- ...

To further analyze my data, I have to transfer this annotation information into the same ENST form, and I downloaded .gtf documents from Ensembl and RefSeq respectively. Then I found there are still many probes that can not be transferred because they are not RefSeq or Ensembl.

ann.df$gene_id <- sub ('//.*', '', ann.df$gene_id)
head(ann.df)
        probeset_id chromosome_name start   end       locus.type                    gene_id
1 TC0100006432.hg.1            chr1 11869 14412 Multiple_Complex                 NR_046018 
2 TC0100006433.hg.1            chr1 28046 29178           Coding spopoybu.aAug10-unspliced 
3 TC0100006434.hg.1            chr1 29554 31109 Multiple_Complex                 NR_036267 
4 TC0100006435.hg.1            chr1 52473 53312       Pseudogene        OTTHUMT00000471235 
5 TC0100006436.hg.1            chr1 62948 63887 Multiple_Complex           ENST00000492842

Then I consider using my locus information to match the ENST, however, the biomaRt package use filter as a list form and the locus information of my whole df is too big to be transferred as a list.

ensembl <- useMart ('ensembl', dataset = 'hsapiens_gene_ensembl')
tem <- getBM (ensembl, attributes = c(
    "hgnc_symbol",
    "ensembl_gene_id",
    "gene_biotype",
    "external_gene_name"),
  filter = c('chromosome_name','start','end')
  values = ann.df$gene_id,
  uniqueRows =TRUE)

I don't know hoiw to deal with it, now. Please help me.

R clariomdhuman gtf annotation • 1.4k views

ADD COMMENT • link 5.8 years ago by jiamacro ▴ 20

1

Entering edit mode

filter and values arguments are related. filter is the column name we want to filter on, and the values is the values we want to keep in that column.

ADD REPLY • link 5.8 years ago by zx8754 12k

0

Entering edit mode

I know these arguments and the meaning of them. I have tried biomaRt while the values should be a list instead of a data frame in my data. The question is my data has more than 13,000 rows and it's too big to be transferred into a list.

ADD REPLY • link 5.8 years ago by jiamacro ▴ 20