I have got my expression matrix and downloaded the official annotation .csv document. The problem is there shows different resources of annotation information in my mrna_assignment column, include RefSeq, Ensembl, lncRNAWiki, AceView and so on. Some of probes have only NM_/NR_, some of them have only ENST, and the others have neither.
> head(ann.df$gene_id,1)
[1] NR_046018 // RefSeq // Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA. // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002844 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000362751 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000450305 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000456328 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0
135757 Levels: --- // --- // Antigenomic background control // --- // --- // --- // --- // --- // --- ...
To further analyze my data, I have to transfer this annotation information into the same ENST form, and I downloaded .gtf documents from Ensembl and RefSeq respectively. Then I found there are still many probes that can not be transferred because they are not RefSeq or Ensembl.
ann.df$gene_id <- sub ('//.*', '', ann.df$gene_id)
head(ann.df)
probeset_id chromosome_name start end locus.type gene_id
1 TC0100006432.hg.1 chr1 11869 14412 Multiple_Complex NR_046018
2 TC0100006433.hg.1 chr1 28046 29178 Coding spopoybu.aAug10-unspliced
3 TC0100006434.hg.1 chr1 29554 31109 Multiple_Complex NR_036267
4 TC0100006435.hg.1 chr1 52473 53312 Pseudogene OTTHUMT00000471235
5 TC0100006436.hg.1 chr1 62948 63887 Multiple_Complex ENST00000492842
Then I consider using my locus information to match the ENST, however, the biomaRt package use filter as a list form and the locus information of my whole df is too big to be transferred as a list.
ensembl <- useMart ('ensembl', dataset = 'hsapiens_gene_ensembl')
tem <- getBM (ensembl, attributes = c(
"hgnc_symbol",
"ensembl_gene_id",
"gene_biotype",
"external_gene_name"),
filter = c('chromosome_name','start','end')
values = ann.df$gene_id,
uniqueRows =TRUE)
I don't know hoiw to deal with it, now. Please help me.
filter and values arguments are related.
filter
is the column name we want to filter on, and thevalues
is the values we want to keep in that column.I know these arguments and the meaning of them. I have tried biomaRt while the values should be a list instead of a data frame in my data. The question is my data has more than 13,000 rows and it's too big to be transferred into a list.