Question

how to annotate RNA-seq

1

Entering edit mode

6.9 years ago

Learner ▴ 280

I have many files like this

ENSG00000000003.13  366
ENSG00000000005.5   26
ENSG00000000419.11  1905
ENSG00000000457.12  775
ENSG00000000460.15  377
ENSG00000000938.11  316
ENSG00000000971.14  272
ENSG00000001036.12  1726
ENSG00000001084.9   2479
ENSG00000001167.13  1166
ENSG00000001460.16  38
ENSG00000001461.15  298

They are all htseq-count.

I have two questions

1- how can I convert the ID to gene names? do you have any solution available (I can use any language) ? 2- what would you use for normalizing them?

RNA-Seq • 5.4k views

ADD COMMENT • link updated 6.9 years ago by Devon Ryan 104k • written 6.9 years ago by Learner ▴ 280

1

Entering edit mode

The normalization will depend of what you want to do next. By exemple, if you want to apply a differential expression analysis, some tools have there own normalization (Deseq2 use RLE, EdgeR use TMM...)

ADD REPLY • link 6.9 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Hi Learner,

Please provide some feedback on your previous threads before opening new threads.
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

@WouterDeCoster please give a proper solution and I will absolutely like and accept it. I don't accept stuff which does not help. the main purpose of this web is to learn and help (at least this is in my case)

ADD REPLY • link 6.9 years ago by Learner ▴ 280

1

Entering edit mode

6.9 years ago

Devon Ryan 104k

You can download the mapping from biomart (you want the "gene stable ID version" and the "gene name"). You can merge them in R or any other language.
You probably want to use something like DESeq2, but without more details it's impossible to know what normalization would be appropriate.

ADD COMMENT • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan thank you very much for your reply. The problem with biomart is that I have about 200000 gene ID and when I use it, it will shrink to 15000 so I lose like 5000 (I dont know why).

I basically want to find upregulated and down regulated genes from transcriptome (based on htseq-count) what do you suggest now?

ADD REPLY • link 6.9 years ago by Learner ▴ 280

0

Entering edit mode

There aren't 200000 genes, perhaps you meant 20000? Either way, presumably you're using an old annotation (e.g., from hg19), in which case you'll have to use the appropriate archive version of biomart.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan Yes 20000 roughly. I use hsapiens_gene_ensembl , can you please tell me how to find the newest version? and also how would you normalize them?

ADD REPLY • link 6.9 years ago by Learner ▴ 280

0

Entering edit mode

I linked to the most recent version, you're likely using an old one. Post some of the IDs that don't have a match and we can have a look.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan can you please tell me how to identify which one is recognised and which one not? I am using the above R code. If you can tell me I will post some of the IDs

ADD REPLY • link 6.9 years ago by Learner ▴ 280

0

Entering edit mode

some_data_frame$ID[which(!(some_data_frame$ID %in% annotation$ID))] or something along those lines.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan for example these ones

ENSG00000122718
ENSG00000130201
ENSG00000150076
ENSG00000150526
ENSG00000155640
ENSG00000166748
ENSG00000168260
ENSG00000168787
ENSG00000170590
ENSG00000170803
ENSG00000171484
ENSG00000172381
ENSG00000172774

ADD REPLY • link 6.9 years ago by Learner ▴ 280

0

Entering edit mode

As I suspected, you're using a very old annotation.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan can you please tell me how to make it new ? how to use the new one?

ADD REPLY • link 6.9 years ago by Learner ▴ 280

0

Entering edit mode

Just download the new version and use it. Note that you'll presumably need to realign your data if you aligned it to hg19 as well. If you don't want to do that then use a GTF file from hg19.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan newer version ? I downloaded the latest version and still the same issue. I cannot download the raw files to reanalyse. I only have access to htseq-count. As you know there is no way to annotate them, I have been asking for this from various people and no one had a clue how to annotate them. It seems like there is no solution for this problem at the moment.

ADD REPLY • link 6.9 years ago by Learner ▴ 280

0

Entering edit mode

If you've downloaded htseq-count files with hg19 IDs then you just need to download the hg19 ID->Gene mapping from the archived version of biomart.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan can you please show me how you convert the above genes? I appreciate your comment but I have been trying to do all what you said with no sucess. So I would appreciate if you can show me how you do it

ADD REPLY • link 6.9 years ago by Learner ▴ 280

0

Entering edit mode

Go here.
Select "Ensembl Genes 91", then "Human genes"
Under "Attributes", deselect "transcript ID" and select "gene name".
Save the results

Note that Ensembl seems to be having technical issues at the moment, so you might have to wait until the morning here in Europe.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

score 3 · Accepted Answer · 2018-01-18

To answer your first question :

Create a TSV file with your input => list.tsv

Remove ".13" etc from your ENSG (list.tsv) => list_ensembl_gid.tsv

awk -F '\t' -v OFS='\t' '{sub(/\.[0-9]*/, "", $1)} 1' list.tsv > list_ensembl_gid.tsv;

Then use R & Biomart to retrieve corresponding genes names

R
data = read.table("list_ensembl_gid.tsv")
colnames(data)[1] <- "ensembl_gene_id"
colnames(data)[2] <- "counts"

library('biomaRt')

hsapiens = useMart("ensembl",
                    dataset="hsapiens_gene_ensembl")

hsapiens_infos <- getBM(attributes=c('ensembl_gene_id',
                                     'external_gene_name'),
                        mart = hsapiens)

merge_infos <- merge(x = data,
                     y = hsapiens_infos,
                     by = "ensembl_gene_id",
                     all.x = TRUE)

head(merge_infos)

ensembl_gene_id counts external_gene_name
ENSG00000000003 366 TSPAN6
ENSG00000000005 26 TNMD
ENSG00000000419 1905 DPM1
ENSG00000000457 775 SCYL3
ENSG00000000460 377 C1orf112
ENSG00000000938 316 FGR