Entering edit mode
2.3 years ago
JACKY
▴
160
Say I have a dataframe of RNA-seq counts, with more than 15,000 genes (ensemble ID). I want to normalize this data from counts to TPM. To do so, I need to get the gene lengths.
I have tried multiple ways, the first is this, which seems to me kind of slumply, and also gave some worrying warnings:
exons = exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons = reduce(exons)
len = sum(width(exons))
insect = intersect(rownames(counts_df),names(len))
geneLengths = len[insect]
counts_df = counts_df[insect,]
When I perform the normalization, I get a list of warnings:
Warning messages:
1: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
2: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
3: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
4: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
5: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
The second way does not seem to work for such a big number of genes:
ensembl_list <- rownames(counts_df)
getGeneLengthAndGCContent(ensembl_list, "hsa")
It crashed:
Connecting to BioMart ...
Downloading sequences ...
This may take a few minutes ...
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: [www.ensembl.org:80] Operation timed out after 300000 milliseconds with 88120121 bytes received
How do I get trust worthy gene lengths of counts RNA-seq data?
You should first ensure that calculating TPMs like this is valid for your analysis. Proper TPMs generally need to be calculated from transcript abundances generated from software such as Salmon. This is because using one gene length value ignores that isoforms of various lengths can be expressed by the gene, and the abundance of these isoforms can change between conditions.
What do you think of this code:
I don't think salmon would fit to my research question.
From my previous experiences in retrieving data in bulk from biomaRt/Ensembl - it is always better to partition your data into batches before retrieving data. So I would rather chop those 15000 genes in batches of 500 and retrieve data per batch. This way you could also put your code to sleep if you overshoot your hourly request quota (in R with
Sys.sleep('time in seconds')
) Also, I prefer using biomaRt'sgetBM()
function to retrieve such kind of data, in your case it would look something like this:Hello manaswwm , only now I noticed that you replied. The gene length is still an issue, I'm very skeptical about the gene lengths I got, so I tried the code you provided for a dataset that I have. Originaly the dataset came with ENTREZ IDs, I switched to ensemble and about 20,000 genes were left. Your code produced a very long list of genes with two columns, transcript_length and cds_length, and some genes accure more than once. The list is very long.. 175293 rows, and I don't have this much genes.
Am I using it wrong?
My dataset does not have duplicated genes..
Very probably what you have now is the following: every row is a specific transcript that belongs to a particular gene, hence you have one gene displayed on multiple rows. You are not able to see this as the table that you got back from biomaRt does not have the transcript id, but it is my fault, as I did not include the transcript id within the code snippet so here is the code that you could try instead:
The output table will still be 175293 rows long, but now you will see the transcript per gene column. Here you will need to decide on a "representative" transcript per gene to get the "representative" gene length - which could mean "transcript_length" if you do not care about the UTRs when calculating the gene length. One of the many ways of doing this is checking for the canonical_transcript per gene, an alternative is to check for the MANE_transcript per gene (more on both here).
For example, if you decide on choosing the canonical transcript then you could ask biomaRt to also mark the canonical transcript within the table with this code:
Again, you will get 175293 rows, but you could subset the rows to contain only the canonical transcripts and their corresponding genes. One of the ways to do is as follows:
You can use this gene-->transcript table to subset your original table (that contained all the transcripts) to get a gene-specific length. One way to do this is as follows:
First of all I would like to thank for your elaborated answer. I really appreciate it.
Let me show you what each code line produces.. in order:
and the fourth line produced an error :
Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column
And this is the genelengths that I prodcued, the ones that I'm skeptical about, I would be glad to hear what you think of this:
That is a bit weird, when I run this code on my machine I get something like this after running the fourth line which I imagine is something that you were looking for (having a single "gene length" per gene):
Can you paste the column names that you have for
gene_canonical_transcript_subset
andgene_lengths
? I see in the fourth figure the only column name is genelength1 which is technically not a column name of any of the two data frames. Could you also paste the exact line of code that you used to get the last figure?Optionally, to avoid R altogether, you could also use the web interface of biomart hosted here -https://www.ensembl.org/biomart/martview. Select the same attributes and filters from the code above, upload your list of gene IDs (in batches of 500) and you should have the same results
The column names of
gene_lengths
are :and for
gene_canonical_transcript_subset
:I produced the
genelengths1
(shown in the last figure) like this:genelength1
here is, I tried cross-referencing the first gene with the entry page for that gene in Ensembl and I could not find the number 6979.Accessing biomart through the weblink that I shared could be may be useful?
manaswwm So it worked, turns out I'm the one who overlooked a small typo, apologies. This is the result. Looking at the
transcript_length
column, as you can see it is very different from the gene lengths that i originaly have. Which should I use?Perhaps my doubts were right afterall and the gene lengths that I have are not accurate?
It is tough to say which one you should use as I am not sure what you have in the
genelength1
data frame. What you have frombiomaRt
are the lengths extracted directly from Ensembl, which I would personally prefer. Maybe you could manually check a couple of genes as an example, go to their corresponding Ensembl pages and check which of the two data frames have the correct length listed against the gene (or transcript for the data frame frombiomaRt
).manaswwm I appreciate your help sir, thank you so much. I will run the analysis using the new values, and see how the results will change. Maybe I'll update you if there are any good news :)