calculating the total length of cDNA for a given gene
2
0
Entering edit mode
10.6 years ago
Quak ▴ 520

I have a list of genes, and I would like to calculate the total length of cDNA for each. Is there any database or a table somewhere where I can look up ?! Is there any way to query ucsc ?!

Alternatively, I looked into ENSMBL through biomart, and I see "cDNA coding start" and "cDNA coding end" are listed in the Attributes. Should I sum them up to have the final cDNA length ?! How can I be sure that the number I would reach is correct and is equal to the true cDNA length !! Is there any database I can double check my result ?!

gene cDNA database • 8.9k views
ADD COMMENT
0
Entering edit mode

I am not an expert, but I guess that cDNA length might differ for every gene isoform. Following this logic, checking cDNA length for every transcript is more correct.

ADD REPLY
0
Entering edit mode

Well basically I am after calculating how much of the gene is coding and how much is not coding. I think, based on your suggestion I should calculate cDNA length for every transcript of the gene and sum them up. Is that correct?!

ADD REPLY
0
Entering edit mode

'cDNA coding start' and 'cDNA coding end' will give you the position of the coding start within the cDNA, which is not what you need.

There isn't an easy way to get this via BioMart, I'm afraid. Are you familiar with our Perl APIs?

ADD REPLY
0
Entering edit mode

So you mean, even the summation of coding length within cDNA is not equal to the coding length of the gene ?!

Basically, I have a set of genes, and I would like to group them based on their coding region length.

ADD REPLY
0
Entering edit mode

Forget about cDNA, that's just not the right term.

ADD REPLY
0
Entering edit mode

you might be right - I used cDNA because it was the closest term to what I am looking for, in Biomart. I am after "coding region" of each gene.

ADD REPLY
1
Entering edit mode

Use 'CDS length' in BioMart - that's exactly what you need.

ADD REPLY
0
Entering edit mode

Should take the longest one? because I see there are different entries for a gene

ADD REPLY
0
Entering edit mode

Are you looking for the length of the mature mRNA? This could be easily calculated for each transcript. In my understanding, cDNA is a laboratory construct (reverse transcribed mRNA), which could be fragmented. The length of the coding sequence is not the same because as the mRNA because of UTRs.

ADD REPLY
2
Entering edit mode
10.6 years ago
Irsan ★ 7.8k
You can not sum up the coding sequence length of all transcripts per gene because lots of exons will be counted double. Just taking the longest one also isn't accurate because you can omit (parts of) exons. The best way is to merge overlapping exons of all transcript per gene and sum their sizes described in this post. Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor
ADD COMMENT
0
Entering edit mode
10.6 years ago

If you can get a gtf for your species, e.g. for human you could get it from this ftp link, then this R script will you file of transcript ids and their lengths. It simply sums the length of each exon within transcript:

gtf<- read.table('Homo_sapiens.GRCh37.63.gtf.gz', sep= '\t', stringsAsFactors= FALSE, quote= '')
exons<- gtf[gtf$V3 == 'exon', ]
exons$transcript_id<- sub('.*transcript_id "', '', exons$V9, perl= TRUE)
exons$transcript_id<- sub('&quot;.*', '', exons$transcript_id, perl= TRUE)
exons$feature_length<- exons$V5 - exons$V4 + 1
cdna_len<- aggregate(list(cdna_length= exons$feature_length), by= list(transcript_id= exons$transcript_id), sum)
write.table(cdna_len, file= 'cdna_length.txt', sep= '\t', col.names= TRUE, row.names= FALSE, quote= FALSE)
ADD COMMENT

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6