Question

Getting coding sequence length

0

Entering edit mode

4.5 years ago

zizigolu ★ 4.3k

Hi

I have a list of genes

Where I can get the coding sequence length of each gene (in nucleotide) ?

I have a file having gene length but I guess that IS FOR whole gene length not coding sequence part of a gene

Can you help please?

NCBI Gene CDS • 4.3k views

ADD COMMENT • link updated 2.6 years ago by newbio17 ▴ 360 • written 4.5 years ago by zizigolu ★ 4.3k

1

Entering edit mode

You can also use ensemble biomart directly. For Grch37 there is an archive ensembl biomart site http://grch37.ensembl.org/biomart/martview .

Select Human genes
In attributes -> Structures -> gene -> CDS Length Radio button

The advantage being other metadata can also be pulled in like gene name etc.

ADD REPLY • link 4.5 years ago by microfuge ★ 2.0k

0

Entering edit mode

without the actual CDS coordinates ( structural annotations ) , it will be quite hard .

Don't you have a gff or such file with the annotations for all CDSs?

ADD REPLY • link 4.5 years ago by lieven.sterck 15k

0

Entering edit mode

No I don't

You meant I must extract that from gif file?

ADD REPLY • link 4.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

if the species is annotated, you could get the CDS in fasta and count the number of bases there for your genes of interest.

ADD REPLY • link 4.5 years ago by husensofteng ▴ 410

0

Entering edit mode

This code suppose to get CDS length for human but I dob't know why I get error in second line

ens<-read.table("Homo_sapiens.GRCh38.84.gtf/Homo_sapiens.GRCh38.84.gtf",sep="\t",skip=3)
nn1<-which(ens[,3]=="CDS")
genes<-paste0("ENSG",gsub(".*ENSG","",as.character.factor(ens[nn1,9])))
genes<-gsub(";.*","",genes)
transcr<-paste0("ENST",gsub(".*ENST","",as.character.factor(ens[nn1,9])))
transcr<-gsub(";.*","",transcr)
len<-ens[nn1,5]-ens[nn1,4]
df<-cbind.data.frame(genes,transcr,len)
df1<-aggregate(df[,3],by=list(genes,transcr),FUN="sum")
write.csv(df1,"gene,transcript,CDS_length_ list,ens84,grch38.csv")

Getting error here

>genes<paste0("ENSG",gsub(".*ENSG","",as.character.factor(ens[nn1,9])))

Error in as.character.factor(ens[nn1, 9]) :   attempting to coerce non-factor

ADD REPLY • link 4.5 years ago by zizigolu ★ 4.3k

0

Entering edit mode

First off, you don't need to skip the comment lines in the GTF file, read.table automatically skips lines that begin with comment.char, whose default value is #.

Also, with GTF file, you're better off using data.table and fread than data.frame and read.table.

Unless you're sure ens[nn1, 9] is a factor, don't use as.character.factor. Simply use as.character and let R figure out which method to dispatch to.

ADD REPLY • link 4.5 years ago by Ram 44k

0

Entering edit mode

Have you looked at: A: How to obtain the length of coding regions for the list of genes?

ADD REPLY • link 4.5 years ago by GenoMax 148k

0

Entering edit mode

My genome is GRCH37, does this differ in CDS length?

ADD REPLY • link 4.5 years ago by zizigolu ★ 4.3k

2

Entering edit mode

Get GTF file for GRCh37 here.

zcat gencode.v34lift37.annotation.gtf.gz | awk '{if($3=="exon") print $10"\t"$5-$4}' | sed -e 's/"//g' -e 's/;//' | bedtools groupby -i - -g 1 -c 2 -o sum > Exon_lengths.txt

ADD REPLY • link 4.5 years ago by GenoMax 148k

0

Entering edit mode

Just a side note:

It might be safer to use transcript columns instead of gene columns (i.e. print $12 instead of print $10) since depending on the annotation an individual might be using, the sum of exon lengths could dramatically change if there are multiple transcripts for the same gene. It could be worthwhile to do grep ENSEMBL before awk in case your GENCODE annotation has HAVANA entries as well.