Getting coding sequence length
0
0
Entering edit mode
4.4 years ago
zizigolu ★ 4.3k

Hi

I have a list of genes

Where I can get the coding sequence length of each gene (in nucleotide) ?

I have a file having gene length but I guess that IS FOR whole gene length not coding sequence part of a gene

Can you help please?

NCBI Gene CDS • 4.2k views
ADD COMMENT
1
Entering edit mode

You can also use ensemble biomart directly. For Grch37 there is an archive ensembl biomart site http://grch37.ensembl.org/biomart/martview .

  • Select Human genes
  • In attributes -> Structures -> gene -> CDS Length Radio button

The advantage being other metadata can also be pulled in like gene name etc.

ADD REPLY
0
Entering edit mode

without the actual CDS coordinates ( structural annotations ) , it will be quite hard .

Don't you have a gff or such file with the annotations for all CDSs?

ADD REPLY
0
Entering edit mode

No I don't

You meant I must extract that from gif file?

ADD REPLY
0
Entering edit mode

if the species is annotated, you could get the CDS in fasta and count the number of bases there for your genes of interest.

ADD REPLY
0
Entering edit mode

This code suppose to get CDS length for human but I dob't know why I get error in second line

ens<-read.table("Homo_sapiens.GRCh38.84.gtf/Homo_sapiens.GRCh38.84.gtf",sep="\t",skip=3)
nn1<-which(ens[,3]=="CDS")
genes<-paste0("ENSG",gsub(".*ENSG","",as.character.factor(ens[nn1,9])))
genes<-gsub(";.*","",genes)
transcr<-paste0("ENST",gsub(".*ENST","",as.character.factor(ens[nn1,9])))
transcr<-gsub(";.*","",transcr)
len<-ens[nn1,5]-ens[nn1,4]
df<-cbind.data.frame(genes,transcr,len)
df1<-aggregate(df[,3],by=list(genes,transcr),FUN="sum")
write.csv(df1,"gene,transcript,CDS_length_ list,ens84,grch38.csv")

Getting error here

>genes<paste0("ENSG",gsub(".*ENSG","",as.character.factor(ens[nn1,9])))

Error in as.character.factor(ens[nn1, 9]) :   attempting to coerce non-factor
ADD REPLY
0
Entering edit mode

First off, you don't need to skip the comment lines in the GTF file, read.table automatically skips lines that begin with comment.char, whose default value is #.

Also, with GTF file, you're better off using data.table and fread than data.frame and read.table.

Unless you're sure ens[nn1, 9] is a factor, don't use as.character.factor. Simply use as.character and let R figure out which method to dispatch to.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

My genome is GRCH37, does this differ in CDS length?

ADD REPLY
2
Entering edit mode

Get GTF file for GRCh37 here.

zcat gencode.v34lift37.annotation.gtf.gz | awk '{if($3=="exon") print $10"\t"$5-$4}' | sed -e 's/"//g' -e 's/;//' | bedtools groupby -i - -g 1 -c 2 -o sum > Exon_lengths.txt
ADD REPLY
0
Entering edit mode

Just a side note:

It might be safer to use transcript columns instead of gene columns (i.e. print $12 instead of print $10) since depending on the annotation an individual might be using, the sum of exon lengths could dramatically change if there are multiple transcripts for the same gene. It could be worthwhile to do grep ENSEMBL before awk in case your GENCODE annotation has HAVANA entries as well.

ADD REPLY
1
Entering edit mode

A, please stop adding answers. Add comments/comment-replies instead.

ADD REPLY

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6