Entering edit mode
4.4 years ago
zizigolu
★
4.3k
Hi
I have a list of genes
Where I can get the coding sequence length of each gene (in nucleotide)
?
I have a file having gene length but I guess that IS FOR whole gene length not coding sequence part of a gene
Can you help please?
You can also use ensemble biomart directly. For Grch37 there is an archive ensembl biomart site http://grch37.ensembl.org/biomart/martview .
The advantage being other metadata can also be pulled in like gene name etc.
without the actual CDS coordinates ( structural annotations ) , it will be quite hard .
Don't you have a gff or such file with the annotations for all CDSs?
No I don't
You meant I must extract that from gif file?
if the species is annotated, you could get the CDS in
fasta
and count the number of bases there for your genes of interest.This code suppose to get CDS length for human but I dob't know why I get error in second line
Getting error here
First off, you don't need to skip the comment lines in the GTF file,
read.table
automatically skips lines that begin withcomment.char
, whose default value is#
.Also, with GTF file, you're better off using
data.table
andfread
thandata.frame
andread.table
.Unless you're sure
ens[nn1, 9]
is a factor, don't useas.character.factor
. Simply useas.character
and let R figure out which method to dispatch to.Have you looked at: A: How to obtain the length of coding regions for the list of genes?
My genome is GRCH37, does this differ in CDS length?
Get GTF file for GRCh37 here.
Just a side note:
It might be safer to use transcript columns instead of gene columns (i.e.
print $12
instead ofprint $10
) since depending on the annotation an individual might be using, the sum of exon lengths could dramatically change if there are multiple transcripts for the same gene. It could be worthwhile to dogrep ENSEMBL
beforeawk
in case your GENCODE annotation has HAVANA entries as well.A, please stop adding answers. Add comments/comment-replies instead.