Entering edit mode
10.1 years ago
pwg46
▴
540
Please refer to the title. Are some of the CDS in Ensembl's cds file simply bad data?
Please refer to the title. Are some of the CDS in Ensembl's cds file simply bad data?
No, it's not bad data. In addition to the non-AUG start, it could be because they are annotated as CDS 5' incomplete and the start was therefore left open. See this example. CDS 5' (or CDS 3') incomplete transcripts are manually annotated by the HAVANA team and displayed in Ensembl as part of the GENCODE gene set.
You can also check the fasta deflines in the file and specifically the status and transcript types. I count 58 CDS or 0.25% using the ensembl_havana_transcript:known. And here's some R code with details.
url <- "ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz"
download.file(url)
system("gunzip GRCh38.cds.all.fa.gz")
cds <- readDNAStringSet("GRCh38.cds.all.fa")
length(cds)
[1] 99436
status <- gsub("[^ ]+ ([^ ]+).*", "\\1", names(cds) )
atg <- substr(cds, 1,3)=="ATG"
ttype <- gsub(".*transcript_biotype:([^ ]+).*", "\\1", names(cds) )
table(status, atg)
atg
status FALSE TRUE
ensembl_havana_transcript:known 102 22937
ensembl_havana_transcript:novel 35 1611
ensembl_havana_transcript:putative 37 1250
ensembl:known 266 10041
ensembl:novel 83 434
havana:known 4620 31920
havana:novel 2617 6072
havana:putative 4222 12728
table(atg[status=="ensembl_havana_transcript:known" & ttype=="protein_coding"])
FALSE TRUE
58 22511
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Its known but maybe its more common than previously thought: http://en.wikipedia.org/wiki/Start_codon#Eukaryotes