12.2% of protein coding transcripts in GRCh38.cds.all.fa (from Ensembl db) don't start with 'ATG'. Why is this?
2
1
Entering edit mode
10.1 years ago
pwg46 ▴ 540

Please refer to the title. Are some of the CDS in Ensembl's cds file simply bad data?

grch38 cds transcript ensembl • 3.9k views
ADD COMMENT
1
Entering edit mode

Its known but maybe its more common than previously thought: http://en.wikipedia.org/wiki/Start_codon#Eukaryotes

ADD REPLY
5
Entering edit mode
10.1 years ago
Denise CS ★ 5.2k

No, it's not bad data. In addition to the non-AUG start, it could be because they are annotated as CDS 5' incomplete and the start was therefore left open. See this example. CDS 5' (or CDS 3') incomplete transcripts are manually annotated by the HAVANA team and displayed in Ensembl as part of the GENCODE gene set.

ADD COMMENT
1
Entering edit mode
10.1 years ago
Chris S. ▴ 340

You can also check the fasta deflines in the file and specifically the status and transcript types. I count 58 CDS or 0.25% using the ensembl_havana_transcript:known. And here's some R code with details.

url <- "ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz"
download.file(url)
system("gunzip GRCh38.cds.all.fa.gz")
cds <- readDNAStringSet("GRCh38.cds.all.fa")

length(cds)
[1] 99436

status <- gsub("[^ ]+ ([^ ]+).*", "\\1", names(cds) )
atg <-  substr(cds, 1,3)=="ATG"
ttype <- gsub(".*transcript_biotype:([^ ]+).*", "\\1", names(cds) )

table(status, atg)

                                    atg
status                               FALSE  TRUE
  ensembl_havana_transcript:known      102 22937
  ensembl_havana_transcript:novel       35  1611
  ensembl_havana_transcript:putative    37  1250
  ensembl:known                        266 10041
  ensembl:novel                         83   434
  havana:known                        4620 31920
  havana:novel                        2617  6072
  havana:putative                     4222 12728

table(atg[status=="ensembl_havana_transcript:known" & ttype=="protein_coding"])

FALSE  TRUE
   58 22511
ADD COMMENT

Login before adding your answer.

Traffic: 1600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6