Hi all,
I needed a list of all human genes (coding and non-coding), so i went here https://ftp.ncbi.nih.gov/gene/DATA/ and downloaded gene2pubmed.gz
The list provides the following:
tax_id: the unique identifier provided by NCBI Taxonomy for the species or strain/isolate
GeneID: unique identifier for a gene
PubMed ID (PMID): unique identifier in PubMed for a citation
Opening the file in Notepad++ i see over 12mil lines, 1 for each unique GeneID. Now, as far as i've learned there are just under 40,000 genes in humans (coding + non-coding); Nature.com says ".....21,306 protein-coding genes and 21,856 non-coding genes — many more than are included in the two most widely used human-gene databases. The GENCODE gene set, maintained by the EBI, includes 19,901 protein-coding genes and 15,779 non-coding genes. RefSeq, a database run by the US National Center for Biotechnology Information (NCBI), lists 20,203 protein-coding genes and 17,871 non-coding genes." (https://www.nature.com/articles/d41586-018-05462-w)
So why are there so many records in the NCBI's file? Are regulatory elements (promoters, enhancers, etc) considered "non-coding genes"? I'm so confused...
Thank you, Jane
probably they include also alternative spliced genes
Thats a great suggestion, thank you!