Entering edit mode
2.6 years ago
Alewa
▴
170
seems weird. could someone please help me understand why? and how to possibly resolve this.
[sta55@cbsukim tb_genes_fasta]$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efilter -feature gene | efetch -format gene_fasta | grep "^>" | wc -l
4008
[sta55@cbsukim tb_genes_fasta]$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efilter -feature gene | efetch -format fasta_cds_aa | grep "^>" | wc -l
3906
Background
I'm extracting the nucleotide sequence of M. TB genes and their corresponding cds(protein) sequences. https://www.ncbi.nlm.nih.gov/nuccore/NC_000962#locus_448814763
At a guess one explanation could be that some features annotated as genes would be RNAs etc which aren't coding for proteins, thus there are more genes than CDSs (i.e. more functionally annotated "things" than just proteins")
Joe - thanks for chiming in. but in my case there were less cds than the genes. or maybe I'm not doing the gene filtering right? :(
That's what I said, no? You have fewer annotated CDSs than genes. Remember what "CDS" actually means: coding sequences.
This is usually taken to mean they give rise to a functional protein, but the definition of a gene is broader these days and can include non-coding RNAs.
Hence number of CDS + number of non-CDS functional elements = number of "genes".
Or more simply:
gene != CDS
.This is still a guess on my part as it could be due to any number of annotation artefacts etc, but I don't see an obvious problem here - the numbers you've retrieved make intuitive sense.