Hello, I feel like I'm missing a simple concept or something regarding this:
I'm working with NCBI feature table files from bacterial genomes. In this type of tab-delimited text file column 11 and 12 are product_accession and non-redundant_refseq respectively. Most of the lines of the file contain IDs strings for both of these columns (Example: WP_000831330.1 & WP_000831330.1), however some lines do not contain any information for these columns even when they have a product name (protein name) assigned to them.
My question is precisely why some of the proteins annotated do not have any of these IDs assigned to them? shouldn't all of them in theory have both IDs?
At a first guess, I would imagine it’s because not everything that’s in NCBI is in the RefSeq database, as it’s a more complete and curated dataset - or are you working with refseq data specifically to start with?
Hello thanks for answering. Yes I'm fetching the tables from RefSeq database and upon your answer I checked how it looks compared to GenBank. For some reason GenBank tables don't have any IDs at all in any of the lines for these 2 columns lol; understandable for refseq accession but they should have a product accession even in GenBank? I'm really confused, although what you said may be right, simply some of the products I'm working with might not be in refseq.