Hi all,
I want to extract a list of protein coding genes from the seq_gene.md file.
ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/BUILD.37.1/mapview/seq_gene.md.gz
I have already selected all rows where the feature name is “GENE” and removed all rows with the ‘LOC’ genes (which I do not want).
I want to remove any genes without good evidence and any pseudogenes.
Therefore my questions are:
1: What evidence codes to keep? Should I select genes with a partial evidence code or keep them all?
2: Should I care about the chromosome orientation? Probably not right?
I will later add a 20,000 base pair extension to the chromosome ‘start’ & ‘stop’ position of my final gene list and positionally map SNPs to these gene windows. If anyone knows a simpler way to do this please let me know.
Thanks for any comments!