extracting protein coding genes from NCBI seq

extracting protein coding genes from NCBI seq_gene.md file

0

Entering edit mode

8.6 years ago

avari ▴ 110

Hi all,

I want to extract a list of protein coding genes from the seq_gene.md file.

ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/BUILD.37.1/mapview/seq_gene.md.gz

I have already selected all rows where the feature name is “GENE” and removed all rows with the ‘LOC’ genes (which I do not want).

I want to remove any genes without good evidence and any pseudogenes.

Therefore my questions are:

1: What evidence codes to keep? Should I select genes with a partial evidence code or keep them all?

2: Should I care about the chromosome orientation? Probably not right?

I will later add a 20,000 base pair extension to the chromosome ‘start’ & ‘stop’ position of my final gene list and positionally map SNPs to these gene windows. If anyone knows a simpler way to do this please let me know.

Thanks for any comments!

genes seq_gene.md SNPs protein coding genome • 1.6k views

ADD COMMENT • link updated 8.5 years ago by Biostar 20 • written 8.6 years ago by avari ▴ 110

Login before adding your answer.