So I have a list of several hundred human gene IDs, one per line, and I need to determine for each of them whether they transcribe from direct chain (in respect to the reference sequence of GRCh37.p13) or from reverse complement chain.
I submitted this list to Batch Entrez, selected "Tabular (text)" format in Display Settings, and there indeed was an "orientation" column in the output, reading "plus" or "minus", respectively. There is however one problem: it always shows orientation for the most recent annotation release (currently release 107 for genome assembly GRCh38.p2), while I need information from annotation release 105 (GRCh37.p13). This is important because, just for example, NIPA1 gene is on direct chain for release #107, but on reverse complement chain for #105, and examples like that are plentiful. I also tried to add AND GRCh37.p13[Assembly Name]
to my search string, but it seems to affect nothing, because in this "Tabular (text)" view it still shows orientation from the latest annotation release.
Can anyone please explain what do I do this situation? It doesn't have to be limited to Entrez only, I can write a parser script or even a web scraper if this would be required to do what I'm trying to do.
I'm newer to sequence analysis, so thanks for pointing me to BioMart, seems like a great tool that's also easy to use. Poking around I found they also have a great timeline of genome assemblies to get an idea of the appropriate versions to select.
Yeah, you find BioMart and Ensembl to be your go to resources for most things. I use both of those far more than NCBI.
Thank you Devon! It works like a charm.