Annotate Illumina SNP file with Human GRCh37 genes file
1
1
Entering edit mode
10.4 years ago
gcastaigne ▴ 10

Hello everybody.

This is my first post, so don't hesitate to tell me if I'm not efficently clear in my explanations.

I would like to annotate a Illumina SNP file and I need to compare it to a Human Genome annotated file with the GRCh37 build (I don't care about de patch, just the build is important).

To be efficient in my comparison , I need several informations in the Human genome file.

I need at least :

  • HGNC symbol
  • GeneID
  • start gene position (bp)
  • end gene position (bp)
  • chromosomeID

There is no real problem to get these informations, I found it in UCSC or Biomart.

But I have a problem with NCBI symbol starting with LOC (i.e : LOC100287633, LOC100128613 etc...)

I compared NCBI and UCSC informations, and I can find every LOC symbols in NCBI but not in UCSC or Biomart.

I know that there are a lot of LOC symbols which are "discontinued" or not updated, however plenty of these symbols are still reviewed in NCBI but unfindable in Biomart or UCSC or other databases.

I could download them from NCBI, but their "start and end positions (bp)" are updated to the GRCh38, and I absolutely need the GRCh37 positions.

So my question is: Do you know a web link, ftp link, where I can download all this information in a single file, or just to download LOC informations with GRCh37 build?

Thanks for your answers!

Guillaume

GRCh37 SNP annotation gene LOC • 3.6k views
ADD COMMENT
0
Entering edit mode

Hi Guillaume

Could you let me know how you output HGNC symbol from UCSC. I tried to do the same tasks as you did. But I just need the genes known to HGNC. For example, I used track=UCSC Genes and selected "geneSymbol". But the output listed some genes not known to HGNC in the column of hg19.kgXref.geneSymbol.

Then I have trouble to annotate integenic SNPs. For example SNP rs188746275 should locate between (PABPC4L , PCDH18)

but the UCSC tables listed the cDNA genes such that the SNP was between BC032916 and BC031238 when I annotated it. Then BC032916 and BC031238 are not known to HGNC or NCBI.

Many thanks if you could guide me how to output the HGNC symbol.

Thanks!

Ake

ADD REPLY
3
Entering edit mode
10.4 years ago
Zhaorong ★ 1.4k

Check out the NCBI FTP archive for Annotation release 104, especially the GFF folder.

ADD COMMENT
0
Entering edit mode

Thanks for the link!

Do you know the difference between all GFF files in this folder? Because I can see 2 kinds of file, top_level and scaffolds..

ADD REPLY
0
Entering edit mode

Check the first column of each file and you'll see. :) The "top_level" file has coordinates on assembled chromosomes, i.e. NC_*, while the "scaffolds" file has coordinates on scaffolds (or contigs), e.g. NT_* and NW_*.

ADD REPLY
0
Entering edit mode

Thank you Zhaorong, I will try to do something with this. You save me from a lot of searching hours ! ;)

ADD REPLY

Login before adding your answer.

Traffic: 1996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6