Question

Is it possible to fetch 1000 genomes project v37 from Entrez?

0

Entering edit mode

9.8 years ago

bfeeny ▴ 50

Right now I am working with GRCh37.p13 RefSeq data from Entrez, querying using efetch in Biopython, here is a sample of the type of query I am doing (Chromosome 1):

net_handle = Entrez.efetch(db="nucleotide",id="NC_000001.10",rettype="fasta", retmode="text")

This result is drastically different than the reference chromosome 1 from 1000 genomes project. Specifically whats contained in their file human_g1k_v37.fasta.gz which I obtained from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

I am working with analyzing SNP's and my understanding is that Ancestry, 23andMe and others typically use 1000 genomes data as their reference, that is why I am looking to use it as well. If this is incorrect, and they in fact use another reference you are aware of please let me know.

What would be ideal is if I could query the data from human_g1k_v37 via a RefSeq or some other means using the Entrez service. Does anyone know if this is possible?

Thanks,
Brian

ncbi 1000genomes biopython entrez • 2.2k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by bfeeny ▴ 50

Ram · Accepted Answer · 2015-04-29

3

Entering edit mode

9.8 years ago

Devon Ryan 105k

I downloaded NC_000001.10 from NCBI and the fasta file you mentioned from 1000 genomes:

$ samtools faidx human_g1k_v37.fasta 1 | grep -v ">" | tr -d '\n' | md5sum
1b22b98cdeb4a9304cb5d48026a85128  -
$ cat NC_000001.10.fasta | grep -v ">" | tr -d '\n' | md5sum
1b22b98cdeb4a9304cb5d48026a85128  -

So they are, in fact, identical except for the chromosome names, as expected. Presuming you need to check a good number of sequences, just download the fasta file and query that. It'll be faster anyway.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 105k

0

Entering edit mode

Devon, thank you for responding. I am rather new to working with all these tools and bioinformatics in general. I did a very naive sdiff of the two files and "stare and compare" approach which led me astray, since each had different widths, sdiff was abbreviating some information. I really like the md5 pipeline you showed me for making sure my data is the same. Its great to know the data is the same and makes a lot of sense.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by bfeeny ▴ 50