Right now I am working with GRCh37.p13 RefSeq data from Entrez, querying using efetch in Biopython, here is a sample of the type of query I am doing (Chromosome 1):
net_handle = Entrez.efetch(db="nucleotide",id="NC_000001.10",rettype="fasta", retmode="text")
This result is drastically different than the reference chromosome 1 from 1000 genomes project. Specifically whats contained in their file human_g1k_v37.fasta.gz
which I obtained from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/
I am working with analyzing SNP's and my understanding is that Ancestry, 23andMe and others typically use 1000 genomes data as their reference, that is why I am looking to use it as well. If this is incorrect, and they in fact use another reference you are aware of please let me know.
What would be ideal is if I could query the data from human_g1k_v37
via a RefSeq or some other means using the Entrez service. Does anyone know if this is possible?
Thanks,
Brian
Devon, thank you for responding. I am rather new to working with all these tools and bioinformatics in general. I did a very naive sdiff of the two files and "stare and compare" approach which led me astray, since each had different widths, sdiff was abbreviating some information. I really like the md5 pipeline you showed me for making sure my data is the same. Its great to know the data is the same and makes a lot of sense.