Hi!
I need to use some data of the 1000 human genome project. On 1000genomes.org's ftp-server I found all data in form of 14gb big cram files. I tried to use cramtools in order to get something I could use in java without success. I searched all around and found that on http://www.1000genomes.org/analysis
Sequence reads were aligned to the human genome reference sequence, The copy of the fasta file that is used for the alignments can be found on our ftp site here. It currently represents the full chromosomes of the GRCh37 build of the human reference.
I checked the fasta file and it is around 3gb big and that's exactly what I'm looking for. Is there a source to get all data as a fasta file? I read the CRAM usage page and it did not help much...So what's the best way to get data like that in java.
explain this please: "I checked the fasta file and it is around 3gb big and that's exactly what I'm looking for. Is there a source to get all data as a fasta file?"
What data do you want to get? Variants, raw reads or something else?
seconding Pierre: please state clearly which type and part of the data in which format do you want to use for which purpose. Do you need the raw fastq data? ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index
Thanks you for your answers!
I try to reproduce some results published in the paper RCSI: Scalable similarity search in thousand(s) of genomes they used the data from the human genome project and the genome data they used was around 3.2 gb each. I want to use them for compression and substring search. I downloaded the fasta files from the link you posted some days ago, but people told me that raw reads would be useless for that purpose. All other genome data I found was separated in chromosomes or a 14gb cram file. What I want to tell is that the fasta file seems to be that kind of data they used in the paper.
This is a pure CS paper irrelevant to real biological applications. I guess it just faked genomes by incorporating called variants into the reference genome. We never use it that way. If you want to do something useful to us, download human assemblies here. There are only 10-20 independent assemblies, but they are realistic data. You can also try your methods on smaller genomes. (EDIT: the reason that you can't find 3.2Gb genome for each individual is that we don't need them.)
Thank you! This helps me very much. I will use that data.