Human genome full chromosomes build
0
0
Entering edit mode
9.1 years ago
priess1991 • 0

Hi!

I need to use some data of the 1000 human genome project. On 1000genomes.org's ftp-server I found all data in form of 14gb big cram files. I tried to use cramtools in order to get something I could use in java without success. I searched all around and found that on http://www.1000genomes.org/analysis

Sequence reads were aligned to the human genome reference sequence, The copy of the fasta file that is used for the alignments can be found on our ftp site here. It currently represents the full chromosomes of the GRCh37 build of the human reference.

I checked the fasta file and it is around 3gb big and that's exactly what I'm looking for. Is there a source to get all data as a fasta file? I read the CRAM usage page and it did not help much...So what's the best way to get data like that in java.

genome • 1.9k views
ADD COMMENT
1
Entering edit mode

explain this please: "I checked the fasta file and it is around 3gb big and that's exactly what I'm looking for. Is there a source to get all data as a fasta file?"

ADD REPLY
0
Entering edit mode

What data do you want to get? Variants, raw reads or something else?

ADD REPLY
0
Entering edit mode

seconding Pierre: please state clearly which type and part of the data in which format do you want to use for which purpose. Do you need the raw fastq data? ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index

ADD REPLY
0
Entering edit mode

Thanks you for your answers!

I try to reproduce some results published in the paper RCSI: Scalable similarity search in thousand(s) of genomes they used the data from the human genome project and the genome data they used was around 3.2 gb each. I want to use them for compression and substring search. I downloaded the fasta files from the link you posted some days ago, but people told me that raw reads would be useless for that purpose. All other genome data I found was separated in chromosomes or a 14gb cram file. What I want to tell is that the fasta file seems to be that kind of data they used in the paper.

ADD REPLY
0
Entering edit mode

This is a pure CS paper irrelevant to real biological applications. I guess it just faked genomes by incorporating called variants into the reference genome. We never use it that way. If you want to do something useful to us, download human assemblies here. There are only 10-20 independent assemblies, but they are realistic data. You can also try your methods on smaller genomes. (EDIT: the reason that you can't find 3.2Gb genome for each individual is that we don't need them.)

ADD REPLY
0
Entering edit mode

Thank you! This helps me very much. I will use that data.

ADD REPLY

Login before adding your answer.

Traffic: 2939 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6