Question

Human genome full chromosomes build

0

Entering edit mode

9.1 years ago

priess1991 • 0

Hi!

I need to use some data of the 1000 human genome project. On 1000genomes.org's ftp-server I found all data in form of 14gb big cram files. I tried to use cramtools in order to get something I could use in java without success. I searched all around and found that on http://www.1000genomes.org/analysis

Sequence reads were aligned to the human genome reference sequence, The copy of the fasta file that is used for the alignments can be found on our ftp site here. It currently represents the full chromosomes of the GRCh37 build of the human reference.

I checked the fasta file and it is around 3gb big and that's exactly what I'm looking for. Is there a source to get all data as a fasta file? I read the CRAM usage page and it did not help much...So what's the best way to get data like that in java.

genome • 1.9k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.1 years ago by priess1991 • 0

1

Entering edit mode

explain this please: "I checked the fasta file and it is around 3gb big and that's exactly what I'm looking for. Is there a source to get all data as a fasta file?"

ADD REPLY • link 9.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

What data do you want to get? Variants, raw reads or something else?

ADD REPLY • link 9.1 years ago by lh3 33k

0

Entering edit mode

seconding Pierre: please state clearly which type and part of the data in which format do you want to use for which purpose. Do you need the raw fastq data? ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index

ADD REPLY • link 9.1 years ago by Ido Tamir 5.2k

0

Entering edit mode

Thanks you for your answers!

I try to reproduce some results published in the paper RCSI: Scalable similarity search in thousand(s) of genomes they used the data from the human genome project and the genome data they used was around 3.2 gb each. I want to use them for compression and substring search. I downloaded the fasta files from the link you posted some days ago, but people told me that raw reads would be useless for that purpose. All other genome data I found was separated in chromosomes or a 14gb cram file. What I want to tell is that the fasta file seems to be that kind of data they used in the paper.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.1 years ago by priess1991 • 0

0

Entering edit mode

This is a pure CS paper irrelevant to real biological applications. I guess it just faked genomes by incorporating called variants into the reference genome. We never use it that way. If you want to do something useful to us, download human assemblies here. There are only 10-20 independent assemblies, but they are realistic data. You can also try your methods on smaller genomes. (EDIT: the reason that you can't find 3.2Gb genome for each individual is that we don't need them.)