I have downloaded several fastq files for people in the 1000 genome project. Typically these contain less than 10 million DNA sequences. Given each sequence is 36 base pairs and they are intended to overlap so each part of the genome is covered three times, this represents less than 120 million bases. That is about 3% of the total human genome.
Eg ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12878/sequence_read/SRR001783.filt.fastq.gz gives 9,457,109 Solexa-3623 DNA sequences from the Sanger Institue of 36 bp each
What am I missing?
Are the DNA sequences reported by Solexa-3623 only targeting the protein coding parts of the genome?
Any comments or help would be most welcome.
Thank you
Bill
The 1000 genomes project is mostly doing low coverage and exome sequencing so isn't trying to get high coverage for many individuals, It is taking a population based approach which makes this work
You're welcome. If the responses are helpful to your research or understanding of the problem, then vote them up. If they are poor, vote them down. It is these votes that makes BioStar useful for others who come along later with a similar question.