Question

The Logic Behind The Naming System In 1000Genomes Ftp Arrangements

2

Entering edit mode

13.4 years ago

Delinquentme ▴ 200

I've got these links:

data/NA18603/sequence_read/ERR000103.filt.fastq.gz
data/NA18603/sequence_read/ERR000103_1.filt.fastq.gz
data/NA18603/sequence_read/ERR000103_2.filt.fastq.gz
data/NA18542/sequence_read/ERR000104.filt.fastq.gz
data/NA18542/sequence_read/ERR000104_1.filt.fastq.gz
data/NA18542/sequence_read/ERR000104_2.filt.fastq.gz
data/NA18582/sequence_read/ERR000105.filt.fastq.gz
data/NA18582/sequence_read/ERR000105_1.filt.fastq.gz
data/NA18582/sequence_read/ERR000105_2.filt.fastq.gz
data/NA18592/sequence_read/ERR000106.filt.fastq.gz
data/NA18592/sequence_read/ERR000106_1.filt.fastq.gz
data/NA18592/sequence_read/ERR000106_2.filt.fastq.gz
data/NA18605/sequence_read/ERR000107.filt.fastq.gz
data/NA18605/sequence_read/ERR000107_1.filt.fastq.gz
data/NA18605/sequence_read/ERR000107_2.filt.fastq.gz
data/NA18592/sequence_read/ERR000108.filt.fastq.gz
data/NA18592/sequence_read/ERR000108_1.filt.fastq.gz
data/NA18592/sequence_read/ERR000108_2.filt.fastq.gz
data/NA12234/sequence_read/ERR000130.filt.fastq.gz
data/NA12234/sequence_read/ERR000130_1.filt.fastq.g

now im trying to figure out:
1) are they all from the same human ?
2) why does the NA18* number change ? 3) why there are 3 versions of each ERR000* file (I thought matched were 2 (paired reads))

genome • 3.0k views

ADD COMMENT • link updated 13.4 years ago by Bert Overduin ★ 3.7k • written 13.4 years ago by Delinquentme ▴ 200

score 4 · Answer 1 · 2011-07-23

4

Entering edit mode

13.4 years ago

Mitch Bekritsky ★ 1.3k

They are not all from the same person. The NA18* number is the ID number for the individual being sequenced. There are 3 sequence files, one for PE1, one for PE2, and the final for reads where at least one of the paired ends didn't pass QC. The QC protocol, as well as a lot of other information on 1000 genomes sequencing data (including most of what I've told you here) can be found [?]here[?]. In the past, when I've had questions about 1000 genomes sequencing info, I've found their [?]FTP site[?] to be a great resource.

ADD COMMENT • link 13.4 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

the sequence alignments i've run before were grouped by chromosome. So then these simply have the entire 23 chromosomes in a file each ?

ADD REPLY • link 13.4 years ago by Delinquentme ▴ 200

0

Entering edit mode

Yeah, should be. I've never checked the alignment files from 1K genomes, but generally, alignment output files are for all aligned regions. For 1K genomes, that means all autosomes, sex chromosomes, mitochondrial chromosome, and non-chromosomal supercontigs. The link to the 1K genome project's description of the alignment protocol is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/README.human_g1k_v37.fasta.txt

ADD REPLY • link 13.4 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

To be clear, that's all aligned regions in the one alignment output file...

ADD REPLY • link 13.4 years ago by Mitch Bekritsky ★ 1.3k

score 2 · Answer 2 · 2011-07-24

2

Entering edit mode

13.4 years ago

Bert Overduin ★ 3.7k

These kind of questions are normally answered by reading the README file .... :)

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.sequence_data

Cheers, Bert

ADD COMMENT • link 13.4 years ago by Bert Overduin ★ 3.7k