Question

Human Whole Genome Sequence Data With Certain Read Length And Coverage

0

Entering edit mode

13.7 years ago

Steffi ▴ 590

Can anybody indicate me how to search efficiently for human whole genome sequence data (fasta/q files)- with a certain read length and coverage?

I am aware of the 1000 genomes project, but I have not found yet a tabular listing of the used read length....further more I find the samples description rather confusing.

Thanks
steffi

whole-genome • 2.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 13.7 years ago by Steffi ▴ 590

score 1 · Answer 1 · 2011-11-10

1

Entering edit mode

13.7 years ago

Istvan Albert 102k

I would say the tasks you are looking for is not possible at this moment. The read lenghts are not information that is usually thought to be important/essential to warrant a special searchable field.

Also I think that the concept of fixed read length is just a limitation/characteristic of a sequencing technology rather than being permanent attribute that characterizes all data.

ADD COMMENT • link 13.7 years ago by Istvan Albert 102k

0

Entering edit mode

totally agree ! +1

ADD REPLY • link 13.5 years ago by toni ★ 2.2k

score 1 · Answer 2 · 2012-02-04

Since the ftp site in the 1000 genomes is accessible remotely via wget, you could query the second line of each file and count the number of characters like this:

wget -qO- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index | awk '{print $1}' | \
grep -v FASTQ_FILE | head | while read fe; do echo $file; wget \
-qO- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/$file | gunzip -c | head -n 2 | \
tail -n 1 | wc -c; done

data/NA19238/sequence_read/ERR000018.filt.fastq.gz
37
data/NA19238/sequence_read/ERR000019.filt.fastq.gz
37
data/NA19240/sequence_read/ERR000020.filt.fastq.gz
37
data/NA19240/sequence_read/ERR000020_1.filt.fastq.gz
37
data/NA19240/sequence_read/ERR000020_2.filt.fastq.gz
37
data/NA19238/sequence_read/ERR000021.filt.fastq.gz
37
data/NA19238/sequence_read/ERR000022.filt.fastq.gz
37
data/NA19238/sequence_read/ERR000023.filt.fastq.gz
41
data/NA19238/sequence_read/ERR000024.filt.fastq.gz
37
[...]