It is sometimes hard to find exactly what you want in the SRA. The suggestion to simulate is probably best, but it depends on what you want to do.
If you want to have more data than you can handle, check out the entire set of 1000 Genomes sequencing runs here (~39k sequencing runs total).
You'll have a row for each run like this (edited to come closer to fitting on the screen):
FASTQ_FILE CENTER_NAME SAMPLE_NAME POPULATION EXPERIMENT_ID INSTRUMENT_PLATFORM INSTRUMENT_MODEL LIBRARY_LAYOUT WITHDRAWN READ_COUNT BASE_COUNT
data/NA19238/sequence_read/ERR000018.filt.fastq.gz BGI NA19238 YRI ERX000014 ILLUMINA Illumina Genome Analyzer SINGLE 0 9280498 334097928
data/NA19238/sequence_read/ERR000019.filt.fastq.gz BGI NA19238 YRI ERX000014 ILLUMINA Illumina Genome Analyzer SINGLE 0 9571982 344591352
data/NA19240/sequence_read/ERR000020.filt.fastq.gz BGI NA19240 YRI ERX000016 ILLUMINA Illumina Genome Analyzer II PAIRED 0 149044 5365584
You can look at the READ_COUNT
and BASE_COUNT
columns to calculate the read length, maybe like this:
$ cat 20120522.sequence.index | cut -f 1,24,25 | awk '{print $1"\t"$3/$2}'
data/NA19238/sequence_read/ERR000018.filt.fastq.gz 36
data/NA19238/sequence_read/ERR000019.filt.fastq.gz 36
data/NA19240/sequence_read/ERR000020.filt.fastq.gz 36
data/NA19240/sequence_read/ERR000020_1.filt.fastq.gz 36
...
You can look through the results to find runs that meet your specifications, or runs that could meet them if you trimmed the reads.
Bowtie2 comes with a nice read simulator. Istvan is right. Understanding multiple hits and mapping oddities is a great exercise.