Does anyone know of any handy smaller datasets for just playing with mapping tools? I would like to reduce the file sizes so I can just run things on my laptop before running bigger jobs. I was thinking of just truncating the gencode GTF file and hg38 reference fasta to a single chromosome (say, chromosome 21 or 22). Then, what I'd be lacking is a an experimental fasta (or pair of fasta for paired-end) that only contains reads from a single chromosome. Is there someplace to find targeted experimental reads like this? I could make some synthetic ones, but I like the idea of using experimental data better. Thanks.
A handy small reference would be the human mitochondrial genome, 17kb roughly. It is included in the hg19 assembly and can be downloaded from UCSC. If you search around a bit, finding some mitochondrial sequencing should not be a problem. Alternatively, use the E.coli genome with its 5Mb. Finding E.coli NGS data should be even easier than chrM-seqs.
chrM will not be a good test case for many things since it's so different from nuclear DNA in composition, repetitive content, heteroplasmy (not diploid!),...