Hi,
I'm building a system to generate fastq files as they are being output from HiSeq 3000 and HiSeq 4000. This will serve us to test our internal systems with some gold standards.
As we are working with microbiome a standard sample may contain around 1500 different species, is there a a bias in the output regarding how 'close' each assembly record appear next to each other or is it truly random?
If record 1,1 is from assembly 2913, what are the odds that record 1,2 is of that assembly as well (assuming 1500 species with the same strain length) ?
I couldn't find any paper on that so any help from experience would be great.
Thank
Yoni
Assemblies do not come in fastq format, you probably have your terminology mixed up somehow.
You might find the Flux-Simulator software useful, as it handles simulation of sequencing data: http://confluence.sammeth.net/display/SIM/Demo+-+Create+Fastq+file , it's a bit involved to get started with but it'll save you a lot of work in the end.
If you're concerned about the distribution in the FASTQ file (I'm not sure why that would matter, since that wouldn't/shouldn't change how well the reads aligns), you could download a few datasets from GEO/SRA, align the reads, see where each read maps and investigate the distribution of mapping location/mapping quality to see if there's any patterns. Again, I don't think any of it would matter at all for any practical purposes.
I'm building a tool, down the pipeline ,that should only get a sample of a fastq file, to produce certain analysis. It matters a lot.
Thank you for the link, but as far as I can tell it doesn't simulate all the artifacts from a sequencing machine.
And what would those be?