Question

Are the Fastq files content is truly random

0

Entering edit mode

5.7 years ago

yonis • 0

Hi,

I'm building a system to generate fastq files as they are being output from HiSeq 3000 and HiSeq 4000. This will serve us to test our internal systems with some gold standards.

As we are working with microbiome a standard sample may contain around 1500 different species, is there a a bias in the output regarding how 'close' each assembly record appear next to each other or is it truly random?

If record 1,1 is from assembly 2913, what are the odds that record 1,2 is of that assembly as well (assuming 1500 species with the same strain length) ?

I couldn't find any paper on that so any help from experience would be great.

Thank

Yoni

fastq hiseq illumina • 1.7k views

ADD COMMENT • link updated 5.7 years ago by Istvan Albert 102k • written 5.7 years ago by yonis • 0

0

Entering edit mode

Assemblies do not come in fastq format, you probably have your terminology mixed up somehow.

ADD REPLY • link 5.7 years ago by WouterDeCoster 47k

0

Entering edit mode

You might find the Flux-Simulator software useful, as it handles simulation of sequencing data: http://confluence.sammeth.net/display/SIM/Demo+-+Create+Fastq+file , it's a bit involved to get started with but it'll save you a lot of work in the end.

If you're concerned about the distribution in the FASTQ file (I'm not sure why that would matter, since that wouldn't/shouldn't change how well the reads aligns), you could download a few datasets from GEO/SRA, align the reads, see where each read maps and investigate the distribution of mapping location/mapping quality to see if there's any patterns. Again, I don't think any of it would matter at all for any practical purposes.

ADD REPLY • link 5.7 years ago by manuel.belmadani ★ 1.4k

0

Entering edit mode

I'm building a tool, down the pipeline ,that should only get a sample of a fastq file, to produce certain analysis. It matters a lot.

Thank you for the link, but as far as I can tell it doesn't simulate all the artifacts from a sequencing machine.

ADD REPLY • link 5.7 years ago by yonis • 0

0

Entering edit mode

all the artifacts from a sequencing machine.

And what would those be?

ADD REPLY • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

5.7 years ago

GenoMax 147k

I'm building a system to generate fastq files as they are being output from HiSeq 3000 and HiSeq 4000.

AND

If record 1,1 is from assembly 2913, what are the odds that record 1,2 is

What do both of those sentences mean? You are trying to simulate fastq data?

Note: Fastq data as generated by a sequencer (raw sequence) is completely random until you do something to change that. If you doing paired-end sequencing then a matching pair of R1 and R2 reads represent sequence from two ends of a DNA fragment that is being sequenced.

ADD COMMENT • link 5.7 years ago by GenoMax 147k

0

Entering edit mode

Yes. I'm simulating a fastq file as it being generated by HiSeq 3000 and HiSeq 4000. I wanted to know about the randomness of the data. Thanks.

ADD REPLY • link 5.7 years ago by yonis • 0

score 3 · Accepted Answer · 2019-04-01

3

Entering edit mode

5.7 years ago

Istvan Albert 102k

The samples will be randomly distributed within a FASTQ file.

But remember that randomness will be governed by the relative abundances of the different DNA sources relative to one another.

The odds of getting two reads from the same genome next to one another are analogous of extracting balls of different colors from a bag that contains each color with a different proportion. Since the number of elements is large but also unknown the most appropriate would be to model it as sampling with replacement.

The order of the reads will follow the physical layout of the flow cell, sometimes that matters, some regions of the flowcell may produce better quality data.

ADD COMMENT • link 5.7 years ago by Istvan Albert 102k

0

Entering edit mode

Sure, I should have added to the questions that they all have the same relative abundances.

ADD REPLY • link 5.7 years ago by yonis • 0

0

Entering edit mode

Having exactly the same abundance is uncommon - unless the samples are created artificially - but do note that even if you had the same number of DNA molecules for each organism the size of each genome is most likely different. A longer genome will produce more reads. Thus the resulting reads will not be of equal proportion.

PS. I have now noticed that you do generate these as a simulation. The second part still applies. Genomes will produce reads proportionally to their lengths.

ADD REPLY • link 5.7 years ago by Istvan Albert 102k