Entering edit mode
23 months ago
Lluís R.
★
1.2k
We are writing the data management plan and I need to estimate the size of the data we'll get (for storage costs) per sample (as a fastq.gz).
I found a very rough estimation in this comment which is what I know from previous experience, but I wonder if there are more accurate calculators available?
Some variables that I think are important:
- Read length
- Paired
- Targeted number of reads
- Adaptors
- UMI
- Technology (RRBS, WGS, RNAseq, 16S, metagenomics, metaproteomics)
This is not something you can be absolute about unless you have a large amount of data to refer back to. The sizes are going to vary depending on library quality/sample type/compress-ability etc.
If you must come up with a number then look at some representative files and add 10% to cover outliers.
Remind the authorities that the more successful you are the more you are going to need (storage that is).
Oh I absolutely agree, I don't need a hard estimate (between 1 and 3 GB) but it would be nice to have some numbers to more or less know if it is better to pay cloud storage or just have local copies.
If you don't have data of your own you could use, then randomly get examples from SRA and go from there. You could also use read simulators and create dummy datasets with different characteristics that you control.
Good ideas! I'll do that. Thanks!