The question is pretty self-explanatory. I need to get a very rough estimate of the amount of gigabyte we are talking about, when we talk about metagenomics/metaproteomics. I can't find much information but I guess some of you daily handle such kind of data so...I felt like asking. Any input is welcome.
Adding on to @5heikki comment. This can depend on additional factors like budget and study objectives. If you have a small budget, you may have to use a reduced output flow cell, which produces less data. If you want to search for low abundant genes (e.g., antimicrobial resistance genes), you have to sequence deep, i.e., generate a lot of data, because those genes make up only a tiny fraction of the overall genetic content. People may generate as little as 0.5 GB of metagenomic data for pilot studies using a MiSeq Nano run or terabytes of metagenomic data using a NovaSeq run, for example. It depends on a lot of factors.
ok, I got the idea and, as you said, we can't know in advance. my guess is that we will start from some metagenome/metatranscriptome sequencing to see what is in the samples we have and then, if we find something interesting, we may decide to increase the sequencing depth.
To be honest, I wasn't expecting such huge differences (1GB to 1TB) but it looks like it's not easy to provide and estimate so I guess we will see!
ok, I got the idea and, as you said, we can't know in advance
You definitely can. If your are doing NGS then the storage is simply (and roughly) a function of the number of reads and the read length per sample that you sequence. That involves a) the raw fastq files and b) the additional files you create, e.g. BAM files. You can simulate fastq files with different read length and different numbers of reads (e.g. wgsim from Heng Li), align them and then see what the storage for these files is when using 1mio,2mio, 5mio and so on...reads. That will give an estimation for the NGS part. It really depends on your throughput. Obviously 10 samples need less than 1000.
My suggestion to you, before I give concrete numbers, is to always buy as large disks as possible. About a year ago every time I bought an internal hard disk it was at least 6 Tb. Now I wish to have bought 10-12 Tb drives instead, because they were only $100-200 more and I am running out of bays to install them. My point is that there is no downside to having a larger disk than necessary at the time, while it is not so easy to fix later.
I will give you a sequencing example: 150bp Illumina 250-500 million paired-end reads. After light processing, assembly with two different programs and binning, followed by re-compressing of all primary files, these datasets take 100-200 Gb each. This is before read mapping or any kind of serious metabolic analysis, and they create very large files. When planning, I assumed it would take 0.5 Tb for each of these datasets, which I though was on the safe side. The only way that will suffice is if I keep compressing all the files that are not immediately needed, but then I am wasting my computer's time (and mine) to save a relatively small amount of money.
Again, hard disks are cheap - even SSDs are getting there. I don't think in the long run you will regret installing 2 x 10 Tb drives instead of 2 x 2 Tb drives, even though it will cost a little more at first and it may take a year or two to realize how good it is to have extra space.
Depends on the type of sample and sequencing depth. Can be anything from 1GB to 1TB or more..
Adding on to @5heikki comment. This can depend on additional factors like budget and study objectives. If you have a small budget, you may have to use a reduced output flow cell, which produces less data. If you want to search for low abundant genes (e.g., antimicrobial resistance genes), you have to sequence deep, i.e., generate a lot of data, because those genes make up only a tiny fraction of the overall genetic content. People may generate as little as 0.5 GB of metagenomic data for pilot studies using a MiSeq Nano run or terabytes of metagenomic data using a NovaSeq run, for example. It depends on a lot of factors.
ok, I got the idea and, as you said, we can't know in advance. my guess is that we will start from some metagenome/metatranscriptome sequencing to see what is in the samples we have and then, if we find something interesting, we may decide to increase the sequencing depth. To be honest, I wasn't expecting such huge differences (1GB to 1TB) but it looks like it's not easy to provide and estimate so I guess we will see!
You definitely can. If your are doing NGS then the storage is simply (and roughly) a function of the number of reads and the read length per sample that you sequence. That involves a) the raw fastq files and b) the additional files you create, e.g. BAM files. You can simulate fastq files with different read length and different numbers of reads (e.g. wgsim from Heng Li), align them and then see what the storage for these files is when using 1mio,2mio, 5mio and so on...reads. That will give an estimation for the NGS part. It really depends on your throughput. Obviously 10 samples need less than 1000.
ok, good to have such resource. I will look into this.