1 base pair to 'x' byte conversion
1
0
Entering edit mode
5.0 years ago
kanika.151 ▴ 160

Hello all,

Does anyone know the base pair to byte conversion? I was recently asked that if each sample has 'y' million reads do you know how space it would occupy in our cluster?

How would you answer it?

fastq file cluster conversion base pairs byte • 4.1k views
ADD COMMENT
1
Entering edit mode

Depending on actual sequence, files are going to compress more or less (similar sequences next to each other will compress better) so there is no way to make a size estimate before hand. You could generate totally random fake fastq sequence data and see what size the file occupies. That may be the largest size you would need to account for that particular data type (number of reads, length of cycles).

ADD REPLY
0
Entering edit mode

Hello All,

What do you folks think of this article? https://bitesizebio.com/8378/how-much-information-is-stored-in-the-human-genome/

Do you think this conversion from this article can be used to give an estimation?

6×10^9 base pairs/diploid genome x 1 byte/4 base pairs = 1.5×10^9 bytes or 1.5 Gigabytes, about 2 CDs worth of space! Or small enough to fit 3 separate genomes on a standard DVD!

ADD REPLY
0
Entering edit mode

It is not directly related. This aims to estimate byte-size of a diploid genome. You aim to estimate the size of a text file which is influenced by read length and length of header lines. The principle is the same though with 1 byte per character as I outlined above. Just take any random fastq file and subsample to 1mio reads (given it is the same read length) and then multiply accordingly to your read numbers.

ADD REPLY
0
Entering edit mode

Thank you. I had taken read length into consideration as the data is paired-end. I will do as you all have suggested. Thanks again! :)

ADD REPLY
1
Entering edit mode
5.0 years ago
ATpoint 86k

I do not think this can generally be answered as fastq is pretty much always gzip-compressed and compressed file size depends on nucleotide composition. I guess you can (for uncompressed files) approximate it with 1byte per character (remember that each end of a line has a hidden newline` so +1 for that). Given that you have equal read length that would be probably be something like for each read (which has 4 lines).

  (number of characters per read header line = line1)    + 1
+ (number of characters per read sequence    = line2)    + 1
+ (1 for the + in line 3)                                + 1
+ (number of characters per read quality line   = line4) + 1

Or you simply make a dummy fastq file with the same read length as in your sample with a certain number of reads, get file size with ls -l and then multiply to match your number of actual reads.

Just to give you an idea, I checked a random fastq from a ChIP-seq experiment I had around, has 17.5mio reads, 50bp length, read headers around 16characters long:

$ ls -lh foo.fastq* && ls -l foo.fastq*

-rw-r--r-- 1 xx xx 2.3G Jan  8 10:11 foo.fastq
-rw-r--r-- 1 xx xx 460M Jan  7 18:18 foo.fastq.gz
-rw-r--r-- 1 xx xx 2369334530 Jan  8 10:11 foo.fastq
-rw-r--r-- 1 xx xx  482130248 Jan  7 18:18 foo.fastq.gz
ADD COMMENT
0
Entering edit mode

As ATpoint said, the best you could do is put an upper limit on the space required for the uncompressed data, and then rest somewhat easy knowing that your compressed data will be smaller than that, but there's no real way to know ahead of time by how much.

ADD REPLY

Login before adding your answer.

Traffic: 1860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6