Question

How Many Copies Of Genbank?

0

Entering edit mode

13.0 years ago

W Langdon ▴ 90

I am trying to find out how many copies there are of human gene sequences there are in our computers.

This is somewhat vague, so I have tried instead to ask how many copies of GenBank's human genome reference sequence are there?.

My best guess so far is <4000 which is based on it takes 2hours to download it from NCBI and copies probably only last a year before falling out of use or being replaced with a new copy (1 year/2 hours = 4383).

Bill

PS: Dear Daniel, Thank you for your reply. Yes I am indeed thinking about other people's copies. Perhaps "our" was not the best word. I was thinking about people reading BioStar, but actually I mean everyone worldwide. I have seen some estimates of disk and network reliability and was trying to estimate how often (world wide) disk copies of human DNA sequences get lost or corrupted. Of course disks are not perfect but I was surprised to learn that a site as large as CERN loses data. http://dx.doi.org/10.1145/1839676.1839692 So I am thinking if they can lose data does this happen to us in Bioinformatics too?

Sorry for my slow response, I was confused by the main index saying "0 Answers" and did not realise that comments had been posted.

genbank ncbi human hardware data • 2.4k views

ADD COMMENT • link 13.0 years ago by W Langdon ▴ 90

0

Entering edit mode

What kind of seqences? There are ESTs, genomic contigs, sequenced reads which all probably add up to billions of sequences.

ADD REPLY • link 13.0 years ago by Damian Kao 16k

0

Entering edit mode

What you've calculated there is how many times one could download a particular file in a year. I'm not really sure what you're trying to figure out.

ADD REPLY • link 13.0 years ago by Neilfws 49k

0

Entering edit mode

Are you asking about copies or versions? You talk about gene sequences and then you talk about genome sequences--which are you interested in, as they are two separate sets of sequences?

ADD REPLY • link 13.0 years ago by Sean Davis 27k

0

Entering edit mode

'our computers'? Do you mean your computers? My computers? Both our computers combined? Everyone computer? I admit I have quite a few copies of the human genome reference laying around.

ADD REPLY • link 13.0 years ago by User 59 13k

0

Entering edit mode

Dear Dk, thank you, sorry my posting was too vague. I ment every kind of sequence. I am thinking about data loss/corruption and would anticipate that the reliability of computer disks and networks would not depend upon the sequence type but only on its size.

ADD REPLY • link 13.0 years ago by W Langdon ▴ 90

0

Entering edit mode

Dear Neilfws, I have ignored the fact that there are other site which mirror NCBI the human genome reference genome and have assumed every bioinformatician downloads directly from NCBI. If everyone was doing this all the time the total number of copies of files would be limited by NCBI's bandwidth but the maximum number of copies would be created. Since the link is not in use 100% of the time, the number of copies must be less than that. I am assuming no one copied from copies. Obviously this is not true, but does anyone have a guess how often copies are taken from copies? Thank you Bill

ADD REPLY • link 13.0 years ago by W Langdon ▴ 90

0

Entering edit mode

Dear Sean, Sorry for my tardy reply. I guess I was thinking of physical copies. My assumption was that people would be happy to use a version that was a few months out of date, rather than insisting on using only the most recent version. Perhaps I am wrong?

Applogies that my original text was confusing. I did not intend to create a difference between "gene sequences" and "genome sequences".

Thank you Bill

ADD REPLY • link 13.0 years ago by W Langdon ▴ 90

0

Entering edit mode

Dear Neilfws, thanks for deleting my "answer" (I could edit it but could not find out how to delete it.)

ADD REPLY • link 13.0 years ago by W Langdon ▴ 90