I am trying to find out how many copies there are of human gene sequences there are in our computers.
This is somewhat vague, so I have tried instead to ask how many copies of GenBank's human genome reference sequence are there?.
My best guess so far is <4000 which is based on it takes 2hours to download it from NCBI and copies probably only last a year before falling out of use or being replaced with a new copy (1 year/2 hours = 4383).
Bill
PS: Dear Daniel, Thank you for your reply. Yes I am indeed thinking about other people's copies. Perhaps "our" was not the best word. I was thinking about people reading BioStar, but actually I mean everyone worldwide. I have seen some estimates of disk and network reliability and was trying to estimate how often (world wide) disk copies of human DNA sequences get lost or corrupted. Of course disks are not perfect but I was surprised to learn that a site as large as CERN loses data. http://dx.doi.org/10.1145/1839676.1839692 So I am thinking if they can lose data does this happen to us in Bioinformatics too?
Sorry for my slow response, I was confused by the main index saying "0 Answers" and did not realise that comments had been posted.
What kind of seqences? There are ESTs, genomic contigs, sequenced reads which all probably add up to billions of sequences.
What you've calculated there is how many times one could download a particular file in a year. I'm not really sure what you're trying to figure out.
Are you asking about copies or versions? You talk about gene sequences and then you talk about genome sequences--which are you interested in, as they are two separate sets of sequences?
'our computers'? Do you mean your computers? My computers? Both our computers combined? Everyone computer? I admit I have quite a few copies of the human genome reference laying around.
Dear Dk, thank you, sorry my posting was too vague. I ment every kind of sequence. I am thinking about data loss/corruption and would anticipate that the reliability of computer disks and networks would not depend upon the sequence type but only on its size.
Dear Neilfws, I have ignored the fact that there are other site which mirror NCBI the human genome reference genome and have assumed every bioinformatician downloads directly from NCBI. If everyone was doing this all the time the total number of copies of files would be limited by NCBI's bandwidth but the maximum number of copies would be created. Since the link is not in use 100% of the time, the number of copies must be less than that. I am assuming no one copied from copies. Obviously this is not true, but does anyone have a guess how often copies are taken from copies? Thank you Bill
Dear Sean, Sorry for my tardy reply. I guess I was thinking of physical copies. My assumption was that people would be happy to use a version that was a few months out of date, rather than insisting on using only the most recent version. Perhaps I am wrong?
Applogies that my original text was confusing. I did not intend to create a difference between "gene sequences" and "genome sequences".
Thank you Bill
Dear Neilfws, thanks for deleting my "answer" (I could edit it but could not find out how to delete it.)