Hi there,
I am interested in computing some statistics for a subset of the genome. I have compiled a list of 28M sites and I am now checking if this site passed the 1000 genomes filters. In order to do this, I am using samtools to access the mask file (in fasta format) for the particular position using -
samtools faidx maskfile chr_pos
However, for 28M sites, this is very slow. Does anyone have any suggestions to speed up the search? I am fairly good at Perl programming but I dont think using the standard search methods that I am familiar with will significantly speed things up. Any suggestions would be very helpful?
Also as step2, I need to lookup the results from a table for which I am currently using perl hashes but again too slow for such large amount of data. I would be most grateful if you have any suggestions to more the speed.
-Diviya
What are the attributes for these 28M sites ? How you are matching with genomes ? Are you trying to match based on some interval range etc?