I am try to math a (very long and not sort) chrom pos list with record in vcf file, because of input pos list is not sort, I trying to use pysam.VariantFile.fetch()
to handle,
Usually, each fetch() will have a file operation, which looks slow, because my data has some clusters, the nearby sites are usually in order according to the order of occurrence (but it cannot be assumed that they are strictly sorted), eg:
chr1 100000
chr1 100001
.....
chr1 100100
chr1 10000
chr1 10001
chr1 10003
......
chr1 10010
I am trying to fetch() a buffer size one time, so if next pos in buffsize It can avoid a file io. For got record easily from buffer, I create a dict to store records.
snp = vcf_input.fetch(chrom, pos - 1, pos + bufferSize)
vcf_inbuffer = {snp_record.pos: snp_record for snp_record in snp}
But I noted that it vcf_inbuffer = {snp_record.pos: snp_record for snp_record in snp}
is very slow if the buffsize set a large value.
My question is:
1) Is there any other way faster to do it?