I need to select the ROI of a gvcf using a bed file. This is the approach I have developed so far but I think it is quite inefficient
My bed tool is parsed like this
[('chr1', 36931694, 36931909, 'CSF3R.exon.17.line.1.chr1.36931697.36932509'), ('chr1', 36931910, 36932116, 'CSF3R.exon.17.line.1.chr1.36931697.36932509')]
and the code that does the interset is this
def roi(gvcf_data,bedregions):
'''Interset. Take ROI of the gVCF using
the bed file'''
rows = []
for region in bedregions:
for index, row in gvcf_data.iterrows():
if (region[0] == row['CHROM']) & (row['POS'] in range (region[1],region[2])):
rows.append([row['CHROM'], row['POS'],row['DP'], region[3]])
return pd.DataFrame(rows, columns=["CHROM", "POS", "DP", "Probe_name"])
This application will be deployed on the computer of one sequencer and I cannot install or do anything to finally use bedtools
Not what you want to hear but reinventing high-performance software in a non-performant language with the help of strangers on the internet is not what I'd call a clever investment of time. You should rather come up with a workaround to either find a way deploying software on that computer or outsource the computation to a machine you have rights to. Variant selection is extremely efficient via something like bedtools, tabix or bcftools. Why exactly can’t you compile existing software on that machine? I mean, imagine you tomorrow realize that you need a samtools equivalent or aligner on that computer. Not going to implement that, right?
Thank for your time.
There are many reasons I could give you to answer your question. Here are some of them,
1) First I am the trainee (but also the only bioinformatician) working in a clinical lab which means that my knowledge is limited.
2) The computer is in a sequencer that cost at the moment $128K.
3) Patient data cannot be sent to an external machine via the internet, all computers in my hospital are window. What is more, I have to have a serious discussion with the IT of our hospital because I was not able to even run a python application or similar programming language. So I cannot install a virtual environment to have for example Linux.
I have just developed a small application that after finishing the Illumina software that generates among other files, the gVCF files. And I have been called to create a report that informs the names of the probes of which one of their bases is lower than a cutoff. For this, I need to interset that gVCF file.
I don't think I'll waste too much time writing a few lines in this community to see if someone has a better idea. From these strangers (as you called them) I have learnt a lot, I have received a lot of help and I have helped
Patient data cannot be sent to an external machine via the internet
But you surely have a network backup in place, no? So maybe you can access the files from the backup instead directly on the sequencing computer? After all, the machine running your tool doesn't need to be connected to the internet, if you pack your software and its dependencies into a container and deploy it on a computer that is part of the trusted network.
As far as your Python question is concerned, you could try to use cgranges instead (see the python subfolder in die repo for an example).