Count regions that overlap with python
2
0
Entering edit mode
2.4 years ago
ManuelDB ▴ 110

I have a dataframe with the coordinates of large mutations (CNVs)

 chr start end
1   200   1000
1   400   800
1  600   1500

How can I identify zones that overlap and count occurrences?

The result for the dataframe given above would be something like this

 start end occurrences 
200    800    2
600    1000   2
600    800    3

In the HPC where I work, I have a few python libraries (e.i. pandas) and Bedtools. How could I do this?

CNV • 2.1k views
ADD COMMENT
1
Entering edit mode

That's what you looking for!

ADD REPLY
4
Entering edit mode

how would you do that with bedtools intersect ?

ADD REPLY
0
Entering edit mode

It's bad practice to ask a question without an output example. How exactly is output supposed to be formatted. Check bedtools intersect as suggested, especially the counting option -wo. By the way, package managers like conda do not require root access so you always have the option to install most software you want with that, even on HPCs.

ADD REPLY
0
Entering edit mode

I don't really know what you mean by "without an output example". I have provided the output I would like to achieve. The HPC I am using is the Genomic England Research Environment. In this HPC you cannot install anything if this is not previously provided by them. Not even using Conda.

I have checked the -wo option. But this mention that you need to bed file. I only have one. Not sure how to do what you mean.

ADD REPLY
0
Entering edit mode

Your partitioning in the example output is incorrect or at least inconsistent with the starting example dataframe. Nonetheless, BEDOPS bedops --partition is an easy way to do this correctly.

ADD REPLY
0
Entering edit mode
2.4 years ago

Here's how you can do this with BEDOPS bedops --partition and bedmap --count, if you have that toolkit available on your HPC or can install it via the Bioconda channel.

Start by creating a sorted BED file from your dataframe text file:

tail -n+2 CNVs.txt | sort-bed - > CNVs.bed 

Then partition the CNVs and map the CNVs against the partitions, count-ing the number of CNVs that fall over a partition:

bedmap --echo --count --delim '\t' <(bedops --partition CNVs.bed) CNVs.bed > answer.bed

If you just want the last three columns, i.e., to strip out the chromosome (not recommended):

cut -f2- answer.bed > answer.txt
ADD COMMENT
0
Entering edit mode

I will check this on my PC out of curiosity. I am not sure I have that tool in the HPC I am using.

ADD REPLY
0
Entering edit mode
conda install -c bioconda bedops

If you need software on the HPC, you should have an administrator or IT department who can put that software in place for you.

ADD REPLY
0
Entering edit mode
2.4 years ago
ManuelDB ▴ 110

I have developed this

 lines = [(200,1000),
         (400,800),
         (600,1500),
        ]


all_numbers = [(x[0],"O") for x in lines] + [(x[1],"C") for x in lines]

all_numbers = sorted(all_numbers, key=lambda x: x[0])


now_open = 0

for first_tuple, second_tuple in  zip(all_numbers, all_numbers[1:]):

    begin, what_side = first_tuple
    end, _ = second_tuple

    if what_side == "O":
        now_open += 1
    elif what_side == "C":
        now_open -= 1


    print(begin, end, now_open)

With Python ... don't need any python library.

ADD COMMENT

Login before adding your answer.

Traffic: 1786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6