Match column row wise and do calculations
1
0
Entering edit mode
5.4 years ago

Hi, I have a file like this. I have to put some criteria like if chromosome matched, then check start position matched; if matched then check for end position matched. If it ends position not matched then put value 1.1 else put value 1

Chromosome Start End   
    1   990595  990685 
    1   990595  990882
    1   1459777 1460585
    1   1563779 1563827
    2   1564102 1564215
    2   1564108 1564219

I am looking for output like this

 Chromosome Start End  
 1  990595  990685  1.1
1   990595  990882  1.2
1   1459777 1460585 2
1   1563779 1563827 3
2   1564102 1564215 1
2   1564108 1564219 2
R • 1.8k views
ADD COMMENT
1
Entering edit mode

You probably need:

GenomicRanges::reduce()
GenomicRanges::findOverlaps()

Or from data.table package:

data.table::foverlaps()
ADD REPLY
0
Entering edit mode

What have you tried? Please add some context (on the biological problem you're trying to solve) so this pure R question can qualify as a bioinformatics question.

ADD REPLY
0
Entering edit mode

Ok I will modify the question for better understanding.

ADD REPLY
0
Entering edit mode

This looks like an XY problem. What are you doing this exercise for?

ADD REPLY
0
Entering edit mode

I have to find the overlap region if it has been found then add new column like serial number 1.1, 1.2 etc

ADD REPLY
1
Entering edit mode

That is a verbatim description of the problem, not the reason why this problem needs solving (See what an XY problem means). There is also another factor you have not mentioned: the numbering resets with the start of a new chromosome. This is quite the convoluted logic and I don't know why you're doing it, so I want to make sure this problem needs solving before starting to solve it.

ADD REPLY
0
Entering edit mode
5.4 years ago
Sam ★ 4.8k

In R, assume your data is stored in a data.frame call dat

dat$Group <- 1
counter<-0
prev.chr <- NA
prev.loc <- 0
sub.count <- 0.1
for(i in 1:nrow(dat)){
    if ( is.na(prev.chr) | prev.chr != dat$Chromosome[i]){
        counter <- 0
        prev.chr <- dat$Chromosome[i]
        prev.loc <- dat$Start[i]
        sub.count <- 0.1
    }
    else if(prev.loc == dat$Start[i]){
        dat$Group[i] <- dat$Group[i]+counter+sub.count
        sub.count<- sub.count+0.1
    }else{
        prev.loc <- dat$Start[i]
        counter<- counter+1
        sub.count <- 0.1
        dat$Group[i] <- dat$Group[i] + counter
    }
}

(There are definitely better ways to go about it, the above codes are just as a reference)

ADD COMMENT
1
Entering edit mode

Your code is incomplete. Also, your first comparison is NA vs a value. See what happens when you do that:

> if(NA != 1) print('Hello')
Error in if (NA != 1) print("Hello") :
  missing value where TRUE/FALSE needed

On a side note, this sort of programming is like using python or java naively, where language specific features are discarded to "make something work". I'm sure there are better ways in R to partition a dataset and order within partitions in a custom manner, using a loop seems to be really inefficient in a language that vectorizes operations and uses the apply family to great effect.

On second thought, I guess OP's problem cannot really be solved without forcing R to process a data.frame row-by-row, which is why I suspect this is an XY problem.


EDIT: Here's a possible sequence of operations to get to the result:


Step-1: group by chromosome and start, and number groups

INPUT:

Chromosome Start    End   
1          990595   990685 
1          990595   990882
1          1459777  1460585
1          1563779  1563827
2          1564102  1564215
2          1564108  1564219

OUTPUT:

Chromosome Start    End       gp.1
1          990595   990685    1
1          990595   990882    1
1          1459777  1460585   2
1          1563779  1563827   3
2          1564102  1564215   1
2          1564108  1564219   2

Step-2: group by chromosome gp.1; and number groups, while counting group membership

INPUT:

Chromosome Start    End       gp.1
1          990595   990685    1
1          990595   990882    1
1          1459777  1460585   2
1          1563779  1563827   3
2          1564102  1564215   1
2          1564108  1564219   2

OUTPUT:

Chromosome Start    End       gp.1    gp.2    gp.count
1          990595   990685    1       1       2
1          990595   990882    1       2       2
1          1459777  1460585   2       1       1
1          1563779  1563827   3       1       1
2          1564102  1564215   1       1       1
2          1564108  1564219   2       1       1

Step-3: Determine final group

If count is 1 for a group, gp is gp.1. If count > 1, gp = paste(gp.1,gp.2,sep=".")

INPUT:

Chromosome Start    End       gp.1    gp.2    gp.count
1          990595   990685    1       1       2
1          990595   990882    1       2       2
1          1459777  1460585   2       1       1
1          1563779  1563827   3       1       1
2          1564102  1564215   1       1       1
2          1564108  1564219   2       1       1

OUTPUT:

Chromosome Start    End       gp.1    gp.2    gp.count    final.gp
1          990595   990685    1       1       2           1.1
1          990595   990882    1       2       2           1.2
1          1459777  1460585   2       1       1           2
1          1563779  1563827   3       1       1           3
2          1564102  1564215   1       1       1           1
2          1564108  1564219   2       1       1           2

Each of the three steps above can be vectorized, avoiding loops entirely.

ADD REPLY
0
Entering edit mode

Opps, I thought the message didn't sent (accidentally click back), didn't realize it was sent out. Sorry about that. Will now try to complete it.

ADD REPLY
0
Entering edit mode

Thanks, a lot Sam. It's working. I got one warning message which I can ignore. You made my day!!!!!!!.

Thanks again.

ADD REPLY
0
Entering edit mode

Could you please explain to us the reason for this exercise? This was too niche to be a commonplace operation and it would be useful to know when we might need such an operation in the future.

ADD REPLY
0
Entering edit mode

Thanks, Sam for your help but unfortunately when I am trying your code I got the error message like this I stored all output value in d4.

Warning message: In dat$Group[i] <- dat$Group + counter + sub.count : number of items to replace is not a multiple of replacement length

d4 [1] 0.1 2.0 3.0 4.0 0.2 5.0 6.0 7.0

Any help is much appreciated.

ADD REPLY
0
Entering edit mode

There's probably a typo, and it needs to be dat$Group[i] <- dat$Group[i] + .... Please understand code before copy-pasting it, lest you end up corrupting your data some day.

ADD REPLY
0
Entering edit mode

Thanks Ram. I will take care.

ADD REPLY
0
Entering edit mode

Yes, there's a typo. Have now updated the code

ADD REPLY

Login before adding your answer.

Traffic: 2558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6