Question

How to make conditional duplicate removal for list of peak file?

0

Entering edit mode

8.0 years ago

Jurat Shahidin ▴ 100

Hi:

I have list of peak files in GRanges object that needed to make specific duplicate removal. Because the condition of duplicate removal for each peaks are different. However, I want to do complete duplicate removal for first list element; for second list element, I need to search the row that appear more than twice (freq >2), and only keep one row; for third list element, search over the row that appear more than three times (freq>3), and keep two rows in that data.frame. I am seeking more programmatic and dynamic solution for this needs. How can make this happen ? Any way to make this type of conditional duplicate removal for list of peak files in data.frame ? Any idea ?

Mini example :

for mini example, I took skeleton of peaks (coordinate of peaks with score value) and represent it in data.frame ;

myList <- list(
    bar= data.frame(start=c(9,19,34,54,70,82,136,9,34,70,136,9,82,136),
                    end=c(14,21,39,61,73,87,153,14,39,73,153,14,87,153),
                    score=c(48,6,9,8,4,15,38,48,9,4,38,48,15,38)),
    cat = data.frame(start=c(7,21,21,72,142,7,16,21,45,72,100,114,142,16,72,114),
                     end=c(10,34,34,78,147,10,17,34,51,78,103,124,147,17,78,124),
                     pos=c(53,14,14,20,4,53,20,14,11,20,7,32,4,20,20,32)),
    foo= data.frame(start=c(12,12,12,58,58,58,118,12,12,44,58,102,118,12,58,118),
                    end=c(36,36,36,92,92,92,139,36,36,49,92,109,139,36,92,139),
                    pos=c(48,48,48,12,12,12,5,48,48,12,12,11,5,48,12,5))
)

I am seeking more programmatic solution to make this specific duplicate removal for my data. How can I make specific duplicate removal if input is list of data.frame ?

This is my desired output :

expectedList <- list(
    bar= data.frame(start.pos=c(9,19,34,54,70,82,136),
                    end.pos=c(14,21,39,61,73,87,153),
                    pos.score=c(48,6,9,8,4,15,38)),
    cat= data.frame(start.pos=c(7,21,72,142,7,16,45,100,114,142,16,114),
                    end.pos=c(10,34,78,147,10,17,51,103,124,147,17,124),
                    pos.score=c(53,14,20,4,53,20,11,7,32,4,20,32)),
    foo= data.frame(start.pos=c(12,12,44,58,58,118,102,118,118),
                    end.pos=c(36,36,49,92,92,139,109,139,139),
                    pos.score=c(48,48,12,12,12,5,11,5,5))
)

Any way to make this happen ? How can I achieve my desired output ? Any idea ? Thanks a lot :)

r ChIP-Seq sequencing peak • 1.6k views

ADD COMMENT • link updated 8.0 years ago by Alex Reynolds 36k • written 8.0 years ago by Jurat Shahidin ▴ 100

score 3 · Accepted Answer · 2016-12-29

If you export Granges to a BED-formatted text file (see links below), you can use BEDOPS bedmap with the Unix utilities awk and cut to filter out elements that show up more than some number of times.

For instance, to filter out elements that show up more than twice:

$ bedmap --count --echo --exact peaks.bed | awk '($1 <= 2)' | cut -f2- > answer.bed

To filter out elements that show up more than three times, change the condition in the awk statement:

$ bedmap --count --echo --exact peaks.bed | awk '($1 <= 3)' | cut -f2- > answer.bed

And so on.

If you need to, you should be able to bring the resulting BED file answer.bed back into R and into a Granges object.

I don't know if exporting from Granges creates a properly sorted BED file, so you may need to first use sort-bed (also in BEDOPS) before using bedmap:

$ sort-bed peaks.unknown-sort-order.bed > peaks.bed

Or, if you want to do it all in one go:

$ sort-bed peaks.unknown-sort-order.bed | bedmap --count --echo --exact - | awk '($1 <= 2)' | cut -f2- > answer.bed