How to make conditional duplicate removal for list of peak file?
1
0
Entering edit mode
8.0 years ago

Hi:

I have list of peak files in GRanges object that needed to make specific duplicate removal. Because the condition of duplicate removal for each peaks are different. However, I want to do complete duplicate removal for first list element; for second list element, I need to search the row that appear more than twice (freq >2), and only keep one row; for third list element, search over the row that appear more than three times (freq>3), and keep two rows in that data.frame. I am seeking more programmatic and dynamic solution for this needs. How can make this happen ? Any way to make this type of conditional duplicate removal for list of peak files in data.frame ? Any idea ?

Mini example :

for mini example, I took skeleton of peaks (coordinate of peaks with score value) and represent it in data.frame ;

myList <- list(
    bar= data.frame(start=c(9,19,34,54,70,82,136,9,34,70,136,9,82,136),
                    end=c(14,21,39,61,73,87,153,14,39,73,153,14,87,153),
                    score=c(48,6,9,8,4,15,38,48,9,4,38,48,15,38)),
    cat = data.frame(start=c(7,21,21,72,142,7,16,21,45,72,100,114,142,16,72,114),
                     end=c(10,34,34,78,147,10,17,34,51,78,103,124,147,17,78,124),
                     pos=c(53,14,14,20,4,53,20,14,11,20,7,32,4,20,20,32)),
    foo= data.frame(start=c(12,12,12,58,58,58,118,12,12,44,58,102,118,12,58,118),
                    end=c(36,36,36,92,92,92,139,36,36,49,92,109,139,36,92,139),
                    pos=c(48,48,48,12,12,12,5,48,48,12,12,11,5,48,12,5))
)

I am seeking more programmatic solution to make this specific duplicate removal for my data. How can I make specific duplicate removal if input is list of data.frame ?

This is my desired output :

expectedList <- list(
    bar= data.frame(start.pos=c(9,19,34,54,70,82,136),
                    end.pos=c(14,21,39,61,73,87,153),
                    pos.score=c(48,6,9,8,4,15,38)),
    cat= data.frame(start.pos=c(7,21,72,142,7,16,45,100,114,142,16,114),
                    end.pos=c(10,34,78,147,10,17,51,103,124,147,17,124),
                    pos.score=c(53,14,20,4,53,20,11,7,32,4,20,32)),
    foo= data.frame(start.pos=c(12,12,44,58,58,118,102,118,118),
                    end.pos=c(36,36,49,92,92,139,109,139,139),
                    pos.score=c(48,48,12,12,12,5,11,5,5))
)

Any way to make this happen ? How can I achieve my desired output ? Any idea ? Thanks a lot :)

r ChIP-Seq sequencing peak • 1.6k views
ADD COMMENT
3
Entering edit mode
8.0 years ago

If you export Granges to a BED-formatted text file (see links below), you can use BEDOPS bedmap with the Unix utilities awk and cut to filter out elements that show up more than some number of times.

For instance, to filter out elements that show up more than twice:

$ bedmap --count --echo --exact peaks.bed | awk '($1 <= 2)' | cut -f2- > answer.bed

To filter out elements that show up more than three times, change the condition in the awk statement:

$ bedmap --count --echo --exact peaks.bed | awk '($1 <= 3)' | cut -f2- > answer.bed

And so on.

If you need to, you should be able to bring the resulting BED file answer.bed back into R and into a Granges object.

I don't know if exporting from Granges creates a properly sorted BED file, so you may need to first use sort-bed (also in BEDOPS) before using bedmap:

$ sort-bed peaks.unknown-sort-order.bed > peaks.bed

Or, if you want to do it all in one go:

$ sort-bed peaks.unknown-sort-order.bed | bedmap --count --echo --exact - | awk '($1 <= 2)' | cut -f2- > answer.bed
ADD COMMENT
0
Entering edit mode

Dear Alex :

Thanks for your detailed answer, I'll apply your solution.

Best regards :

Jurat

ADD REPLY

Login before adding your answer.

Traffic: 1898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6