How to edit large bed files in order to keep peaks in particular chromosome ?
1
1
Entering edit mode
7.9 years ago

Hi,

I used data set from Encode consortium for my package development, due to size of actual peak files are rather big, I can't use these data set for my package use. Because actual size of package resulted from R CMD build must be less than 4Mb on disk, I have to use rather small peak file as an example data for my package . In Encode sample's data set, each peak files contains around 100,000 peaks each. How can I edit rather big bed files in order to keep particular chromosome ? Is there any handy tools to edit peak files ? Thanks in advance :)

Best regards :

Jurat

R ChIP-Seq genome peak encode • 2.7k views
ADD COMMENT
2
Entering edit mode

You could provide data for one chromosome. Choose the one important for your application.

ADD REPLY
0
Entering edit mode

@Goutham Atla: Thanks, peak files are already constructed in robust way and stored in bed file, I think there is no need to pick up important one, I think taking sample could be option. Should I take sample from each chrom ? How can I do that ? Could you elaborate your answer please ? I'm sorry if my question is simple to ask.

ADD REPLY
0
Entering edit mode

When you say "sample from each chromosome" ? Do you mean bam file ?

ADD REPLY
0
Entering edit mode

@Goutham Atla : I mean bed file, all peaks are stored in BED format file . Thanks

ADD REPLY
0
Entering edit mode

I think it would be better to pick just one chromosome rather than sampling peaks from the whole genome. If you sample from the whole genome you artificially increase the distance between peaks which may or may not be a concern.

By the way, a ChIP-Seq file of 100,000 peaks is quite extreme, most of them should be in the order of few thousands peaks (say 1000 to 30000). Are you sure you are looking at ChIP-Seq for transcription factors rather than FAIRE-Seq or nucleosomes?

ADD REPLY
0
Entering edit mode

@dariober : Yes, I am sure that I am looking at ChIP-Seq for TFBS. Thanks

ADD REPLY
2
Entering edit mode
7.9 years ago

If you have GNU Parallel installed, you can use this with BEDOPS bedextract to very quickly split a BED file by chromosome:

$ bedextract --list-chr input.bed | parallel "bedextract {} input.bed > input.{}.bed"

You can then use my sample utility or GNU shuf to uniformly sample without replacement:

$ sample -k ${SAMPLE_SIZE} input.chrN.bed > input.chrN.sample.bed

Or:

$ shuf --head-count=${SAMPLE_SIZE} input.chrN.bed > input.chrN.sample.bed
ADD COMMENT
0
Entering edit mode

Dear Alex :

Thanks for kind instruction. How can I easily use BEDOPS tools on windows? I intend to get sample (around 1000 features) from each bed files, store these sample as BED file for further usage ? Could you teach me using BEDOPS tools to get these expected example data quickly ? Thank you very much :)

Best regards :

Jurat

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

@ Alex Reynolds: I don't have GNU tool, and familiar with usage of BEDOPS tools. Regarding on my issue, is there any available command list that I could directly try on windows machine? It is bit of urgent to generate small example data. Surely, BEDOPS with a lot features to learn. Is there any quick solution available ?Thanks again for your kind help.

Best regards :

Jurat

ADD REPLY
0
Entering edit mode

If you want to run Unix tools on Windows, you might try running Cygwin, or set up VirtualBox with Linux.

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6