Bin chromosome every 1kb and get average value
3
2
Entering edit mode
6.6 years ago
dp0b ▴ 80

Hi,

I have a large file of each chromosome for sequence data and I want to get the average score in 1kb bins? Is there an easy way to do to this?

chr1 123 127 1890
chr1 208 210 90
chr1 215 220 50
chr1 225 230 100
chr1 250 300 800

I have tried bedtools, it will create the windows but then how do I get the average of the fourth value in that window? It also seems to be quite slow.

bedtools makewindows -w 1000 -g chromosome.txt > windows.txt

Any help would be much appreciated.

genome sequence chromosome binning mean • 6.3k views
ADD COMMENT
0
Entering edit mode

Have a look at wiggletools.

ADD REPLY
4
Entering edit mode
6.6 years ago

for fun, using awk+sqlite3, window size=100:

 rm -f tmp.db && awk 'BEGIN {printf("create table T(C TEXT,S INT,E INT,V INT); BEGIN TRANSACTION;\n");}{printf("INSERT INTO T(C,S,E,V) VALUES(\"%s\",%s,%d,%s);\n",$1,$2,$3,$4);} END {printf("COMMIT; SELECT C,(S/100)*100 as G,((S/100)+1)*100,AVG(V) FROM T GROUP BY C,G;");}' input.bed | sqlite3 -separator $'\t' tmp.db && rm -f tmp.db

chr1    100 200 1890.0
chr1    200 300 260.0
ADD COMMENT
2
Entering edit mode

You are a madman, Pierre!

ADD REPLY
2
Entering edit mode

Now this is a funny definition of fun.

ADD REPLY
0
Entering edit mode

Thanks so much but getting the following error when I try it. Any ideas?

Error: near line 1: table T already exists
Error: near line 1502: cannot commit - no transaction is active
ADD REPLY
0
Entering edit mode

ah sorry, typo in my command I forgot to change the name of my db when I've copied+paste. it should be tmp.db instead of jeter.db . I'll fix;

ADD REPLY
0
Entering edit mode

Thank you! Last question now I swear, piping that screen to an output file, the usual > output.txt after the code remains empty also tried |tee. Dont want to edit the script as its way above my capabilities!

ADD REPLY
1
Entering edit mode

Where are you adding the redirect? It will need to be after tmp.db but before && rm -f...

ADD REPLY
0
Entering edit mode

Got it!!Thank you!!!

ADD REPLY
0
Entering edit mode

Seriously love this code,edit it for max,min,average values.Different distance. Works so fast too. Honestly thank you!

ADD REPLY
4
Entering edit mode
6.6 years ago
ATpoint 85k

Use bedtools map:

bedtools map -a windows.bed -b seqdata.bed -c 4 -o mean > bins_averaged.txt

-c 4 tells bedtools to perform the averaging operation (-o mean) on the 4th column of -b

ADD COMMENT
1
Entering edit mode

Even though Pierre has created a strange little masterpiece up there, I think this is going to be the best answer for most people :)

ADD REPLY
1
Entering edit mode

Works very well indeed. If I may add, for others, that one can get 'smoothed' (overlapping) windows by doing the following:

bedtools makewindows \
  -w 1000 -s 500 \
  -g /ReferenceMaterial/1000Genomes/hg19/human_g1k_v37.fasta.fai > human_g1k_v37_windows.bed ;

bedtools map \
  -a human_g1k_v37_windows.bed \
  -b GSM3444988_K562_DMSO.bed \
  -c 4 -o mean > GSM3444988_K562_DMSO_Smoothed.bed ;
ADD REPLY
4
Entering edit mode
6.6 years ago

You can use Kent utilities fetchChromSizes to get the chromosome bounds of your genome of interest, e.g. hg38:

$ fetchChromSizes hg38 | awk '($0!~"_"){ print $1"\t0\t"$2; }' | sort-bed - > hg38.bed

You can then use BEDOPS bedops --chop and bedmap --mean to generate 1000-nt windows and score the mean signal of signal.bedgraph elements that overlap those windows, resp.:

$ bedmap --echo --mean --delim '\t' <(bedops --chop 1000 hg38.bed) <(awk -vOFS="\t" '{print $1, $2, $3, ".", $4; }' signal.bedgraph | sort-bed -) > answer.bed

There are several score operators, other than --mean, such as --median, --kth, --stdev, --wmean, etc. See bedmap --help for a full listing.

By using process substitutions, this should run pretty quickly. This can also be parallelized with bedops --chrom <chr> and bedmap --chrom <chr> if you are working with very large (whole-genome scale) datasets.

ADD COMMENT

Login before adding your answer.

Traffic: 1545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6