Sliding window using gene position file

1

Entering edit mode

10.3 years ago

burnsro ▴ 20

I have a file of genes with RPKM values and their position along a chromosome

Gene rpkm  chr start  stop
AT1G01010     4.18954 Chr1  3631  5899
AT1G01020    10.22902 Chr1  5928  8737
AT1G01030     1.99064 Chr1 11649 13714
AT1G01040    34.81453 Chr1 23146 31227
AT1G01050    56.88482 Chr1 31170 33153
AT1G01060     1.24325 Chr1 33379 37871

I would like to do a sliding window using the start position, a non overlapping window size of 20Kb along the chromosome, and calculate the median expression for that window.

I haven't been able to find a question similar to mine, most people seem to have different structured files and use bed/sam/vcf tools

I looked at this question (Binning Of Chromosome Data) but here the answer is only for the number of genes within that window, and not the expression.

RNA-seq • 6.2k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.3 years ago by burnsro ▴ 20

3

Entering edit mode

10.3 years ago

Alex Reynolds 36k

BEDOPS bedmap may be useful.

Take a look at bedmap --median, which calculates median signal from elements that map over reference windows.

The documentation here explains in more detail: http://bedops.readthedocs.org/en/latest/content/reference/statistics/bedmap.html

Here's your original data (say it is called expression.txt):

Gene rpkm  chr start  stop
AT1G01010     4.18954 Chr1  3631  5899
AT1G01020    10.22902 Chr1  5928  8737
AT1G01030     1.99064 Chr1 11649 13714
AT1G01040    34.81453 Chr1 23146 31227
AT1G01050    56.88482 Chr1 31170 33153
AT1G01060     1.24325 Chr1 33379 37871

You could turn it into a sorted BED file like so:

$ tail +2 expression.txt \
    | awk '{print tolower($3)"\t"$4"\t"$5"\t"$1"\t"$2}' - \
    | sort-bed - \
    > expression.bed

The file expression.bed is your signal or map file in a bedmap operation. It contains the score signal in the fifth column, per UCSC specification.

Your sliding windows are then put into a second BED file, which acts as the reference file in a bedmap operation.

Given the bounds of chromosomes, you can easily make sliding windows across each chromosome with bedops.

Say you are working with genome build hg19, which has the following extents:

	chr1 0 249250621
	chr10 0 135534747
	chr11 0 135006516
	chr12 0 133851895
	chr13 0 115169878
	chr14 0 107349540
	chr15 0 102531392
	chr16 0 90354753
	chr17 0 81195210
	chr18 0 78077248
	chr19 0 59128983
	chr2 0 243199373
	chr20 0 63025520
	chr21 0 48129895
	chr22 0 51304566
	chr3 0 198022430
	chr4 0 191154276
	chr5 0 180915260
	chr6 0 171115067
	chr7 0 159138663
	chr8 0 146364022
	chr9 0 141213431
	chrM 0 16571
	chrX 0 155270560
	chrY 0 59373566

view raw hg19.extents.bed hosted with ❤ by GitHub

You can generate non-overlapping, 20 kbase sliding windows with the --chop option against the extents file, like so:

$ bedops --chop 20000 hg19.extents.bed > sliding_windows.bed

The file sliding_windows.bed contains all the sliding windows going across each chromosome in hg19, in sorted BED format.

Now that you have the reference and map files, you can do the mapping operation with bedmap:

$ bedmap --echo --median sliding_windows.bed expression.bed > answer.bed

The file answer.bed contains each sliding window region (via the --echo operator), and the median expression signal over that window (via the --median operator).

Since it makes no meaningful statistical sense to calculate a median over a sample size of 0, consider adding the --skip-unmapped operator to leave out windows over which there is no expression signal:

$ bedmap --echo --median --skip-unmapped sliding_windows.bed expression.bed > answer.bed

The revised file answer.bed contains each sliding window region (via the --echo operator), and the median expression signal over that window (via the --median operator), only where there is any overlapping signal across the window (via the --skip-unmapped operator).

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.3 years ago by Alex Reynolds 36k

3

Entering edit mode

10.3 years ago

dariober 15k

How do you want to deal with genes that span more than one window? If a gene overlaps a window just by one base I guess you shouldn't assign to that window the full gene expression, right?

The procedure below assigns gene expression to windows weighing expression by % overlap (so if 50% of a gene overlaps a window, then 50% of that gene expression is assigned to that window).

Convert expression file to bed (first line, header, skipped):

awk -v OFS="\t" 'NR>1 {print $3, $4, $5, $1, $2}' expr.txt > expr.bed

Generate windows across the genome (chrom.txt is tab separated file with chromosome name and chrom size):

bedtools makewindows -g chrom.txt -w 20000 > winds.bed

This is a bit the tricky: Assign genes to windows and weigh expression:

bedtools intersect -a winds.bed -b expr.bed -wao \
| awk -v OFS="\t" '{geneSize=$6-$5; if(geneSize > 0){
                                        weightExpr=$8 * $9/geneSize}
                                    else{
                                        weightExpr=0};
                    print $0, weightExpr}' > weightdExprBywindow.bed

Summarize expression in windows:

bedtools groupby -i weightdExprBywindow.bed -g 1,2,3 -c 10 -o median

bedtools is from here: http://bedtools.readthedocs.org/en/latest/

Did I get it right?

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.3 years ago by dariober 15k

0

Entering edit mode

Looks to work well, but for the end file it doesn't seem to be able to calculate the median of the rpkm for each window (I got a median of 0 for each window) after following your steps. I think it has something to do with the intersect line, as the result from that command looks like this:

chr1    0       20000   .       -1      -1      .       -1      0       0
chr1    20000   40000   .       -1      -1      .       -1      0       0
chr1    40000   60000   .       -1      -1      .       -1      0       0

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.3 years ago by burnsro ▴ 20

0

Entering edit mode

If a window does not intersect any gene, as in the lines in your comment, then the median is zero as it should be. Make sure chromosome names are the same in your input files. Here you have chr1, in your original post you have Chr1.

ADD REPLY • link 10.3 years ago by dariober 15k

Login before adding your answer.