Sliding window using gene position file
2
1
Entering edit mode
9.5 years ago
burnsro ▴ 20

I have a file of genes with RPKM values and their position along a chromosome

Gene rpkm  chr start  stop
AT1G01010     4.18954 Chr1  3631  5899
AT1G01020    10.22902 Chr1  5928  8737
AT1G01030     1.99064 Chr1 11649 13714
AT1G01040    34.81453 Chr1 23146 31227
AT1G01050    56.88482 Chr1 31170 33153
AT1G01060     1.24325 Chr1 33379 37871

I would like to do a sliding window using the start position, a non overlapping window size of 20Kb along the chromosome, and calculate the median expression for that window.

I haven't been able to find a question similar to mine, most people seem to have different structured files and use bed/sam/vcf tools

I looked at this question (Binning Of Chromosome Data) but here the answer is only for the number of genes within that window, and not the expression.

RNA-seq • 5.8k views
ADD COMMENT
3
Entering edit mode
9.5 years ago

BEDOPS bedmap may be useful.

Take a look at bedmap --median, which calculates median signal from elements that map over reference windows.

The documentation here explains in more detail: http://bedops.readthedocs.org/en/latest/content/reference/statistics/bedmap.html

Here's your original data (say it is called expression.txt):

Gene rpkm  chr start  stop
AT1G01010     4.18954 Chr1  3631  5899
AT1G01020    10.22902 Chr1  5928  8737
AT1G01030     1.99064 Chr1 11649 13714
AT1G01040    34.81453 Chr1 23146 31227
AT1G01050    56.88482 Chr1 31170 33153
AT1G01060     1.24325 Chr1 33379 37871

You could turn it into a sorted BED file like so:

$ tail +2 expression.txt \
    | awk '{print tolower($3)"\t"$4"\t"$5"\t"$1"\t"$2}' - \
    | sort-bed - \
    > expression.bed

The file expression.bed is your signal or map file in a bedmap operation. It contains the score signal in the fifth column, per UCSC specification.

Your sliding windows are then put into a second BED file, which acts as the reference file in a bedmap operation.

Given the bounds of chromosomes, you can easily make sliding windows across each chromosome with bedops.

Say you are working with genome build hg19, which has the following extents:

You can generate non-overlapping, 20 kbase sliding windows with the --chop option against the extents file, like so:

$ bedops --chop 20000 hg19.extents.bed > sliding_windows.bed

The file sliding_windows.bed contains all the sliding windows going across each chromosome in hg19, in sorted BED format.

Now that you have the reference and map files, you can do the mapping operation with bedmap:

$ bedmap --echo --median sliding_windows.bed expression.bed > answer.bed

The file answer.bed contains each sliding window region (via the --echo operator), and the median expression signal over that window (via the --median operator).

Since it makes no meaningful statistical sense to calculate a median over a sample size of 0, consider adding the --skip-unmapped operator to leave out windows over which there is no expression signal:

$ bedmap --echo --median --skip-unmapped sliding_windows.bed expression.bed > answer.bed

The revised file answer.bed contains each sliding window region (via the --echo operator), and the median expression signal over that window (via the --median operator), only where there is any overlapping signal across the window (via the --skip-unmapped operator).

ADD COMMENT
3
Entering edit mode
9.5 years ago

How do you want to deal with genes that span more than one window? If a gene overlaps a window just by one base I guess you shouldn't assign to that window the full gene expression, right?

The procedure below assigns gene expression to windows weighing expression by % overlap (so if 50% of a gene overlaps a window, then 50% of that gene expression is assigned to that window).

Convert expression file to bed (first line, header, skipped):

awk -v OFS="\t" 'NR>1 {print $3, $4, $5, $1, $2}' expr.txt > expr.bed

Generate windows across the genome (chrom.txt is tab separated file with chromosome name and chrom size):

bedtools makewindows -g chrom.txt -w 20000 > winds.bed

This is a bit the tricky: Assign genes to windows and weigh expression:

bedtools intersect -a winds.bed -b expr.bed -wao \
| awk -v OFS="\t" '{geneSize=$6-$5; if(geneSize > 0){
                                        weightExpr=$8 * $9/geneSize}
                                    else{
                                        weightExpr=0};
                    print $0, weightExpr}' > weightdExprBywindow.bed

Summarize expression in windows:

bedtools groupby -i weightdExprBywindow.bed -g 1,2,3 -c 10 -o median

bedtools is from here: http://bedtools.readthedocs.org/en/latest/

Did I get it right?

ADD COMMENT
0
Entering edit mode

Looks to work well, but for the end file it doesn't seem to be able to calculate the median of the rpkm for each window (I got a median of 0 for each window) after following your steps. I think it has something to do with the intersect line, as the result from that command looks like this:

chr1    0       20000   .       -1      -1      .       -1      0       0
chr1    20000   40000   .       -1      -1      .       -1      0       0
chr1    40000   60000   .       -1      -1      .       -1      0       0
ADD REPLY
0
Entering edit mode

If a window does not intersect any gene, as in the lines in your comment, then the median is zero as it should be. Make sure chromosome names are the same in your input files. Here you have chr1, in your original post you have Chr1.

ADD REPLY

Login before adding your answer.

Traffic: 948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6