I have two files, a file with each row showing the coverage for scaffold across many samples, and one file with position information as scaffold, chromosome and cM location tab delineated. please note file 1 is not scaffold ordered to the same as file2 it is just in this example.
file1:
scaffold_id sample1 sample2
Scaffold86180 10 5
Scaffold863688 20 0
Scaffold868772 23 6
Scaffold934477 1 7
Scaffold992750 23 67
Scaffold2030550 15 16
Scaffold2282709 156 17
Scaffold383332 178 1895
Scaffold3943711 10 185
Scaffold6328630 189 1975
Scaffold6682550 15 88
Scaffold844703 13 98
Scaffold876147 12 1
Scaffold1237644 12 5
Scaffold1243015 16 8
Scaffold1251638 18 36
Scaffold1422442 195 33
Scaffold1440480 14 3
file 2: position information
Scaffold86180 1A 0
Scaffold863688 1A 0
Scaffold868772 1A 0
Scaffold934477 1A 0
Scaffold992750 1A 0
Scaffold2030550 1A 1.7075
Scaffold2282709 1A 1.7075
Scaffold383332 1A 1.7075
Scaffold3943711 1A 1.7075
Scaffold6328630 1A 1.7075
Scaffold6682550 1A 1.7075
Scaffold844703 1A 1.7075
Scaffold876147 1A 1.7075
Scaffold1237644 1A 3.415
Scaffold1243015 1A 3.415
Scaffold1251638 1A 3.415
Scaffold1422442 1A 3.415
Scaffold1440480 1A 3.415
What I would like to do is maybe in R but perl/python/linux if easier, to operate on each column to sum the coverage by chromosome cM, for example to file 3 (note not real results and the file is 2gb so will not work in excel). Any ideas how to simply do this?
file 3
Chrom cM samp1 samp2
1A 0 10 34
1A 1.7075 23 45
Hi Ibrahim, Thanks for the script. I'm just looking through it to understand it. There is an arbitrary 2cM bin that the user specifies, is it an easy fix to have a variable bin size according to the data as below example. Due to varying sizes of scaffolds the bin sizes are variable too. I think just changing below line with the
$cMlimit
of the data should do it but I'm not sure yet how to implement the logic..cM
Dear rob234king,
I did what you wanted. Now, instead of using
-stepSize
argument you use a-stepFile
argument. This should be a HEADERLESS file with bin start values separated by new line characters (here it looks like as if there are double newlines, not like here, like 1 newline), for example:So here your bins will be 0-2 and 2-3. In the command line type (just an example):
Of course, all three files are assumed to be in the same directory as the script. I paste the commented script below. Compare both scripts to see how the flow is modified.
I hope this helps,