I am using VarScan to analyze somatic copy number alteration. In the recommended workflow found here:
http://varscan.sourceforge.net/copy-number-calling.html
, I am stuck at the fifth step, i.e. using mergeSegment.pl. It seems that this tool needs to be provided a reference file against the parameter --ref-arm-sizes, but I can find it nowhere. I was wondering if I can obtain this file and some more detailed instructions on using this script.
About this parameter:
--ref-arm-sizes Two column file of reference name and size in bp for calling by chromosome arm
Though it does not say this is a necessary parameter, if I run the script without the parameter, it will display something like this:
YutekiMacBook-Pro:challenge_proj yfu$ perl mergeSegments.pl 1767_chr22.copynumber
Use of uninitialized value $input in <HANDLE> at mergeSegments.pl line 446.
readline() on unopened filehandle at mergeSegments.pl line 446.
Can't use an undefined value as a symbol reference at mergeSegments.pl line 456.
If I specify this parameter, the output is like this:
YutekiMacBook-Pro:challenge_proj yfu$ perl mergeSegments.pl --ref-arm-sizes hs19.len 1767_chr22.copynumber
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line 1.
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line 2.
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line 3.
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line 4.
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line 5.
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line 6.
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line 7.
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line 8.
Use of uninitialized value $arm_name in concatenation (.) or string at mergeSegments.pl line 453, <GEN0> line
It seems that I need a reference file that has a specific format.
Although I haven't used this tool, to get best help please post your input command and the error you are getting. This will help give people some idea if it's a local perl module that's not installed on your system, or some other issue. Does the documentation (ie read me file) tell you about this --ref-arm-sizes parameter?
Thanks for your reply.
I have edited the post to make it more specific.
Ok, thanks for the output. What's in your hs19.len? It sounds like that is supposed to be your reference file?
Thank you for your patience. I just gave --ref-arm-sizes an arbitrary reference file (and the format must be wrong), since the authors did not mention what reference file they used (no matter in the paper or the supplemental materials or the source code).
The hs19.len looks like this:
It just contains the length of each chromosome. But judging from the name of the parameter - "" it seems that the reference file should contain the length of each chromosome arm. Unfortunately, I do not have that reference file and I can find it nowhere online. I emailed the author, she said that their team usually answer questions on biostar and that is why I ask the question here.
Your information about that hs19.len file states that it should be 2 columns. Get rid of the 1st column. The question is, what format their script expects the chromosomes in -- try a couple of variations of 1p or chr1p and 1q or chr1q, etc... and use the length of each chromosome arm. That may work.
Where can I get the length of each chromosome arm?
I do this via ucsc genome browser. Select the table browser. Use mammal/human/assembly hg19 (I'm assuming all this time we have been discussing H sapiens data, not some other organism with 46,XY karyotype). Then for group use "mapping and sequencing tracks" and for track choose "gap". Then choose region as "genome" and then set a filter => enter in "type" as "centromere telomere". For output format you can choose "hyperlinks to genome browser" which does what it says. Or you can ask for "selected fields from primary and related tables". Then hit "get output". All this gives you the location coordinate for each telomere and centromere for each chr. This gives you the length of each chr arm. (There are other ways of doing this...) Good luck!