Hi,
I would like to use the plotFreq function that is included in the copynumber package( available via BioConductor in R). I have sequencing data that I converted to a format which I think is the right one for the plotFreq function. The package itself contains a test-dataset called "lymphoma.res" and it looks like this:
> lymphoma.res[1:10,]
sampleID chrom arm start.pos end.pos n.probes mean
1 X01.B1 1 p 1082138 65355304 71 -0.0432
2 X01.B1 1 p 66680940 119515493 57 0.0462
3 X01.B2 1 p 1082138 50340843 56 -0.0442
4 X01.B2 1 p 50638118 107342958 58 0.0378
5 X01.B2 1 p 108332010 119515493 14 -0.0271
6 X01.B3 1 p 1082138 66680940 72 -0.0628
7 X01.B3 1 p 67244579 70705534 5 -0.4119
8 X01.B3 1 p 71950650 119515493 51 -0.0355
9 X03.B1 1 p 1082138 50340843 56 -0.0376
10 X03.B1 1 p 50638118 119515493 72 0.0178
This works perfectly when I run the plotFreq function. However, this plotFreq doens't run when I use my own data (a). However, I think they look quite similar:
> a[1:10,]
sampleID chrom arm start.pos end.pos n.probes mean
1 GC041544AR001 1 p 887078 1920182 10 3.658
2 GC043725AR013 1 p 1282158 2340523 10 4.884
3 GC041544AR001 1 p 2694253 7633559 10 5.635
4 GC041544AR001 1 p 7633560 10932033 10 12.726
5 GC041544AR001 1 p 11507192 14281220 10 24.825
6 GC041544AR001 1 p 15339635 16417326 10 14.902
7 GC041544AR001 1 p 17084389 18335685 10 8.939
8 GC041544AR001 1 p 18335686 21220367 10 25.313
9 GC041544AR001 1 p 21220368 25194985 10 16.479
10 GC041544AR001 1 p 25194986 26836784 10 14.579
When I compare both formats:
> mode(lymphoma.res)
[1] "list"
> mode(a)
[1] "list"
But when I try to run the plotFreq function with my data, I get the following error:
> plotFreq(segments=lymphoma.res, thres.gain=0.2, thres.loss=-0.1)
Hit <Return> to see next plot:
> plotFreq(segments=a, thres.gain=0.2, thres.loss=-0.1)
Error in `[<-.data.frame`(`*tmp*`, z$chr == chrom[j], 2 + i, value = numeric(0)) :
replacement has length zero
I only have limited knowledge of R, so I might be missing some basic things here. But does anyone have a suggestion I could try?
Thanks a lot
Are you sure that you have numeric, character/text, and factor columns where they should be?
Can you post the output of the
str()
function on both of your objects? - there, you may see a differenceThanks. Great suggestion. Indeed, some columns were not correct (for example factors instead of characters and integer instead of numeric). With the help of 'Google' I managed to convert my input files such that it contains the same columns as the test-dataset. Test-dataset (lymphoma.res):
Which corresponds to my own dataset amut:
However, when I try running the plotFreq, the same error remains as before:
Do you perhaps have some other excellent suggestions? Thanks!
Yes, these issues appear sometimes, and they can be very annoying.
What I believe is the next logical step is to look at the code for the
plotFreq
function, and to see where (and why) it throws that particular error. To view the code for any function in R, just type the name at the command line (e.g. typeplotFreq
).The code doesn't seem too difficult, but still too difficult for me. It seems I still don't understand why it is throwing me this error...
Should I assume that there is still something missing/wrong in my input file as it says that there is an error in the
data.frame
?It is impossible to solve without having a look at your whole dataframe, I fear. If you can, try uploading it somewhere and sharing the link here. I think it could be an issue with your chromosome names: check if you have some observations for each chromosome, or if some are missing (sort(unique(a$chr)) == sort(unique(lymphoma.res$chr)) should tell you if you have the same chromosome IDs in the 2 datasets, if not you can look at the elements which differ).
Selenocysteine's ideas will help. You should follow up on those too.
I believe that the error lies in the code:
getFreqData
andpullOutContent
are two inner functions of the package, which will also have their own code.In order to know which one is being called, run
is.data.frame(amut)
We actually know that amut is a dataframe from the str() output they posted above. I think it might also be something here:
But the only way to know is to have access to the actual table I guess.
I checked the chromosome names and these should be ok: the output was the same when using sort(unique(amut$chr)) and sort(unique(lymphoma.res$chr)). Indeed amut is a data.frame, as well as lymphoma.res.
You can find the link to my test file here: https://github.com/Lien-NGSdummie/R-copynumber/issues/1
I have isolated the error to the
interpolate.pcf()
function call, which is called fromgetFreqData(data)
in your above code.Here is the code for the
getFreqData()
function: https://rdrr.io/github/mathieu-lemire/copynumber38/src/R/getFreqData.rI am now trying to understand why it throws the error.
You'd be surprised. The name-spacing for variable types in R needs to improve. I've seen data-frames that fail the
is.data.frame()
function call, even when I produce them withdata.frame()
Yes indeed :)
Anyway I think the issue here is that lien's dataframe does not have data for each sample for every chromosome (while lymphoma.res does). For this reason, some combinations of sampleID and chr give an empty result, and this gets the function stuck. I think this should be fixable with an if-else inside interpolate.pcf()?
(see: ggplot(ares) + geom_point(aes(x=sampleID, y=as.factor(chrom))) vs ggplot(lymphoma.res) + geom_point(aes(x=sampleID, y=as.factor(chrom))) )
Could be! - thanks for your contribution.
Just for the OP: I have been able to successfully plot each chromosome with the following code:
However, when you try to run it all together, the error is thrown in interpolate.pcf() in the loop:
It stops at i=1 and j=15, i.e., chromosome 15. I don't know why.
It is exactly for the reason I stated above. The function does:
The first sampleID (i = 1) is "GC041544.AR001". The first chromosome for which you don't have observations for GC041544.AR001 is chromosome 15 (j = 15). When you do it chr by chr it works because if you subset by taking a single chromosome you will have data for all the sampleID from that chromosome. :)
Good spot. interpolate.pcf() should have been able to deal with such a simple error like that, though. Like you said, a simple if / ifelse
I am not sure, I just tried adding the if and it is still throwing an error at the end:
:/
Ok this can be fixed by changing the interpolate.pcf() function and adding this if clause:
}
This is the resulting plot. Plot
Does this look as it should be to you lien?
I will try to add the bugfix on their github. [edit] this is the edited version: https://github.com/selenocysteine/copynumber
Thanks Kevin for spotting the exact line which was throwing the error!!
Hi I cannot see the resulting plot. But I assume it will look ok. Could you help me out a little bit more on how to use your code? How can I integrate it within the copynumber package in R? Sorry for this probably basic question.
If you are using linux, it is pretty straightforward. There are multiple ways, I don't know all of them because I usually don't do this :D
Anyway, from the terminal:
1) create a directory anywhere on your computer and clone the git repos:
(or download the ZIP from the website https://github.com/selenocysteine/copynumber and unzip it somewhere if you don't have git)
2) cd into the parent directory of the unzipped folder of the repository (e.g. /home/lien/Desktop/ if the folder is /home/lien/Desktop/copynumber)
3) type and enter:
R CMD build copynumber
it will create a tar.gz file for the packages.
4) type and enter:
R CMD INSTALL "name_of_the_tar.gz_file"
[if it gives you an error, you have to run it with SUDO because your user is likely not to have access to the R packages directory]
5) from R:
should run the "updated" version
If you are using windows, try to look at "installing downloaded R packages" I guess?
Don't worry it is not such a basic question :D
You can also over-ride the current package function with your own function, as follows:
For this, you need to also load in the pullOutContent function though.
Good teamwork Selenocysteine! ;)
Irish bioinformatic collaboration at its best :D
Thanks a lot Kevin Blighe and Selenocysteine for all your work! It works perfectly now!
Click on the "Plot" word above to see the plot. If your local firewall rules block free image hosting sites then you may not be able to see the plot.