I am getting close to getting the velocyto pipeline working. However, I am running into some issues I think having to do with sparse data. I am not quite sure exactly what the solution would be, and would appreciate any tips/guidance on whether things are looking appropriate...
I am following along with this tutorial: https://github.com/BUStools/getting_started/blob/master/velocity_tutorial.ipynb
After this step:
vlm.score_cv_vs_mean(2000, plot=True, max_expr_avg=50, winsorize=True, winsor_perc=(1,99.8), svr_gamma=0.01, min_expr_cells=50)
vlm.filter_genes(by_cv_vs_mean=True)
I get the following graph, which doesn't look as smooth as in the example. What might this mean?
Then, after these steps (Note that I had to to use a quite high apparently 60 as the min_perc_U value, otherwise I would get complaints like "min_perc_U=0.5 corresponds to total Unspliced of 1 molecule of less. Please choose higher value or filter our these cell" ):
vlm.score_detection_levels(min_expr_counts=0, min_cells_express=0,
min_expr_counts_U=25, min_cells_express_U=20)
vlm.score_cluster_expression(min_avg_U=0.007, min_avg_S=0.06)
vlm.filter_genes(by_detection_levels=True)
vlm.normalize_by_total(plot=True, min_perc_U=60)
I get the following graph:
The step that is ultimately giving me trouble is after that:
vlm.adjust_totS_totU(normalize_total=True, fit_with_low_U=False, svr_C=1, svr_gamma=1e-04)
I get the message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I am guessing this means I have some genes that are not highly enough expressed at either unspliced or spliced levels. Does this just mean I have to do some more filtering? Or is something about my dataset perhaps not optimal for conducting velocity analysis?
It seems like potentially the issue might have been not having pre-filtered the sadata and uadata expression matrices to only keep cells expressing at least some minimum number of genes...