There are so many papers performing PCA analysis with only top high variance genes(top 500 for example), and the plot seems good. Cells in different conditions separated from each other, but I am wondering is that reasonable? Since all the genes in my expression data can represent all the characteristic in my samples, only top 500 high variance selected for PCA analysis seems too artifical and could be on purpose just for separating samples out?
I think ideally, you would want to use some kind of statistical method to find genes that vary significantly among all your samples. And then plot a PCA with only those genes. There really isn't any point to plot genes that don't vary significantly as they don't offer any information.
I am not sure if there are any good/easy methods for doing that right now, so that's probably why you see a lot of papers just plotting the top X most varying genes.