I have been using a random forests regression model to correlate a measured and predicted phenotype (R). The input data is SNPs for sites across a genome for multiple individuals. I then measure the importance of each SNP using permutation importance. Some of these SNPs are given a negative value. If I then remove these negative SNPs and re-enter a reduced number of SNPs to correlate a measured and predicted phenotype I find the R value is improved. I can go through several iterations of this until reaching a maximal R value.
For example, I used 50000 SNPs to correlate measured and predicted phenotype which gives an R2 value of 0.7. I then measure the permutation importance for each of the 50000 SNPs. If I then remove SNPs with a negative permutation importance I am left with 37000 SNPs which are used as random forests input giving an R2 value of 0.75. This process continues until reaching a maximal R2 value of 0.8 with 25000 SNPs.
Is there any issue with this approach?
In any kind of data analysis, removing outliers will result in a better correlation.