Do any statistical/ML software tools explicitly incorporate reusable holdout, where one uses thresholding, noise, or bootstrapping in holdout validation to prevent garden-of-forking-paths or overfitting issues?
I feel this paper describing the method made a splash when it came out in 2015 but I haven't seen much in implementation, at least in the R ecosystem: https://doi.org/10.48550/arXiv.1411.2664
Seems like something the #tidymodels team might think about? @topepo @juliasilge #rstats
Our basic resampling tools can indirectly (and sometimes directly) address this. Imagine someone writing an analysis, resampling it, and getting better inference. I know that’s a pretty generic answer (also, I don't know much about differential privacy).
On the prediction side, there are qualitative choices that we can make that are in the same vein as p-hacking. Feature selection is the main one that comes to mind. We resample the crap out of that since the qualitative decisions (x1 in or out) create a huge amount of variability in the results. caret also does a nested resampling strategy when there is also tuning.
If there is something more specific, it would be interesting to talk about.
@topepo @juliasilge Just curious, I don't know much about it either! The one time I've made use of something like this in a project pipeline, I bootstrapped the holdout data every time I reported performance (setting a seed for bootstrapping based on the hash of the model coefficients for reproducibility). This introduced noise into the performance metric but prevented over-training on the holdout set.
@topepo @juliasilge Some of the other approaches work on the change in performance between evaluations (e.g., only reporting an improvement in performance if it exceeds some threshold over the last value). So there would need to be some memory in the workflow.
I thought of you because I just read that neat postprocessing post and thought of this as yet another consideration in splitting.