Alexis Simon @alxsim

Do any statistical/ML software tools explicitly incorporate reusable holdout, where one uses thresholding, noise, or bootstrapping in holdout validation to prevent garden-of-forking-paths or overfitting issues?

I feel this paper describing the method made a splash when it came out in 2015 but I haven't seen much in implementation, at least in the R ecosystem: https://doi.org/10.48550/arXiv.1411.2664

Seems like something the #tidymodels team might think about? @topepo @juliasilge #rstats

arXiv.orgPreserving Statistical Validity in Adaptive Data AnalysisA great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of $m$ adaptively chosen functions on an unknown distribution given $n$ random samples. We show that, surprisingly, there is a way to estimate an exponential in $n$ number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

May 08, 2024, 01:06 PM··Web

3boosts·2favorites

**Max Kuhn** @topepo@fosstodon.org · May 8, 2024

May 8, 2024

Max Kuhn @topepo@fosstodon.org

@noamross @juliasilge

Our basic resampling tools can indirectly (and sometimes directly) address this. Imagine someone writing an analysis, resampling it, and getting better inference. I know that’s a pretty generic answer (also, I don't know much about differential privacy).

**Max Kuhn** @topepo@fosstodon.org · May 8, 2024

May 8, 2024

Max Kuhn @topepo@fosstodon.org

@noamross @juliasilge

On the prediction side, there are qualitative choices that we can make that are in the same vein as p-hacking. Feature selection is the main one that comes to mind. We resample the crap out of that since the qualitative decisions (x1 in or out) create a huge amount of variability in the results. caret also does a nested resampling strategy when there is also tuning.

If there is something more specific, it would be interesting to talk about.

**Noam Ross** @noamross · May 8, 2024

May 8, 2024

Noam Ross @noamross

@topepo @juliasilge Just curious, I don't know much about it either! The one time I've made use of something like this in a project pipeline, I bootstrapped the holdout data every time I reported performance (setting a seed for bootstrapping based on the hash of the model coefficients for reproducibility). This introduced noise into the performance metric but prevented over-training on the holdout set.

**Noam Ross** @noamross · May 8, 2024

May 8, 2024

Noam Ross @noamross

@topepo @juliasilge Some of the other approaches work on the change in performance between evaluations (e.g., only reporting an improvement in performance if it exceeds some threshold over the last value). So there would need to be some memory in the workflow.

I thought of you because I just read that neat postprocessing post and thought of this as yet another consideration in splitting.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back