We have developed a new Visual Studio Code extension that detects data leakage — mainly preprocessing, overlap and multi-test leakage — from Jupyter Notebook files. Data leakage happens when a model training data set makes use of test data in data science code, leading to inaccurate performance estimates. Beyond detection, we implemented two correction mechanisms named Quick Fix: a conventional approach that manually fixes the leakage and an LLM-driven approach that guides ML developers toward best practices for building ML pipelines.
According to the paper "Data Leakage in Notebooks: Static Detection and Better Processes", many model designers do not effectively separate their testing data from their evaluation and training data. We are developing an extension for the VS Code IDE that identifies instances of data leakage in ML code and provides suggestions on how to remove the leakage.