Data Leakage Detector VS Code Extension

We have developed a new Visual Studio Code extension that detects data leakage — mainly preprocessing, overlap and multi-test leakage — from Jupyter Notebook files. Data leakage happens when a model training data set makes use of test data in data science code, leading to inaccurate performance estimates. Beyond detection, we implemented two correction mechanisms named Quick Fix: a conventional approach that manually fixes the leakage and an LLM-driven approach that guides ML developers toward best practices for building ML pipelines.
According to the paper "Data Leakage in Notebooks: Static Detection and Better Processes", many model designers do not effectively separate their testing data from their evaluation and training data. We are developing an extension for the VS Code IDE that identifies instances of data leakage in ML code and provides suggestions on how to remove the leakage.
Data Leakage Research Paper Link

Get Started

Learn how to install and use the Leakage Detector VS Code extension with its dependencies.
Get Started

Extension Preview

Full Demonstration

Contributors

Arnav Marchareddy profile picture
Arnav Marchareddy
Jeffrey Busold profile picture
Jeffrey Busold
Michael Socas profile picture
Michael Socas
Owen Truong profile picture
Owen Truong
Ryan Lee profile picture
Ryan Lee
Terrence Zhang profile picture
Terrence Zhang