VS Code Extension for Data Leakage Analysis

Download Documentation Sprints Time Tracking Team

Data Leakage Detector VS Code Extension

We have developed a new Visual Studio Code extension that detects data leakage — mainly preprocessing, overlap and multi-test leakage — from Jupyter Notebook files. Data leakage happens when a model training data set makes use of test data in data science code, leading to inaccurate performance estimates. Beyond detection, we implemented two correction mechanisms named Quick Fix: a conventional approach that manually fixes the leakage and an LLM-driven approach that guides ML developers toward best practices for building ML pipelines.

According to the paper "Data Leakage in Notebooks: Static Detection and Better Processes", many model designers do not effectively separate their testing data from their evaluation and training data. We are developing an extension for the VS Code IDE that identifies instances of data leakage in ML code and provides suggestions on how to remove the leakage.

Data Leakage Research Paper Link

Get Started

Learn how to install and use the Leakage Detector VS Code extension with its dependencies.

Extension Preview

Full Demonstration

Project Advisor

Dr. Eman Abdullah AlOmar

Contributors

Arnav Marchareddy

Jeffrey Busold

Michael Socas

Owen Truong

Ryan Lee

Terrence Zhang