VS Code Extension for Data Leakage Analysis

Download Documentation Sprints Time Tracking Team

Get Started

Learn how to install and use the Leakage Detector VS Code extension with its dependencies.

About Data Leakage

Data leakage in machine learning is when a model uses information during training that would not be available at the time of prediction like test data. This creates overly optimistic, invalid predictive models. Data leakage often occurs due to poor practices in machine learning code. This can include obvious mistakes like incorporating test data into the training set, as well as more subtle errors, such as inadvertently revealing test data distribution through preprocessing before training. Learn more about the different data leakage types in the links below.

Multi-Test Leakage

Overlap Leakage

Preprocessing Leakage

Example Data Leakage Files

Here are some example Jupyter Notebook files and their original Python equivalent file that demonstrate data leakage in machine learning models. There are more Python files than Jupyter Notebook files because we did not convert all the original Python files to Jupyter Notebook files.

Let's examine data leakage in the example files. For example, nb_362989.ipynb is the Jupyter Notebook file that contains 2 unique preprocessing leakages and 1 unique multi-test leakage. The original Python file, nb_362989.py, contains the same leakages.

Download the example files below to see how data leakage can occur.

Example Jupyter Notebooks

Example Python Files

Spring 2025 Report

Check out the Spring 2025 report for an in-depth look at our progress and key outcomes.

Spring 2025 Report

Source Code

Go to Plugin Code

Website Code

Go to Website Code

Weekly Slides

Week 1 Week 2 Week 3

Week 4 Week 5 Week 6

Week 7 Week 8 Week 9

Week 10 Week 11 Week 12

Week 13 Week 14 Week 15

Week 16 Week 17 Week 18

Week 19 Week 20 Week 21

Week 22 Week 23 Week 24

Week 25 Week 26 Week 27

Week 28 Week 29

Fall 2023 Resource Page

Go to Resource Page