Get Started

Learn how to install and use the Leakage Detector VS Code extension with its dependencies.
Get Started

About Data Leakage

Data leakage in machine learning is when a model uses information during training that would not be available at the time of prediction like test data. This creates overly optimistic, invalid predictive models. Data leakage often occurs due to poor practices in machine learning code. This can include obvious mistakes like incorporating test data into the training set, as well as more subtle errors, such as inadvertently revealing test data distribution through preprocessing before training. Learn more about the different data leakage types in the links below.

Example Data Leakage Files

Here are some example Jupyter Notebook files and their original Python equivalent file that demonstrate data leakage in machine learning models. There are more Python files than Jupyter Notebook files because we did not convert all the original Python files to Jupyter Notebook files.
Let's examine data leakage in the example files. For example, nb_362989.ipynb is the Jupyter Notebook file that contains 2 unique preprocessing leakages and 1 unique multi-test leakage. The original Python file, nb_362989.py, contains the same leakages.
Download the example files below to see how data leakage can occur.

Source Code

Go to Plugin Code

Website Code

Go to Website Code

Weekly Slides

Fall 2023 Resource Page

Go to Resource Page