Data leakage in machine learning is when a model uses information during training that would not be available at the time of prediction like test data. This creates overly optimistic, invalid predictive models. Data leakage often occurs due to poor practices in machine learning code. This can include obvious mistakes like incorporating test data into the training set, as well as more subtle errors, such as inadvertently revealing test data distribution through preprocessing before training. Learn more about the different data leakage types in the links below.
Here are some example Jupyter Notebook files and their original Python equivalent file that demonstrate data leakage in machine learning models. There are more Python files than Jupyter Notebook files because we did not convert all the original Python files to Jupyter Notebook files.
Let's examine data leakage in the example files. For example, nb_362989.ipynb is the Jupyter Notebook file that contains 2 unique preprocessing leakages and 1 unique multi-test leakage. The original Python file, nb_362989.py, contains the same leakages.
Download the example files below to see how data leakage can occur.