VS Code Extension for Data Leakage Analysis

Download Documentation Sprints Time Tracking Team

What is Multi-test Leakage?

Multi-test leakage refers to a situation in which information from multiple tests or experiments is unintentionally shared or used in a way that compromises the validity or independence of the tests.

Causes of Multi-test Leakage:

Potential for multi-test leakage arises because the tokenization and padding processes are applied to the entire dataset before splitting it into training, validation, and test sets.

Solutions for Multi-test Leakage:

Fix multi-test leakage by ensuring strict separation of information between different tests or experiments to maintain their validity and independence. Implementing protocols that prevent cross-experiment data sharing and using independent datasets for each test can safeguard against this type of leakage. Some techniques to prevent multi-test leakage include:

Tokenization and Padding Within Each Split: perform the tokenization and padding separately for the train, validation, and test sets after the split. This ensures that the preprocessing steps are independent for each set, reducing the risk of info leak.
Tokenization and Padding in a Pipeline: use a pipeline to encapsulate the tokenization and padding processes. This ensures that the same transformations are applied consistently across different sets.

Example of Multi-test Leakage Code

Pretend that the examples shown below are Jupyter Notebook files. In the code below, X_test is used more than once (in line 14 and line 18), hence the multi-test leakage.

How Quick Fix Would Be Performed

Our VS Code extension can perform Quick Fix to fix multi-test leakage through a manual Quick Fix or through the GitHub Copilot AI-based Quick Fix. The variable associated with data leakage would be highlighted in red, and when you hover over it, you would see a pop up that says "Data Leakage: MultiTestLeakage." The pop up displays 3 options at the bottom, with one option that says "Quick Fix": select this to open the Quick Fix menu. Then, you may select one of the light bulb icons to perform the manual Quick Fix or select the option "Fix using Copilot" to perform the Copilot AI-based Quick Fix. You must have the GitHub Copilot VS Code extension to fix using Copilot, which is discussed in the installation guide. These Quick Fix options try to resolve the data leakage.

Installation Guide

Manual Quick Fix Result

Once the manual quick fix was performed, the last three lines were added to the Jupyter Notebook as shown below. It introduced a new test data evaluation process to address the multi-test leakage issue. The last three lines load a separate test dataset (X_X_test_new, y_X_test_new), transform it using the same feature selection model, and evaluate the final model's performance on this new data. This process addressed multi-test leakage by ensuring that the evaluation is conducted on data that was not involved in the training, providing a more accurate assessment of the model's performance.

GitHub Copilot Quick Fix Result

Once the Copilot quick fix was performed, the code was modified to introduce a validation set (X_val, y_val) for model scoring, separating it from the final test set to reduce multi-test leakage. However, using X_val with the SelectPercentile transformation applied to the entire dataset inadvertently introduced pre-processing data leakage, as the feature selector was fit on the entire dataset, leading to the "vectorizer fit on train and test data together" issue.

Other Data Leakage Types:

Overlap Leakage

Preprocessing Leakage