VS Code Extension for Data Leakage Analysis

Download Documentation Sprints Time Tracking Team

What is Overlap Leakage?

Overlap leakage refers to a situation in which there is unintentional sharing or overlap of information between the training and test datasets in a machine learning model.

Causes of Overlap Leakage:

This can occur when the same or highly similar data points are present in both the training and test sets. When the model is trained on a dataset that shares information with the test set, it may lead to overly optimistic performance evaluations and may not generalize well to new, unseen data.

Solutions for Overlap Leakage:

Fix overlap data leakage by using independent test data for evaluation, ensuring that no data points from the training set are present in the test set. This approach maintains the integrity of model evaluation by preserving the independence between training and testing data. Some techniques to prevent overlap leakage include:

Randomized Splitting: use a randomized approach when splitting the dataset into training and testing sets. This helps ensure that instances in the training set are not overly similar to instances in the test set.
Stratified Sampling: if the dataset has class imbalances, use stratified sampling to maintain the distribution of classes in both the training and testing sets. This can help prevent situations where certain classes are overrepresented or underrepresented in one of the sets.
Temporal Splitting: if the data has a temporal dimension, split the dataset based on time. The training set should include data from earlier time periods, while the testing set should include data from later time periods. This helps simulate a more realistic scenario where the model needs to generalize to future, unseen data.
Geographical Splitting: in some cases, especially in spatial data, geographical splitting can be useful. Ensure that instances from specific geographical regions are present in either the training or testing set but not in both.

Example of Overlap Leakage Code

Pretend that the examples shown below are Jupyter Notebook files. In the code below, line 9 has fit_resample(X,y) before train_test_split.

How Quick Fix Would Be Performed

Our VS Code extension can perform Quick Fix to fix overlap leakage through a manual Quick Fix or through the GitHub Copilot AI-based Quick Fix. The variable associated with data leakage would be highlighted in red, and when you hover over it, you would see a pop up that says "Data Leakage: OverlapLeakage." The pop up displays 3 options at the bottom, with one option that says "Quick Fix": select this to open the Quick Fix menu. Then, you may select one of the light bulb icons to perform the manual Quick Fix or select the option "Fix using Copilot" to perform the Copilot AI-based Quick Fix. You must have the GitHub Copilot VS Code extension to fix using Copilot, which is discussed in the installation guide. These Quick Fix options try to resolve the data leakage.

Installation Guide

Manual Quick Fix Result

Once the manual quick fix was performed, it introduced a new dataset called X_X_train_new, separate from the original data, and a corresponding transformation. These changes ensure that the model evaluation and transformation processes are performed on entirely new data, preventing overlap between training and test data. The added test and transformation steps help maintain independence between datasets. However, using the transformed training data for evaluation can inadvertently maintain overlap if not handled separately.

GitHub Copilot Quick Fix Result

Once the Copilot quick fix was performed, the overlap leakage is addressed by altering the point at which SMOTE oversampling is applied. Initially, the oversampling was applied before data splitting, which could lead to overlap between training and test datasets. By applying SMOTE only on the training data after the initial train-test split, this approach prevents any overlap or bias that could arise from synthetic data influencing both training and testing phases. Consequently, model evaluation remains valid and unbiased, as the test data is independent and reflects real-world scenarios.

Other Data Leakage Types:

Multi-Test Leakage

Preprocessing Leakage