What is Preprocessing Leakage?

Preprocessing leakage occurs when information from the test set influences the preprocessing steps applied to the training set.

Causes of Preprocessing Leakage:

If preprocessing steps are influenced by information from the test set, it can lead to overly optimistic performance estimates during model development and may result in poor generalization to new, unseen data.

Solutions for Preprocessing Leakage:

Fix preprocessing leakage by moving feature selection after the train/test split and ensuring that any data-driven preprocessing steps are not influenced by test data. This approach helps you accurately evaluate your model's performance on data it has not seen during training. Some techniques to prevent preprocessing leakage include:

Example of Preprocessing Leakage Code

Pretend that the examples shown below are Jupyter Notebook files. In the code below, there is preprocessing leakage because there is no split before the feature selection.

How Quick Fix Would Be Performed

Our VS Code extension can perform Quick Fix to fix preprocessing leakage through a manual Quick Fix or through the GitHub Copilot AI-based Quick Fix. The variable associated with data leakage would be highlighted in red, and when you hover over it, you would see a pop up that says "Data Leakage: PreProcessingLeakage." The pop up displays 3 options at the bottom, with one option that says "Quick Fix": select this to open the Quick Fix menu. Then, you may select one of the light bulb icons to perform the manual Quick Fix or select the option "Fix using Copilot" to perform the Copilot AI-based Quick Fix. You must have the GitHub Copilot VS Code extension to fix using Copilot, which is discussed in the installation guide. These Quick Fix options try to resolve the data leakage.

Manual Quick Fix Result

Once the manual quick fix was performed, it separated the data before making predictions with y_pred = gbc.predict(X_test), ensuring X_test was not influenced by the training processes. Initially, SelectKBest was applied to the entire dataset, leading to data leakage. Once this fix is implemented, predictions can still be made with y_pred = gbc.predict(X_test) because X_test is no longer influenced by training processes, maintaining the integrity of model evaluation.

GitHub Copilot Quick Fix Result

Once the Copilot quick fix was performed, it applied SelectKBest only after splitting the data, performing feature selection separately on the training set (X_train). This ensures that X_test remains independent from the feature selection process, addressing the preprocessing leakage. This change preserves the integrity of the test data by avoiding its influence on training decisions, effectively preventing leakage.

Other Data Leakage Types: