The challenge of fraud detection in the banking and e-commerce industry represents a quintessential problem tackled using classical machine learning. At the very core, fraud detection is a binary classification issue: a transaction is either fraudulent or it’s not. With 1,097,231 instances and 201 attributes (190 continuous and 11 categorical), this dataset epitomizes a large-scale, high-dimensional problem that’s ripe for machine learning exploration.
For the technically inclined, think of this as a gigantic matrix where each row is a transaction, and each column is a piece of information about that transaction – be it the device type used for the transaction, product features, or perhaps the location of the transaction. But what makes this dataset particularly challenging, and yet intriguing, is the class imbalance. A mere 3.50% of these transactions are fraudulent, making the non-fraudulent transactions overwhelmingly predominant at 96.50%. This skews the model towards predicting most transactions as non-fraudulent due to the sheer majority of such cases.
The fraud detection landscape, especially with an imbalanced dataset such as this, presents a myriad of challenges. Yet, amid this myriad, some solutions stand out for their simplicity and effectiveness. In our endeavor, we’ve found merit in grounding ourselves with the basics: standard scaling for continuous variables to ensure they’re on the same scale and undersampling techniques to address the class imbalance.
While many algorithms are at our disposal for such problems, our focus is on a select few that have repeatedly demonstrated robustness. Decision trees, with their innate ability to capture non-linear patterns, are a natural choice. But, to add vigor and reduce potential variance, boosting trees come into the frame. Boosting, by iteratively correcting errors, often outperforms standalone models.
Yet, the real curiosity lies in our proprietary oblique tree algorithm, which we believe holds the potential to redefine the standards in fraud detection. In this exploration, we will juxtapose the traditional decision trees and boosting trees against our oblique tree, gauging their strengths and identifying areas of improvement.
In essence, this isn’t just about detecting fraud; it’s about refining the tools at our disposal and seeking the most effective blend of technique and insight.
After selecting the models, we needed to find the best hyperparameters for each one, which is where our grid search came in. Grid search is a tuning technique that attempts to compute the optimal values of hyperparameters. It works by searching exhaustively through a manually specified subset of the hyperparameters space of a learning algorithm. In our case, we performed an extensive grid search to find the hyperparameters that give the best model performance.
Finally, we trained our models. This step is crucial, as it involves feeding the models our preprocessed data, enabling them to learn and make predictions. During this process, we carefully monitored the training to ensure a good convergence, which implies that the model’s learning is stabilizing and it is ready to make predictions. We also took precautions to avoid overfitting, a common problem where the model learns the training data too well and fails to generalize to unseen data.
Interpretability is another crucial aspect of fraud detection models. While complex machine learning models might offer slightly higher accuracy, their “black box” nature makes it challenging for stakeholders to understand the basis of their decisions. A model’s predictions need to be explainable to auditors, regulators, and sometimes even to the customers themselves.