Data-driven problem solver with extensive experience in Banking and Technology (Payments), focusing on Strategic Planning, Product Management, and Data Analytics. MBA from London Business School.
Data Science | FinTech | Payment Processing | Banking
View My LinkedIn Profile
Applying machine learning models on a the e-commerce transactions dataset, which contains a wide range of features from device type to product features, to detect fraudulent transactions and improve the efficacy of alerts to reduce fraud loss as well as save the hassle of false positives.
Photo by Paul Felberbauer on Unsplash
The data comes from real-world e-commerce transactions, source: IEEE-CIS Fraud Detection
TransactionDT
: timedelta from a given reference datetime (not an actual timestamp)TransactionAMT
: transaction payment amount in USDProductCD
: product code, the product for each transactioncard1
- card6
: payment card information, such as card type, card category, issue bank, country, etc.addr
: addressdist
: distanceC1
-C14
: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.D1
-D15
: timedelta, such as days between previous transaction, etc.M1
-M9
: match, such as names on card and address, etc.Percentage of Fraudulent Transactions | Transaction Amount Distribution across Two Classes |
---|---|
Transaction Distribution across Card Types |
---|
Transaction Distribution across Email Domains |
---|
(such as how many addresses are found to be associated with the payment card)
Since the data for couting information is heavily right skewed, thus looking further into the higher quantile values, and it’s interesting to find that for most of the columns, fraud class has much higher values and only columns C4
& C9
are opposite.
Quantiles for Counting Variables |
---|
card6
(card type), card2
(card number), P_emaildomain
(purchaser email address), C5
& C7
(counting informaiton) are both identified from both models.V264
, which is an engineered feature ranks the top in both models, however, the number of splts is not among the top, indicates the significant impact of this feature on the target. It would be worth further looking at the split nodes of this feature.Feature Importance (XGBoost) | Feature Importance (LGBM) |
---|---|