Generalized Additive Models in Fraud Detection and Pattern Recognition
Data Science Capstone Project
Literature Review
Pingping Zhou
Fraud detection has become increasingly reliant on advanced statistical and machine learning approaches due to the complexity and evolving nature of fraudulent behaviors. Among these approaches, Generalized Additive Models (GAMs) provide a balance between predictive flexibility and interpretability, making them especially suitable for domains where explainability is critical, such as finance, auditing, and cybersecurity.
Tragouda et al. (2024) highlight GAMs’ ability to balance interpretability and predictive performance in fraud detection, such as bank cheque fraud. While challenges like uneven data and changing fraud patterns reduce precision (5.6%) but maintain recall (77.8%), GAMs provide clear explanations for regulators. Combining GAMs with other models can enhance accuracy while keeping results interpretable for legal and ethical oversight.
Miller (2025) investigates GAMs for identifying fraudulent financial statements, often hidden in complex accounting data. GAMs, combined with models like random forests, detect irregular revenue patterns and generate interpretable visualizations for auditors. Although effective, GAMs may miss sophisticated frauds involving multiple interacting factors. They provide a strong balance of accuracy and clarity for early detection of financial fraud.
Hanagandi et al. (2023) explore regularized generalized linear models, including Ridge, Lasso, and ElasticNet, for detecting credit card fraud in highly imbalanced datasets (0.17% fraud cases). These models, similar to GAMs, capture complex transaction patterns while remaining interpretable. Ridge regression achieved high accuracy (up to 98.2%). The study highlights that careful data preparation is crucial for effective real-time fraud detection in banking environments.
Brossart et al. (2015) discuss the application of GAMs to Medicare claims data for identifying fraudulent billing and overcharging. GAMs effectively detect unusual patterns and provide clear visualizations, which enhance auditor trust. While highly interpretable, they can be less adaptive to emerging fraud patterns compared to more complex models, but their transparency makes them valuable for healthcare fraud investigations.
Chang et al. (2022) introduced Graph Neural Additive Networks (GNANs) as an extension of Generalized Additive Models (GAMs) for graph-structured data, enabling fraud detection in domains such as financial transaction networks and social platforms. GNANs combine graph neural networks with additive modeling, capturing complex relational patterns while maintaining interpretability through simple visualizations. Their approach achieved strong predictive performance, reaching 84.5% ROC-AUC in detecting banned or suspicious users, while also providing clear, auditable explanations that satisfy regulatory compliance.
In telecom fraud detection, Zhang et al. (2025) introduced a graph-based framework that used Generalized Additive Models (GAMs) as a baseline for comparison. The study highlighted that GAMs are effective for modeling nonlinear patterns in sequential data, such as call frequency and duration, which are important for detecting fraudulent behavior. However, the results showed that GAMs were outperformed by graph neural networks (GNNs) in capturing complex network interactions, suggesting that GAMs are more appropriate for simpler fraud detection tasks where interpretability is prioritized over modeling intricate relationships (Zhang et al., 2025).
Grace Allen
Hastie & Tibshirani (1986) introduced GAMs as an extension of generalized linear models by replacing the linear predictor with an additive combination of smooth functions of covariates. This approach provides flexibility to capture nonlinear effects while remaining interpretable, making GAMs a balance between fully parametric and nonparametric methods. Their work established the foundation for modern GAM applications, including the backfitting algorithm, smoothing methods, and practical ways to select model complexity.
Guisan et al. (2002) reviewed the application of generalized linear and additive models in ecology, with a focus on species distributions. They noted that while generalized linear models are easier to interpret, they are limited in their ability to capture curved patterns in ecological data. GAMs, on the other hand, better represent nonlinear species–environment relationships but carry risks of overfitting, reduced interpretability, and misleading predictions when applied outside observed data ranges. The authors emphasized careful variable selection, model validation, and consideration of data quality.
White et al. (2020) provided a tutorial that used GAMs to evaluate alcohol consumption as a health exposure variable. By applying GAMs in R and SAS, they demonstrated how nonlinear associations can be identified where linear assumptions fail. Their results showed that GAMs are powerful for visualizing complex patterns but have limitations when it comes to making formal inference on nonlinear effects. This makes them especially useful as exploratory tools, though the subjectivity involved in smoothing parameter selection was also noted as a limitation.
Miller (2025) presented GAMs from a Bayesian perspective, framing smoothing penalties and variable selection as Bayesian priors. This approach highlights how uncertainty can be measured using credible intervals rather than traditional confidence intervals, which may be more informative in sparse data situations. Miller also discussed shrinkage methods that help simplify GAMs by reducing unnecessary terms. However, most GAMs still rely on empirical Bayes approaches rather than fully Bayesian models, and smoothing remains partly subjective.
Wood (2025) reviewed advances in GAM methodology, including computational improvements that allow GAMs to be applied to larger datasets. He emphasized automated smoothing parameter selection, scalable algorithms, and extensions to models beyond the mean, such as location–scale modeling and quantile regression. His work also highlighted the use of tensor-product splines and isotropic smooths for modeling interactions across variables with different scales, making GAMs applicable to more complex research questions.
Detmer (2025) tested GAMs as tools for detecting thresholds in ecological systems, such as abrupt changes in species response to environmental variables. Through simulation studies and a case application to Pacific hake distribution, the study found that GAMs perform well under conditions with long, high-quality datasets but are less reliable with short or noisy time series. The findings stress the importance of data quality and model validation, as GAMs may detect false thresholds if conditions are not suitable.
Sonya Melton
Zhu et al. (2023) present a Generative Adversarial Network (GAN) framework designed to enhance fraud detection through synthetic data generation. Their approach mitigates imbalanced class distributions and strengthens model robustness by training competing generator and discriminator networks. The system demonstrates improved detection rates in simulated financial environments but acknowledges ongoing challenges in stability and ethical use of generated data.
Agarwal et al. (2021) extend interpretability further through Neural Additive Models (NAMs), which merge the transparency of GAMs with the representational power of neural networks. By assigning a small neural sub-network to each predictor, NAMs model feature-specific nonlinearities while preserving visibility into how each input contributes to predictions. The authors demonstrate the method’s strong performance in financial, healthcare, and risk assessment domains.
Complementing these efforts, GAMformer (2023) introduces a transformer-based approach for fitting GAMs through in-context learning. This innovation eliminates the need for time-intensive iterative optimization, enabling faster, smoother function estimation. While it performs competitively against Explainable Boosting Machines and other interpretable frameworks, it faces scalability limits when data complexity grows.
Functional Generalized Additive Models (FGAM, 2015) offer another expansion, extending additive modeling to functional predictors such as temporal or spatial signals. Employing penalized tensor-product B-splines, FGAM captures nonlinear effects across continuous domains, with empirical success in brain imaging studies. Although computationally demanding, this framework demonstrates that interpretability can coexist with flexibility even in high-dimensional functional settings.
Dynamic Generalized Additive Models (DGAM, 2021)incorporate latent temporal components, enabling GAMs to perform robust forecasting in dynamic environments. Through the mvgam R package, DGAMs facilitate multi-series forecasting with improved uncertainty quantification, outperforming traditional GAMs in ecological and environmental applications.
Lastly, the Gam.hp (2020) enhances interpretability by offering a principled measure of variable importance within GAMs. By decomposing shared and individual variance components, it provides clearer insights into each predictor’s role, as illustrated in air quality analyses. This contribution supports more transparent communication of model results, a vital step in responsible data-driven decision-making.
Kesi Allen
The evolution of Generalized Additive Models (GAMs) reflects the ongoing pursuit of balance between flexibility, interpretability, and statistical rigor in predictive modeling. Introduced by Hastie & Tibshirani (1990) as a natural extension of Generalized Linear Models (GLMs), GAMs replaced the linear predictor with an additive structure of smooth functions, allowing the response variable to vary non-linearly with each covariate. This innovation made it possible to model intricate data relationships that linear methods could not capture, while maintaining the interpretability essential to applied fields such as finance, medicine, and social science.
Early theoretical frameworks treated GAMs primarily as a statistical tool; however, subsequent developments transformed them into a practical modeling approach supported by robust computation. Simon N. Wood’s foundational text Generalized Additive Models: An Introduction with R marked a turning point by introducing the mgcv package, which automates smoothness selection, penalization, and model inference (Wood, 2017). The package’s underlying methodology uses penalized likelihood estimation, where the degree of smoothness is optimized to prevent overfitting—a crucial safeguard in high-variance or rare-event settings such as fraud detection. Wood’s later work (2025) further refined these techniques, addressing implementation issues related to convergence, concurvity, and computational efficiency (Wood, 2025).
Complementing these advances, Zlaoui (2018), “A (very) quick introduction to GAMs” provided a concise yet accessible explanation of GAM principles, illustrating how smooth functions replace fixed coefficients to capture non-linear effects. Zlaoui’s applied examples using the mgcv syntax (gam(…, formula = y ~ s(x1))) highlight the model’s intuitive structure—each predictor contributes a smooth, interpretable curve to the overall prediction. Meanwhile, the Carnegie Mellon University lecture notes by HalDa (2012) bridge the conceptual gap between GLMs and GAMs, emphasizing the role of the link function, variance function, and iteratively reweighted least squares (IRLS) estimation in forming the statistical backbone of additive modeling. These materials collectively establish the theoretical and computational foundations of modern GAM practice.
Beyond classical GAMs, recent research has extended the additive modeling paradigm to machine learning through Explainable Boosting Machines (EBMs) and Generalized Additive Models with pairwise interactions (GA²Ms). Developed under Microsoft’s InterpretML framework, the EBM algorithm merges the interpretability of GAMs with the predictive strength of gradient boosting (Lou et al., 2012). EBMs learn feature-specific “shape functions”, which resemble GAM smooth terms, but allow for faster, iterative refinement using boosting ensembles. Optional pairwise interaction terms capture limited dependencies between features (e.g., merchant × geography) without introducing the opacity of deep models. This design aligns with regulatory and operational needs in domains like fraud analytics, where models must justify their decisions in human-readable form.
In the context of fraud detection, these developments are especially significant. Fraud data typically exhibit non-linear, threshold-based, and imbalanced characteristics, where rare fraudulent events are overshadowed by a majority of legitimate transactions. Traditional models like logistic regression may fail to identify the subtle curvature or saturation effects that define risk behavior. Conversely, black-box systems such as random forests or neural networks, while accurate, lack the transparency necessary for audit and compliance purposes. GAMs and EBMs provide a middle ground: they model fraud risk through transparent, data-driven curves that can be visualized and validated against expert expectations.
Recent literature in applied fraud analytics underscores the importance of interpretability and adaptive modeling. Dal Pozzolo et al. (2014) emphasize that credit card fraud detection systems must balance recall (catching fraud) with precision (avoiding false positives), particularly under concept drift—the natural evolution of fraud tactics over time. GAMs are well-suited to these challenges because their structure allows for regular recalibration, feature-wise monitoring, and penalized smoothness adjustments. Combined with techniques such as time-based cross-validation, class weighting, and hybrid ensemble integration, additive models can remain both interpretable and competitive in performance.
The growing intersection of GAM research and explainable AI demonstrates a paradigm shift in how predictive modeling is approached. Instead of viewing interpretability as a constraint, modern frameworks like mgcv and InterpretML leverage it as a design principle. By explicitly modeling feature contributions and enabling localized adjustments through smoothness penalties, GAMs and EBMs embody a “glass-box” philosophy—making them uniquely valuable for high-stakes analytical systems where accountability and clarity are non-negotiable.
This literature foundation supports the design of the current project, which applies Generalized Additive Models and Explainable Boosting Machines to real-world fraud detection data. By synthesizing statistical theory, modern software ecosystems, and applied best practices, this work seeks to demonstrate that additive modeling can deliver not only strong predictive accuracy, but also human-understandable insights—a combination increasingly demanded in today’s data-driven, regulated environments.