Threat Domain Partitioning and Sorted Rejection Labeling: Benchmarking for Adversarial Environments

Structural differences in fraud detection make conventional classification metrics both hard to compute and often misleading. We introduce a practical framework that defines business-aligned benchmarks that are cheaper to label, stable under performative drift, and foundational for systematic and iterative fraud model development

Abstract

Several distinctions make fraud detection different from other domains such that conventional machine learning classification metrics become impractical: adversarial adaptation, expensive labels, class imbalance, and unseen classes. These distinctions make many conventional classification metrics not only difficult to compute but even misleading. We present a practical framework that offers cheaper and more consistent benchmarks for such models. Our novel framework introduces two procedures: (1) threat domain partitioning, which comprehensively defines the space of possible attacks into manageable categories, and (2) sorted rejection labeling, which efficiently measures model performance by focusing evaluation effort on the highest-risk cases. This framework replaces conventional classification metrics with alternatives that are directly comparable with business objectives such as user conversion and fraud exposure risk. Application of the framework on real-world fraud detection systems demonstrates significant reductions in labeling costs and much more consistent benchmarks while maintaining rigorous product standards, enabling rapid deployment cycles that match the pace of adversarial adaptation.