Find exceptionally bad performance groups in your loan portfolio

Some of popular machine learning algorithms miss identifying exceptionally bad segments in a loan portfolio when these segments are so small compared to the population size. This post explains a way to identify every bad-performing group in data using BigData technology with the popular German Credit Data as an example.

How happy would you be knowing ALL segments whose default rate is more than twice of the average? The advent of BigData and cloud technologies has made such credit anomaly (or fraud) detection possible. For instance, our algorithm has found all segmentswith at least 40 customers and 50% default rate from the German credit data. This post suggests a new way of identifying patterns of highly bad-performing segments using BigData technology and compares the new with existing methods. It will show that the new method can not only work in a complementary way with the existing tools but also deliver new insights that the existing methods could not provide.

  • Status of existing checking account: Less than zero balance
  • Installment rate in percent of disposable income: 4
  • Job: Skilled employee / official
  • Number of existing credit at this bank: 1
  • Other installment plan: None
  • Other debtors and guarantors: None
  • Foreign worker: Yes

In the German credit data, a 40-loan group with the above characteristics had the highest, 75% default rate, more than twice of the average (30%).

  • Duration in month: 36 or more
  • Savings account and bonds: Less than 100 DM

This group of 103 loans had a default rate of 61.17%. Probably too long duration for those with little money in their bank accounts.

  • Job: skilled employee / official
  • Other debtors and guarantors: None
  • Savings account and bonds: Less than 100 DM
  • Status of existing checking account: Less than 0 DM

Out of 112 loans that had these common characteristics 71 loans defaulted (63.4% default rate). They could be high earning (from their job) with little financial perseverance (from their saving status).

A new method

How it works

In a nutshell, it generates a list of all possible segments and evaluates whether each segment satisfies both conditions below:

  • Size: a segment should have enough entities (e.g. loans, transactions, customers and etc.); otherwise, a resulting solution won't be meaningful. For instance, 100% default rate of 3 loans can easily be due to randomness
  • Event rate: a segment should have an enough number of events; otherwise, a resulting solution won't be useful. For example, 30% default rate of a 100-loan segment in a loan portfolio with 30% default rate does not bring any value to business.

Why now?

Though it sounds easy, generating a list of all candidates and evaluating each is computationally very expensive (using Computer Science terms, the time complexity is exponential). The recent advent of the below two technologies made our algorithm possible.

BigData Technology

The algorithm involves a lot of tasks, which for one computer will take forever. The MapReduce framework allows the tasks to be distributed to many computers and orchestrates the distribution so that these tasks complete much faster without loss of data.

Cloud Technology

We only need the computers when we run the algorithm and the cloud technology enables us to do so. With cloud technology, we can rent hundreds of powerful servers only temporarily to run the algorithm fast.

How is it different from decision tree?

Though it sounds similar to the decision tree algorithm, it is fundamentally different. The decision tree algorithm considers sample size in making branches and usually the top branches are those that make impacts to larger groups in a dataset. This effectively removes the possibility that the decision tree algorithm will consider or detect anomaly cases.

Pros & cons of the new method

This approach basically automates the investigation responsibility of manual investigators and the descriptive analytics tasks of data scientists.


  • Insight

    • When patterns are not obvious (which is usually the case when trying to detect anomaly), data scientists often resort to using more complicated algorithms. This tendency can increase performance of resulting models but often these models get too complicated that it almost runs like a blackbox. In contrast, the algorithm clearly shows the characteristics and magnitude of anomaly cases
  • Early and complete detection

    • Because it searches all segments completely, it can detect small fraud segments as well, which greatly helps detect fraud early.
  • Tailored

  • The algorithm can work with any dataset (credit, fraud, marketing, you name it) and deliver customized results.


  • Overfitting

    • Because it finds all combinations of values that satisfy the two conditions (absolute volume and relative frequency of outlier events in a group), some of the results are due to over-fitting and not worth attention. This problem can be mitigated by first looking at those results with highest lifts (i.e. those showing higher degree of relationship with anomaly cases). Moreover, stringent conditions can be provided in running the algorithm to make the result set smaller and more meaningful.
  • Time complexity

    • The algorithm's time complexity is basically exponential and as the amount of data grows, it will require a significant amount of time. This problem can be handled by renting more stronger servers and/or sampling data with only meaningful columns. In addition, in an initial run, a user can mandate very strict conditions to shorten the computation time.

Examples & use cases

German credit data

  • Data source: UCI Machine Learning Repository
  • Number of rows: 1,000
  • Frequency of event of interest: 300 (30.0%)
  • Conditions provided:

    • Size: at least 40 loans
    • Rate: at least 50% default rate
  • Number of segments found: 1,649

We have applied the algorithm to the famous German credit data. The algorithm found 1,649 segments with at least 40 loans and 50% default rates, and our tests confirmed that the algorithm's solutions are accurate (a solution satisfies both size and rate requirements) and complete (other than the solutions the algorithm yields, there is no other solution). Below are the two groups with the highest default rates.

  • Group 1:

    • Group size: 40
    • Default rate: 75.0% (30 defaults)
    • Descriptions

      • Status of existing checking account: Less than zero balance
      • Installment rate in percent of disposable income: 4
      • Job: Skilled employee / official
      • Number of existing credit at this bank: 1
      • Other installment plan: None
      • Other debtors and guarantors: None
      • Foreign worker: Yes
  • Group 2:

    • Group size: 42
    • Default rate: 73.8% (31 defaults)
    • Descriptions

      • Status of existing checking account: Less than zero balance
      • Installment rate in percent of disposable income: 4
      • Credit history: Existing credits paid back duly until now
      • Telephone: None
      • Other installment plans: None
      • Other debtors and guarantors: None
      • Foreign worker: Yes

Bank marketing campaigns

Find more from this blog post: Who responds to marketing campaigns?.


The use of the algorithm in conjunction with the existing methods will deliver earlier, more complete and robust risk management to your organizations. If you have any question or want to apply the algorithm to a dataset in your organization, please contact us by clicking the "Contact" button in the menu above.