Factors in Machine Learning Model for Fraud Prevention

Machine learning has become a very popular tool to prevent fraud in online businesses but generating meaningful variables has been tough because fraudsters also learn and evolve. This post will talk about how to create a comprehensive list of variables with consistent performance in fighting online fraud.


Ubiquity in e-commerce and online financial services has brought widespread online frauds. Fraudsters typically exploit the fact that merchants or finance service providers cannot see their customers face-to-face and illegally use stolen credit card information or IDs on their behalf.

Machine learning has become a very popular tool to fight against these online frauds. So much data has been accumulated from online business to a degree that one would find it difficult to notice any unusual trends or events without the assistance of machines. Most online transactions occur real-time and well-trained models can process and support such real-time activities with good infrastructure.

However, machine learning is not a silver bullet and has weaknesses. The two most obvious ones are as follows:

  1. Limited capacity in quantity, variety, and quality of data (Garbage In, Garbage Out), and
  2. Limited accessibility of information - information may not be immediately available (for example, if someone uses your credit card to buy a product online, it will take two days to show in your credit card statement, and additional days or weeks for your credit card company to notify the online merchant; typically, the entire feedback takes from a week to months).

Furthermore, fraudsters also learn and evolve. Even if your machine learning model detects an IP address that was previously identified as one frequently used for fraudulent activities, the fraudsters can easily change their IP addresses, which would effectively incapacitate your machine learning model until it detects new fraudulent activities and reports to the system.

This post will describe the variables that can create machine learning models with consistent performance. Specifically, we will talk about methods that do not require outcomes of transactions so that a machine learning model can learn more quickly with a shorter feedback.

Method 1. Cross-check

From a variable, you can extract information that you can check with other variables. For instance, if you already have geographic information from shipping and billing addresses, you can extract same type of information from other variables such as IP addresses or mobile GPS if enabled. The information could mismatch for genuine customers too but these weak signals can be combined under a machine learning algorithm to produce strong signals or scores.

Method 2. Link

Some variables cannot belong to more than a person or a family at a time. For example, phone numbers, devices, email addresses, social network service profiles, and bank accounts typically belong to a person. Some corner cases do exist, such as public PCs in a library. Again, the idea here is that we do not expect a single weak variable to catch all frauds. Rather, it functions as a piece of puzzle that allows a machine learning algorithm to distinguish good signals from bad ones.

Method 3. Stability

The link method sees how many accounts are associated with a particular item. Stability method looks at how frequently an account changes its items. If a person rarely uses more than five devices in a short period of time (i.e. a transaction from an account from someone who just used his or her sixth device in the last 7 days), you probably want to mark it. Another good example is a person who frequently changes his income in his credit card or online loan applications to get favorable terms. This approach can further extend to other items as well, such as phone numbers or addresses. At a very sophisticated level, you can do stability checks between items and items (e.g. between devices and geographic locations).

Method 4. Velocity

The method 4 and 5 are related to distinguishing “typical” vs. “unusual” activities. If a typical person makes two or three orders in your website a week, then you should be alerted if a person makes more than twenty orders. You can also do this for the total notional amount of transactions rather than a number of them. Similarly, you can also apply the same logic to an IP address, a region, a device or anything that shouldn’t deviate too much from regular metrics in terms of volume, number or monetary amount.

Method 5. Ratio

This one is similar to the method 4 in that it flags unusual activities but it looks from a ratio perspective. For example, what is a typical fraud decline rate for a region? What about a coupon usage for a particular browser or device type? You can create flag variables for these items with unusually high or low ratios and let the machine learning decide how useful they are.


So far, we have discussed ways to generate variables to feed to machine learning models aimed at preventing fraud. As you noticed, these are weak signals and just one variable by itself would not necessarily be enough to guarantee a fraudulent transaction, even though it may be enough to raise suspicion levels. If you already have strong indicators such as mismatch of name on a credit card vs. name on an account profile, your business already should have set up rules or business operations protected around them.

On the contrary, a variable created from the above methods do not necessarily require a rule or operation by itself and that is precisely why machine learning can greatly help – it will combine these variables to create a powerful scorecard whose effectiveness your data scientists can evaluate. Furthermore, using the methodology above, you can generate variables without having to wait for an outcome of a transaction.

In conclusion, over the course of improving machine learning models in your business, you will incorporate a multitude of new and diverse data sources. The framework above will be a helpful resource to guide you through creating a comprehensive list of relevant and creative variables efficiently.