Sign up for The Paypers newsletter Follow The Paypers on LinkedIn Follow The Paypers on Twitter Follow The Paypers on Facebook
The Paypers, paypers, Insight in payments, News, Reports, Events
Expert opinion

Machine Learning – What is Inside "The Black Box"?

Wednesday 12 July 2017 | 01:03 PM CET

Roberto Valerio, Risk Ident: Machine learning is sometimes referred to as a revolutionary approach in technology

This may sound quite bold, but if you think about self-driving cars, speech recognition or advanced web search, does it really seem like an exaggerated claim?

In fact, machine learning is a very important part of modern software across industries, and it also provides great assistance when it comes to fraud prevention. This is because it has some huge advantages compared to rule-based systems, which are very common in today’s fraud prevention industry. Well-integrated machine learning provides the most accurate predictions, works in real-time and – most importantly – can easily adapt to new fraud patterns.

Understanding machine learning

A concern about machine learning from some fraud managers and specialists is that they find it unintuitive and non-transparent, often calling it a “black box”, cloaking everything that happens inside.

So how does this magical “black box” really work? At Risk Ident, we use mainly “supervised machine learning” for our fraud prevention software stack. This means that labelled data is used for training a model. A specific algorithm analyses the existing data and provides a function (also called a classifier) that can be used in order to make an informed decision on previously unknown transactions.

Extracting relevant features

To make the data applicable for the algorithm of the selected model, the first step is to extract relevant information out of the data – a process we call “feature engineering”. For example, it is possible to extract the age of a customer from their date of birth, or the distance between two locations from the addresses.

Just from an email you can extract more than 10 different “features”, including the used domain, if the customer name is part of the email or how many vowels, consonants and numbers are used within the local part of an email address.

By knowing that, you have lots of combinations of data fields that can leverage information before feeding them into the algorithm. A broad domain knowledge here is indispensable. At Risk Ident, we developed and mastered this knowledge over years specializing in the ecommerce, banking, and telecoms industries.

However, excellent human expertise is also needed on the company’s side. Machine learning cannot replace fraud managers, but it is the best tool to expand existing knowledge to a larger number of transactions that cannot be evaluated manually.

Choosing the best algorithm

The right model needs to be chosen when feature engineering is done properly and all relevant features are in place. All available algorithms differ by the type of data which they can apply to complex training processes, and by the type of patterns that they can recognize within the data.

Logistic Regression algorithm models are fast, easily-trained, and easy to interpret, because they can return the probability of a fraud occurrence as their output. Yet they have two disadvantages: they are prone to outliers (anomalies) in the data and have issues for some types of non-linear patterns.

Neural Network algorithms (especially deep learning models) have shown excellent results in classifying image and video data. They are able to detect complex patterns and connections in the data, although the amount of time and resources for training them is high and the results are not easily interpretable.

Decision Trees have the big advantage that they are easily readable and understandable by humans even without detailed knowledge in statistics. This makes it possible to make a prediction based on the model, without a computer. They can return scores too, such as probability score, but be aware that if they are too detailed they can “overfit” the data, resulting in bad results for future, unknown transactions.

To avoid this, it is possible to apply Random Forests to the data. Here you use multiple Decision Trees for random subsets of the data and combine all the results, producing more precise end results. The downside is that the results are harder to interpret and the time and resources for training them is higher.

The Naïve Bayes model is fast, easily trained, and can often outperform more sophisticated models, as well as return real probabilities as a score. Even if some data points are missing, a result can be provided. But it has also an important drawback: all features must be independent from each other (which is often not the case in the real world), otherwise the weight of the feature gets multiplied and affects the result.

Facing the challenges

One of the biggest challenges when setting up fraud prevention based on machine learning is actually collecting the data. As already mentioned, supervised machine learning evolves from existing data which is labelled “fraud” or “no fraud”. But how is fraud even defined? Where do these labels come from and are they correct?

The quality of the model predictions can only be established when labels are correct. For a valuable process of feature engineering it is important to provide as much data as possible, but for most companies this is a bigchallenge.

While machine learning provides the best results in predicting fraud cases, there are still some cases that require a fraud expert to make a decision and thereby train the model. To make an intelligent decision, it is important to have an intuitive graphical user interface (GUI) and good visualization tools, so that the fraud manager is able to have a deeper and clearer look at the data. The GUI is – if you want to put it that way – your window inside the “black box” of fraud prevention.

About Roberto Valerio:

Roberto Valerio is the CEO of Risk Ident, leading the day-to-day management of the company. He is responsible for driving the development of the business to serve merchants in need of a modern, intelligent approach to online fraud prevention.


About Risk Ident:

Risk Ident is a leading software company that offers highly efficient anti-fraud solutions to companies within the ecommerce, telecoms and financial sectors. We are global experts with long-term experience in data science and machine learning. With Risk Ident, your fraud prevention gets stronger day by day.