“Torture the data, and it will confess to anything.” - Ronald Coase Click To Tweet

Introduction 

How to handle large data volumes in Process Mining?

Nowadays, all data analytics activities face the same challenge – handling big data. Several trends in the last decades have only made this problem worse 

On one hand, the amount of data gathered is immense. In the last two years alone, we have created more than 90% of the data in all our history. It is mindboggling to imagine how much data this actually is! 

The way we handle big data has changed as well. A decade ago, a data analyst could spend hours configuring a data mining algorithm or writing queries to a reporting system. He would press a button to execute the query and would wait minutes – sometimes hours – for the answer to his question. 

Currently, that principle is not true anymore. Like many other data analytics techniques, process mining is making a move towards a more business-oriented audience.  

This transition really changes the game.  

Business users want an easy-to-use tool, which gives them relevant insights fast. They expect a user experience similar to what they know from their smart phones. So instead of waiting minutes, they want results in seconds. 

How does data affect speed? 

ProcessGold tool enables you to make “governed self-service” apps for Process Mining.  

What exactly does this mean?  

ProcessGold gives business users a contained information space that is very easy to useUsers get the Process Mining insights they need to optimize their business processesHowever, the speed of such app must be fast enough to keep these users engaged. 

Our software developers at ProcessGold love performanceThey are continually making step-by-step improvements to the overall speed of the ProcessGold Platform.  

But does it only depend on their efforts and extra hours spent in the office?  

hard-work

Indeed, there are many other factors that determine the speed of a Process Mining tool. The following is a very simple rule of thumb that applies to all data analytics tools“The more data you put in an app, the slower it gets.”  

Performance scales in the number of records used in the appIt’s the number of records in your largest dataset that has the biggest impact on performance.  

In Process Mining, that usually is the event log itself. 

Remember, it’s a very simple trade-off. The more data you put in, the slower it gets 

How to improve performance? 

One solution is to reduce the amount of data records that are loadedFor example, you could limit the time-period from ten years to only one yearHowever, that’s not always desirable. 

And what if you want to load a drastically higher number of data records: say 10 times, 100 times, or maybe 1000 times as much 

Sounds impossible?  

The innovative solution of ProcessGold to this problem is called “Sharding”. 

What is sharding? 

With ProcessGold’s sharding, you divide the original dataset into multiple shards. The smaller each shard is, the faster each shard will be. When a user logs in, the corresponding data shard will be loaded 

A typical unit for sharding would be “Company code” or “Department”. For example, if you have 50 company codes, each shard will contain one company code, and essentially be 50 times faster than the original dataset 

machine-in-pieces

Photo by Florian Klauer on Unsplash

User management can be isolated per shard, such that users can be managed separately. Using the ProcessGold User Sync functionality, information about who belongs to which shard can be loaded automatically without extra configuration for each new user. 

Development is easy, because you only have to develop, maintain and deploy one single appIt can be used for all shards, because the data structure of each shard is the same. 

Now, you might be wondering: what if I want to compare all my company codes? Is that still possible with sharding? 

Benchmark shards 

While sharding vastly improves performance per shard, you lose the ability to compare over shards. To get that overview back, ProcessGold has “Benchmark shards” that combine the data of multiple shards into one benchmark.  

To make sure the benchmark shard performs better than the original dataset, we must somehow reduce the data per shardThere are multiple ways to do that. 

1. Pre-aggregation 

We can pre-aggregate values over shards, or any other attribute in the dataset. This prevents you from doing all detailed analyses, but you are still able to compare differences over shards. 

2. Lower level-of-detail 

With Process Mining, a typical benchmark shard removes levels of details in the events. We can filter out all fine-grained events, and only keep the high-level events. This enables you to compare processes on a coarse level. 

3. Tagging 

ProcessGold’s unique ability to tag interesting situations works like a charm in combination with benchmark shards. You can even remove all event data and keep the tags of their respective cases. This makes it easy to compare tags over multiple shards. 

>contact-banner-cta

Combining shards 

The combination of a benchmark shard, and many normal shards gives you best of both worlds. A high-level overview to compare shards, and a possibility to zoom in a specific shard, to see all fine-grained details available.  

ProcessGold gives your business users a great user experience, by switching seamlessly from benchmark shard to specific shard and back. High-level management can see the overall picture, while you can still zoom in to all the details. And the cherry on the cake – all of this can be done with great speed! 

martijn-wijffelaars-photoMartijn Wijffelaars, Head of Product @ProcessGold