By failing to prepare, you are preparing to fail – Benjamin Franklin


Preparing your dataset and connecting it to a Process Mining tool are arguably the most important steps in your Process Mining initiative. If done incorrectly, there is a risk of obtaining unreliable insights, derailing the investment and hard work put into your project.

Thankfully, there are a number of good practices and features within the ProcessGold tool that are designed to make preparing and connecting your dataset simple, accurate and efficient.

This article will discuss the steps and practices that can be used to prepare your dataset for the ProcessGold Process Mining platform from your ERP system.

Although, be aware that this article is not intended as an ultimate guide and won’t solve all dataset-related problems faced by a process mining expert. The article does however, provide some tips and tricks to optimize your dataset.

Identify your attributes

As you might already know, three different attributes are necessary for Process Mining:

  1. Case ID: A unique identifier necessary to track a case throughout a process. E.g.  Quality notification or Purchase order number
  2. Activities: These are the various states that a case can go through. E.g. Outgoing payment created or Goods receipt.
  3. Event: An instance of an activity with a time stamp.

When preparing the data, we recommend involving a person with knowledge of the specific process (i.e. a domain expert). This person provides valuable input about the attributes that could serve as case ID’s, activities or other relevant attributes.

Set goals with your process in mind

Countless systems record every step in a process and within ERP systems there are many different processes. Think of purchase-to-pay or order-to-cash. Every step is recorded in a database, which can then be mined (read more about data mining). Nonetheless, you still have to think beforehand about what the overall goal is and which processes are involved. For instance, streamlining the purchase order process to realize cost savings.

Depending on the processes and goals, you can then decide which attribute to use as case ID. Often processes in ERP systems can be looked at from various perspectives. If the perspective is clear, you probably have an idea of how the ‘case’ traverses throughout the process and which activities it traverses through. With that knowledge, you know which tables are needed to obtain complete information.

Connecting to the ProcessGold tool

Most of the time, half the work is knowing the structure of the database and its tables. Fortunately, databases can be directly connected to the ProcessGold tool via ODBC (Open Database Connectivity). Having an IT specialist present during this phase is highly recommended.

A helpful advantage of ProcessGold is that ETL is built into the tool, allowing you to directly query databases that allow ODBC. This makes it a simple way to get data into the ProcessGold platform, resulting in greater flexibility in data selection, as well as saving on costs associated with using an intermediate tool.

Furthermore, ODBC prevents a lot of manual labor, it allows you to refresh the data via scripts, so that you don’t need to export data from the database to files and then back into ProcessGold. The connection only needs to be set up once, while the manual process is error-prone. Also, getting the data from the database directly means that you don’t have to worry about the parse settings.

Apart from that, it’s worthwhile considering what information needs to be exported. It may be unnecessary to export all columns for the insights you need. Less columns will result in a lower disk space requirement for the server. In addition, it’s good practice to avoid open text fields, keeping the potential for human error at a minimum.

Verify your data from ODBC

Once the ODBC connection is set up, we advise you verify the data that was loaded. In other words, whether the correct databases were queried and the necessary fields were retrieved.

Creating a process graph is the easiest way to check the data, as it requires all three attributes and visualizes the process.

The process graph should look somewhat similar to the process being analyzed. Unexpected events or attributes could be further investigated, for example, when an attribute only has NULL values.

The ProcessGold tool includes a scan for application issues. This scan will also check your attributes for error values. Most of the time, error values are an indication that something is wrong with the connection.

How do you check whether data is optimized?

With ERP systems, we often see millions of records logged from several years, but analysts may not need all this information to perform their analysis. To improve performance, it’s wise to consider what information is really necessary. If only current year data is needed, it’s possible to filter out data in the query, assuming ODBC is used.

If you can’t or don’t want to limit the amount of data, resulting in a slower application, sharding may be a good solution. A good split could be a different application for each process within the ERP system.

Another tip to improve performance is looking at the attributes needed. Often, the process owner will say everything is important while they are only interested in a few attributes. Think logically about what is really important and check if the table fields contain data.


In conclusion, there are some key considerations when preparing your dataset for Process Mining.

  • Always try to make an ODBC connection instead of using files.
  • Try to have both an IT specialist and a domain expert present when verifying the data.
  • Start with the three essential attributes to verify the data and add attributes to enrich the data afterwards.
  • Use common sense: Does what you see match expectations, and do you need it all?

At ProcessGold, our expert consultants support the implementation of your Process Mining solution from the start, ensuring your organization reaches it’s full potential with our platform.  

Happy mining!

Thijs Ledeboer, Consultant @ProcessGold
Rik Schreurs, Software Engineer @ProcessGold