Reduce time loss and increase software reliability through automated data generation

Apr 3

Théo Oiry

Having a test data set is a given when working on a project, but it is often an aspect that is neglected and not exploited to its full potential.

To get started, we will see what is the fundamental utility of having a test data set, and then understand why it is important to automate its generation to avoid wastage and bring it to it's maximum capabilities!

Why would you use a test data set?

It is obvious that a test data set is useful for... testing! But for what kind of test? Having a sample of fake data is not enough for all, some tests evaluate specific scenarios and require an adapted dataset.

We won't talk about these tests here, using data generation tools to facilitate the creation of these scenarios would require a separate article.

The first use case for a test dataset is that developers can easily navigate the application without having to create their own fake data to access certain functionality. The dataset should contain enough data to cover the basic use cases of the application.

Even if developers do not need realistic data to run the application, it is interesting to make a realistic dataset, especially for applications with a strong business context. Developers don't necessarily come from the same business as the application, and having a realistic dataset strongly helps to understand the context of the application smoothly.

And to kill two birds with one stone, if it is realistic you can also use this dataset for demonstrations to leads.

If there is no centralized dataset, each developer will end up doing his own somehow, which is a clear waste of time. Or even worse, their will use production data which can lead to serious privacy issues.

It sure seems like a very bad idea to use production data, and that it's always better to make your own dataset, but some databases can be very complex. It can take a lot of time to create your own data by hand (it's even worse if you want to make realistic data) and with the pressure of a deadline, this kind of scenario happens very often...

This is the main reason why it is necessary to automate the generation of this test data, to avoid wasting time and not falling into the shortcut of using the production data.

Manage dataset maintenance

If you are already convinced of the usefulness of such a dataset, and you already have some for your project (which is not the production dataset 👀). You don't need an automatic data generation tool, right? Drum roll.... Of course you need one!

A dataset must evolve with the project. The same problem of complexity to generate fake data applies each time the database changes.

Imagine you juste move one column to another table, add a single column or a table, change a one to many to a many-to-many relation. There are so many cases where we have to edit our test dataset.

It becomes very easy to abandon the realistic side of the data, or just work with a half-full dataset and have the developers add their own data.

Having a tool that directly understands our data structure and can generate realistic data based on it solves this kind of problems.

Test at scale

Automating the generation of the test dataset also gives us the possibility to test our applications easier at scale.

Testing at scale is a common need when developing new features, it is complicated to know how it will react when faced with real volumes of data. Even though we cannot make precise estimates, since the machines we've been testing on will certainly not have the same performance as those in production.

Testing with a lot of data still allows us to have an idea of the fluidity and behavior of our algorithms when faced with a larger volume of data.

Without data generation automation tools, developers tend to make scripts themselves to generate data, which necessarily takes time, and when you don't have it, this kind of test goes by the wayside.

Which is unfortunate because identifying slowness issues as soon as a feature is designed can avoid project image issues and useless returns to support.

UI Tests

Even if it is obvious to check that the UIs will appear as expected, we are rarely pushing them to their limits for example by putting very long strings, using unusual special characters, putting many more lines in an array than expected.

This kind of tests are often neglected because it's again, tedious to generate a lot of data of this kind. However, UIs are rarely used as excepted by the developers.

This also shows that we may need several datasets. Indeed, fuzzing data is not realistic at all, and we don't want to mix them.

Ok that's good, but now, how do I automate the generation of my fake dataset ?

Wait wait, I was coming there. There are several data generation tools that can be distinguished into several types :

Code libraries

The advantage of code libraries to generate data is that they are often well integrated into the project, but used alone it can be quite laborious. You have to do the generation functions yourself in relation to the models. However, some frameworks allow you to automatically generate data generation functions based on the database models.

The main disadvantage of this method is that it requires the execution of the code to generate data, and any modification to the generator or to the configuration such as the number of lines requires a modification of the code.

Data anonymization

Data anonymization is the process to take some sensitive data and obfuscate it to not be able to find the initial information. With tools like Tonic that obfuscate the data while preserving its meaning. Once set up and if you already have a good amount of production data, It's realy easy to generate a realistic dataset.

But this technique has already shown flaws (see this article: "Why 'Anonymous' Data Sometimes Isn't") and it requires to already have some production data, which is not cool to develop new features or a brand-new project like we talk earlier.

Fake data generator linked to the dev database

This kind of tools like Wolebase represents the perfection and excellence of data generation (yes, look at the url above, don't expect me to be sober 😎).

Joking aside, the advantage of Wolebase is that it's a desktop application (also available in CLI) that makes dataset generation easy, by introspecting the database schema and proposing generators adapted to the database types.

By making your own selection from a list of generators, you can easily generate realistic data sets or even unexpected data to test the interfaces and robustness of your algorithms.

As it is directly connected to your dev database, it generates data directly into it without having to export/import datasets.

It breaks my heart to say this, but a disadvantage of Wolebase compared to anonymization solutions is that you have to adapt the generators yourself to get realistic data. Although we would like to improve our system to offer more realistic generators based on field and table names, this is not the case at this time.

Final word:

IT can't work without data, so developers need it to develop and test what they do. Developing with poor quality data is like cooking with rotten food for training, it's impossible to know if your cooking will taste good with real food. But fortunately, you can use a fake data generator (like Wolebase 👀) to make delicious meals!