When trying to build a data warehouse, very little attention is paid to cleaning up the information that goes into it. Apparently it is believed that the larger the storage size, the better. This is a bad practice and the best way to turn your data warehouse into a garbage dump. The data must be cleared. That’s one of the main principles how to build data warehouse. After all, information is heterogeneous and is collected from various sources. It is the presence of many points of collection of information that makes the cleaning process especially relevant.

By and large, mistakes are always made, and it will not be possible to completely get rid of them. Perhaps it’s better to use help of data warehousing consulting services, and sometimes there is a reason to come to terms with them, rather than spend money and time getting rid of them. But, in the general case, you need to strive in any way to reduce the number of errors to an acceptable level. The methods used for the analysis are already fraught with inaccuracies, so why exacerbate the situation? In addition, the psychological aspect of the problem how to build a data warehouse must be taken into account.

Types of errors

We will not consider such errors as type mismatch, differences in input formats and encodings. Those. cases when information comes from different sources, where different agreements have been adopted to indicate the same fact. A typical example of such an error is the designation of a person’s gender. Somewhere it is designated as M / F, somewhere as 1/0, somewhere as True / False. These kinds of errors are dealt with by specifying conversion rules and type casting. Problems of this kind are, at the very least, being solved today. We are interested in higher-order problems, those that cannot be solved in such elementary ways.

There are a lot of options for this kind of error. In addition, there are errors that are characteristic only for a specific subject area or task. But let’s consider those that do not depend on the task:

  • inconsistency of information;
  • data gaps;
  • abnormal values;
  • noise;
  • data entry errors.

There are proven methods for solving each of these problems. Of course, errors can be corrected manually, but with large amounts of data this becomes quite problematic. Therefore, we will consider options for solving these problems in an automatic mode with minimal human participation.

Inconsistency of information

First, you need to decide what exactly is considered a contradiction. Once we define what constitutes a contradiction and find them, there are several options for action.

If you find several conflicting records in data warehouse example, delete them. The method is simple and therefore easy to implement. Sometimes this is enough. It is important not to overdo it, otherwise we can throw out the baby with the water.

Correct inconsistent data. You can calculate the probability of each of the conflicting events occurring and choose the most likely one. This is the most competent and correct method of dealing with contradictions.

Data gaps

A very serious problem. This is generally a scourge for most data warehouses. Most forecasting methods are based on the assumption that data arrives in an even, constant stream. In practice, this is extremely rare. Therefore, one of the most demanded areas of application of data warehouses – forecasting – turns out to be implemented poorly or with significant limitations.

Abnormal values

Quite often, events occur that are strongly out of the general picture. And it is best to correct such values. This is due to the fact that forecasting tools do not know anything about the nature of the processes. Therefore, any anomaly will be perceived as a perfectly normal value. Because of this, the picture of the future will be greatly distorted. Some kind of accidental failure or success will be considered a pattern.

Noise

We almost always encounter noise when analyzing. Noise does not carry any useful information, but only interferes with clearly seeing the picture.

Data entry errors

In general, this is a topic for a separate conversation, tk. the number of types of such errors is too large, for example, typos, deliberate data corruption, format mismatch, and this is not counting the typical errors associated with the peculiarities of the data entry application. There are proven methods to combat most of them. Some things are obvious, for example, format checks can be performed before entering data into the storage. Some are more sophisticated. For example, you can correct typos based on various kinds of thesauri.

Summary

Mistakes are very big problem. In fact, they can negate all efforts to create a data warehouse. Moreover, we are not talking about a one-time operation, but about constant work in this direction. Purely not where they do not litter, but where they clean up. Specialists of DataArt company know how to deal with the problem. The ideal option would be to create a gateway through which all data entering the storage passes.

 

[mashshare]