Contact

News

Good quality data: practical tips for Data Analysts

Garbage in, garbage out

Data analysts, besides searching data, visualising it and actually analysing it, also have to deal with cleaning and keeping the data they work with clean. This is because the reliability of data analysis also depends on the reliability of the data used, which is also known as the 'garbage in, garbage out' principle. The quality of the output of analyses cannot be good if the quality of the data is not (Kilkenny & Robinson, 2018).

What makes the difference between good-quality and bad data lies in a number of factors (Teslow, 2016), of which this article discusses the following:

  • Consistency (and completeness)
  • Accuracy (and precision)
  • Topicality

Consistency (and completeness)
Since many analysts use data coming from different sources, there is a high probability that the data is formatted just differently depending on the source. This can lead to data duplication, with the same data appearing multiple times in a dataset. Such duplication leads to distorted analyses and issues being overlooked (Rahm & Do, 2000). Consistency of data means that the data is consistent and not contradictory. In the case of data duplication, there is too little consistency in the data. This is also related to completeness, where it is important that no important data is missing. When there is data duplication and the data is not all in the right place, the data will also become incomplete. For example, if a person appears twice in a database and one version of this person adds phone and address details and the other adds bank details, both versions of this person are incomplete due to data duplication.

Practical tip 1: Standardise data fields and formats before merging data from different sources. This minimises compatibility issues and ensures the consistency of the resulting dataset.

Practical tip 2: In i2 iBase, setting the correct (combination of) fields as discriminator fields is incredibly important for preventing data duplication when importing and manually creating new data in the database. These fields ensure that duplicates are detected, but that does not prevent them from being stored twice anyway if the person entering the data chooses to do so.

Practical tip 3: Duplicates can still occur in a database despite discriminator fields. It is therefore important to check for this regularly as well. This can be done in i2 iBase with the Duplicate Records Checker. This can be used to search for duplicate records within a database based on the contents of specified fields.                                                                                                                                    

Accuracy (and precision)
Accurate data means that the data does not contain errors and corresponds to reality. Accuracy goes hand in hand with precision, which means that the data is exact and does not contain unnecessary deviations. Verifying and validating both the sources and the data before entering the data is obviously extremely important for this. When it comes to data with potentially reduced reliability, it is important that this is communicated in a consistent manner.

Regularly checking the quality of the data, including its consistency, can also contribute to accuracy and precision. Again, having clear internal agreements on how data is stored and formatted is extremely important.

Practical tip 1: In i2 iBase, reliability fields can be added to records in which both source and data reliability can be indicated. In databases that do not contain these fields, source fields can be used where reference is made to the source of the information, and then it can be mentioned in the comments that there may be doubts about its reliability.

Practical tip 2: In i2 Analyst's Notebook, degrees of assurance can be added to both entities and links between them. These gradations become analysis attributes that can be included when, for example, searching or sorting the data.                                                                                                               

Topicality
Data should not only be up-to-date to give the most realistic and complete picture possible, it should also be in line with data retention periods to comply with the GDPR. According to Article 5(1)(e) of the GDPR, personal data must be kept as long as is necessary for the original purpose. This means that data analysts need to understand how long certain data should be kept to both comply with legislation and meet the needs of their analyses. However, according to the Police Data Act (Wpg), police data cannot be used and retained indefinitely. Deadlines have been set for specific circumstances. Other agencies also have to deal with data retention periods. Violations of this can have consequences for, among other things, the legal validity of the data.

Practical tip: i2 has developed a tool specifically for this purpose called the i2 iBase Weeder. This tool tracks data retention periods and ensures data is deleted when the retention period is reached. This ensures that these deadlines are always met without having to look at them manually.

References

General Data Protection Regulation [GDPR] (2016), Article 5(1)(e) and Articles 13-14.

Geiger, R. S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., & Huang, J. (2020). Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From? In Conference on Fairness, Accountability, and Transparency (FAT '20),* January 27–30, 2020, Barcelona, Spain (pp. 18 pages). ACM, New York, NY, USA. [Online]. Beschikbaar via: https://stuartgeiger.com/papers/gigo-fat2020.pdf [Geraadpleegd op 27 augustus 2023]. DOI: https://doi.org/10.1145/3351095.3372862

Kilkenny, M. F., & Robinson, K. M. (2018). Data quality: "Garbage in – garbage out." Health Information Management Journal, 47(3), 103-105. DOI: 10.1177/1833358318774357

Pressman, R. (2014). Software Engineering: A Practitioner's Approach. McGraw-Hill Education.

Rahm, E., & Do, H. H. (2000). Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23(4), 3-13. Wet Politiegegevens (Wpg).

Teslow M (2016) Health data concepts and information governance. In: Abdelhak M, Hanken MA (eds) Health Information: Management of a Strategic Resource, 5th ed, pp. 88–144. St Louis, Missouri: Elsevier Saunders.

Police Data Act [Wpg] (2018).

Clear filter