- What does the term ‘data profiling’ mean?
Data profiling is the process of examining data from available sources to confirm that it is suitable for analysis and highlight any quality issues. This can involve applying statistical tools, extracting metadata, checking ranges are as expected, checking consistency between datasets and identifying probable key values and relationships.
A data profiling tool may be used to profile the data and, in some cases, may use business rules provided by the user to expose any problems such as illegal values (e.g. a field may only have legal values of Y or N but some rows contain X due to sloppy validation in the source application or OCR errors, a NOT NULL field contains NULL values for some rows or a field with a UNIQUE constraint contains duplicate values), out of range errors (e.g. a numeric field can only have values between 0 and 100 but some rows contain negative values), unmatched references (e.g. a product code field on the orders data must match a product code in products data) &c. Depending on the target system these errors may cause data to fail to upload or may cause analysis to give incorrect results.
- Data usually goes through a cleansing process in preparation for data analysis. The purpose of this process is to remove/correct inaccurate or corrupt records to improve data accuracy and quality.
When an error or issue is found in the cleansing process which causes the data values not to be accepted, the following decisions must be made:
- Do we accept or reject the data?
- Do we correct the error?
- Do we create a default value?
Please explain in more detail
A data analyst may choose to load the erroneous data if the impact is small and the rest of the data in the rows is still useful and unlikely to be impacted by the error. Alternatively they may reject all rows containing erroneous data if an error in one field means that other fields in the same row are likely to also be errors and/or if the impact of the erroneous data on the final analysis is likely to be greater than the impact of just excluding the rows. The decision to accept or reject the data may depend on an in depth knowledge of the systems and datasets involved. It should be documented and the source of error investigated to minimize future reoccurrence.
Correcting erroneous data primarily depends on if the correct value can be determined to within an acceptable degree of accuracy to an acceptable certainty, and the risk of an inaccurate value is deemed preferable to using a known erroneous value or blanking the field. There are a number of potential ways to determine the correct value, depending on the data set and the field, if other fields in the same row can be used to identify the value then that may give a high level of confidence as to what the value is (e.g. if someone has selected the title Mrs, Miss or Ms then there is a very high probability that the Sex field should be Female, if they selected Mr then the Sex field should probably be Male and if they selected Mx then the Sex field should be Prefer Not To Say, any other titles may be best classified with Sex as Unknown), similarly it may be possible to infer the correct value from other records relating to the same entity or transaction. How the correct version (within an acceptable error margin) is determined will rely on an understanding of the data and understanding the potential impact of getting it wrong. ‘Corrected’ data should not be written back to the source systems, in particular for living individuals, but if practicable should be notified to the owners of the source systems, due to the General Data Protection Regulation which requires that data about an individual be kept current and accurate.
Sometimes it might be more useful to replace erroneous data with a default value. This might be used when the row as a whole is needed but there is no practicable way to infer the correct value and it is important to have a value in that field but less of a problem if that value is not the original (e.g. if the field is counted or averaged many systems will ignore rows where the field being used is null). The value used could be a fixed value or it may be more useful to use a calculated or random value. Examples of calculated values for numeric fields might the median, mode, maxima or minima of the field in rows where it is not an error. For non-numeric fields with a small number of different possible values (e.g. Sex) one might randomly assign one of the possible values or insert them in a repeating sequence.
Obviously, before settling on a method to handle erroneous values you would need to assess the potential impact, based on your understanding of the data, that each would have on the results. If the analysis is to be repeated then the assessment should typically include identifying the source of the errors and recommendations on how to avoid reoccurrence.
- A Data Quality (DQ) Dimension is a recognised term used by data management professionals to describe a feature of data that can be measured or assessed against defined standards in order to determine the quality of data.
There are six dimensions of data quality:
- Completeness
- Uniqueness
- Timeliness
- Validity
- Accuracy
- Consistency
Please explain in more detail
- Completeness refers to all the required data being provided. Data can still be complete if optional data is missing but all mandatory fields must be present in every record.
- Uniqueness of data requires that there be no duplication of data, i.e. records that occur more than once. This can be an issue where two datasets are merged which have some data in common, e.g. quarterly sales data reports are merged to create an annual sales report but due to an error in the extraction routine the last day of one quarter is repeated as the first day of the next quarter. It can also be a problem where the same record at different points in time appears in different extracts that are then merged. For example in an order processing system the order record may be created with a status of new, move to placed, then to confirmed and so on to closed. A regular data extract that extracts all records or all records updated since the last extract may have instances of this record at different statuses in different dataset extracts. If these datasets were merged then a simple deduplication routine (such as you might apply in the first example) may not work as the records in toto are not identical. To ensure uniqueness in this situation some sort of reliable unique key and timestamp would be needed so the deduplication routine can return only the latest version of each record based on the unique key.
- Timeliness requires that the data be available/supplied when needed. For example if a report is needed for a meeting and will take a week to build then the data must be supplied at least a week before the meeting. It can also cover the data needing to be the correct data for the period being reported on, e.g. if the report is about March then you do not want data about February to be submitted instead.
- Validity means that the data is in the correct format (or where this is not possible can be converted to the correct format) and units. Common issues here are different parts of the business report the same data in different formats or units due to tradition or national conventions, for example dates in most parts of the world are recorded as DD/MM/YYYY but in the US are typically recorded as MM/DD/YY. A common issue in Data Warehousing, where historical data may be stored, is that the unit of measure changes. Specifically for the date format issue, many systems avoid this internally by storing dates as a number of days since a known point in time (often 1st January 1900 or 1st January 1970) but data transferred between systems may suffer from date format differences. This might be a change from imperial to metric measures (e.g. a product that used to be sold in packages of 1lb is now sold in packages of 500g, 1lb = 453.6g) or where changes to the source systems means that data that used to be reported as single items are now reported combined items or visa versa (e.g. a stock control system that used to record baked beans by the number of cans now records the number or 48 can cases). Where issues of validity occurred the data analyst must identify and execute a strategy to resolve these and ensure that all data is in the same format.
- Accuracy looks at how well the data reflects the real world. Does it correctly model the real world objects or events that it should? This can often be confirmed by taking a representative sample of data and measuring it again or, if that is not possible, using triangulation methods (use 2 existing observations to predict a third and confirm the prediction is correct) or comparing with past data sets to ensure that any changes are with expected ranges. It can also cover ensuring that text fields are spelled correctly, that postal/zip codes match the associated address &c.
- Consistency covers ensuring that data from different sources matches the same underlying situations (if one dataset predicts a truth then other datasets predict the same truth). For example if a person’s HR record shows that they have been dismissed then the payroll should not contain a payment record for them and their login account on the network should be disabled, similarly if a project is marked as closed in the project management system then there should be no entries in the timesheet system logged against that project.
Suggested websites
https://datacadamia.com/data/quality/profiling#description
https://datacadamia.com/data/quality/data_correction