I’m doing a Level 4 qualification in data analysis and the training providor recently sent me a set of exercises to help prepare for an exam I’ll have to take at some point. The idea is to web research the questions then fill in the answers. This is the first batch.
- What is Raw data? provide at least one example. Collected data that has not been processed in any way, data that is substantively as it came from the source. An example of this might be the output from sensors in a building or area or the results of a survey.
- What is Processed data? provide at least one example. Data that has been changed through processing from the source. This might include merging with other data sets, summation or categorization. An example of this might be taking the output from a number of sensors over a period of time which includes the ID of the sensor, the time of the reading and the value returned. That dataset might be merged with a geographical dataset that maps the sensor ID to the location of that sensor, the dataset might then be summarized to the average value for a certain period of time (e.g. the sensor produces a reading every minute but you only need the average for each half hour) and then categorized (e.g. “Much to high”, “Somewhat too high”, “OK”, “Somewhat too low” and “Much too low”) based on the value with the category added as an extra column.
- Provide examples of CSV, XML, RTF and TXT Files? [Not 100% sure what this question is asking for]
CSV stands for Comma/Character Separated Variable. These are structured files where each line typically represents one record made up of alphanumeric variables separated by a specific character (often a comma but can be a tab, pipe, slash or other character). CSV data is typically easily mapped to a database table or spreadsheet as field identity is order based so the first variable is that between the start of the line and the first comma (or whichever character is used) the second variable is that between the first comma and the second &c. CSV is the default data export format for many commercial systems as it is non-propriety and widely supported.
XML stands for eXtensible Markup Language and is an implementation of Standard Generalised Markup language (SGML). These are structured files and use tags enclosed in the less-than (<) and greater-than (>) to denote fields. Each field can have a value (between the tags) and an attribute (key-value in the opening tag). The structure of an XML file is defined by its schema, usually stored as a separate XML file, which can be used to validate the structure of the file. XML can be hierarchical so a record can represent something that in a relational database might require multiple tables, e.g. invoices that would have separate tables for the Invoice Details and the Line Items Details in separate tables joined on Invoice Number.
RTF stands for Rich Text Format. These are typically unstructured files, although it may be possible to infer structure from the visual appearance/format, e.g. the block of text in the top right corner is an address, text in a certain font is a heading and alphanumerics organized visually into rows and columns divided by straight lines are a table. They may also include image data.
TXT stands for TeXT or Plain Text. These can be structured or unstructured. CSV is one example of a structured text file, another common example is Fixed Width where each field has a specific width (so the first field might be the 1st to 10th characters, the second the 11th to 15th &c) and if the data in a record for that field is less than that width it is padded with spaces to that width. Unstructured text files can be any text data such as that extracted from a letter/email, a blog post, a web page, a PDF document, a book, a magazine article &c. As described with RTF it may be possible to infer structure from the visual formatting (although many of the cues such as font, justification, graphical lines &c will be missing).
- What does data transformation mean? Data transformation is any processing that changes the data or it’s arrangement. This may include summarization, removal of unneeded fields, filtering, normalization, denormalization, cleansing, categorization or the creation of calculated fields. A common form of transformation is to take data from a transactional system that is in 3rd Normal Form across many tables (optimized for many small transactions and data integrity by storing each data item only once and using key references) and denormalise to a star schema with most of the data in a single table (where the same data item may appear in multiple records) with small dimension tables linked to it by key references. For example a transactional order processing system may a have tables for customer details (account number, name, salutation, date of account created &c, maybe flags for if their account live or inactivated, if they are in credit management, credit limit &c), addresses linked to customer details by the customer account number (customer account number and the address, maybe also a flag for if it is a live or old address and a notes field for any special conditions for deliveries such as needing to call ahead or that the location doesn’t have a loading dock or goods lift) and an orders table that contains the details of each order and links to the customer details table for the customer details via the account number and the addresses table for the delivery address via an address reference. One customer may have one or more addresses and each address may have zero or more orders delivered to it but each order will have one and only one customer and one and only one delivery address. Whilst good for transactions creating and looking up single orders the join cost (processing time) for running reports on this data would be very high so for the reporting solution these tables would be merged to produce a single record for each order with all of the details for each order in one record (row). The join cost is taken upfront in the transformation process so it doesn’t need to be taken every time a report is run.
- What are the stages of the ETL process/ What is the ETL acronym short for? Extract, Transform and Load. Extract from the source system. Transform to better suit the intended purpose. Load into the reporting system.
- How do the types of data below differ regarding quality, availability and privacy?
- Open/public data? Typically anonymized and may only deal with summaries rather than the raw data so no privacy concerns, unless there’s a problem with the anonymization process. Often published on the web (e.g. https://data.birmingham.gov.uk/) but older datasets may be on physical media in archives, typically free to access with no checks. Quality is highly variable, some can be very high quality with a full spectrum to so poor quality that it is effectively useless. As this data cannot be sold for profit the producers often will not expend much effort on documentation, quality checking &c. If the data has to be published by law/regulation then the producer may deliberately obfuscate it to make it hard to use, common strategies (especially where it is published regularly) are to change the column order, add new colums and remove others, change units (e.g. one month report a currency in pounds, the next pennies and the one after in Euros), or rename columns.
- Propriety data? Typically this data is only available for purchase and will often include personally identifying information so there may be privacy issues and the need to show compliance with the GDPR. Common examples of propriety data are marketing contact lists and survey responses. As the dataset are usually the property of an organization there may be license restrictions on their use. Availability tends to be good, datasets are generally sold to anyone prepared to pay, unless the data is from a commissioned survey in which case it may only be available to the commissioning body. Quality is highly variable, especially marketing contact lists which are frequently out of date. Of particular note is a list still in circulation that was stolen by a disgruntled employee from Experian in 2004 which contained a lot of ‘dirty’ (incorrect) data and sold by them to several aggregators of such lists. A number of customers of these aggregators then sold on lists based on that list, with the ‘dirty’ data.
- Operational data? Operational data is any data that an organization collects as a consequence of it’s day to day operations. Common examples are sales/orders data, HR information about it’s employees, production data and customer feedback. It can also include research into competitors and their market. Generally the data is only available within the organization that produced it, although some may be provided to external bodies such as regulators or used in producing public reports. Operational data may be commercially confidential or include personally identifying information so there are often privacy concerns and the need to show compliance with the GDPR and other regulations. Quality will typically be high as the potential for impact on the running of the organization means that any errors will usually be visible quickly and the potential for impact on the business of the organization mean that senior management will usually be willing to invest in data cleansing and efforts to ensure that the error rate in collection is low.
- Administrative data? Administrative data is collected by government and similar bodies through their day to day activities. Some of the data may be published (possibly after anonymization) so be open data whilst other data may be subject to restrictions so availability can be variable, as can privacy concerns. Quality of administrative data tends to be proportionate to the cost of poor quality to the body or team who collect it. Examples of administrative data include the electoral roll, taxation returns, service performance monitoring statistics and financial transaction records.
- Research data? Research data can be the result of an academic research project, market research surveys, clinical trials, product research or any programme of targeted research. Availability will depend in the source, typically academic research data will be freely available to assist with peer review and verification. Elsevier (an academic publisher and data analytics company) provide, through their Mendeley product, a portal for researchers to share raw data with other researchers. Following the 1964 Declaration of Helsinki, complete data relating to clinical trials must be publicly available with only personally identifying information of participants redacted. The reach of the Declaration of Helsinki, and the strictness with which it is enforced, was largely driven by revelations of data being hidden by executives at Chemie-Grünenthal about the teratogenic side effects of Thalidomide. Other research data may only be available to the organization that carried out or commissioned the research. Academic research is typically subject to ethics panels which have as part of their remit to ensure privacy of participants, privacy of clinical trial participants is required under the Declaration of Helsinki and often further enforced by national laws on medical privacy. Market research data privacy is typically just covered by local laws on data protection and voluntary codes of practice. Quality of academic research data tends to depend on the researcher but, particularly in the Natural Sciences, the need for repeatability of results tends to surface any issues quickly. Clinical trial data quality has to be high as the penalty for poor data quality can be very high, both financially and reputationally. Market and product research data quality can be highly variable depending on the body carrying out the research, street survey data is particularly suspect as stories of researchers sitting in cafes and pubs to fill out surveys themselves to get their numbers up abound. One also needs to be aware that research is often undertaken to answer a specific question or to achieve a particular answer so trying to use that data for another purpose may lead to erroneous results due to unstated assumptions in rthe research. Political surveying often produces very low quality data as many research companies will target the demographic groups that will give the answer their customer wants so that that customer will come back again. This has led to observations on social media that “Political surveys are not done to canvass public opinion but to shape it”.
- Enterprise data? Enterprise data is the data shared across a large organization, across it’s multiple business units and locations. It can include all of the previously discussed types of data. Data will be gathered from operational systems (sales, manufacturing, customer support &c), the ERP (Enterprise Resourcing and Planning) systems (Finance, HR, CRM &c), marketing surveys, published third party surveys and other sources, which can include scaping and analysis of social media feeds, news articles and customer emails. Availability of the data is typically restricted to the organization itself, although in many large organizations that lack a strong corporate culture of openness and sharing departments may resist sharing ‘their’ data. In the mid-1990s a Swiss based pharmaceutical company ran into issues as their Eastern Europe and Asia sales team and their Western Europe, Middle East and North Africa sales team refused to share data and treated each other (and the other sales teams) as competitors. It took the intervention of the board and several senior management dismissals to resolve the issue and may have been a major factor in the company being acquired by Novo Nordisk a few years later. Since the data typically remains within the organization privacy concerns are reduced, but (in light of GDPR and similar regulations) care must be taken to ensure that the processing is compatible with the stated purpose of collection and the storage is located lawfully. Quality is subject to the constraints previously discussed for each class of data but can be drastically improved by creating ‘Gold’ repositories for common data elements, such as customer addresses, with robust processes for keeping it up to date and key mapping to operational systems. This is particularly an issue for local councils and other bodies where there are multiple possible touch points for citizens and, whilst the organization may see different departments with their own systems, the customer sees only one organization. The use of fuzzy matching and machine learning can be very valuable in this to identify possibly linked records in different systems.
- Describe Quantitative data Quantitative data is typically numerical and represents, as the name suggests, the quantity of something. It can be used in calculations and often represented visually as a graph or chart. When using quantitative data in calculations care must be taken to ensure that only appropriate calculations are done and that the data is in fact quantitative in that context, not everything that looks like a number is a quantity or indeed a number (e.g. a product code).
- Describe Discrete Quantitative data Discrete quantitative data is data which is constrained to specific values. For example, when buying sugar at the supermarket, bags of sugar is discrete because you can only record in units of one bag, you don’t buy a fractional bag of sugar.
- Describe Continuous Quantitative data Continuous quantitative data is data which can have any value, typically within a range, limited only by your ability to measure it. Returning to the sugar example, if you are buying loose sugar by the kilogram you can have any fractional amount of a kilo down to the level of precision of your scales, you are not limited to the units of the weight of one bag of sugar.
Another example of the difference might be tailor made vs off the peg clothing. People come in a variety of sizes so could have a, say, chest measurement anywhere in a wide range including fractions of an inch (clothing in the UK is still mostly measured in inches, where a standard measure is used rather than ‘dress sizing’ or arbitrary ‘T-shirt sizes’) and a tailormade shirt or blouse can be made to fit that exact measurement. Off the peg shirts are typically only produced in set sizes so someone with a chest measurement between 30 and 32 inches (continuous quantitative data) will buy a 32 inch chest shirt (discrete quantitative data) as there are no shirts between 30 and 32 inches.
- Describe Qualitative data Qualitative data is typically non-numeric and refers to a characteristic or non-quantity aspect of the subject. Examples might include sentiment analysis of text, colour, size (in particular ‘t-shirt sizing’), language, nationality, ethnicity, complexity &c.
- Describe Binominal Qualitative data [I’ve not come across this before and really struggled to find any online references, this is based on what I was able to infer from what I did find, references I found were academic papers and assumed that the reader was already familiar with the term] Binomial Qualitative data denotes where the data item depends on another data item about the same entity or from that data item (perhaps with another) the one or more others can be inferred. For example, for someone to be buried in a grave their vital status should be ‘dead’; if someone’s role in a family is ‘Father’ then you can infer that their gender is ‘Male’ and that they have one or more children. In linguistics the term can also refer to a phrase that identifies an entity as being in a subset of a larger group so “Pike, you stupid boy.” Tells us that the entity Pike is male, stupid and young (or at least younger than the speaker). Care must be taken with not over stretching inference as someone being a ‘Father’ does not tell us if the child(ren) is/are their biological offspring, adopted, from a past relationship of their partner or some other connection.
- Describe Nominal Qualitative data Nominal data denotes a classification that is not objectively related to rank. Examples include colour, flavor, ethnicity, gender, location etc.
- Describe Ordinal Qualitative data Ordinal data denotes a ranking. Examples include T-shirt sizing, rank/position (first, second, third &c), qualification grading (A is better than B is better than C &c), pricing band (Luxury, Standard, Basic and Value), etc. It may be based on quantitative data (e.g. the team that wins the most games is first in the league, the team that wins the next most will be second &c) or not (e.g. subjective judgement of flavor and presentation in a cooking competition).
Most common issue I’ve found is that some of the terminology is non-standard.