Exercise 1
- What is the purpose of data integration?
Data integration reduces the complexity of data by unifying multiple datasets to create a single view of the data. - How does data integration:
- improve the speed of analysing data?
Integration may reduce the total volume of data to be processed by pre-excluding data not needed for the analysis (fields not needed and filtering out irrelevant data). Integration may also include the pre-calculation of common summations, so they do not need to be calculated every time. Similarly denormalizing the data may be used to reduce the cost of joins by taking the hit once in the ETL (Extract Transform Load) process rather than every time a query is run. Since data for analysis may be spread across a number of line of business applications and ERP (Enterprise Resource Planning) platforms integrating the required data to a single source (e.g. a data warehouse) will avoid the network traffic cost of database links or API calls, also the dataset for analysis can be heavily indexed to support the queries being run in ways that would unacceptably slow the transactional line of business applications.
- improve timelier business decisions?
Analytic queries tend to be very compute heavy and place a heavy load on the system which would have a negative impact on the transactional source systems, they also tend to touch most if not all of the data in a system. Traditionally these would be run overnight or over the weekend so a decision maker wishing to investigate an anomaly in a report or run a number of What-If analyses may have to wait till the following day or even week to get their answer, on data which could be out of date. By integrating the data to a single dataset the decision maker can drill down to focus in on the causes of anomalies or run their What-Ifs immediately on data that may be from close of business the previous day. Immediate analysis of close to current data allows them to make confident decisions quickly.
- improve the speed of analysing data?
- Describe the rules and policies that must in place to ensure information security in the workplace ie The CIA or AIC triad model
CIA/AIC stands for Confidentiality, Integrity and Availability. To maintain confidentiality users must be required to be authenticated (login with appropriate credentials; typically a username and one or more of ‘Something you know’ (e.g. password or passphrase), ‘Something you have’ (e.g. physical passkey or one time code generator, could also be access to a trusted system that supports Single Sign On) or ‘Something you are’ (e.g. fingerprint, retina scan, iris pattern, blood antigens &c). Permissions to access data must also be graded so users can only access the data they need to access to do their job, for example someone who only needs to see reports based on their own department then they should only have permissions to see the data about their department. Some enterprise level database management systems, such as Oracle, implement row level access control to provide this so in that example the data relating to all departments could be in the same table but those rows a person does not have permissions to see are hidden from them at the database level. So, all user can be provided with the same report but only see the data they have permissions for. It is good practice to assign permissions to a role and then grant the role to the user rather than directly granting permissions to a user. Roles should be named for and associated to job roles so everyone doing the same job has the same permissions, any extra permissions granted via another role named for its purpose, and if someone changes jobs their old role(s) can be removed and new roles granted. This reduces the risk of someone being allocated incorrect permissions or keeping permissions from an old job that they no longer should have.
Integrity relates to keeping data accurate and up to date. In so far as possible data should be stored only once and then referenced via key lookups, this is typically the case for a single database supporting a single application but can be difficult where multiple applications, each with their own database, all need to store the same information (e.g. multiple systems that all need to store the name and addresses of citizens). A reconciliation process, possibly involving a Robotic Process Rutomation (RPA) system, may be needed to ensure that changes to one system are reflected into other systems storing the same data and that all systems are in step. This may require human intervention for exceptions such as where two systems show different updates to the same data on the same day or where a data item is considered especially sensitive. Where a data item is subject to business rules or format restrictions these should be encoded into validation processes within the application stack, ideally, this should be as close to the data as possible (e.g. via database triggers) so that if multiple applications access the same data (or batch updates are made via a third party tool) the rules will still be applied. If a data item has an authoritative source (e.g. in the UK addresses can be looked up in the Postal Address File, PAF) then if possible that should be used to carry out a lookup to get the data. Additionally update permissions should only be granted to those users who are required and authorized to carry out updates to minimize the risk of human error, typos &c, and auditing and transaction logging processes may be put in place to ensure that when a change is made it is evident who made the change when and what change was made (for unstructured data this might be a version control system which stores the delta (changes) between versions and who made the change that caused that delta to be created).
Availability relates to ensuring that the data is available when required by the business. This includes a range of factors such as ensuring regular backups are taken and can be restored in a timely fashion in event of system failure, but that the backup process does not interfere with access to the data, more recently this has focused on the use of hot backups and hot or cold failover spares but in the past meant scheduling backups for specific times of day when access wasn’t required. For critical business systems hot failover spares (one active copy and one passive copy of the application data with updates from the active being applied to the passive copy, often by transaction log shipping, so in event of the active system failing the passive copy can be spun up and take over the processing, in some cases the log shipping may be paused each day long enough for the passive copy to be cold backed up to give an extra layer of protection) or synchronized clustering (two live versions of the application, each with their own copy of the database, are running and accessed at the same time with a backend processes to keep the databases in step, if one fails then users are moved off that one on to the other). Similarly redundant network connections, power supply, storage &c may be called for with procedures to ensure that failures are resolved in a timely manner. Another aspect is ensuring that hardware on which applications are run is sufficiently powerful to run the application under peak load at an acceptable speed for the users to carry out their job, similarly all layers of the stack and the links between them must have the capacity to provide a responsive service under peak load. Users will avoid a slow application. In the past this has mostly been handled by sizing infrastructure for the peak load of each application, with the development of cloud technologies ‘elastic provisioning’ has become common where the base specification of a virtual server is scaled for the normal load but additional resource is reserved that can be added dynamically for times of peak load. In particular a datawarehouse may be allocated extra resources during the ETL process, which tends to be very compute and I/O heavy due to calculating summations and denormalising data, which is then dialed back for when it only needs to service queries against the denormalised data and summations. - A benefit of integrating or centralising data can be that it makes the data more accessible. Explain and discuss
When data is centralized then it is all available in one place to analysts and users, rather than having to try to go around multiple sources and people to negotiate for access, if data is not centralized then analysts may not even be aware of all data sets that might be relevant to their query. Centralization also provides the opportunity to create a data custodian/librarian function who can provide advice and steer analysts and users to datasets that are relevant to their query and maintain master data management (MDM) to ensure that metadata and documentation of data sources are accurate and up to date to help analysts and users to be confident that the datasets are correct for their needs. Integration of data, coupled with MDM, allows the creation of merged datasets that ensure that related data likely to be needed for the same reports is available in one dataset avoiding the need to locate and merge the data sets each time. Obviously this would require knowing which merged sets are likely to be needed, a data custodian/librarian could monitor which data sets are requested and use this to plan what integrations need to be produced in the future. - Why do we integrate data from multiple sources? What is the benefit? Provide examples of different data sources that we could integrate?
Often related data or data about the same real-world objects, people, places &c can be stored in the different systems, possibly by different organisations. We integrate the data to support analysis and reporting on the relationships, in particular where there may be a causal relationship between them. Examples of this might be data on advertising spend in a region from the marketing system being integrated with sales data for that region from the sales system to determine how that advertising has impacted sales or data a local council holds in its housing system on antisocial behavior may be integrated with police data on crimes, with standard demographic data and with OFSTED report scores and schools data on referrals of pupils to identify predictors of future antisocial behavior and crimes. - Data is important to a business because it can improve decision making and productivity. Discuss and explain 2 additional reasons that data is important to a business.
Businesses can use data for a number of different purposes other than reports for decision and productivity support. Key ones are transaction processing and in time monitoring.
Most transactions a business performs will involve data in some way. If a customer places an order, past data about that customer may be retrieved such as their address, their credit limit &c. The stock levels of the items they order will also need to be checked to ensure that there is stock to fulfil the order and that may trigger a replenishment request if the stock drops below a minimum figure. The marketing system may also prompt them with other complementary products that people who bought what they are ordering now have often bought or if the product they want is out of stock what a suitable substitute might be.
In time monitoring is about the now and very recent past (typically seconds to days ago). In IT this might be monitoring the physical status of servers to check the temperature and power consumption to identify any devices that are running hot or monitoring alert/error logs and escalating important messages or messages that have reached a threshold over a time period to a human operator for investigation. The goal is to proactively identify future faults to allow preventative maintenance to be taken (a server that is running hot is more likely to fail in the near future, a disk that is throwing more than a certain number of errors in an hour may need to be replaced). In a call centre this might be number of agents currently logged in, number of agents currently on calls, number of agents in wrap up, number of agents available, number of callers in queue, average time in queue for the past hour and past 4 hours, average call duration, longest current call and similar metrics. This will be used by the floor supervisiors to manage load, redistribute staff from back office to call handling or visa versa, if a call is well over the expected duration then check on the agent to see if they are having problems and similar minute to minute controls. Transport and logistics companies are increasing using GIS tracking of vehicles to support real time management of services. Examples of this include bus companies tracking buses to provide arrival time data to schedule boards in bus stops, companies such as Uber and Ola routing requests for journeys to the nearest driver and Google monitoring the locations of Google Maps users in real time to determine if they are travelling, by what means and their speed so as to provide congestion information to other users so they can be directed around bottlenecks. - One of the challenges of a data integration project can be combining data from incompatible sources. However, another challenge is getting the analysis of the requirement correct. This can be a complex task of identifying and documenting both the functional and non-functional requirements. The functional requirements gathering process focuses on product functionality, capability, features, and usability whereas the non-functional requirements gathering process encompasses all other factors not connected to a product’s functionality such as performance, security, or technical specifications.
- Provide 3 questions that should be answered when gathering functional requirements
What are you looking to achieve?
What is your business process that the product will be used in, where are ther touch points?
What does your current solution not do that you need it to do?
- Provide 3 questions that should be answered when gathering non-functional requirements
How quickly do you need the solution to respond to transactions/produce the report? What’s your optimal target and what’s your worst acceptable case? (The business may want immediate response but not be prepared to pay for it)
What if any legal or regulatory restrictions does the solution need to comply with?
In the event of a catastrophic system failure what downtime can you tolerate? What is your optimal target and what’s your worst acceptable case? (Again, the business may want zero downtime and no transaction loss, but they may not be prepared to pay for it).
- Provide 3 questions that should be answered when gathering functional requirements
Suggested websites