1. DATA MINING 1. 1 INTRODUCTION TO DATA MINING The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data.
Figure 1 from the Red Brick Company illustrates the data explosion. Figure 1: The Growing Base of Data Data storage became easier as the availability of large amounts of computing power at low cost i. e. , the cost of processing power and storage is falling, made data cheap. There was also the introduction of new machine learning methods for knowledge representation based on logic programming etc. in addition to traditional statistical analysis of data.
The new methods tend to be computationally intensive hence a demand for more processing power. Having concentrated so much attention on the accumulation of data the problem was what to do with this valuable resource? It was recognized that information is at the heart of business operations and that decision-makers could make use of the data stored to gain valuable insight into the business. Database Management systems gave access to the data stored but this was only a small part of what could be gained from the data. Traditional on-line transaction processing systems, Oltp, are good at putting data into databases quickly, safely and efficiently but are not good at delivering meaningful analysis in return. Analyzing data can provide further knowledge about a business by going beyond the data explicitly stored to derive knowledge about the business. This is where Data Mining or Knowledge Discovery in Databases (KDD) has obvious benefits for any enterprise.
... things. Before examining the development of data warehousing and how databases are emerging in business, let's first review what has ... During the past decade, the sharply increasing popularity of the personal computer on business desktops has introduced many new options and ... process is much like a product warehouse. Instead of storing goods, the commodity is information. However, just like product ...
The term Data Mining has been stretched beyond its limits to apply to any form of data analysis. One of the definition of Data Mining, or Knowledge Discovery in Databases is: Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency net works, analyzing changes, and detecting anomalies. – William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Mathe us The analogy with the mining process is described as: Data mining refers to “using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation.
The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful” Clementine User Guide, a data mining toolkit Basically data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. It is the computer, which is responsible for finding the patterns by identifying the underlying rules and features in the data. The idea is that it is possible to strike gold in unexpected places as the data mining software extracts patterns not previously discernable or so obvious that no one has noticed them before. 1. 2 PROCESSES IN DATA MINING Data mining analysis tends to work from the data up and the best techniques are those developed with an orientation towards large volumes of data, making use of as much of the collected data as possible to arrive at reliable conclusions and decisions. The analysis process starts with a set of data, uses a methodology to develop an optimal representation of the structure of the data during which time knowledge is acquired.
... 2.1Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised ... see the relationship between the data if there are zero error in the data then the information you get is skewed and ... items to buy based on the buying patterns in prior transactions. c.Supervised-Identifying a network data packet as dangerous (virus, hacker ...
Once knowledge has been acquired this can be extended to larger sets of data working on the assumption that the larger data set has a structure similar to the sample data. Again this is analogous to a mining operation where large amounts of low-grade materials are sifted through in order to find something of value. Figure 2 summarizes the some of the stages / processes identified in data mining and knowledge discovery by Usama Fayyad & Evangel os Simoudis, two of leading exponents of this area. Figure 2: Stages / processes identified in data mining The phases depicted start with the raw data and finish with the extracted knowledge, which was acquired as a result of the following stages: o SELECTION – selecting or segmenting the data according to some criteria e. g. all those people who own a car, in these way subsets of the data can be determined.
o PREPROCESSING – this is the data cleansing stage where certain information is removed, which is deemed unnecessary and may slow down queries for example unnecessary to note the sex of a patient when studying pregnancy. Also the data is reconfigured to ensure a consistent format as there is a possibility of inconsistent formats because the data is drawn from several sources e. g. sex may recorded as f or m and also as 1 or 0.
o TRANSFORMATION – the data is not merely transferred across but transformed in that overlays may add such as the demographic overlays commonly used in market research. The data is made useable and navigable. o DATA MINING – this stage is concerned with the extraction of patterns from the data. A pattern can be defined as given a set of facts (data) F, a language L, and some measure of certainty C a pattern is a statement S in L that describes relationships among a subset Fs of F with a certainty c such that S is simpler in some sense than the enumeration of all the facts in Fs. o INTERPRETATION AND EVALUATION – the patterns identified by the system are interpreted into knowledge, which can then be used to support human decision-making e. g.
... Hadoop users is Facebook, whose model follows this pattern. A MySQL database stores the core data. This is then reflected into ... the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the ... associated with cloud computing because the analysis of large data sets in real-time requires a framework like MapReduce to ...
prediction and classification tasks, summarising the contents of a database or explaining observed phenomena. 1. 3 DATA MINING MODELS IBM has identified two types of model or modes of operation, which may be used to unearth information of interest to the user. 1.
Verification Model The verification model takes a hypothesis from the user and tests the validity of it against the data. The emphasis is with the user who is responsible for formulating the hypothesis and issuing the query on the data to affirm or negate the hypothesis. In a marketing division for example with a limited budget for a mailing campaign to launch a new product it is important to identify the section of the population most likely to buy the new product. The user formulates a hypothesis to identify potential customers and the characteristics they share.
Historical data about customer purchase and demographic information can then be queried to reveal comparable purchases and the characteristics shared by those purchasers, which in turn can be used to target a mailing campaign. ‘Drilling down’s o that the hypothesis reduces the ‘set’ returned each time until the required limit is reached can refine the whole operation. The problem with this model is the fact that no new information is created in the retrieval process but rather the queries will always return records to verify or negate the hypothesis. The search process here is iterative in that the output is reviewed, a new set of questions or hypothesis formulated to refine the search and the whole process repeated. The user is discovering the facts about the data using a variety of techniques such as queries, multidimensional analysis and visualization to guide the exploration of the data being inspected. 2.
Discovery Model The discovery model differs in its emphasis in that it is the system automatically discovering important information hidden in the data. The data is sifted in search of frequently occurring patterns, trends and generalisations about the data without intervention or guidance from the user. The discovery or data mining tools aim to reveal a large number of facts about the data in as short a time as possible. An example of such a model is a bank database, which is mined to discover the many groups of customers to target for a mailing campaign. The data is searched with no hypothesis in mind other than for the system to group the customers according to the common characteristics found 1.
... prices of homes. Furthermore, the team also analyzed more data sets to include more variables like bedrooms and bathrooms. Six bedrooms ... allow quicker feedback than a postal survey but any potential customers are usually hesitant on giving anything but short answers. ... groups are a good way of getting detailed information about customer preferences and tastes. Test marketing involves picking a small ...
4 DATA MINING FUNCTIONS 1. Classification Data mine tools have to infer a model from the database, and in the case of supervised learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class. Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative.
A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class. A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, are very probable. The categories of rules are: o exact rule – permits no exceptions so each object of LHS must be an element of RHS o strong rule – allows some exceptions, but the exceptions have a given limit o probabilistic rule – relates the conditional probability P (RHS|LHS) to the probability P (RHS) Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS. 2.
Associations Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as “72% of all the records that contain items A, B and C also contain items D and E.” The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A, B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule. A typical application, identified by IBM that can be built using an association function is Market Basket Analysis. This is where a retailer run an association operator over the point of sales transaction log, which contains among other information, transaction identifiers and product identifiers.
... will allow that employee to have access to the company network and databases. It is important to have the option available for ... . Excel can be used to handle the data and output, but many of the database products allow for scanning of barcodes to ... to get into that state of mind. Scenario 1 Tracking data about booth components, equipment, shippers, and shipments are extremely important ...
The set of products identifiers listed under the same transaction identifier constitutes a record. The output of the association function is, in this case, a list of product affinities. Thus, by invoking an association function, the market basket analysis application can determine affinities such as “20% of the times that a specific brand toaster is sold, customers also buy a set of kitchen gloves and matching cover sets.” 3. Sequential/Temporal patterns Sequential / temporal pattern functions analyse a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i. e.
consisting of a number of items drawn from a given collection of items).
The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information, for each customer, of the sets of products that the customer buys in every purchase order. A sequential pattern function will analyse such collections of related records and will detect frequently occurring patterns of products bought over time. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven. Sequential pattern mining functions are quite powerful and can be used to detect the set of customers associated with some frequent buying patterns.
Use of these functions on for example a set of insurance claims can lead to the identification of frequently occurring sequences of medical procedures applied to patients which can help identify good medical practices as well as to potentially detect some medical insurance fraud. 4. Clustering/Segmentation Clustering and segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A cluster is a set of objects grouped together because of their similarity or proximity.
... information system standard. Silberschatz (2006) in his book “Database System Concept” the goal of database system is to simplify and facilitate access to data. ... the entire system in relationship to external entities. Data Flow – refersto apath taken by data within a device, network, or ... to-use software program eliminates the need for handwritten set of books. This program looks like only a ...
Objects are often decomposed into an exhaustive and / or mutually exclusive set of clusters. Clustering according to similarity is a very powerful technique, the key to it being to translate some intuitive measure of similarity into a quantitative measure. When learning is unsupervised then the system has to discover its own classes i. e. the system clusters the data in the database. The system has to discover subsets of related objects in the training set and then it has to find descriptions that describe each of these subsets.
There are a number of approaches for forming clusters. One approach is to form rules, which dictate membership in the same group based on the level of similarity between members. Another approach is to build set functions that measure some property of partitions as functions of some parameter of the partition. 1. 5 DATA MINING TECHNIQUES 1. Cluster Analysis In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram.
The first step is to discover subsets of related objects and then find descriptions e. g. , D 1, D 2, D 3 etc. which describe each of these subsets. Figure 3: Discovering clusters and descriptions in a database Clustering and segmentation basically partition the database so that each partition or group is similar according to some criteria or metric. Clustering according to similarity is a concept, which appears in many disciplines.
If a measure of similarity is available there are a number of techniques for forming clusters. Membership of groups can be based on the level of similarity between members and from this the rules of membership can be defined. Another approach is to build set functions that measure some property of partitions i. e. groups or subsets as functions of some parameter of the partition. This latter approach achieves what is known as optimal partitioning.
Many data mining applications make use of clustering according to similarity for example to segment a client / customer base. Clustering according to optimization of set functions is used in data analysis e. g. when setting insurance tariffs the customers can be segmented according to a number of parameters and the optimal tariff segmentation achieved.
Clustering / segmentation in databases are the processes of separating a data set into components that reflect a consistent pattern of behavior. Once the patterns have been established they can then be used to “deconstruct” data into more understandable subsets and also they provide sub-groups of a population for further analysis or action, which is important when dealing with very large databases. For example a database could be used for profile generation for target marketing where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response. 2.
Induction A database is a store of information but more important is the information, which can be inferred from it. There are two main inference techniques available i. e. deduction and induction. o Deduction is a technique to infer information that is a logical consequence of the information in the database e. g.
the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers. o Induction has been described earlier as the technique to infer information that is generalised from the database as in the example mentioned above to infer that each employee has a manager. This is higher level information or knowledge in that it is a general statement about objects in the database. The database is searched for patterns or regularities.
Induction has been used in the following ways within data mining. ? Decision Trees Decision trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labelled with attribute names, the edges are labelled with possible values for this attribute and the leaves labelled with different classes. Objects are classified by following a path down the tree, by taking the edges, corresponding to the values of the attributes in an object. o Rule Induction A data mine system has to infer a model from the database that is it may define classes such that the database contains one or more attributes that denote the class of a tuple i. e.
the predicted attributes while the remaining attributes are the predicting attributes. Class can then be defined by condition on the attributes. When the classes are defined the system should be able to infer the rules that govern classification; in other words the system should find the description of each class. Production rules have been widely used to represent knowledge in expert systems and they have the advantage of being easily interpreted by human experts because of their modularity i. e. a single rule can be understood in isolation and doesn’t need reference to other rules.
3. Neural networks Neural networks are an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an “expert” in the category of information it has been given to analyse. This expert can then be used to provide projections given new situations of interest and answer “what if” questions.
Neural networks have broad applicability to real world business problems and have already been successfully applied in many industries. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including: o sales forecasting o industrial process control o customer research o data validation o risk management o target marketing etc. Neural networks use a set of processing elements (or nodes) analogous to neurons in the brain. These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i. e. the network learns from experience just as people do.
This distinguishes neural networks from traditional computing programs that simply follow instructions in a fixed sequential order. The structure of a neural network looks something like the following: Figure 4: Structure of a neural network The bottom layer represents the input layer, in this case with 5 input labels X 1 through X 5. In the middle is something called the hidden layer, with a variable number of nodes. It is the hidden layer that performs much of the work of the network. The output layer in this case has two nodes, Z 1 and Z 2 representing output values we are trying to determine from the inputs. For example, predict sales (output) based on past sales, price and season (input).
Each node in the hidden layer is fully connected to the inputs which means that what is learned in a hidden node is based on all the inputs taken together. Statisticians maintain that the network can pick up the interdependencies in the model. The problems of using neural networks are they have been used successfully for classification but suffer somewhat in that the resulting network is viewed as a black box and no explanation of the results is given. This lack of explanation inhibits confidence, acceptance and application of results. He also notes as a problem the fact that neural networks suffered from long learning times which become worse as the volume of data grows.
4. On-line Analytical processing A major issue in information processing is how to process larger and larger databases, containing increasingly complex data, without sacrificing response time. The client / server architecture gives organisations the opportunity to deploy specialised servers, which are optimised for handling specific data management problems. Until recently, organisations have tried to target relational database management systems (DBMS) for the complete spectrum of database applications.
It is however apparent that there are major categories of database applications which are not suitably serviced by relational database systems. Oracle, for example, has built a totally new Media Server for handling multimedia applications. Sybase uses an object-oriented DBMS (OO DBMS) in its Gain Momentum product, which is designed to handle complex data such as images and audio. Another category of applications is that of on-line analytical processing (OLAP).
OLAP was a term coined by E F Code (1993) and was defined by him as: “The dynamic synthesis, analysis and consolidation of large volumes of multidimensional data” Analysis in that the system can provide analysis functions in an institutive manner and that the functions should supply business logic and statistical analysis relevant to the users application; Shared from the point of view of supporting multiple users concurrently; Multidimensional as a main requirement so that the system supplies a multidimensional conceptual view of the data including support for multiple hierarchies; Information is the data and the derived information required by the user application. Dimensional databases are not without problem as they are not suited to storing all types of data such as lists for example customer addresses and purchase orders etc.
Relational systems are also superior in security, backup and replication services, as these tend not to be available at the same level in dimensional systems. The advantages of a dimensional system are the freedom they offer in that the user is free to explore the data and receive the type of report they want without being restricted to a set format. OLAP servers have the means for storing multidimensional data in a compressed form. This is accomplished by dynamically selecting physical storage arrangements and compression techniques that maximize space utilization. Dense data (i. e.
, data exists for a high percentage of dimension cells) are stored separately from sparse data (i. e. , a significant percentage of cells are empty).
For example, a given sales channel may only sell a few products, so the cells that relate sales channels to products will be mostly empty and therefore sparse. By optimizing space utilization, OLAP servers can minimize physical storage requirements, thus making it possible to analyze exceptionally large amounts of data.
It also makes it possible to load more data into computer memory which helps to significantly improve performance by minimizing physical disk I/O. In conclusion OLAP servers logically organize data in multiple dimensions, which allows users to quickly and easily analyze complex data relationships. The database itself is physically organized in such a way that related data could be rapidly retrieved across multiple dimensions. OLAP servers are very efficient when storing and processing multidimensional data. RDBMS have been developed and optimized to handle OLTP applications.
Relational database designs concentrate on reliability and transaction processing speed, instead of decision support need. The different types of server can therefore benefit a broad range of data management applications. 5. Data Visualization Data visualisation makes it possible for the analyst to gain a deeper, more intuitive understanding of the data and as such can work well along side data mining. Data mining allows the analyst to focus on certain patterns and trends and explore in-depth using visualisation.
On its own data visualisation can be overwhelmed by the volume of data in a database but in conjunction with data mining can help with exploration. 1. 6 DATA MINING PROBLEMS/ISSUES Data mining systems rely on databases to supply the raw data for input and these raises problems in that databases tend be dynamic, incomplete, noisy, and large. Other problems arise as a result of the adequacy and relevance of the information stored. 1. Limited Information A database is often designed for purposes different from data mining and sometimes the properties or attributes that would simplify the learning task are not present nor can they be requested from the real world.
Inconclusive data causes problems because if some attributes essential to knowledge about the application domain are not present in the data it may be impossible to discover significant knowledge about a given domain. 2. Noise and missing values Databases are usually contaminated by errors so it cannot be assumed that the data they contain is entirely correct. Attributes, which rely on subjective or measurement judgements, can give rise to errors such that some examples may even be mis-classified. Error in either the values of attributes or class information are known as noise. Obviously where possible it is desirable to eliminate noise from the classification information as this affects the overall accuracy of the generated rules.
Noisy data in the sense of being imprecise is characteristic of all data collection and typically fit a regular statistical distribution such as Gaussian while wrong values are data entry errors. Statistical methods can treat problems of noisy data, and separate different types of noise. 3. Uncertainty Uncertainty refers to the severity of the error and the degree of noise in the data. Data precision is an important consideration in a discovery system.
4. Size, updates, and irrelevant fields Databases tend to be large and dynamic in that their contents are ever changing as information is added modified or removed. The problem with this from the data mining perspective is how to ensure that the rules are up-to-date and consistent with the most current information. Also the learning system has to be time-sensitive as some data values vary over time and the discovery system is affected by the ‘timeliness’ of the data. 1. 7 POTENTIAL APPLICATIONS Data mining has many and varied fields of application some of which are listed below.
1. Retail/Marketing o Identify buying patterns from customers o Find associations among customer demographic characteristics o Predict response to mailing campaigns o Market basket analysis 2. Banking o Detect patterns of fraudulent credit card use o Identify ‘loyal’ customers o Predict customers likely to change their credit card affiliation o Determine credit card spending by customer groups o Find hidden correlation between different financial indicators o Identify stock trading rules from historical market data 3. Insurance and Health Care o Claims analysis – i. e. which medical procedures are claimed together o Predict which customers will buy new policies o Identify behaviour patterns of risky customers o Identify fraudulent behaviour 4.
Transportation o Determine the distribution schedules among outlets o Analyse loading patterns 5. Medicine o Characterise patient behaviour to predict office visits o Identify successful medical therapies for different illnesses 2. DATA WAREHOUSING 2. 1 INTRODUCTION TO DATA WAREHOUSING Data mining potential can be enhanced if the appropriate data has been collected and stored in a data warehouse. A data warehouse is a relational database management system (RDBMS) designed specifically to meet the needs of transaction processing systems. It can be loosely defined as any centralised data repository, which can be queried for business benefit.
Data warehousing is a new powerful technique making it possible to extract archived operational data and overcome inconsistencies between different legacy data formats. As well as integrating data throughout an enterprise, regardless of location, format, or communication requirements it is possible to incorporate additional or expert information. It is, The logical link between what the managers see in their decision support EIS applications and the company’s operational activities John McIntyre of SAS Institute Inc In other words the data warehouse provides data that is already transformed and summarized, therefore making it an appropriate environment for more efficient DSS and EIS applications. 2.
2 CHARACTERISTICS OF A DATA WAREHOUSE According to Bill In mon, author of Building the Data Warehouse and the guru who is widely considered to be the originator of the data-warehousing concept, there are generally four characteristics that describe a data warehouse: o SUBJECT-ORIENTED: Data are organised according to subject instead of application e. g. an insurance company using a data warehouse would organise their data by customer, premium, and claim, instead of by different products (auto, life, etc. ).
o INTEGRATED: When data resides in many separate applications in the operational environment, encoding of data is often inconsistent. For instance, in one application, gender might be coded as “m” and “f” in another by 0 and 1.
When data are moved from the operational environment into the data warehouse, they assume a consistent coding convention e. g. gender data is transformed to “m” and “f.” o TIME-VARIANT: The data warehouse contains a place for storing data that are five to 10 years old, or older, to be used for comparisons, trends, and forecasting. These data are not updated.
o NON-VOLATILE: Data are not updated or changed in any way once they enter the data warehouse, but are only loaded and accessed. 2. 3 DATA WAREHOUSING AND OLTP SYSTEMS A database which is built for on line transaction processing, OLTP, is generally regarded as unsuitable for data warehousing as they have been designed with a different set of needs in mind i. e. maximising transaction capacity and typically having hundreds of tables in order not to lock out users etc. Data warehouses are interested in query processing as opposed to transaction processing.
OLTP systems cannot be repositories of facts and historical data for business analysis. They cannot quickly answer ad hoc queries and rapid retrieval is almost impossible. The data is inconsistent and changing, duplicate entries exist, entries can be missing and there is an absence of historical data, which is necessary to analyze trends. Basically OLTP offers large amounts of raw data, which is not easily understood. The data warehouse offers the potential to retrieve and analyze information quickly and easily. Data warehouses do have similarities with OLTP as shown in the table below.
The data warehouse serves a different purpose from that of OLTP systems by allowing business analysis queries to be answered as opposed to “simple aggregations” such as ‘what is the current account balance for this customer?’ Typical data warehouse queries include such things as ‘which product line sells best in middle-America and how does this correlate to demographic data?’ 2. 4 PROCESSES IN DATA WAREHOUSING The first phase in data warehousing is to “insulate” your current operational information, i. e. to preserve the security and integrity of mission-critical OLTP applications, while giving you access to the broadest possible base of data. The resulting database or data warehouse may consume hundreds of gigabytes – or even terabytes – of disk space, what is required then are efficient techniques for storing and retrieving massive amounts of information.
Increasingly, large organisations have found that only parallel processing systems offer sufficient bandwidth. The data warehouse thus retrieves data from a variety of heterogeneous operational databases. The data is then transformed and delivered to the data warehouse / store based on a selected model (or mapping definition).
The data transformation and movement processes are executed whenever an update to the warehouse data is required so there should some form of automation to manage and execute these functions. The information that describes the model and definition of the source data elements is called “metadata.” The metadata is the means by which the end-user finds and understands the data in the warehouse and is an important part of the warehouse.
The metadata should at the very least contain o the structure of the data; o the algorithm used for summarisation; o and the mapping from the operational environment to the data warehouse. Data cleansing is an important aspect of creating an efficient data warehouse in that it is the removal of certain aspects of operational data, such as low-level transaction information, which slow down the query times. The cleansing stage has to be as dynamic as possible to accommodate all types of queries even those, which may require low-level information. Data should be extracted from production sources at regular intervals and pooled centrally but the cleansing process has to remove duplication and reconcile differences between various styles of data collection. Once the data has been cleaned it is then transferred to the data warehouse, which typically is a large database on a high performance box. SMP, Symmetric Multi-Processing or MPP, Massively Parallel Processing.
Number-crunching power is another important aspect of data warehousing because of the complexity involved in processing ad hoc queries and because of the vast quantities of data that the organization want to use in the warehouse. Another approach to data warehousing is Parsaye’s Sandwich Paradigm put forward by Dr. Kamran Parsaye, CEO of Information Discovery, Hermosa Beach, CA. This paradigm or philosophy encourages acceptance of the probability that the first iteration of a data-warehousing effort will require considerable revision. The Sandwich Paradigm advocates the following approach: o pre-mine the data to determine what formats and data are needed to support a data-mining application; o build a prototype mini-data warehouse i. e.
the meat of the sandwich, with most of the features envisaged for the end product; o revise the strategies as necessary; o build the final warehouse. 2. 5 THE DATA WAREHOUSE MODEL Data warehousing is the process of extracting and transforming operational data into informational data and loading it into a central data store or warehouse. Once the data is loaded it is accessible via desktop query and analysis tools by the decision-makers. The data warehouse model is illustrated in the following diagram. Figure 5: A data warehouse model The data within the actual warehouse itself has a distinct structure with the emphasis on different levels of summarization as shown in the figure below.
Figure 6: The structure of data inside the data warehouse The current detail data is central in importance as it: o reflects the most recent happenings, which are usually the most interesting; o it is voluminous as it is stored at the lowest level of granularity; o It is always (almost) stored on disk storage, which is fast to access but expensive and complex to manage. Older detail data is stored on some form of mass storage; it is infrequently accessed and stored at a level detail consistent with current detailed data. Lightly summarized data is data distilled from the low level of detail found at the current detailed level and generally is stored on disk storage. When building the data warehouse have to consider what unit of time are summarization done over and also the contents or what attributes the summarized data will contain. Highly summarized data is compact and easily accessible and can even be found outside the warehouse.
Metadata is the final component of the data warehouse and is really of a different dimension in that it is not the same as data drawn from the operational environment but is used as: o a directory to help the DSS analyst locate the contents of the data warehouse, o a guide to the mapping of data as the data is transformed from the operational environment to the data warehouse environment, o a guide to the algorithms used for summarisation between the current detailed data and the lightly summarised data and the lightly summarised data and the highly summarised data, etc. 2. 6 IMPLEMENTATION PHASES The phases for implementation for a data warehouse implementation are: o Planning o Gathering Data requirements and modelling o Physical database design and development o Data sourcing, integration, and mapping o Populating the data warehouse o Automating the data management process o Creating the reports o Data validation and testing o Implementation Planning: Similar to software development life cycle projects, planning for a data warehousing project, is concerned with: 1. Defining the project scope 2. Creating the project plan 3. Defining the necessary technical resources, both internal and external 4.
Defining the business participants and responsibilities 5. Defining the tasks and deliverables 6. Defining time lines 7. Defining the final project deliverables. Besides these regular project planning activities, for a data warehousing project, the following technical considerations have to be included in the planning process: 1.
Capacity planning 2. Data integration strategies 3. Procedures for end-user access to archived data 4. Data refresh / update strategies 5.
Operations and job scheduling strategies 6. Metadata management strategies 7. Database connectivity 8. Database gateways 9. Front-end data access tools Gathering Data requirements and modelling: In a data warehousing, gathering the data requirements and data modelling plays a pivot role. The inherent problems of a data warehouse make collecting the requirements by the analysts a nightmare due to broad scope of the project, and number of areas and people.
Adding to the confusion is the misunderstanding between decision support systems and operational processing, which causes an analyst to oscillate between them while gathering requirements. Also, it is often difficult to communicate to the end-users of the difference between data requirements for these two systems. Even for a subject-specific data mart, iterative process (about 3-4 times) would be beneficial to collect all the user requirements. Based on the information collected, the data modeling can be defined. The objective of this phase is to provide either a logical data model or a dimensional business model. The Logical data model covers the scope of the development model including relationships, cardinality, attributes, definitions, and candidate keys.
In a dimensional model diagrams the facts, dimensions, hierarchies, relationships, and candidate keys for the scope of the development project. Physical database design and development: This phase covers database design and de normalisation. This phase covers: 1. Designing the database, including fact tables, relationship tables, and description (lookup) tables. 2. De normalizing the data 3.
Identifying keys 4. Creating-indexing strategies 5. Creating appropriate database objects 6. Developing aggregation strategies 7.
Developing partitioning strategies 8. Refine capacity planning estimates. Data Sourcing, Integration and Mapping: This phase is time consuming and encompasses locating the source of the data in the operational systems, doing analysis to understand what types of data integration may be required, writing integration specifications, and mapping the source data to target data warehouse database design. This phase covers 1.
Defining the possible source systems 2. Determining file layouts 3. Reviewing hierarchy changes 4. Reviewing the update cycle 5.
Performing data analysis to determine the best source of data 6. Reviewing Data priorities and requirements 7. Performing data analysis to integrate the data 8. Developing written data conversion specifications for each field and refining the integration strategy 9. Reviewing security requirements 10. Reviewing Capacity planning 11.
Mapping source to target data Populating the Data warehouse: This phase includes 1. Developing programs or using tools to extract and move the data 2. Developing load strategies 3. Developing the procedures to load the data into the warehouse 4. Developing programs or using data conversion tools to integrate data 5. Developing update / refresh strategies 6.
Testing extract, integration, and load programs and procedures 7. Review Capacity planning 1. 3. 6 Data Management Process: This phase is concerned with automating the extraction, integration, and loading the data to the data warehouse. This phase includes 1. Automating and scheduling the data extraction process 2.
Automating and scheduling the data conversion process 3. Automating and scheduling the data load process 4. Creating back up and recovery procedures 5. Conducting a full test of all the automated processes Creating the Reports Creating the Reports: Before developing the reports, a sample set of reports can be developed as soon as the test subset of data is loaded. For this phase knowledge of data access tools to pre build several reports is essential. Structured navigation paths to access predefined reports or data directly must also be developed.
This phase is concerned with 1. Creating the sample set of predefined reports 2. Testing reports 3. Documenting applications 4. Developing navigation paths Data Validation and Testing: This phase includes the standard data validation processes throughout the data extract, integration, and load development phases. The basic checks on record counts need to be performed on the data warehouse.
This phase along with the previous phase (creating the reports) are the catalysts for iterative changes within the data warehouse development life cycle as users work with the front-end access tools to interact with the data. Development will move back to the analysis phase, working back to the database modifications. The new data modifications will be located, extracted, mapped, integrated and loaded into the data warehouse. This phase includes 1. Validating data and the sample set of reports 2. Validating data using the standard processes 3.
Iteratively changing the data Implementation: Before rolling out for implementation, training the end-users on the scope of the data in the data warehouse and front-end access tools, and their usage plays a critical part. This phase includes 1. Deploying the Data warehouse applications 2. Creating end-user support structures 3. Creating procedures for adding new reports and expanding the scope of the data warehouse. 4.
Back-up procedures for the data warehouse applications, and data related issues. 5. Setting up procedures for metadata management 6. Creating change management procedures. And then the application is given to the end-users for acceptance testing and production implementation. 2.
7 CRITERIA FOR A DATA WAREHOUSE The criteria for data warehouse Rdbms are as follows: o Load Performance – Data warehouses require incremental loading of new data on a periodic basis within narrow time windows; performance of the load process should be measured in hundreds of millions of rows and gigabytes per hour and must not artificially constrain the volume of data required by the business. o Load Processing – Many steps must be taken to load new or updated data into the data warehouse including data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update. These steps must be executed as a single, seamless unit of work. o Data Quality Management – The shift to fact-based management demands the highest data quality. The warehouse must ensure local consistency, global consistency, and referential integrity despite “dirty” sources and massive database size. o Query Performance – Fact-based management and ad-hoc analysis must not be slowed or inhibited by the performance of the data warehouse RDBMS; large, complex queries for key business operations must complete in seconds not days.
o Terabyte Scalability – Data warehouse sizes are growing at astonishing rates. Today these range from a few to hundreds of gigabytes, and terabyte-sized data warehouses are a near-term reality. The RDBMS must not have any architectural limitations. It must support modular and parallel management. It must support continued availability in the event of a point failure, and must provide a fundamentally different mechanism for recovery. It must support near-line mass storage devices such as optical disk and Hierarchical Storage Management devices.
Lastly, query performance must not be dependent on the size of the database, but rather on the complexity of the query. o Mass User Scalability – Access to warehouse data must no longer be limited to the elite few. The RDBMS server must support hundreds, even thousands, of concurrent users while maintaining acceptable query performance. o Networked Data Warehouse – Data warehouses rarely exist in isolation.
Multiple data warehouse systems cooperate in a larger network of data warehouses. The server must include tools that co-ordinate the movement of subsets of data between warehouses. Users must be able to look at and work with multiple warehouses from a single client workstation. Warehouse managers have to manage and administer a network of warehouses from a single physical location.
o Warehouse Administration – The very large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility. The RDBMS must provide controls for implementing resource limits, charge back accounting to allocate costs back to users, and query prioritisation to address the needs of different user classes and activities. o Integrated Dimensional Analysis – The power of multidimensional views is widely accepted, and dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools. o Advanced Query Functionality – End users require advanced analytic calculations, sequential and comparative analysis, and consistent access to detailed and summarised data.
2. 8 BUSINESS USE OF A DATA WAREHOUSE No discussion of the data warehousing systems is complete without review of the type of activity supported by a data warehouse. Some of the activity against today’s data warehouses is predefined and not much different from traditional analysis activity. Other processes such as multi-dimensional analysis and information visualisation were not available with traditional analysis tools and methods.
Tools to be used against the data warehouse One of the objectives of the data warehouse is to make it as flexible and as open as possible. It is not desirable to set a steep entry price in terms of software and training for using the data warehouse. The data warehouse should be accessible by as many end-user tools and platforms as possible. Yet, it is not possible to make every feature of the data warehouse available from every end user tool.
Low-end tools such as simple query capability built into most spreadsheets may be adequate for a user that only needs to quickly reference the data warehouse. Other users may require the use of the most powerful multi-dimensional analysis tools. Standard reports and queries Many users of the data warehouse need to access a set of standard reports and queries. It is desirable to periodically automatically produce a set of standard reports that are required by many different users. When these users need a particular report, they can just view the report that has already been run by the data warehouse system rather than running it themselves. This facility can be particularly useful for reports that take a long time to run.
In addition to standard reports and queries, sometimes it is useful to share some of the advanced work done by other users. A user may produce advanced analysis that can be parameterized or otherwise adapted by other users in different parts of the same organisation or even in organisations. Queries against summary tables As introduced earlier, the summary views in the data warehouse can be the object of a large majority of analysis in a data warehouse. Simple filtering and summation from the summary views accounts for most of the analysis activity against many data warehouses.
These summary views contain predefined standard business analysis. Data mining in the detail data Even though data mining in the detail data may account for a very small percentage of the data warehouse activity, the most useful data analysis might be done on the detail data. The reports and queries off the summary tables are adequate to answer many “what” questions in the business. The drill down into the detail data provides answers to “why” and “how” questions.
Interface with other data warehouses The data warehouse system is likely to be interfaced with other applications that use it as the source of operational system data. A data warehouse may feed data to other data warehouses or smaller data warehouses called data marts. The operational system interfaces with the data warehouse often become increasingly stable and powerful. As the data warehouse becomes a reliable source of data that has been consistently moved from the operational systems, many downstream applications find that a single interface with the data warehouse is much easier and more functional than multiple interfaces with the operational applications.
The data warehouse can be a better single and consistent source for many kinds of data than the operational systems. It is however, important to remember that the much of the operational state information is not carried over to the data warehouse. Thus, data warehouse cannot be source of all operation system interfaces. 2.
9 EXPANDING THE DATE WAREHOUSE Data warehouses linked to the World Wide Web are often called as “Web-Enabled.” Web-enabled data warehouses may provide information access to internal users such as employees, as well as external users such as customers, business partners, and suppliers. Internal access to data warehouse Access to data warehouse information inside a company is often called Intranet decision support. A typical Intranet data warehouse application is company employees using a web browser to view benefits information, including static documents such as employee manuals, and unique accounts such as 401 K investments. External access to the data warehouse External access to a data warehouse can greatly facilitate business-to-business and business-to-customer interactions. For examples, companies can make product information, design specification, white papers, and competitive data, order status information, and trouble reports directly available to partners and prospects. Not only do extranets provide greater customer services, but also at the same time customer support costs can be sharply reduced.
The advantages of web enabling of data warehouses are: 1. Use of less expensive thin client hardware 2. Use of low cost standard web browsers 3. Lower training costs 4. Lower communication costs 5.
Reduced application software licensing 6. Reduced responsibility for compatibility 7. Diminished need to extend corporate networks 8. Simplified system administration 9. The advantages of web-enabling of data warehouses are: 10. Use of less expensive thin client hardware 11.
Use of low cost standard web browsers 12. Lower training costs 13. Lower communication costs 14. Reduced application software licensing 15. Reduced responsibility for compatibility 16. Diminished need to extend corporate networks 17.
Simplified system administration 3. SUMMARY This paper introduced the fundamental concepts of data mining & data warehousing. It is important to note that data warehousing is a science that continues to evolve. Many of the design and development concepts introduced here greatly influence the quality of the analysis that is possible with data in the data warehouse. If invalid or corrupt data is allowed to get into the data warehouse, the analysis done with this data is likely to be invalid.
After the rapid acceptance of data warehousing systems during past few years, there will continue to be many more enhancements and adjustments to the data warehousing system model. Further evolution of the hardware and software technology will also continue to greatly influence the capabilities that are built into data warehouses. Data warehousing systems have become a key component of information technology architecture. A flexible enterprise data warehouse strategy can yield significant benefits for a long period. 4. REFERENCE 1 web > 2 “An Introduction to Data Warehousing” white paper by Vivek R.
Gupta, Senior Consultant, System Services corporation, Chicago, Illinois, August 1997, web > 3 “Data Warehousing- An Approach Document” by GoldStone Technologies 4 “The Data Warehousing Lifecycle” by Dwight Seeley.