Data Warehousing Concepts
Venkat Jandhyala
2010
1
1.1 INTRODUCTION
Data Warehousing Concepts
Based on the way the data is used, databases can be classified in two ways: the one that is used for transactions i.e. Online Transaction Processing (OLTP) and the one that is used for analysis Online Analytical Processing (OLAP).
As the businesses these days contain huge amounts of data and the users are connected to these databases across the globe and round the clock the necessity for maintaining a separate database for the sake of analysis is very much clear. 1.2 OLTP DATABASES OLTP Databases are what we generally refer as ‘Databases’. These are the databases that contain day-to-day transactions. Typically, an OLTP database has hundreds of users connected to it and performing transactions round the clock. Most of the time, these transactions insert data in to the database. Examples are ATM Machines, Online Shopping, Online Application Filing, Online Railway Reservations … The ratio of the number of records being inserted are more than the number of records being updated or deleted. Hence these databases are optimized for insertions. These databases are normalized to reduce the redundancy of the data and increase performance while inserting the data. 3 rd Normalization Form is most commonly found in all types of businesses. Figure 1.1 shows the architecture of OLTP Databases.
Figure 1.1
Local Network or Internet OLTP Database (Normalized)
The Essay on Database Information Databases Data
What is a Database? A database is a software program arranged to collect, hold and process information. There are many software packages that help you handle information. However, what makes databases different is that once you enter the information into it, the database will operate the information in ways that allow you to analyze the data. It is designed in such a way as to make it easy to ...
Users OLTP Database Architecture
Normalization is a refinement process for Online Transaction Processing (OLTP) data models. OLTP systems support the day-to-day operations of the financial institution. This is where trades
Venkat Jandhyala
are booked, executed and settled, where new product and customers are entered into the computer systems. The focus is on transaction management, entering changing and deleting records online in consistent manor. OLTP systems are not designed for analysis, reporting and decision support. Dimensional modeling, a completely different approach should be used to design Decision Support Systems (DSS).
1.3 OLAP SYSTEMS / DECISION SUPPORT SYSTEMS An OLTP (relational) database and an OLAP (multi-dimensional) database both contain information about your business. An OLTP database can be used for many different purposes. It is generally optimized so that you can quickly insert and update records. An OLAP database is generally used to analyze data. It is optimized so that you can quickly retrieve data. An OLAP database is generally created from the information you have put in an OLTP database. OLAP systems are often referred to as Decision Support Systems. Decision Support Systems or DSS, (sometimes also called Business Intelligence or BI) is about synthesizing useful knowledge from large data sets. It’s about integration, summarization and abstraction as well as ratios, trends and allocations. It’s about comparing data-based generalizations with model-based assumptions and reconciling them when they’re different. It’s about good, data-facilitated creative thinking and the monitoring of those creative ideas that were implemented. It’s about using all types of data wisely and understanding how derived data was calculated. It’s about continuously learning, and modifying goals and working assumptions based on data-driven models and experience. In short, business intelligence should function like a virtuous cycle of decision making improvement. OLAP systems store data in multidimensional databases. You then access these databases to perform financial and statistical analyses on different combinations of the data. Vendors offer a variety of OLAP products that you can group into three categories: relational OLAP (ROLAP), multidimensional OLAP (MOLAP), and hybrid OLAP (HOLAP).
The Term Paper on Object oriented Database Management Systems
The construction of Object-Oriented Database Management Systems started in the middle 80's, at a prototype building level, and at the beginning of the 90's the first commercial systems appeared. The interest for the development of such systems stems from the need to cover the modeling deficiencies of their predecessors, that is the relational database management systems. They were intended to be ...
Relational OLAP (ROLAP) ROLAP products (e.g., Informix’s Meta Cube ROLAP Option for the Informix Dynamic Server, Micro Strategy’s DSS Agent) adapt traditional relational databases to support OLAP. Summaries and aggregated data are stored in the database itself. The ROLAP approach begins with the premise that data does not need to be stored multi-dimensionally to be viewed multidimensionally. A scalable, parallel, relational database provides the storage and high-speed access to this underlying data. A middle analysis tier provides a multidimensional conceptual view of the data and an extended analytical functionality which are not available in the underlying relational server. ROLAP depends on a specialized schema design and its technology is limited by its nonintegrated, disparate tier architecture. The problem is that the data is physically separated from analytical processing. The two important features of ROLAP are: Data warehouse and relational database are inseparable Any change in the dimensional structure requires a physical re-organization of the database, which is too time consuming. Certain applications are too fluid for this and the on-the-fly dimensional view of a ROLAP tool is the only appropriate choice. Multidimensional OLAP (MOLAP) The traditional ER Model tends to be too complex and difficult to navigate, as the most important data warehouse requirement is to have fewer queries accessing large amounts of records. MOLAP servers support multidimensional views of data through array-based data warehouse servers. They map multidimensional views of a data cube to array structures. The advantage structures of using a data cube is that it allows fast indexing to pre-compute summarized data. As with a
Venkat Jandhyala
multidimensional data store storage utilization is low, and MOLAP is recommended in such cases. MOLAP vs ROLAP The following arguments can be given in favour of MOLAP: Relational tables are unnatural for multidimensional data Multidimensional arrays provide efficiency in storage and operations There is a mismatch between multidimensional operations and SQL For ROLAP to achieve efficiency, it has to perform outside current relational systems, which is the same as what MOLAP does. The following arguments can be given in favour of ROLAP: ROLAP integrates naturally with existing technology and standards MOLAP does not support ad hoc queries effectively because it is optimized for multidimensional operations Since data has to be downloaded into the MOLAP systems, updating is difficult The efficiency of ROLAP can be achieved by using such techniques as encoding and compression ROLAP can readily take advantage of parallel relational technology The claim that MOLAP performs better than ROLAP is intuitively believable. HOLAP HOLAP products (e.g., Microsoft SQL Server OLAP Services) combine MOLAP and ROLAP. With HOLAP products, a relational database stores most of the data.
The Business plan on Data Warehouse Information Database Analysis
1. DATA MINING 1. 1 INTRODUCTION TO DATA MINING The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The ...
A separate multidimensional database stores the densest data, which is typically a small proportion of the data. 1.4 DATA WAREHOUSES Data Warehouse in its most simplest and generic definition is a simple database with huge amounts of data in it. However, a data warehouse is a multi-dimensional database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. Data Warehousing is the process of making your operational data available to your business managers and decision support applications. Data warehousing doesn’t just make data available; proper warehousing focuses on efficient information access. Of course, this efficiency doesn’t happen magically. First you have to understand the business user needs from the data and the decision support applications, and then you must evaluate your current operational data and determine how to transform that data into what the business user requests. The tools that you choose for your warehousing solution will take data from your operational systems (extract it), convert your operational data into business information using your defined business rules (transform it), and create a data warehouse (load it).
The Business plan on Data Warehousing Warehouse Business Information
Data Warehouses MGT 327 April 13 th, 2004 In the past decade, we have witnessed a computer revolution that was unimaginable. Ten to fifteen years ago, this world never would have imagined what computers would have done for business. Furthermore, the Internet and the ability to conduct electronic commerce have changed the way we are as consumers. One of the upcoming concepts of the computer ...
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:
Subject Oriented Integrated
Venkat Jandhyala
Nonvolatile Time Variant
Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like “Who was our best customer for this item last year?” This ability to define a data warehouse by subject matter, sales in this case makes the data warehouse subject oriented. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred and whatever once happened never changes. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive.
A data warehouse’s focus on change over time is what is meant by the term time variant. Data warehousing technology comprises a set of new concepts and tools which support the knowledge worker (executive, manager, and analyst) with information material for decision making. The fundamental reason for building a data warehouse is to improve the quality of information in the organization. The key issue is the provision of access to a company-wide view of data whenever it resides. Data coming from internal and external sources, existing in a variety of forms from traditional structural data to unstructured data like text files or multimedia is cleaned and integrated into a single repository. A data warehouse (DWH) is the consistent store of this data which is made available to end users in a way they can understand and use in a business context. Figure 1.2 shows the Data Warehouse Architecture As the figure 1.2 depicts, Data Warehouses get their data from multiple OLTP sources. These sources need not be maintained using same database management systems. They may be having different structures as well. To understand this better let us consider an example of a garment manufacturing company having its retail outlets across the globe.
The Essay on Building a Data Warehouse
Starbucks is a company that is specialized in offering a range of products including coffee, handcrafted beverages, merchandise, and fresh food. As an enterprise, they require a proper data management to enable them serve their customers efficiently. Data on sales, customer views, customer information, market analytics, products, and production needs a proper storage and retrieval system hence the ...
It may also be having different web based applications and/or portals through customers can place their orders online. In a scenario like this, the data need not be maintained in same fashion at all these places. Number of orders placed online may be more as compared to any given outlet since outlets are geographically restricted. Hence, it is wise to go with software like Oracle to maintain the information about online orders. Based on the average number of transactions that occur per outlet per day some outlets may be maintaining their data in SQL Server, some in Access, some in Universal Database DB2 and so on… When business analysts of such a company need a sales report of the whole company they need to integrate all the data from these various sources. Executing such an ad hoc query on all these databases transactional databases obviously does not yield in a faster response. So there arises a need to maintain a separate database that can be used only for querying purposes. This database needs to be periodically updated from the transactional
Venkat Jandhyala
databases. It is this query-purpose database that is referred to as a Data Warehouse. Data from the transactional databases is not directly brought in to the data warehouses. Instead they initially pass through such process as Data Cleansing.
Figure 1.2
SQL Server
Multi-dimensional Cubes
Oracle
Meta Data
Staging Area DB2 UDB
Data Warehouse
Business Intelligence
Access Data Marts
Flat Files
OLTP Sources
End Users (Analysts)
Data Warehouse Architecture
1.5 DATA WAREHOUSE LIFE CYCLE Data Cleansing Data Cleansing is the process of cleansing or validating the data brought from multiple sources. The sources data may be invalid for reasons more than one. The data might also have become invalid because of improper manual feeding done at the OLTP level. Organizational policies change with the time and hence their business logic. Data from the OLTP source becomes invalid if it no longer meets the new business logic and policies.
The Research paper on Data Warehousing Warehouse Operational Information
... 1. Introduction 2. What is a data warehouse 3. Past, Present and Future 4. Data Warehouses and Business Organisations 5. Conclusion 6.Bibliography 1. 0 Introduction ... was to design a SQL Server Database. The next step was to decide what sort of query tool was to be used. ... new analysis are available by product, customer, premium level or region. Drilling-down into the detail helps identify good and bad ...
Venkat Jandhyala
Extracting, Transforming and Loading (ETL) Extracting, Transforming and Loading (ETL) is the process of reading (extracting) data from heterogeneous sources and transform them so that the discrete data from different sources gets integrated and then loading in to the target Data Warehouse. During this process, data from multiple tables may be merged in to one and/or data from single table may be routed in to multiple tables and/or can be sorted, grouped, filtered and so on… These operations are done at a special dedicated area wherein all the data from all the sources are first dumped off, well known as Staging Area. Software like Informatica Power Center and Oracle Warehouse Builder are used for these operations.
Figure 1.3
From OLTP Sources
1 3 2
Ranking Splitting Aggregation
To Warehouse
Filtering
Data Cleansing
Calculations Merging
Transformations
An overview of ETL
Meta Data Metadata is to the data warehouse what the card catalogue is to the traditional library. It serves to identify the contents and location of data in the warehouse. Metadata is a bridge between the data warehouse and the decision support application. It answers questions as “What does this field mean in business terms?”, “Which business process does this set of queries support?”, “When did the job to update the customer data in our data mart last run?” A metadata repository should contain A description of the structure of the data warehouse. This includes the schema, view, dimensions, hierarchies and derived data definitions, data marts location and contents, etc… Operational metadata such as data linkages, currency of data and monitoring information (warehouse usage statistics, error reports and its trails) The summarization processes which include dimension definition, data on granularity, partitions, summary measure, aggregation, summarization, etc… Details of data sources which include source databases and their contents, gateway descriptions, data extractions, clearing, transformation rules and defaults.
Venkat Jandhyala
Data related to system performance, which include indices and profiles that improve data access and retrieval performances. Business metadata, which includes business terms and definitions, data ownership information and changing policies.
Data Marts Data Marts contain the summarized data of the ware houses and are referred as High Performance Query Structures. They consist of Materialized Views and Special Indexes. In some businesses these data marts may be maintained within the ware houses whereas, in some other scenarios they may be maintained apart from the data warehouses. Multi-dimensional Cubes A cube is a structure that stores your business data in a multi-dimensional format that makes it easy to analyze. Designed to be departmental, and optimized for performance, a multidimensional OLAP cube consists of aggregated, summarized, and pre-calculated data. Usually each cube contains data that focuses on a specific aspect of the business, such as sales data, financial data, or data for tracking inventory. Each cube is usually designed to address a specific business question. When you create a report, you connect to a cube, and use the data from that cube in your report.
Figure 1.4
Sales Data Warehouse
Accounts
Finance
Data Marts Figure 1.5
Products Regions Time
Multi-dimensional Cubes
Business Intelligence
Venkat Jandhyala
Business Intelligence comprises of console based and/or window based and/or web based applications that we use for querying our data warehouses. These applications provide security for the data being accessed, and are more user friendly for the non-technical personnel to operate with. These also allow the end users to do such operations as drilling, knowledge discovery and many others. 1.6 DATA MINING/KNOWLEDGE DISCOVERY IN DATABASES Data mining (DM) or knowledge discovery in databases (KDD), as it is also known, is the nontrivial extraction of implicit, previously unknown and potentially useful information from the data. Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. With the widespread use of databases and the explosive growth in their sizes, organizations are faced with the problem of information overload. The problem of effectively utilizing these massive volumes of data is becoming a major problem for all enterprises. Traditionally, we have been using data for querying a reliable database repository via some well circumscribed application or canned report-generating utility. While this mode of interaction is satisfactory for a large class of applications, there exist many other applications which demand exploratory data analyses. Data mining techniques support automatic exploration of data. Data mining attempts to source out patterns and trends in the data and infers rules from these patterns. With these rules users will be able to support, review and examine decisions in some related business or scientific area. This opens up the possibility of a new way of interacting with databases and data warehouses. Consider for example, a banking application where the manager wants to know whether there is a specific pattern followed by defaulters. It is hard to formulate a SQL query for such information. It is generally accepted that if we know the query precisely, we can turn to query language to formulate the query. But if we have some vague idea and we do not know the precisely query, then we can resort to data mining techniques.
Venkat Jandhyala
Figure 1.6
Databases
Flat Files
Cleansing and Integration
Data Warehouse
Selection & Transformation Data Mining
Pattern Recognition
Data Mining Architecture Knowledge
Drill up & Drill Down Drilling is the term used to navigate through the warehouse data through a given dimension. We can say we drill the sales data by region. In that case, viewing region wise sales report and then sub region wise sales report for a given region and then moving forward to country wise sales and then to state wise sales data is what is referred to as Drilling Down the data. Drilling up is navigating back. Slicing & Dicing Slicing & dicing the data means analyzing the same data in different fashions and groups. Let us consider a sales report, which measures my sales by the amount of items sold region wise-time wise-product category wise. Changing the order and way we view the data within these given dimensions is what is known as slicing & dicing.
Venkat Jandhyala
Figure 1.7
Region #1 Sub Region #1 Country #1 State #1 City #1 : : : Region #2 Sub Region #2
Drilling up the Data
Drilling Down the Data
Drill Operations
Figure 1.8 Slicing & Dicing Region #1 … … … … Region #2 … … … Region #1 … Region #2 Region #3 Region #4 Quarter #1 Region #1 … Region #2 … Region #3 … Product #4 Product #2 … Product #1 … … … … … … … … Quarter … … … …
Product #1 Product #2 Product #3 Product #4
Venkat Jandhyala