DATA INTEGRATION
————————————————-
Data integration involves combining data residing in different sources and providing users with a unified view of these data. This process becomes significant in a variety of situations, which include both commercial (when two similar companies need to merge their databases and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. In management circles, people frequently refer to data integration as “Enterprise Information Integration” EII.
————————————————-
History
Figure 1: Simple schematic for a data warehouse. The ETL process extracts information from the source databases, transforms it and then loads it into the data warehouse.
Figure 2: Simple schematic for a data-integration solution. A system designer constructs a mediated schema against which users can run queries. The virtual database interfaces with the source databases via wrapper code if required.
Issues with combining heterogeneous data sources under a single query interface have existed for some time. The rapid adoption of databases after the 1960s naturally led to the need to share or to merge existing repositories. This merging can take place at several levels in the database architecture.
The Term Paper on Databases And Information Management
1. Introduction Database systems are the information heart of modern enterprises, where they are used for processing business transactions and for understanding and managing the enterprise. Business intelligence is the analysis of data to produce insights useful for managing the enterprise and increasingly, in routine business operations such as intelligent supply chain management. The knowledge ...
One popular solution is implemented based on data warehousing (see figure 1).
The warehouse system extracts, transforms, and loads data from heterogeneous sources into a single common queriable schema so data becomes compatible with each other. This approach offers a tightly coupled architecture because the data is already physically reconciled in a single repository at query-time, so it usually takes little time to resolve queries. However, problems lie in the data freshness, that is, information in warehouse is not always up-to-date. Therefore, when an original data source gets updated, the warehouse still retains outdated data and the ETL process needs re-execution for synchronization. Difficulties also arise in constructing data warehouses when one has only a query interface to summary data sources and no access to the full data. This problem frequently emerges when integrating several commercial query services like travel or classified advertisement web applications.
As of 2009 the trend in data integration has favored loosening the coupling between data and providing a unified query-interface to access real time data over a mediated schema (see figure 2) which allows information to be retrieved directly from original databases. This approach relies on a mappings between the mediated schema and the schema of original sources, and transform a query into specialized queries to match the schema of the original databases. Such mappings can be specified in 2 ways : as a mapping from entities in the mediated schema to entities in the original sources (the “Global As View” (GAV) approach), or as a mapping from entities in the original sources to the mediated schema (the “Local As View” (LAV) approach).
The latter approach requires more sophisticated inferences to resolve a query on the mediated schema, but makes it easier to add new data sources to a (stable) mediated schema.
As of 2010 some of the work in data integration research concerns the semantic integration problem. This problem addresses not the structuring of the architecture of the integration, but how to resolve semantic conflicts between heterogeneous data sources. For example if two companies merge their databases, certain concepts and definitions in their respective schemas like “earnings” inevitably have different meanings. In one database it may mean profits in dollars (a floating-point number), while in the other it might represent the number of sales (an integer).
The Research paper on Guidelines in Choosing Data, Materials and Sources
For any research, the next important thing after defining your topic and thesis statement is the use of the choice of the kind of resources, date and material that you will use in the project. It is important to note that in every project, the choice of your data and material will greatly influence the outcome of your work. Specifically, the quality of the data and material that you used will ...
A common strategy for the resolution of such problems involves the use of anthologies which explicitly define schema terms and thus help to resolve semantic conflicts. This approach represents ontology-based data integration. On the other hand, the problem of combining research results from different bioinformatics repositories requires bench-marking of the similarities, computed from different data sources, on a single criterion such as, positive predictive value. This enables the data sources to be directly comparable and can be integrated even when the natures of experiments are distinct.
As of 2011 it was determined that current data modeling methods were imparting data isolation into every data architecture in the form of islands of disparate data and information silos each of which represents a disparate system. This data isolation is an unintended artifact of the data modeling methodology that results in the development of disparate data models. Disparate data models, when instantiated as databases, form disparate databases. Enhanced data model methodologies have been developed to eliminate the data isolation artifact and to promote the development of integrated data models. One enhanced data modeling method recasts data models by augmenting them with structural metadata in the form of standardized data entities. As a result of recasting multiple data models, the set of recast data models will now share one or more commonality relationships that relate the structural metadata now common to these data models. Commonality relationships are a peer-to-peer type of entity relationships that relate the standardized data entities of multiple data models. Multiple data models that contain the same standard data entity may participate in the same commonality relationship. When integrated data models are instantiated as databases and are properly populated from a common set of master data, then these databases are integrated.
The Essay on Database Model Hierarchical Structure Network
A Comparison of the Hierarchical, Network and Relational, Database Models Database models continue to evolve as the information management needs of organizations become more complex. From flat files to relational databases, the growing demands on data integrity, reliability and performance of database management systems (DBMS), has shaped the design of databases and their underlying models. In ...
————————————————-
————————————————-
————————————————-
Example
Consider a web application where a user can query a variety of information about cities (such as crime statistics, weather, hotels, demographics, etc.).
Traditionally, the information must be stored in a single database with a single schema. But any single enterprise would find information of this breadth somewhat difficult and expensive to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data.
A data-integration solution may address this problem by considering these external resources as materialized views over a virtual mediated schema, resulting in “virtual data integration”. This means application-developers construct a virtual schema — the mediated schema — to best model the kinds of answers their users want. Next, they design “wrappers” or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by the respective websites or databases) into an easily processed form for the data integration solution (see figure 2).
When an application-user queries the mediated schema, the data-integration solution transforms this query into appropriate queries over the respective data sources. Finally, the virtual database combines the results of these queries into the answer to the user’s query.
This solution offers the convenience of adding new sources by simply constructing an adapter or an application software blade for them. It contrasts with ETL systems or with a single database solution, which require manual integration of entire new dataset into the system. The virtual ETL solutions leverage virtual mediated schema to implement data harmonization; whereby the data is copied from the designated “master” source to the defined targets, field by field. Advanced Data virtualization is also built on the concept of object-oriented modeling in order to construct virtual mediated schema or virtual metadata repository, using hub and spoke architecture.
The Essay on Data mining techniques
2.1Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. a.Supervised-Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers). b.Unsupervised-In an online bookstore, making recommendations to customers ...
Each data source is disparate and as such is not designed to support reliable joins between data sources. Therefore, data virtualization as well as data federation depends upon accidental data commonality to support combining data and information from disparate data sets. Because of this lack of data value commonality across data sources, the return set may be inaccurate, incomplete, and impossible to validate.
One solution is to recast disparate databases to integrate these databases without the need for ETL. The recast databases support commonality constraints where referential integrity may be enforced between databases. The recast databases provide designed data access paths with data value commonality across databases.
————————————————-
Data Integration in the Life Sciences
Large-scale questions in science, such as global warming, invasive species spread, and resource depletion, are increasingly requiring the collection of disparate data sets for meta-analysis. This type of data integration is especially challenging for ecological and environmental data because metadata standards are not agreed upon and there are many different data types produced in these fields. National Science Foundation initiatives such as Data net are intended to make data integration easier for scientists by providing cyber in frastructure and setting standards. The two funded Data net initiatives are Data ONE and the Data Conservancy.
Data analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facts and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.
Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA).
The Essay on Data Mining Is A Term For The Computer Implementation Of
... potential patterns in data and reduce the results to a simple summary report. The BI OLAP and data-mining approaches to reporting on data belong together ... a data-mining model structure is built (either by wizard or directly), it is stored as part of an object hierarchy in the Analysis ...
EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical or structural models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis.
Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling.
Type of data
Data can be of several types
* Quantitative data data is a number
* Often this is a continuous decimal number to a specified number of significant digits
* Sometimes it is a whole counting number
* Categorical data data one of several categories
* Qualitative data data is a pass/fail or the presence or lack of a characteristic.
Free software for data analysis
* ROOT – C++ data analysis framework developed at CERN
* PAW – FORTRAN/C data analysis framework developed at CERN
* JHepWork – Java (multi-platform) data analysis framework developed at ANL
* KNIME – the Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
* Data Applied – an online data mining and data visualization solution.
* R – a programming language and software environment for statistical computing and graphics.
* DevInfo – a database system endorsed by the United Nations Development Group for monitoring and analyzing human development.
* Zeptoscope Basic[16] – Interactive Java-based plotter developed at Nanomix.
* Business Data Analytics Software- Free desktop edition for organizations.
* GeNIe – discovery of causal relationships from data, learning and inference with Bayesian networks, industrial quality software developed at the Decision Systems Laboratory, University of Pittsburgh.
The Business plan on Data Warehousing Warehouse Business Information
Data Warehouses MGT 327 April 13 th, 2004 In the past decade, we have witnessed a computer revolution that was unimaginable. Ten to fifteen years ago, this world never would have imagined what computers would have done for business. Furthermore, the Internet and the ability to conduct electronic commerce have changed the way we are as consumers. One of the upcoming concepts of the computer ...
* ANTz – C realtime 3D data visualization, hierarchal object trees that combine multiple topologies with millions of nodes.
Nuclear and particle physics
In nuclear and particle physics the data usually originate from the experimental apparatus via a data acquisition system. It is then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information. Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc. The resulting data n-tuples are then scrutinized by the physicists, using specialized software tools like ROOT or PAW, comparing the results of the experiment with theory.
The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation software like Geant4, in order to predict the response of the detector to a given theoretical event, thus producing simulated events which are then compared to experimental data.
DATA MINING:
Introduction
The Microsoft SQL Server 2005 Data Mining Platform introduces significant capabilities to address data mining in both traditional and new ways. In traditional terms, data mining can predict future results based on input, or attempt to find relationships among data or cluster data in previously unrecognized yet similar groups.
Microsoft data mining tools are different from traditional data mining applications in significant ways. First, they support the entire development lifecycle of data in the organization, which Microsoft refers to as Integrate, Analyze, and Report. This ability frees the data mining results from the hands of a select few analysts and opens those results up to the entire organization. Second, SQL Server 2005 Data Mining is a platform for developing intelligent applications, not a stand-alone application. You can build custom applications that are intelligent, because the data mining models are easily accessible to the outside world. Further, the model is extensible so that third parties can add custom algorithms to support particular mining needs. Finally, Microsoft data mining algorithms can be run in real time, allowing for the real-time validation of data against a set of mined data.
DATA INTEGRATION
The integration phase covers the capturing of data from disparate sources, the transformation of data, and loading it into one or many sources. Traditional data mining tools play almost no role in the integration phase, as it is this phase that captures data and prepares it to be mined. While this may sound a bit like a chicken and egg problem, the Microsoft approach to this phase is rather straight-forward: capture the data, consolidate it, mine it, and then use the results of the mining to apply to the current and all future data. Furthermore, the data mining algorithms help companies spot outliers that already exist in the data, or outliers that may be brought in during a traditional extraction, transforming, and loading (ETL) process.
Data mining tools are integrated with SQL Server Integration Services. This means that during the data movement and transformation stage, data can be analyzed and modified based on the predictive output of the data mining models. For example, documents or text fields can be analyzed on the fly and placed in appropriate buckets based on keywords within the documents.
Data Analysis
Typical data mining tools generate results after a data warehouse is built and these results are analyzed independently of the analysis done on the data warehouse. Forecasts are generated or relationships are identified, but the result of the data mining models is generally independent of the data used in the data warehouse.
Microsoft tools are integrated with the entire process. Just as data mining is available in SQL Server Integration Services, the benefits of data mining are visible in Analysis Services and SQL Server as well. Whether a company chooses to use relational or OLAP data, mining benefits can be evident during the analysis phase. Thanks to the Universal Data Model (UDM), analysis can be performed against either relational or OLAP data in a transparent manner, and data mining provides a boost to this analysis.
When analyzing certain data elements, such as how products are related or how to group customers based on buying or Web site surfing patterns, various data mining models can determine how to cluster those customers or products into groups that make sense for analysis. When you feed these groups back into the analytic process, the data mining engine allows analysts and users to slice and drill based on these clusters.
Reporting
Once the modeling is complete and an accurate model has been created, the emphasis on data mining changes from analysis to results, and more importantly putting these results to work by getting them into the hands of the right people at the right time. Thanks to the integration between data mining and reporting in SQL Server 2005, providing predictive results to anyone in the organization can be done in a simple, flexible, and scalable manner.
They could even see intelligent reports displaying the top ten reasons customers buy or don’t buy the product and target their efforts appropriately. Microsoft allows the intelligence and power of data mining to be easily exposed through reporting, delivering meaningful data to users in an easy-to-digest format
Conclusion
The Microsoft approach to data mining is revolutionary. Rather than creating a stand-alone tool to generate groups or predict future results, Microsoft has created a platform that spans the entire process of dealing with data, something they call Integrate, Analyze, and Report.
This means that the output of a data mining model can immediately be applied back to the data gathering, transformation, and analysis processes. Anomalous data can be detected in existing data sets, and new data entry can be validated in real time, based on the existing data. This can free developers from having to create complicated decision trees in application code in an attempt to validate complex input of multiple data values.
Microsoft has also built a secure platform in which the mining model and its output are stored in a central location. No longer are models stored on a variety of separate machines where they are harder to control. Additionally, having a centralized model ensures that the same model is used by all analysts and users.
DATA REPORTING:
Microsoft SQL Server 2008 Reporting Services (SSRS) provides a full range of ready-to-use tools and services to help you create, deploy, and manage reports for your organization, as well as programming features that enable you to extend and customize your reporting functionality.
With Reporting Services, you can create interactive, tabular, graphical, or free-form reports from relational, multidimensional, or XML-based data sources. You can publish reports, schedule report processing, or access reports on-demand. Reporting Services also enables you to create ad hoc reports based on predefined models, and to interactively explore data within the model. You can select from a variety of viewing formats, export reports to other applications, and subscribe to published reports. The reports that you create can be viewed over a Web-based connection or as part of a Microsoft Windows application or SharePoint site. Reporting Services provides the key to your business data.
BUSINESS INTELLIGENCE REPORTING
Business intelligence reporting describe the practice of analyzing large amounts of data in the form of human readable reports. For most users business intelligence reporting describes the software tools available that allow this form of reporting to take place. In this latter definition then, business intelligence reporting is a software tool whose purpose is to harvest information, categorize the information into logical order, sum up the information, and present the chosen information in a viewable and understandable format. Business intelligence reporting uses archived or previously stored data, usually in a data warehouse.
Every organization uses business intelligence reporting in some fashion, making business intelligence reporting one of the very first applications to be utilized in the business intelligence implementation. Because of business intelligence reporting, the company can get easy access to key information, quickly manipulate the information into any desired format, and ultimately provide that information to any necessary executives, employees, and consumers.
Open Source Software vs. Commercial Software
There are two classes of software when it comes to business intelligence reporting, open source software and the commercial variety. Open source software (OSS) is sometimes referred to as “free” software, because the source code is available to the public via a public domain. By definition open source software is made available to the public in its source or uncompiled format; users are then free to modify it, use it as is and re-distribute it at will, provided that they re-publish any changes they make back to the original source. In theory, OSS is available so that the product can be more accessible to masses and more understandable among all users. Commercial software on the other hand is exactly as it sounds- proprietary software that is designed to be marketed and sold to the public.
EXAMPLE OF BUSINESS INTELLIGENCE REPORTING OSS
One such open source software program is the Agata Report- a cross-platform database business intelligence reporting tool which includes ability to create graphs in addition to a query tool that allows the user to access information from various applications and reformat into various useable formats. Another open source software business intelligence reporting program is the BIRT (Business Intelligence and Reporting Tools) Project which has business intelligence reporting capabilities for both rich client (the computer’s central server) and web applications.
The JasperReport business intelligence reporting tool is another open source program that can actually transfer the retrieved and finalized information in written format directly onto the screen, a printer, or generate as PDF, HTML, and a variety of other file formats. This type of business intelligence reporting tool can be used in either Java or Internet-based applications.
EXAMPLE OF BUSINESS INTELLIGENCE REPORTING COMMERCIAL SOFTWARE
An example of commercial business intelligence reporting software is Crystal Reports, a business intelligence reporting tool that can devise and create final information reports from a large number of data sources. When Crystal Reports is installed on an operating system, users can highlight select data and then catalog the information into the desired format. The final report can be seen on the screen, printed, or formatted into a variety of file types (PDF, Excel, etc.).
Selecting Business Intelligence Reporting Software
The number of available alternatives for business intelligence reporting software are broad and easily identified. The key aspect of selecting business intelligence reporting software is to ensure that you don’t over-buy and that your business intelligence reporting software provides the feature set you need, without providing features that are overly expensive and that you will not use.
SELECTING BUSINESS INTELLIGENCE REPORTING SOFTWARE
The number of available alternatives for business intelligence reporting software are broad and easily identified. The key aspect of selecting business intelligence reporting software is to ensure that you don’t over-buy and that your business intelligence reporting software provides the feature set you need, without providing features that are overly expensive and that you will not use.