paper no: Custom3
last update: 20/05/08
DATA WAREHOUSING AND MINING
A data warehouse is the main repository of an organization's historical data, its corporate memory. It contains the raw material for management's decision support system. E xtracting data from legacy systems and other resources; cleaning, scrubbing and preparing data for decision support; maintaining data in appropriate data stores; accessing and analyzing data using a variety of end user tools; and mining data for significant relationships are the main process involved in Data Warehousing. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems.
The Data Warehousing has the following qualities:
The data in the database is organized so that all the data elements relating to the same real-world event or object are linked together.
The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time.
Data in the database is never over-written or deleted - once committed, the data is static, read-only, but retained for future reporting.
The database contains most of the organization's operational application and that this data is made consistent.
Data mining consists of five major elements:
- Extract, transform, and load transaction data onto the data warehouse system.
- Store and manage the data in a multidimensional database system.
- Provide data access to business analysts and information technology professionals.
- Analyze the data by application software.
Present the data in a useful format, such as a graph or table
NEED FOR DATA WAREHOUSING:
- The processing load of reporting reduced the response time of the operational systems.
- The database designs of operational systems were not optimized for information analysis and reporting.
- Most organizations had more than one operational system, so company-wide reporting could not be supported from a single system
- Development of reports in operational systems often required writing specific computer programs which was slow and expensive
As a result separate computer database began to be build that were specifically designed to support mathematical information and analysis purpose.
TYPES OF DATA WAREHOUSE:
Offline Data Warehouse:
Data Warehouse in this stage of evolution are updated on a regular time cycle(monthly or weakly) from the operating system and the data is stored in a integrated reporting oriented data structure.
Real time Data Warehouse:
Data Warehouse at this stage are updated on a transaction are event basis every time an O.S performs the transaction.(e.g. A delivery or a booking)
Integrated Data Warehouse:
Data Warehouse at this stage are used to generate activity or transaction that are passed back into the O.S for use in the daily activity of an organization.
RELATIONSHIP BETWEEN DATA WAREHOUSING AND MINING:
A data warehouse assembles the data from the heterogeneous database. It does not attempt to extract information from the data into the warehouse. Data warehouse formats and organize the data and support management functions.
Data Mining in the other hand attempts to extract useful information from the database.
The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures. Extraction converts the data into a format for transformation processing .
An intrinsic part of the extraction is the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data is rejected entirely.
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data to be loaded to the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformations types to meet the business and technical needs of the end target may be required:
- Selecting only certain columns to load (or selecting null columns not to load)
- Translating coded values ( e.g. , if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this is called automated data cleansing; no manual cleansing occurs during ETL
- Encoding free-form values ( e.g. , mapping "Male" and "1" and "Mr" into M)
- Deriving a new calculated value ( e.g. , sale_amount = qty * unit_price)
- Joining together data from multiple sources ( e.g. , lookup, merge, etc. )
- Summarizing multiple rows of data ( e.g. , total sales for each store, and for each region)
- Generating surrogate key values
- Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
- Splitting a column into multiple columns ( e.g. , putting a comma-separated list specified as a string in one column as individual values in different columns)
- Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus no, partial or all the data is handed over to the next step, depending on the rule design and exception handling. Most of the above transformations itself might result in an exception, e.g. when a code-translation parses an unknown code in the extracted data.
The load phase loads the data into the end target, usually being the data warehouse (DW). Depending on the requirements of the organization, this process ranges widely. Some data warehouses might weekly overwrite existing information with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a historized form, e.g. hourly. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs.
As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (e.g. uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
It is a database of data gathered from operational data and other sources that is designed to serve a particular group/department of an organization.
Data Mart is typically a Sub-set of Data Warehouse.
Data Warehouse is a database of data gathered from operational data and other sources that is designed to serve the whole organization.
For Further more download pdf...