Rupali Gill, Jaiteg Singh
Data inconsistency, identification of errors, organization growth, ETL, data quality
|PUBLISHED DATE||December 2014|
|PUBLISHER||The author(s) 2014. This article is published with open access at www.chitkara.edu.in/publications|
In today’s scenario, Extraction–transformation– loading (ETL) tools have become important pieces of software responsible for integrating heterogeneous information from several sources. The task of carrying out the ETL process is potentially a complex, hard and time consuming. Organisations now –a-days are concerned about vast qualities of data. The data quality is concerned with technical issues in data warehouse environment. Research in last few decades has laid more stress on data quality issues in a data warehouse ETL process. The data quality can be ensured cleaning the data prior to loading the data into a warehouse. Since the data is collected from various sources, it comes in various formats. The standardization of formats and cleaning such data becomes the need of clean data warehouse environment. Data quality attributes like accuracy, correctness, consistency, timeliness are required for a Knowledge discovery process. The present state -of –the- art purpose of the research work is to deal on data quality issues at all the aforementioned stages of data warehousing 1) Data sources, 2) Data integration 3) Data staging, 4) Data warehouse modelling and schematic design and to formulate descriptive classification of these causes. The discovered knowledge is used to repair the data deficiencies. This work proposes a framework for quality of extraction transformation and loading of data into a warehouse.
Business today forces the enterprises to run different but coexisting information systems. Correct information is the most imperative resource in many enterprises for the business success. Decision support and business intelligence systems are used to mine the data in order to obtain knowledge that supports the decisiontaking process affecting the future of a given organization. a data warehouse is a class of relational database that is designed for analytical processing rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. The challenge in data warehouse environments is to integrate, rearrange and consolidate large volumes of data over many systems, to provide a unified information base for business intelligence.
The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. In ETL data is extracted from different data sources, and then propagated to the DSA (Data Staging area) where it is transformed and cleansed before being loaded to the data warehouse. Source, staging area, and target environments may have many different data structure formats as flat files, XML data sets, relational tables, non-relational sources, web log sources, legacy systems, and spreadsheets. To work in an operational environment several quality issues have been seen in an ETL environment. Cleansing data of errors in structure and content is important for data warehousing and integration. Current solutions for data cleaning involve many iterations of data “auditing” to find errors, and long-running transformations to fix them. There are various approaches used in cleaning data in manufacturing industries, schools/colleges/universities, organizations and many more. users need to endure long waits, and often write complex transformation scripts.
|ISSN||Print : 2321-3906, Online : 2321-7146|
Data quality has become a major concern activity performed by most organizations that have data warehouses. every organization needs quality data to improve on its services it renders to its customers. In view of this a thorough review of approaches and papers in that regard are discussed and their limitations also stated. This is to help future development and research directions in the area of data cleansing. The papers reviewed in this report looked at critical aspects of data quality and the various types of data that could be cleansed.