Archive for the ‘ETL’ Category

Analytical Processing of Data

November 14, 2013

A short presentation on Analytical Processing of Data; very high level overview…..

Data Munging in the Big Data world

October 25, 2012
Recently, when NASA announced a competition on a large US government data sets’ data munging problem. It is called NITRD Big Data Challenge series @ http://community.topcoder.com/coeci/nitrd/
The first of the challenges was primarily about “How to create a homogeneous big-data dataset from large soloed data sets available with the multiple government departments such that some meaningful societal decisions can be derived from the knowledge generated from big data analytics”
So, the word coined almost a decade back “Data Munging” has come back into a key skill in today’s world of “Data Science” discipline.
What is data munging?
In simplest possible terms, it is making data that is generated in heterogeneous platforms/formats to a common processable format for further munging or analytics!
How is it deferent from ETL/data integration?
Data integration and ETL are fully automatic and programmed where as the munging involves semi automatic; based on human assisted machine learning algorithms.
Why is it important now?
As the massively parallel processing paradigm based on map reduce and other so called “big data” technologies, it is a key thing now how the existing vast amounts of “data” be made available for such kind of processing to derive the knowledge by the means of analytic and machine learning algorithms.
There is an emergence of start-ups trying to generate platforms and tools for data munging are now in the market. In my opinion, this is going to be a key “skill” in future big data based “Data Science” discipline.
So, if you have good skills in data and algorithms based on assisted machine learning for manipulation then go for it!

>data consolidation (ETL) and data federation (EII)

June 16, 2011

>Operational IT systems focus on providing the support for the business operations & enable capture, validation, storage and presentation of transactional data during normal running of the operations. They contain latest view of the organization’s operational state.

Traditionally, the data from various operational systems is extracted, transformed and loaded into a central warehouse for historical trending and analytic purposes. This ETL process will need a separate IT infrastructure to hold the data as well as it introduces some time lag in making the information in the OLTP systems available in the central data warehouse.

When the costs/resources required for consolidating data in the traditional way is not suitable due to the latest trends of acquisitions etc., there is a need for a different mechanism of data integration. The relatively different way of looking at this problem is to provide a semantic layer that can be used to access the data across heterogeneous sources for analytical purposes. This new way is called as “Data Federation” or “Data Virtualization” or EII – Enterprise Information Integration.

Key advantages of EII are quick delivery and lower costs. Key disadvantage is the performance of the solution and dependence on the source systems.

A good use case of data virtualization in my view is to consolidate different enterprise data warehouses due to mergers/acquisitions.

Traditional ETL and data warehouse technology vendors are coming up with data federation tools. Informatica Data Services uses a consolidate data integration philosophy where as Business Objects data federator uses a virtual tables in the BO universes for providing same functionality. Composite Integration Server is the independent technology provider in this area.

Key considerations in selecting the data federation and associated technologies are
1. native access to the heterogeneous source systems
2. capabilities of access method optimization
3. caching capabilities of the federation platform
4. metadata discovery capabilities from various sources
5. ease of development

A carefully chosen hybrid approach of consolidation and federation of data is required for a successful enterprise in the modern world.