Archive for the ‘bigdata’ Category

Data streams, lakes, oceans and “Finding Nemo” in them….

April 4, 2014

This weekend, I complete 3 years of TCS second innings. Most of the three years I have been working with large insurance providers trying to figure out the ways to add value with the technology to their operations and strategy.

The concurrent period has been a period of re-imagination. Companies and individuals (consumers / employees) slowly moving towards reimagining themselves in the wake of converging digital forces like cloud computing, analytics & big data, social networking and mobile computing.

Focus of my career in Information Technology has always been “Information” and not technology. I am a firm believer in “Information” led transformation rather than “technology” led transformation. The basis for information is data and the ability to process and interpret the data, making it applicable and relevant for the operational or strategic issues being addressed by the corporate business leaders.

Technologists are busy making claims that their own technology is best suited for the current data processing needs. Storage vendors are finding business in providing the storage in cloud. Mobility providers are betting big on wearable devices making computing more and more pervasive. The big industrial manufacturers are busy fusing sensors everywhere and connecting them on the internet following the trend set by the human social networking sites. A new breed of scientists calling themselves data scientists are inventing algorithms to quickly derive insights from the data that is being collected. Each one of them is pushing themselves to the front taking support of the others to market themselves.

In the rush, there is a distinctive trend in the business houses. The CTO projecting technology as a business growth driver and taking a dominant role is common. The data flows should be plumbed across the IT landscape across various technologies causing a lot of hurried and fast changing plumbing issues.

In my view the data flow should be natural just like streams of water. Information should be flowing naturally in the landscape and technology should be used to make the flow gentle avoiding the floods and tsunamis. Creating data pools in the cloud storage and connecting the pools to form a knowledge ecosystem grow the right insights relevant to the business context remains the big challenge today.

The information architecture in the big data and analytics arena is just like dealing with big rivers and having right reservoirs and connecting them to get best benefit in the landscape. And a CIO is still needed and responsible for this in the corporate.

If data becomes an ocean and insights become an effort like “Finding Nemo” the overall objective may be lost. Cautiously avoiding the data ocean let us keep the (big) data in its pools and lakes as usable information while reimagining data in the current world of re-imagination. This applies to both corporate business houses as well as individuals.

Hoping Innovative reimagination in the digital world helps improve the life in the ecosystems of the real world….

Context analytics for better decisions – Analytics 3.0

November 19, 2013

Today’s #BigData world, #analytics took additional complexity beyond pure statistics or pattern recognition using clustering, segmentation or predictive analytics using logistic regression methods.

One of the great challenge for big data’s unstructured analytics is the ‘context’. In traditional processing of data, we have removed the context and just recorded the content. All the we try to do with sentiment analysis is based on deriving the words, phrases & entities and try to combine them into ‘concepts’ and score them by matching known patterns of pre-discovered knowledge and assign the sentiment to the content.

The success rate in this method is fairly low. (This is my own personal observation!) One of the thoughts to improve the quality of this data is to add the context back to the content. To do this the technology enables is again a ‘Big Data’ solution. Means, we start with a big data problem and find the solution in the big data space. Interesting. Isn’t it?

Take the content data at rest, analyze it. and enrich with the context information like spatial and temporal information and derive knowledge from it. Visualize the data by putting similar concepts together and by merging same concepts into a single entity.

The big blue is doing this after realizing the fact. Few months back they published a ‘red paper’ that can be found here.

Finally putting the discovered learning into action in real time gives all the needed business impact and takes it to the world of Analytics 3.0. (Refer to

Exciting world of opportunities….

Analytical Processing of Data

November 14, 2013

A short presentation on Analytical Processing of Data; very high level overview…..

Anticipatory Computing and Functional Programming – some rambling…

August 23, 2013


After an early morning discussion on Anticipatory Computing on TCS’s enterprise social network – Knome,  I thought of making this blog post linking the aspects of “functional orientation” of complex systems with consciousness.

In the computing world, it is generally widely accepted fact that data can exist without any prescribed associated process. Once the data is stored on a medium (generally called as Memory) it can be put into any abstract process trying to derive some conclusions. (This trend is generally called as big-data analytics leading to predictive and prescriptive analytics)


If I mention that function can exist without any prescribed data to it with multiple outcomes, then it is not easily accepted. Only thing people can think about is completely chaotic random number generator in this. Completely data independent, pure function that returns a function based on its own “anticipation” is what is called consciousness.

This is one of my interest areas in computability and information theory. A complex system behavior is not driven entirely by the data presented to it. Trying to model the complex system purely by the past data emitted by the system is not going to work. One should consider the anticipatory bias of the system as well while modeling.

Functional Programming comes a step near to this paradigm. It tries to define the function without intermittent state preserving variables. In mathematical terms a function maps elements of domain to its range. Abstracting this into an anticipation model we get the consciousness (or free will) as a function of three possible return functions.

1. Will do
2. Will NOT do
3. Will do differently
(I have derived this based on an ancient Sanskrit statement regarding free will – kartum, akartum, anyathA vA kartum saktaH)

The third option above (it is beyond binary, 0 or 1) leads to the recursion of this function of evaluation of alternatives and again at (t+Δt) the system has all the three options. When the anticipatory system responds then “data” starts emitting from it. The environment in with this micro anticipatory system is operating is also a macro anticipatory system.

The ongoing hype around big data is to establish the patterns of data emitted from various micro-systems and establishing the function of macro-freewill. It is easier for a micro-freewill to dynamically model the function which is called “intuition” that is beyond the limits of computability.

Enough of techno-philosophical rambling for this Friday! Have a nice weekend.

Crisscrossing thoughts around #Cloud and #BigData

August 2, 2013

While “Big Data Analytics” is running on Cloud based infrastructure with 1000s of (virtual) servers, Cloud infrastructure management has become a big data problem!

Assuming all key availability and performance metrics need to be collected and processed regularly to keep the cloud infrastructure running within the agreed performance service levels and to identify the trends of demand for the cloud services there is an absolute need for the predictive analytics on the collected metrics data.

As the data centers gradually turn into private clouds with a lot of virtualization, it becomes increasingly important to manage the underlying grid of resources efficiently by allocating the best possible resources to the high priority jobs. The integrated infrastructure monitoring and analytics framework running on the grid itself can optimize the resource allocation dynamically to fit the workload characteristics could make the data center more efficient and green.

Taking the same approach to the business services across the organizational boundaries, there could be an automated market place where the available computing resources could be traded by the public cloud providers and the consumers can “buy” needed computing resources in the market and get their processing executed by probably combining multiple providers’ resources on an extended hybrid cloud in a highly dynamic configuration.

The data and processing have to be encapsulated at a micro or nano scale objects, taking the computing out of current storage – processor architecture into a more connected neuron like architecture with billions of nodes connected in a really BIG bigdata.


If all the computing needed on this tiny globe can be unified into a single harmonic process, the amount of data that needs moving comes to a minimum and a “single cloud” serves the purpose.

Conclusion: Cloud management using bigdata, and big data running on cloud infrastructure complement each other to improve the future of computing!

Question: If I have a $1 today, where should I invest for better future? In big data? Or in Cloud startup??

Have a fabulous Friday!

Data Philosophers and data quality

May 24, 2013

After data scientists and data artists, another need is for “data philosophers”. made me think about the data philosophers.

So, the data scientists are focusing on the underlying technology to gather validate and process the ‘big’ data and the artists are using the processed ‘big’ data to paint and visualize the insights.

In this whole process due to its wide variety and velocity (two ‘V’s of big data!) are we missing on the rigor of quality of data?

Considering the 36 attributes of data quality in the 1972 paper of Kristo Ivanov – and evaluating today’s big data insights, I somehow feel there is a ‘big’ gap in the quality of ‘big data’.

I see some parallels in big data processing and orbit determination. As long as the key laws governing the planetary motion are unknown, whatever is the amount of the data from observation we have, we will not be able to explain the ‘retrograde motion’ of the planets. In the same way, if we do not have a clear understanding of underlying principles of the data streams, we will not be able to explain them. That is where we need the philosophers!

Now, I think I am becoming a “Data Philosopher” already!

Data Artist – A new professional skillset?

May 17, 2013

In past few days, I have seen at least two blogs talking about “Data Artist”1.

The trend seems to go towards business centric data visualization of so called “big data”.

One who can use data as the paint and create art that can represent massive flows of data and visualize the patterns in a way business users are delivered with a lot of “information” in a single glance.

It is slightly different from the “Data Scientist” profession. Data Scientists are focused on technical process of collecting, preparing and analyzing the data for patterns where as the Data Artists specialize in visualizing the discoveries in an artistic manner!

“Scientific Artists” and “Artistic Scientists” with Data! Are we complicating the matter too much??

“White noise” and “Big Data”

March 15, 2013

For those who are familiar with physics and communications you would have heard about the term “White Noise” – In simple terms it is the noise produced by combining all different frequencies together.
So, what is the relationship between the white noise and big data?
At present, there is a lot of “noise” about big data in both positive and negative frequencies. Some feel it is data in high volume, some unstructured data, some relate it with analytics, some with real-time processing, some with machine learning, some with very large databases, some with in memory computing, some others with regression, still others with pattern recognition and so on….
People have started defining “big data” with 4 v’s (Volume, Velocity, Variety, and Variability) and gone on to add multiple other Vs to it. I have somewhere seen a list of 21Vs defining big data.
So, in simple terms big data is all about unstructured data mostly machine generated in quick succession in high volumes (one scientific example is the Large Hadron Collider generating huge amounts of data from each of its experiments) that need to be handled where the traditional computing models fail to do.
Most of this high volume data is also “white noise” which combines signals of all frequencies produced simultaneously on the social feeds like twitter etc., (The 4th goal by Spain in Euro 2012 match resulted in 15K tweets per second!) which could only prove there are so many people watching and exited about that event and adds minimum “business value” by such piece of information.
How to derive “Value” then?
The real business value of big data can only be realized when the right data sources are identified with the right data channelized through the processing engine to apply the right technique to separate out the right signal from the white data. That is precisely the job of a “Data Scientist” in my honest opinion.
I have not found a really good general use-case in the insurance industry for big data yet! (other than the stray cases related to vehicle telematics in auto sector and some weather/flood/tsunami hazard modeling cases in corporate specialty)
But I am tuned to the white noise anyway looking for the clues that identify some real use cases in insurance and largely in financial services… (Other than the “machine trading” algorithms are already well developed in that field!)
Comments? Views?

Few thoughts on Data Preparation for #Analytics

February 1, 2013
It is Friday and it is time for a blog post.
Typical analysis project spends 70% (- 80%) of time in preparing the data. Achieving the right Data quality and right format of data is a primary success factor of success of an analytic project.
What makes this task very knowledge intensive and why is a multifaceted skill required to carry out this task?
I will give a quick/simple example of how the “Functional knowledge” other than the technical knowledge is important in the preparation of the data. There is a functional distinction between missing data and non-existing data.
For example consider a customer data set. If the customer is married and the age of spouse is not available this is missing data. If customer is single, age of spouse is non-existing. In the data mart these two scenarios need to be represented differently so that the analytic model behaves properly.
Dealing with the missing data (data imputation techniques) within the data set while preparing the data impacts on the results of the analytical models.
Dr. Gerhard Svolba of SAS has written extensively on Data Preparation as well as Data Quality (for Analytics) and this presentation gives more details on the subject.
I have made a blog post earlier dealing with these challenges in the “Big data” world –

Small & Big Data processing philosophies

January 3, 2013
In this first post of 2013, I would like to cover some fundamental philosophical aspects of “data” & “processing”.
As the buzz around “Big Data” going on high, I have classified the original structured, relational data as “small data” even though some very large databases I have seen having 100+ Terabytes of data with an IO volume of 150+ Terabytes per day.  
Present day data processing predominantly uses Von-Neumann architecture of computing in which “Data” and its “processing” are distinct and separated into “memory” and “processor” connected by a “bus”.  Any data that need to be processed will be moved into processor using the bus and then the required arithmetic or logical operation happens on it producing the “result” of the operation. Then the result will be moved to “memory/storage” for further reference.  Also, the list of operations to be performed (the processing logic or program) is stored in the “memory” as well. One needs to move the next instruction to be carried out into the processor from memory using the bus.
So in essence both the data and the operation that needs performing will be in memory which can’t process data and the facility that can process data is always dependent on the memory in the Von-Neumann architecture.
Traditionally, the “data” has been moved into a place where the processing logic is deployed as the amount of data is small when compared to the amount of processing needed is relatively large involving the complex logic. In the RDBMS engines like Oracle read the blocks of storage into the SGA buffer cache of running database instance for processing. The transactions were modifying small amounts of data at any given time.
Over a period of time “analytical processing” that required to bring huge amounts of data from storage into processing node which created a bottleneck on the network pipe. Add to that there is a large growth in the semi-structured and unstructured data that started flowing which needed a different philosophy towards data processing.
There comes the HDFS and map-reduce framework of Hadoop which took the processing to the data. During the same time comes Oracle Exadata which took the database processing to storage layer with a feature called “query offloading”
In the new paradigm, the snippets of processing logic are being taken to a cluster of connected nodes where the data mapped with a hashing algorithm resides and results of processing then reduced to finally produce result sets. It is now becoming economical to take the processing to data as the amount of data is relatively large and the required processing is fairly simple tasks of matching, aggregating, indexing etc.,
So, we now have choice of taking small amounts of data to complex processing with structured RDBMS engines with shared-everything architecture of traditional model as well as taking processing to data in the shared-nothing big data architectures. It purely depends on the type of “data processing” problem in hand and neither traditional RDBMS technologies will be replaced by new big data architectures, nor could the new big-data problems be solved by traditional RDBMS technologies. They go hand-in-hand complementing each other while adding value to the business when properly implemented.
The non-Von-Neumann architectures still need better attention by the technologists which will probably hold the key to the way human brain processes and deals with the information seamlessly either it is structured or non-structured streams of voice, video etc., with ease. 
Any non-Von-Neumann architecture enthusiasts over here?