Archive for the ‘Analytics’ Category

My continued association with Computer Science

August 8, 2015

It has been 27years of my association with “Computer Science” today!

Recently I heard a student making a remark while selecting the course for undergraduate: “What is there in computer science? I can learn java on my own”

My clarification is as follows: “Computer Science” is not just a programming language or the skill of writing an program. A deep understanding of operating systems, memory management, compilers, data structures and algorithms, data storage, compression, encryption and security, parallel processing, analytics and on and on….

Along with the understanding, ability to apply the understanding to implement the algorithms with existing computing resources for solving the problems makes up the study of “Computer Science”

Of late, I have started learning the statistical language “R” and trying to experiment my ability to apply machine learning on some #kaggle challenges. My first submission to a competition: (Currently stand at 1110th position) 

Having moved my regular technology blogging to LinkedIn, my last year’s post: 

Earlier on this blog: 

All my last year posts can be found on LinkedIn:

So, like any other subject, “Computer Science” has lot of depth and breadth, if one wants to explore it!! 

Data and Analytics – some thoughts on consulting

May 31, 2014

On this technical blog, I have not been very regular now-a-days primarily due to the other writing engagements on artha SAstra and on Medium.

Yesterday, I have been asked to address a group of analytic enthusiasts in an interactive session arranged by NMIMS at Bangalore on a theme “Visualize to Stratagise”. Having spent around three hours several topics on Analytics and specifically on Visual Analytics were discussed.

I thought of writing on two aspects of Analytics which I have seen in past few months on this post to give a little food for thought to those who are consulting on Analytics.

Let “data” speak.
Few weeks back one of my customers had a complaint on database. Customer said, we have allocated a large amount of storage to the database and in one month time all the space was consumed. As per the customer’s IT department at the maximum of 30K business transactions were only performed by a user group of 50 on the application which is supported by this database. So, they have concluded there is something wrong on the database and hence an escalation to me to look into it.

I have suspected some interfacing schema that could be storing the CLOB/BLOB type data and there could be missing cleanup and asked my DBA to give me a tablespace growth trend. The growth is on the transaction schema and across multiple transaction tables in that schema. I have ruled out some abnormal allocation on a single object with this observation.

We thought of running a simple analytics on the transaction data to see the created user on those transactions to verify if someone has run any migration script that could have got a huge amount of data into transaction tables or some other human error.

For our surprise we have seen 1100 active users who have created 600,000+ transactions in the database. All through the different times and most regular working day, working hour pattern. No nightly batch or migration user created the data. We went ahead with a detailed analytics on the data which has mapped all the users across geography of the country of operation.

We created a simple drill down visualization of the data and submitted to business and IT groups at the customer with a conclusion that the data indeed valid and created by their users and there is no problem with the system.

So, the data spoke for itself and the customer’s business team said to the IT team that they have started using the system across the country for the last month and all the users were updating transactions on this system. This fact the IT team was not aware of. IT team is still thinking it is running pilot mode with one location and 50 users.

Let the data speak. Let it show itself to those who need it for decision making.Democratize the data.

The second point which came up evidently yesterday was

“If you torture your data long enough, it will confess anything”
Do not try to prove the known hypothesis with the help of data. It is not the purpose of analytics. With data and statistics you can possibly infer anything. Any bias towards a specific result will defeat the purpose of analytics.

So, let the data with its modern visualization ability be an unbiased representative which shows the recorded history of the business with all its deficiencies, with all its recording errors and all possible quality problems; in the process of decision making and strategising….. 

Hope I made clear my two points while consulting on Analytics….

Data streams, lakes, oceans and “Finding Nemo” in them….

April 4, 2014

This weekend, I complete 3 years of TCS second innings. Most of the three years I have been working with large insurance providers trying to figure out the ways to add value with the technology to their operations and strategy.

The concurrent period has been a period of re-imagination. Companies and individuals (consumers / employees) slowly moving towards reimagining themselves in the wake of converging digital forces like cloud computing, analytics & big data, social networking and mobile computing.

Focus of my career in Information Technology has always been “Information” and not technology. I am a firm believer in “Information” led transformation rather than “technology” led transformation. The basis for information is data and the ability to process and interpret the data, making it applicable and relevant for the operational or strategic issues being addressed by the corporate business leaders.

Technologists are busy making claims that their own technology is best suited for the current data processing needs. Storage vendors are finding business in providing the storage in cloud. Mobility providers are betting big on wearable devices making computing more and more pervasive. The big industrial manufacturers are busy fusing sensors everywhere and connecting them on the internet following the trend set by the human social networking sites. A new breed of scientists calling themselves data scientists are inventing algorithms to quickly derive insights from the data that is being collected. Each one of them is pushing themselves to the front taking support of the others to market themselves.

In the rush, there is a distinctive trend in the business houses. The CTO projecting technology as a business growth driver and taking a dominant role is common. The data flows should be plumbed across the IT landscape across various technologies causing a lot of hurried and fast changing plumbing issues.

In my view the data flow should be natural just like streams of water. Information should be flowing naturally in the landscape and technology should be used to make the flow gentle avoiding the floods and tsunamis. Creating data pools in the cloud storage and connecting the pools to form a knowledge ecosystem grow the right insights relevant to the business context remains the big challenge today.

The information architecture in the big data and analytics arena is just like dealing with big rivers and having right reservoirs and connecting them to get best benefit in the landscape. And a CIO is still needed and responsible for this in the corporate.

If data becomes an ocean and insights become an effort like “Finding Nemo” the overall objective may be lost. Cautiously avoiding the data ocean let us keep the (big) data in its pools and lakes as usable information while reimagining data in the current world of re-imagination. This applies to both corporate business houses as well as individuals.

Hoping Innovative reimagination in the digital world helps improve the life in the ecosystems of the real world….

Social Analytics for Online Communities

January 10, 2014

This is the first post of 2014. Happy new year to one and all………..

A recent discussion on knome (TCS’ internal social platform) related to managing online communities, controlling spam, making the best out of an enterprise social platform of the scale of ~200K members made me study the application of Social Analytics to achieve these objectives.

As I research on the internet, came across this paper –… titled “Scalable Social Analytics for Online Communities” by Marcel Karnstedt, Digital Enterprise Research Institute (DERI), National University of Ireland, Galway Email:

This post is to summarize the contents of the paper and some of my thoughts around it.

Success of a social platform depends on strength of analytics understanding and driving the dynamics of the network built by the platform.

To achieve these goals we need to have a set of tools that can perform multidimensional analysis of the structure, behavioural, content/semantic and cross community analysis.

Structural Analysis: Analyse all the communities, memberships, sub-communities based on strong relations between the members, influencers/leaders and followers.

Behavioural Analysis: Analyse the interactions to identify the helpful experts (or sub-groups) who provide information and newbies who are seeking information that are benefited by the interactions. Both a micro-level or individual level and a macro-level analysis is needed.

Content / Semantic analysis: Use text mining to detect, track and quantitatively measure current interest and shift in interest in topic and sentiment within the community.

Cross community dynamics: Understand how the community structure and sub structures are influencing each other to detect redundancies and complementary to merge and link them together.

There is a need to sufficiently combine all the analysis from all four dimensions in a scalable real-time model to achieve best understanding, control and utility of socially generated data. (rather knowledge!)

New solutions for new problems! Have a nice weekend………..

Context analytics for better decisions – Analytics 3.0

November 19, 2013

Today’s #BigData world, #analytics took additional complexity beyond pure statistics or pattern recognition using clustering, segmentation or predictive analytics using logistic regression methods.

One of the great challenge for big data’s unstructured analytics is the ‘context’. In traditional processing of data, we have removed the context and just recorded the content. All the we try to do with sentiment analysis is based on deriving the words, phrases & entities and try to combine them into ‘concepts’ and score them by matching known patterns of pre-discovered knowledge and assign the sentiment to the content.

The success rate in this method is fairly low. (This is my own personal observation!) One of the thoughts to improve the quality of this data is to add the context back to the content. To do this the technology enables is again a ‘Big Data’ solution. Means, we start with a big data problem and find the solution in the big data space. Interesting. Isn’t it?

Take the content data at rest, analyze it. and enrich with the context information like spatial and temporal information and derive knowledge from it. Visualize the data by putting similar concepts together and by merging same concepts into a single entity.

The big blue is doing this after realizing the fact. Few months back they published a ‘red paper’ that can be found here.

Finally putting the discovered learning into action in real time gives all the needed business impact and takes it to the world of Analytics 3.0. (Refer to

Exciting world of opportunities….

Analytical Processing of Data

November 14, 2013

A short presentation on Analytical Processing of Data; very high level overview…..

Bayesian Belief Networks for inference and learning

September 12, 2013

I have attended a daylong seminar at IIM Bangalore on 10th September on the subject of Bayesian Belief Networks. Dr. Lionel Jouffe, gave 4 case studies during the one day technical session.

Introduction to Bayesian Belief Networks (BBNs) and building the network with multiple modes i.e., a network is built from a. mining the past data, or b. built purely from expert knowledge capture or a combination of both methods. Once the conditional probabilities for each node exist and associations between the nodes are built, both assisted and non-assisted learning can be used.

First case study involved knowledge discovery in the stock market whereby loading publicly available stock market data, a BBN is built, automatic clustering algorithm using the ‘discritized’ continuous variables was run to find similar tickers. ( )

Second case study showed was on segmentation using BBNs. Input contained market share of 58 stores selling juices. Three groups of juices like local brands, national brands and premium brands of juices are sold in one state of US across 11 brands of above three groups. Using this data, BBN was built, automatic segmentation performed into 5 segments with a good statistical description of segmentation.

Third case study involved a marketing mix analysis to describe and predict the efficiency of multiple channel campaigns (like TV, radio, online) on product sales. ( )

The fourth case study covered a vehicle safety scenario taking publicly available accident data to discover the two key factors that can reduce fatality of injury based on parameters of vehicle, driver etc., ( )

Conclusion is any analytic problem can be converted into a BBN and solved. I have seen few advantages of this approach:
1. The BBN can be built in a No Data scenarios with expert knowledge completely hand crafted. It can also be built from big data scenario deriving the conditional probabilities mining the data.
2. One strong theoretical framework solving the problems making it easier to learn. No need to learn multiple theories.
As a technique, it has some promising features. The whitepapers presented are useful in understanding the technique in different scenarios. Views? Comments?

Models of Innovation diffusion in social networks

July 12, 2013


Having seen the trust modeling, centrality in social network, this post is the third and last of the series on social network analysis.

Innovation diffusion, influence propagation or ‘viral marketing’ is one of the most researched subject of contemporary era.

Some theory:

Compartmental models studying the spread of epidemics, which have susceptible (S), infected (I) and recovered (R) ‘SIR states’ are used to study the influence propagation in the electronic social networks as well. Initially these are descriptive models to describe a specific behavior of nodes when exposed to new innovation or information each node has an initial probability to adapt to that innovation. As each node adapts to the new innovation it has a specific amount of influence on the nodes connected to it.

Primarily two basic models are used to study the spread in a social network. An initial set of ‘active’ nodes at time t0 exert influence on the connected nodes and at t1 some of the connected nodes will become ‘active’ with a probability p(i). Each individual node has a threshold θi and when the influence from the neighbors is more than this threshold it becomes active. This model is called ‘Linear Threshold’ model. At each step, the set of nodes till the step – 1 remain active and influence their neighbors with a weightage. In independent cascade model each node is given only one chance to influence its neighbor.

Based on the above two diffusion models, the maximization problem is to determine the best set of initial ‘active’ nodes in a network to arrive at a best propagation by maximizing influence for ‘viral marketing’ campaign.

It is a NP-hard problem and this paper – discusses some interesting approximation algorithm with a general cascade, threshold and triggering models.

Have a good weekend reading!

On Centrality and Power in social networks

June 21, 2013
After the last weeks post on ‘Trust’ – – let us quickly review another important measure of (social) network structure.

Centrality is a structural measure of a network that gives an indication of relative importance of a node in the graph / network.
Simplest way of measuring centrality is by counting the number of connections a node has. This is called ‘degree centrality’.

Another way of measuring centrality is to see how far a node from all other nodes of the graph is is. This measure is called as ‘closeness centrality’ as it measures the path length between pairs of nodes.

‘Betweenness Centrality’ is the measure of number of times the node acting as a bridge on the shortest path of any other two nodes. That gives how important each n ode in connecting the whole network.

To complicate the centrality further, we have a measure called ‘eigenvector centrality’. Eigenvector considers the influence for the node in the network. This methods considers the power of the nodes the current node is connected. To explain it simply, if I am connected to 500 other people on LinkedIn is different from Barak Obama connecting to 500 of his friends on the LinkedIn. His 500 connections are more influential (probably) than my 500 connections. Google’s page rank is a variant of Eigenvector Centrality.

When an external factor is considered for each node and implement eigenvector centrality to consider an external α it is called ‘alpha centrality’

When we move the alpha centrality measure from one node to cover multiple radii to include first degree, second degree and so on.. With a factors of β(i) and measure the centrality as a function of influence of varying degrees, it is called beta centrality.

The key problem with centrality computation is the amount of computing power needed to arrive at the beta centrality measure of the social network with millions of nodes. I recently came across this paper – which proposes an alternative approximation algorithm which is computationally efficient to estimate fairly accurate centrality measure. This alter-based non recursive method works well on non-bipartite networks and suits well for social networks.

Title of this blog states “power” and whole content did not mention anything about it. Generally centrality is considered as the indicator of power or influence. But in some situations power is not directly proportional to centrality. Think about it.

Trust modeling in social media

June 14, 2013


After last week’s “tie strength” post, this week let me give some fundamentals on importance of modeling TRUST in social media.

What is Trust?
It is difficult to define. But when I ask “Will you loan a moderate amount to the other person?” or “Will you seek a reference or recommendation regarding a key decision?” help understand the term TRUST.

There are two components to TRUST. Some people are more trusting than others. Some quickly establish trust where as others take a long time in establishing the trust. This component is not easy to be modeled. The second component is the credibility of the trusted person.

Measuring Trust:
In social media, the second component can be measured by analyzing the sentiment based on the blogs referenced by others. This is called “network based trust inference”.

This paper describes a model for measuring trust using link polarity.

Have a good weekend reading!