Archive for the ‘Database’ Category

Data and Analytics – some thoughts on consulting

May 31, 2014

On this technical blog, I have not been very regular now-a-days primarily due to the other writing engagements on artha SAstra and on Medium.

Yesterday, I have been asked to address a group of analytic enthusiasts in an interactive session arranged by NMIMS at Bangalore on a theme “Visualize to Stratagise”. Having spent around three hours several topics on Analytics and specifically on Visual Analytics were discussed.

I thought of writing on two aspects of Analytics which I have seen in past few months on this post to give a little food for thought to those who are consulting on Analytics.

Let “data” speak.
Few weeks back one of my customers had a complaint on database. Customer said, we have allocated a large amount of storage to the database and in one month time all the space was consumed. As per the customer’s IT department at the maximum of 30K business transactions were only performed by a user group of 50 on the application which is supported by this database. So, they have concluded there is something wrong on the database and hence an escalation to me to look into it.

I have suspected some interfacing schema that could be storing the CLOB/BLOB type data and there could be missing cleanup and asked my DBA to give me a tablespace growth trend. The growth is on the transaction schema and across multiple transaction tables in that schema. I have ruled out some abnormal allocation on a single object with this observation.

We thought of running a simple analytics on the transaction data to see the created user on those transactions to verify if someone has run any migration script that could have got a huge amount of data into transaction tables or some other human error.

For our surprise we have seen 1100 active users who have created 600,000+ transactions in the database. All through the different times and most regular working day, working hour pattern. No nightly batch or migration user created the data. We went ahead with a detailed analytics on the data which has mapped all the users across geography of the country of operation.

We created a simple drill down visualization of the data and submitted to business and IT groups at the customer with a conclusion that the data indeed valid and created by their users and there is no problem with the system.

So, the data spoke for itself and the customer’s business team said to the IT team that they have started using the system across the country for the last month and all the users were updating transactions on this system. This fact the IT team was not aware of. IT team is still thinking it is running pilot mode with one location and 50 users.

Let the data speak. Let it show itself to those who need it for decision making.Democratize the data.

The second point which came up evidently yesterday was

“If you torture your data long enough, it will confess anything”
Do not try to prove the known hypothesis with the help of data. It is not the purpose of analytics. With data and statistics you can possibly infer anything. Any bias towards a specific result will defeat the purpose of analytics.

So, let the data with its modern visualization ability be an unbiased representative which shows the recorded history of the business with all its deficiencies, with all its recording errors and all possible quality problems; in the process of decision making and strategising….. 

Hope I made clear my two points while consulting on Analytics….

Advertisements

F1 Database from Google: A scalable distributed SQL database

August 30, 2013

 

This world is a sphere. We keep going round and round. After a great hype around the NoSQL highly distributed databases, now Google presented a paper on how they have implemented a SQL based highly scalable database for supporting their “AdWords” business in 39th VLDB conference.


The news item: http://www.theregister.co.uk/2013/08/30/google_f1_deepdive/
and the paper: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/41344.pdf


The key changes I liked:
1. Heirarchically clustered physical schema model: I always thought heighrarchical model is more suited for real life application than a pure relational model. This implementations is proving it.

2. Protocol Buffers: Columns allowing structured types. It saves a lot of ORM style conversions when moving data from a storage to in-memory and vice versa.


A quote from the paper’s conclusion:

In recent years, conventional wisdom in the engineering community has been that if you need a highly scalable, high-throughput data store, the only viable option is to use a NoSQL key/value store, and to work around the lack of ACID transactional guarantees and the lack of conveniences like secondary indexes, SQL, and so on. When we sought a replacement for Google’s MySQL data store for the Ad-Words product, that option was simply not feasible: the complexity of dealing with a non-ACID data store in every part of our business logic would be too great, and there was simply no way our business could function without SQL queries.

So, ACID is needed and SQL is essential for running the businesses! Have a nice weekend!!

All (that I know) about Exadata

April 25, 2013

Having associated with Exadata in one or other way from its release ( see: http://technofunctionalconsulting.blogspot.in/2008/09/exadata-database-processing-moves-into.html ) I have tried consolidating key points related to Exadata for a session to technical audience.

Related Posts: 

http://technofunctionalconsulting.blogspot.in/2009/10/exadata-v2-worlds-first-oltp-database.html

http://technofunctionalconsulting.blogspot.in/2010/02/hybrid-columnar-compression-hcc.html

http://technofunctionalconsulting.blogspot.in/2012/06/exadata-performance-features.html

Accelerating Analytics using “Blink” aka “BLU acceleration”

April 5, 2013
This Friday marks completion of my 2 years in the second innings with TCS ‘s Technology Excellence Group and it is time for a technical blog post.
During this week, I have seen IBM announcing new “BLU acceleration” enabled DB2 10.5 that claims a 10 to 20 times performance improvement out of box.  (Ref: http://ibmdatamag.com/2013/04/super-analytics-super-easy/ )
This post aims at giving a brief summary of the Blink Project which has brought in this acceleration to the analytic queries.
The Blink technology has primarily two components that achieve the said acceleration to the analytic processing:
1.       The compression at the load time
2.       The query processing
Compression & Storage:
At load time each column is compressed using a “Frequency Partitioning” order preserving fixed length dictionary encoding method. Each partition of the column has a dictionary of its own making it to use shorter column codes. As it preserves order the comparison operators/predicates can be applied directly to the encoded values without needing to uncompress them.
Rows of are packed using the bit aligned columns to a byte aligned banks of 8, 16, 32 or 64bits for efficient ALU operations. This bank-major storage is combined to form blocks that are then loaded into the memory (or storage.) This bank-major storage exploits SIMD (Single Instruction, Multiple Data) capability of modern POWER processor chips of IBM.
Query Processing:
In Blink there are no indexes, no materialized views nor a run-time query optimizer. So, it is simple. But the query must be compiled to take care of different encoded column lengths of each horizontal partition of the data.
Each SQL is split into a series of single-table queries (STQs) which does scans with filtering. All the joins are hash joins. These scans happen in an outside-in fashion on a typical snowflake schema creating intermediate hybrid STQs.
Blink executes these STQs in multiple blocks to threads each running on a processor core. As most modern ALUs can operate on 128bit registers all the operations are bit operations exploiting SIMD which makes the processing fast.
For more technical details of Blink project refer to – http://sites.computer.org/debull/A12mar/blink.pdf
Hope this will bring “Analytics” a boost and some competition to Oracle’s Exa- appliances. Views, Comments?

Small & Big Data processing philosophies

January 3, 2013
In this first post of 2013, I would like to cover some fundamental philosophical aspects of “data” & “processing”.
As the buzz around “Big Data” going on high, I have classified the original structured, relational data as “small data” even though some very large databases I have seen having 100+ Terabytes of data with an IO volume of 150+ Terabytes per day.  
Present day data processing predominantly uses Von-Neumann architecture of computing in which “Data” and its “processing” are distinct and separated into “memory” and “processor” connected by a “bus”.  Any data that need to be processed will be moved into processor using the bus and then the required arithmetic or logical operation happens on it producing the “result” of the operation. Then the result will be moved to “memory/storage” for further reference.  Also, the list of operations to be performed (the processing logic or program) is stored in the “memory” as well. One needs to move the next instruction to be carried out into the processor from memory using the bus.
So in essence both the data and the operation that needs performing will be in memory which can’t process data and the facility that can process data is always dependent on the memory in the Von-Neumann architecture.
Traditionally, the “data” has been moved into a place where the processing logic is deployed as the amount of data is small when compared to the amount of processing needed is relatively large involving the complex logic. In the RDBMS engines like Oracle read the blocks of storage into the SGA buffer cache of running database instance for processing. The transactions were modifying small amounts of data at any given time.
Over a period of time “analytical processing” that required to bring huge amounts of data from storage into processing node which created a bottleneck on the network pipe. Add to that there is a large growth in the semi-structured and unstructured data that started flowing which needed a different philosophy towards data processing.
There comes the HDFS and map-reduce framework of Hadoop which took the processing to the data. During the same time comes Oracle Exadata which took the database processing to storage layer with a feature called “query offloading”
In the new paradigm, the snippets of processing logic are being taken to a cluster of connected nodes where the data mapped with a hashing algorithm resides and results of processing then reduced to finally produce result sets. It is now becoming economical to take the processing to data as the amount of data is relatively large and the required processing is fairly simple tasks of matching, aggregating, indexing etc.,
So, we now have choice of taking small amounts of data to complex processing with structured RDBMS engines with shared-everything architecture of traditional model as well as taking processing to data in the shared-nothing big data architectures. It purely depends on the type of “data processing” problem in hand and neither traditional RDBMS technologies will be replaced by new big data architectures, nor could the new big-data problems be solved by traditional RDBMS technologies. They go hand-in-hand complementing each other while adding value to the business when properly implemented.
The non-Von-Neumann architectures still need better attention by the technologists which will probably hold the key to the way human brain processes and deals with the information seamlessly either it is structured or non-structured streams of voice, video etc., with ease. 
Any non-Von-Neumann architecture enthusiasts over here?

Oracle database 12c

October 3, 2012

I have been seeing Oracle database form 6, 7, 8i, 9i, 10g, 11g and now it is going to be called 12c – c as in Cloud. OOW12 revealed the new architecture. (there was no tail for 6 and 7…. i stood for internet, g for Grid….. )

The new release with a fundamental architectural change to the data dictionary that gives a concept of “Pluggable Database” – PDB over the standard oracle system “Container Database” – CDB.

The obj$ of Oracle database dictionary will contain all the information of objects in the current architecture.  It is being split into CDB and PDB going forward for easing the “multi-tenant” private cloud databases.

This fundamental seperation of the data dictionary provides
a. Better security of multiple “databases” on the same instance
b. Seperation of CDB from other PDBs will allow easy upgrades.
c. All PDBs will share the instance and overall management of the consolidated database should be much simpler.

Excited about the cloud enabled, multi-tenant, pluggable database from Oracle! 

So, let us wait and see when the stable 12.2 will come out to roll out the new database into production….

Data Replication

April 12, 2012

The need for data replication is evergreen. The use-case may vary from “High Availability” to “Operational Reporting (BI)” to “Real-time Analytics”; whatever may be the case, the need for replicating the data exists in the information systems and solutions.

I try to summarize the evolution of data replication technology from a Oracle Database standpoint in this post.

  1. Looking back at the Oracle based data replication technologies the first one is “Database Link“. One can create a database link and pull or push the data directly into a remote database starting Oracle 5 or 6. This is the very first method of replication where the application need to push or pull the data from a remote database and apply necessary logic to identify what data has changed and what to do with those changes…..
  2. The next improvement in the replication is around Oracle 8 – one can setup a trigger based replication. That means whenever a transaction changes the data in a table one can trigger a function that can handle the replication without changing the application. So, database started giving a method to replicate data using triggers….
  3. Then the next improvement has come around 9.2 with streams and log-based replication. The capability is to mine the redo logs and move the committed transactions to the target systems. (DBMS_CDC_PUBLISH and DBMS_CDC_SUBSCRIBE packages were introduced)
  4. Oracle Advanced Queuing enhanced the streams to have robust publish and subscribe model replication that has enqueue and dequeue based communication. I was involved in a very large project that set up a custom Changed Data Capture to migrate data from a legacy system into SAP ERP involving large tables having 100Milion records…. 
  5. Later the log-mining technology was used for physical and logical standby dataabses and evolved to a Active Data Guard technology……
  6. With GoldenGate, the heterogeneous log-based data replication solution is complete from Oracle that has capabilities to extract, replicate, synchronize data in bi-directional movement.  

Depending on the need one should choose the right technology to achieve the needed data replication….

in memory computing

February 14, 2012

Approximately two years back I made a post on Enterprise Data Fabric technology. The aim of the data grid or “in memory data store” is to remove the movement of data in and out of slower disk storage for processing. Instead the data in kept in “pooled main memory” during the processing.

To get above the physical limitations on the amount of main memory, the data grid technologies will create a pooled memory cluster with data distributed over multiple nodes connected using a high bandwidth network.

With SAP bringing HANA, an in memory database that has option to store data in traditional row store and column store (read storing data in rows and columns) within an in-memory technology and Oracle bringing the Exaletics appliance, the in-memory computing is getting more attention.

So, the claims are that the in memory technology will boost the performance by multiple degrees. But the truth is it can only remove the time taken to move the data out of disk into main memory. If there is a query that is processing the data using a wrong access method, even if all the data is moved into a memory store the processing will still take as long to provide the answer!

In memory computing would need re designing the applications to use the technology for better information processing. OLTP workload will surely improve the performance due to memory caching but the consistency of the data need to be managed by the application moving to a event based architecture.

OLAP and Analytical workloads would also improve the performance by using memory based column stores with a good design of the underlying structure of data that suits the processing requirements.

Overall, in memory computing is promising at the moment but without the right design to use the new technology, the old systems will not just get the performance boost just by moving the data store into the main memory

Let us wait and see how the technology shapes further in future…..

Storing Rows and Columns

December 4, 2011

A fundamental requirement of a database is to store and retrieve the data. In Relational Database Management Systems (RDBMS) the data is organized into a table that contain the rows and columns. Traditionally the data is stored into blocks of rows. For example a “sales transaction row” may have 30 data items representing 30 columns. Assuming a record occupies 256 bytes, a block of 8KB can hold 32 such records. Again assuming a million such transactions that need to be stored in 32150 blocks per day. All this works well as long as we need the data as ROWS! We want to access one row or a group of rows at a time to process that data, this organization has no issues.

Let us consider if we want to get a summary of total value of type x items that are sold in past seven days. This query need to retrieve 7million records that contain 30 columns each to just process the count of items of types x. All that we need is two columns item type and amount to process this. This type of analytical requirement lead us to store the data in columns. We group the columns together and store them in blocks. It improves the speed of retrieving the columns from the overall table quickly for the purpose of analyzing the data.

But the column storage has its limitations when it comes to the write and update

With a high volume of social data, where there is high volume of write is needed (like messages and status updates, likes and comments etc.,) , highly distributed, NOSQL based column stores are emerging into mainstream. Apache Cassandra is the new breed of NOSQL column store that was initially developed by Facebook.

So, we have a variety of data base / data stores available now, a standard RDBMS engine with SQL support for OLTP applications, A column based engies for OLAP processing and noSQL based key value pair stores for in-memory processing, highly clustered Hadoop style big data with map/reduce framework for big data processing and noSQL based column stores for high volume social write and read efficiencies. 


Making right choice of data store for the problem in had is becoming tough with many solution options. But that is the job of an architect; Is it not?

ACID and BASE of data

October 9, 2011

I am completing my 18 years of working in the field of Information Technology.

All these days an enterprise  data store generally provides the four qualities Atomicity, Consistency, Isolation and Durability (ACID) to the transactions. Oracle has emerged as a leader in providing enterprise class ACID transactional capabilities to the applications.

Recently in the Open World 2011, Oracle announced a noSQL database which typically characterized by the BASE acronym. Basically Available, Soft state, Eventually consistent (BASE)

I see a lot of debate on SQL vs NoSQL, ACID vs BASE and Shared Everything vs Shared Nothing architectures of data stores of late; and with Oracle getting on to the NoSQL bandwagon, this debate is just took up additional momentum.

Oracle has posted this paper nicely explaining their NoSQL database. http://www.oracle.com/technetwork/database/nosqldb/learnmore/nosql-database-498041.pdf

In my opinion, SQL and NOSQL choice is straight forward to make:-

big query: Are we storing data or BIG-DATA (read my old post on transactional data vs machine generated big data – http://technofunctionalconsulting.blogspot.com/2011/02/analytics.html)

With the new trends in ‘BIG DATA’ all the data almost become key, value pair with read and insert only operations with minimal or no updates to the data records. NoSQL/BASE is best suited to handle this type of data. Still the traditional transactional databases of OLTP nature, needs ACID complaint transactions.

So, when designing the big data solutions, an architect should surely look at the NoSQL dataBASE. Is it not?

Publishing this post on 09/10/11 (dd/mm/yy) and this is my 85th post to this blog.