Archive for the ‘SQL’ Category

F1 Database from Google: A scalable distributed SQL database

August 30, 2013


This world is a sphere. We keep going round and round. After a great hype around the NoSQL highly distributed databases, now Google presented a paper on how they have implemented a SQL based highly scalable database for supporting their “AdWords” business in 39th VLDB conference.

The news item:
and the paper:

The key changes I liked:
1. Heirarchically clustered physical schema model: I always thought heighrarchical model is more suited for real life application than a pure relational model. This implementations is proving it.

2. Protocol Buffers: Columns allowing structured types. It saves a lot of ORM style conversions when moving data from a storage to in-memory and vice versa.

A quote from the paper’s conclusion:

In recent years, conventional wisdom in the engineering community has been that if you need a highly scalable, high-throughput data store, the only viable option is to use a NoSQL key/value store, and to work around the lack of ACID transactional guarantees and the lack of conveniences like secondary indexes, SQL, and so on. When we sought a replacement for Google’s MySQL data store for the Ad-Words product, that option was simply not feasible: the complexity of dealing with a non-ACID data store in every part of our business logic would be too great, and there was simply no way our business could function without SQL queries.

So, ACID is needed and SQL is essential for running the businesses! Have a nice weekend!!

Storing Rows and Columns

December 4, 2011

A fundamental requirement of a database is to store and retrieve the data. In Relational Database Management Systems (RDBMS) the data is organized into a table that contain the rows and columns. Traditionally the data is stored into blocks of rows. For example a “sales transaction row” may have 30 data items representing 30 columns. Assuming a record occupies 256 bytes, a block of 8KB can hold 32 such records. Again assuming a million such transactions that need to be stored in 32150 blocks per day. All this works well as long as we need the data as ROWS! We want to access one row or a group of rows at a time to process that data, this organization has no issues.

Let us consider if we want to get a summary of total value of type x items that are sold in past seven days. This query need to retrieve 7million records that contain 30 columns each to just process the count of items of types x. All that we need is two columns item type and amount to process this. This type of analytical requirement lead us to store the data in columns. We group the columns together and store them in blocks. It improves the speed of retrieving the columns from the overall table quickly for the purpose of analyzing the data.

But the column storage has its limitations when it comes to the write and update

With a high volume of social data, where there is high volume of write is needed (like messages and status updates, likes and comments etc.,) , highly distributed, NOSQL based column stores are emerging into mainstream. Apache Cassandra is the new breed of NOSQL column store that was initially developed by Facebook.

So, we have a variety of data base / data stores available now, a standard RDBMS engine with SQL support for OLTP applications, A column based engies for OLAP processing and noSQL based key value pair stores for in-memory processing, highly clustered Hadoop style big data with map/reduce framework for big data processing and noSQL based column stores for high volume social write and read efficiencies. 

Making right choice of data store for the problem in had is becoming tough with many solution options. But that is the job of an architect; Is it not?

>SQL performance tuning

April 30, 2011

>Having seen several performance problems within IT systems, I have a methodology for performance tuning. When it comes to a SQL query tuning, it should be slightly different.

9 out of 10 cases of performance problems on relational database systems relate to a bad SQL programming. Even with the latest “optimizers” within the commercial database management core execution engines, it is the skill of the developer to make use of the facilities effectively to get the best out of the RDBMS.

I have recently came across a typical problem with a “junction table” design.
A set of tables represent USER and all are connected by a USERID
Another set of tables represent ACCOUNT and all are connected by ACCOUNTID

The software product implements the one to many relationship using a junction table called USER_ACCOUNTS which contains (USERID, ACCOUNTID) with the composite primary key USERID, ACCOUNTID.

Now there are 30K users 120K accounts and 120K USER_ACCOUNTS and a query that need to get data from USER tables involving some outer joins on itself and ACCOUNT tables which joins multiple tables to get various attributes; all these tables linked in a join using the junction table. That query runs for 18 hours.

When the query is split into two inline views with all the required filtering in each side of data access on USER and ACCOUNT individually and then joined using the junction table it completes in 43 seconds.

So, FILTER and then JOIN is better than JOIN and then FILTER in terms of resource consumption. Hence the performance tuning is all about doing the sequence of actions in the right order to minimize the consumption of resources to perform the job!