Archive for the ‘Performance Tuning’ Category

Accelerating Analytics using “Blink” aka “BLU acceleration”

April 5, 2013
This Friday marks completion of my 2 years in the second innings with TCS ‘s Technology Excellence Group and it is time for a technical blog post.
During this week, I have seen IBM announcing new “BLU acceleration” enabled DB2 10.5 that claims a 10 to 20 times performance improvement out of box.  (Ref: http://ibmdatamag.com/2013/04/super-analytics-super-easy/ )
This post aims at giving a brief summary of the Blink Project which has brought in this acceleration to the analytic queries.
The Blink technology has primarily two components that achieve the said acceleration to the analytic processing:
1.       The compression at the load time
2.       The query processing
Compression & Storage:
At load time each column is compressed using a “Frequency Partitioning” order preserving fixed length dictionary encoding method. Each partition of the column has a dictionary of its own making it to use shorter column codes. As it preserves order the comparison operators/predicates can be applied directly to the encoded values without needing to uncompress them.
Rows of are packed using the bit aligned columns to a byte aligned banks of 8, 16, 32 or 64bits for efficient ALU operations. This bank-major storage is combined to form blocks that are then loaded into the memory (or storage.) This bank-major storage exploits SIMD (Single Instruction, Multiple Data) capability of modern POWER processor chips of IBM.
Query Processing:
In Blink there are no indexes, no materialized views nor a run-time query optimizer. So, it is simple. But the query must be compiled to take care of different encoded column lengths of each horizontal partition of the data.
Each SQL is split into a series of single-table queries (STQs) which does scans with filtering. All the joins are hash joins. These scans happen in an outside-in fashion on a typical snowflake schema creating intermediate hybrid STQs.
Blink executes these STQs in multiple blocks to threads each running on a processor core. As most modern ALUs can operate on 128bit registers all the operations are bit operations exploiting SIMD which makes the processing fast.
For more technical details of Blink project refer to – http://sites.computer.org/debull/A12mar/blink.pdf
Hope this will bring “Analytics” a boost and some competition to Oracle’s Exa- appliances. Views, Comments?

De-normalizing with join materialized views fast refresh on commit

March 22, 2013

Background:
Two weeks back, I wrote a post on result_cache feature of Oracle 11g database to solve a specific performance scenario in MDM implementation. Working on the same set of performance issues, we have encountered another situation where we have a normalized structure which results in writing queries to use OUTER JOINS to achieve the required aggregation.

Problem:
The structure contains a set of tables for PERSON and another set of tables to represent ORGANIZATION when a CUSTOMER can be a PERSON or an ORGANIZATION.
The requirement is to get a consolidated view of all persons and organizations together with certain attributes. We need to perform a UNION ALL query joining a total of 8 tables that is going to result in something like 10Million records. We will not be able to result_cache this result in memory.

Solution:
Inevitably we need to create a persistent version of the result of the UNION ALL query in a materialized view. But customer needs real-time data and can’t afford any latency. So, we need a view that gets updated whenever underlying tables change. That is where the “REFRESH FAST ON COMMIT” comes into the picture.
To be able to do fast refresh MATERIALIZED VIEW LOG to be created on all the underlying tables. We have selected “rowid”. All the 8 underlying tables need to have the MV LOGS created before creating a MV as follows:

CREATE MATERIALIZED VIEW MV_PN_ORG REFRESH FAST
WITH rowid ON COMMIT AS
SELECT < REQUIRED COLUMNS>
  p.rowid  AS prowid,
  xp.rowid  AS xprowid,
  xpn.rowid AS xpnrowid,
  pn.rowid  AS pnrowid
FROM person p,
  xperson xp,
  xpersonname xpn,
  personname pn
WHERE p.PID  = xp.XPid
AND XPN.XPID = PN.PNID
AND pn.CId  = p.CId
AND xpn.preferred_ind  ='Y'
UNION ALL
SELECT <REQUIRED COLUMNS>,
  o.rowid  AS orowid,
  xo.rowid  AS xorowid,
  xon.rowid  AS xonrowid ,
  orgn.rowid AS orgnrowid
FROM org o,
  xorg xo,
  xorgname xon,
  orgname orgn
WHERE o.cid  = xo.xoid
AND xon.xON_id =orgn.ONid
AND orgn.cId  = o.Cid
AND xon.preferred_ind  ='Y';

This MV now has de-normalized data which can be used in the higher level queries for looking up requird data without costly joins. We can also create INDEXes on the MV to improve lookup.

Any experiences? (both good and bad are welcome for discussion)

SQL result_cache in Oracle 11g

March 8, 2013

Background:

For the problem mentioned in my past blog post – http://technofunctionalconsulting.blogspot.in/2013/02/data-agility-in-master-data-management.html there are times where the SQL Queries are expensive and need lot of processing to generate a result set. These queries are executed from multiple sessions and it would be good if we can get the prepared result in the memory.

SQL Result Cache:

This feature is available in Oracle database 11g that can be enabled with the initialization parameter result_cache_mode the possible values for this parameter are FORCE (will cache all results and not recommended) and MANUAL. Setting this value to MANUAL one can selectively cache the results from the SQLs where the hint /*+ RESULT_CACHE */ is added just after the SELECT.

RESULT_CACHE_MAX_SIZE and RESULT_CACHE_MAX_RESULT are the other parameters that impact the way the result cache will function by defining the maximum amount of memory used and the maximum amount of memory a single result set can occupy.

More Information:

Please use the following links on Oracle.com to get better understanding of this feature.

http://www.oracle.com/technetwork/articles/sql/11g-caching-pooling-088320.html

http://www.oracle.com/technetwork/articles/datawarehouse/vallath-resultcache-rac-284280.html

http://www.oracle.com/webfolder/technetwork/tutorials/obe/db/11g/r1/prod/perform/rescache/res_cache.htm

Data agility in Master Data Management

February 22, 2013
Recently, I came across few master data management implementations and database performance problems related to them. 
What is master data management (MDM): It is a set of processes to centralize the management of “master data” i.e, the reference data for a given business. It typically consists of Customers, Agents, and Products for an insurance company.  Typically an MDM instance focuses on one data domain. If we focus on Customer domain the implementation is called as Customer Data Integration or when the domain is Products it is called as Product Information Management.
When it comes to Insurance industry, even though the “customer centricity” has been the latest trend, traditionally it has been a “agent” driven business. So, one key subject area for MDM is “Agent Data” which actually sells “Products” to “Customers”. For generalization we can call this as “Channel Master Data.” So we have three subject areas to deal with in a multi-domain-MDM implementation.
The processes include:
1.       Data acquisition (Input)
2.       Data standardization, duplicate removal, handling missing attributes (Quality)
3.       Data mining, analytics and reporting  (Output)
The industry tools have standard data and process models for customers and products to achieve the required functionality of MDM. We can adopt some customer model to model the agent domain or keep both in a “party” domain.
The problem with multi-domain MDM:
Here comes the problem. When it comes to agent domain, the customer requirement states “get me the agent and his top n customers acquired” or “get me the agent and his top n products sold”
In a renowned MDM product an SQL query need to join 20+ tables (involving some outer joins) to achieve the requirement and even after caching all the tables in-memory the processing for each agent id is taking 20+ seconds on an Oracle 11.2 database on a decent production sized hardware. 
a)      Dynamically querying for such data and building the structure of data into target structure puts more load/ adds latency to the screen. SLA is not met.
b)      Pre-populating the data with a periodic refresh in a materialized view does not make the data current to the real-time.
As we can’t have both requirements of getting the data in 2sec and it is to be current in traditional design, we need to design a hybrid approach.

Any ideas or experience in this are welcome. Have anyone implemented a MDM solution on a NoSQL database yet?

Multi-Tenancy and Resource Management

August 8, 2012

As my association turns 24years today with the computer software; most of the past year I have spent on Oracle Exadata – An appliance (with hardware + software bundle of specified configuration as 1/4 rack, 1/2 rack or a full rack)

In the pre-appliance world, the underlying deployment architecture of server, network, storage would be built as per the application requirements and the quality attributes are “portability”, “scalability” and so on….

An application would be sized for the capacity of required CPU, Memory, I/O, storage along with its local fail-over requirements and the disaster recovery requirements and the underlying infrastructure was built using either physical or virtual components of server(s) and storage. The number of nodes in the cluster and size of each node would be carefully planned.

But, with the Exadata, the pre-configured 8 compute nodes with 14 storage cells in a full rack is configured by Oracle. Each compute node has 24 CPU cores and 96GB of memory.

Now this Exadata appliance need to be shared by multiple applications.. The complexity of multi-tenancy starts here. How to ensure Quality of Service?

1. Server pools and Instance Caging
2. Service design
3. Database Resource Manager
4. I/O Resource Manager

I think it is always good to have a database per application. Hosting multiple applications on a single database instance could be tricky with respective to the CPU allocation.

Next most challenging task is allocating the memory across multiple applications. This is one thing it is still being done manually. Automatic SGA and PGA management and memory tuning within an instance is improving but allocating a memory target for each database should be done manually.

Classifying the workload within a database using the USER NAME, SERVICE, CLIENT PROGRAM NAME, CLIENT USERNAME, MODULE, ACTION etc., parameters to a “resource consumer group” and assigning a resource consumer group to a resource manager plan is achieved using DBRM. Every user session should be put on the right consumer group dynamically based on multiple parameters rather than always putting USER1 on medium_consumer_group. If USER1 is performing an important action that session should be prioritized at a higher level.

Finally, controlling IO operations on the cells across the databases and multiple workloads from within a database also a very important activity to maintain the right Quality of Service (QoS). IO Resource manager database plan and a category plan for prioritizing the workload from within a database should be configured.

In my opinion, the performance management within the appliance world has become more complicated due to the complexity involved in the QoS and resource management. Now we have to develop applications that are best suited to use the platform features of the appliance like Exadata.

Questions to think about:
Is this going opposite of “portability”? How easy is it to port the applications from one appliance to another?

Exadata performance features

June 19, 2012

Recently I have reviewed an Exadata implementation (about 66TB data warehouse with multiple marts running on different services on a single database of full-rack Exadata V2) for performance improvements. This post tries to summarize the key points application developers / designers / DBAs should be aware of while deploying the applications on  to Oracle Exadata V2.

1. “Smart Scan” is a Cell Offloading feature that the selection / projection of an SQL is offloaded to the storage cell instead of doing that operation on the compute node after reading all the required blocks to the buffer cache. This works with Full Table Scans (FTS) and Full Index Scans when using the direct path reads. This can dramatically improve the performance of a FTS but that does not mean all the processing need to happen over FTS and all the indexes to be removed! When the single row look up need to happen or very small amount of records are read from a large table, still the index based look up is much faster than the FTS even with smart scan.

Smart Scan is more a run-time decision rather than an optimizer time decision. Smart Scan depends on the number of sessions requesting for the data, number of dirty blocks, size of the table (_small_table_threshold – by default Oracle considers 2% of buffer cache as small table threshold; this may not be good in some times! This parameter may be tweaked at session level as needed.)

On the explain plan one can see the “STORAGE” keyword on action like “TABLE ACCESS STORAGE FULL” and the statistic value in V$ views is “cell physical IO bytes eligible for predicate offload”.

To force the direct read on serial operations at a session level “_serial_direct_read” can be set to TRUE.

2. “Storage Index” is another feature that each cell builds a dynamic negative index of what data is surely not there on the cell for each column value by making a min and max value ranges that are stored on the cell for a given column. This structure is a in-memory index dynamically built after seeing multiple queries that are offloaded to the storage. This feature gives performance improvement similar to “partition pruning” on partitioned tables. To take best advantage of this feature, an ordered load of data into the table based on the most used where clause predicate columns is recommended.  The ETL processes should use a serial loading of data using “APPEND” hint into the table such that the best advantage of storage index can be achieved on SELECT statements.

3. In a data warehouse type environment, when the most of the times all the rows are accessed but every time only a subset of columns are accessed from the table, Hybrid Columnar Compression improves the performance. Using a COMPRESS FOR QUERY HIGH mode of HCC all the queries that use few columns of the table would only read the required column blocks and perform better.

It is important to consider these features during design of an application and building the application to take advantage of these features will tremendously reduce the resource consumption on the platform at the same time giving best throughput.

But, 

It is important to have correct indexing strategy, correct partitioning strategy in place even with these features to absolutely make sure the performance is predictable.  Just leaving the tables to grow large with all the history data without the right partitioning strategy will leave the performance degrade over time even with smart scan, storage indexes and HCC!
  

Identifying Bottlenecks

April 23, 2012

Recently, I have been asked to review performance of an ETL workflow of Informatica provided by Oracle as part of OBIEE upgrade. The OLAP schema upgrade which has several transformations to upgrade Siebel Analytics 7.7 to OBIA 7.9.6.2

The problem:
An innocent looking “Insert” load is reading a source table, applying a set of look up transformations, then generating a sequence number as a unique key and inserting to a target table in bulk mode is only able to give a throughput of about 65 records per second on a 4 core, 8GB RAM server. The source table is having over 20 million records.

There are other work flows which are running at a throughput of >2000 records per second. So, I have started investigating into this “Performance Problem”

1. Looked at the AWR reports from the source and target databases as well as the Informatica repository database. There is no indication any bottleneck on the database end.

2. Looked at the Session Log of Informatica PowerCenter. The session log shows the Writer thread 99% busy.

3. Taken a subset of 100K records using the source qualifier started running multiple scenarios to identify the bottleneck.

  • First step is to remove the target database; converted the target to a local flat file; so the workflow is now just reading the source; applying all the transformations & look-ups and writing to the local flat file. The throughput did not improve. 
  • Next step is to remove the large table look-ups. One of the look-up was on a table of more than 20 million records. Removed all the look-ups in the workflow mapping. Still the throughput is only 65 records per second. 
  • As the third step, removed the Sequence Number generation; Now the workflow is just reading from the source table and writing to a flat file after applying few compute transformations. The throughput reached 2400 records per second. 
  • By now we have identified the probable bottleneck with the “Sequence Generator”; to confirm that re-ran the work flow with all the look-up transformations and only disabling the sequence generator. The throughput was 2400 records per second. 

4. Looking at the sequence generator is reusable and set to cache only one value. This is causing a round trip from Informatica powercenter server to Informatica repository for the each sequence number to be generated. This could only do a maximum 65 round trips in a second which is causing the bottleneck in this workflow.

Setting appropriate caching on the sequence generator, we finally achieved a throughput of 2400 records per second and completed the load in less than 2.5 hours; It is an improvement of throughput by around 37 times!  


Spending few hours in identifying the bottleneck and removing it worth its effort……