Skip navigation.

Yann Neuhaus

Syndicate content
All blog entries from
Updated: 16 hours 53 min ago

Backup an SQL Server database from On-Premise to Azure

Sun, 2014-07-20 23:46

SQL Server database backup & restore from On-Premise to Azure is a feature introduced with SQL Server 2012 SP1 CU2. In the past, it could be used with these three tools:

  • Transact-SQL (T-SQL)
  • PowerShell
  • SQL Server Management Objects (SMO)

With SQL Server 2014, backup & restore can also be enabled via SQL Server Management Studio (SSMS).

Oracle EM agent 12c thread leak on RAC

Thu, 2014-07-17 22:24

In a previous post about nproc limit, I wrote that I had to investigate the nproc limit with the number of threads because my Oracle 12c EM agent was having thousands of threads. This post is a short feedback about this issue and the way I have found the root cause. It concerns the enterprise manager agent 12c on Grid Infrasctructure >=



The issue was:


ps -o nlwp,pid,lwp,args -u oracle | sort -n
   1  8444  8444 oracleOPRODP3 (LOCAL=NO)
   1  9397  9397 oracleOPRODP3 (LOCAL=NO)
   1  9542  9542 oracleOPRODP3 (LOCAL=NO)
   1  9803  9803 /u00/app/oracle/product/agent12c/core/ /u00/app/oracle/product/agent12c/core/ agent /u00/app/oracle/product/agent12c/agent_inst/sysman/log/emagent.nohup
  19 11966 11966 /u00/app/11.2.0/grid/bin/oraagent.bin
1114  9963  9963 /u00/app/oracle/product/agent12c/core/ ... emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain


By default ps has only one entry per process, but each processes can have several threads - implemented on linux as light-weight process (LWP). Here, the NLWP column shows that I have 1114 threads for my EM 12c agent - and it was increasing every day until it reached the limit and the node failed ('Resource temporarily unavailable').

The first thing to do is to know what those threads are. The ps entries do not have a lot of information, but I discovered jstack which every java developer should know, I presume. You probably know that java has very verbose (lengthy) stack traces. Jstack was able to show me thousands of them in only one command:


$ jstack 9963
2014-06-03 13:29:04
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.14-b01 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00007f3368002000 nid=0x4c9b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"CRSeOns" prio=10 tid=0x00007f32c80b6800 nid=0x3863 in Object.wait() [0x00007f31fe11f000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at oracle.eons.impl.NotificationQueue.internalDequeue(
	- locked  (a java.lang.Object)
	at oracle.eons.impl.NotificationQueue.dequeue(
	at oracle.eons.proxy.impl.client.base.SubscriberImpl.receive(
	at oracle.eons.proxy.impl.client.base.SubscriberImpl.receive(
	at oracle.eons.proxy.impl.client.ProxySubscriber.receive(
	at oracle.sysman.db.receivelet.eons.EonsMetric.beginSubscription(
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
	at java.util.concurrent.ThreadPoolExecutor$
	at oracle.sysman.gcagent.util.system.GCAThread$

I don't paste all of them here. We have the 'main', we have a few GCs and 'Gang workers' which are present in all JVMs and we have a few enterprise manager threads. And what was interesting was that I had thousands of "CRSeOns" that seemed to be increasing.

Some guesses: I'm on RAC, and I have a 'ons' resource and the EM agent tries to subscribe to it. Goggle search returned nothing, and that's the reason I put that in a blog post now. Then I searched MOS, and bingo, there is a note: Doc ID 1486626.1. It has nothing to do with my issue, but has an interesting comment in it:

In cluster version and higher, the ora.eons resource functionality has been moved to EVM. Because of this the ora.eons resource no longer exists or is controlled by crsctl.

It also explains how to disable EM agent subscription:

emctl setproperty agent -name disableEonsRcvlet -value true

I'm in and I have thousands of threads related to a functionality that doesn't exist anymore. And that leads to some failures in my 4 nodes cluster.

The solution was simple: disable it.

For a long time I have seen a lot of memory leaks or CPU usage leaks related to the enterprise manager agent. With this new issue, I discovered a thread leak and I also faced a SR leak when trying to get support for the 'Resource temporarily unavailable' error, going back and forth between OS, Database, Cluster and EM support teams...

Perfmon does not start automatically

Mon, 2014-07-14 19:28

I have recently used perfmon (performance monitor) at a customer site. I created a Data Collector Set to monitor CPU, Memory, Disk, and Network during one day. Then, I ran the monitor and I received a "beautiful" error message…

Partial Join Evaluation in Oracle 12c

Sun, 2014-07-13 21:21

Do you think that it's better to write semi-join SQL statements with IN(), EXISTS(), or to do a JOIN? Usually, the optimizer will evaluate the cost and do the transformation for you. And in this area, one more transformation has been introduced in 12c which is the Partial Join Evaluation (PJE).

First, let's have a look at the 11g behaviour. For that example, I use the SCOTT schema, but I hire a lot more employees in departement 40:


SQL> alter table EMP modify empno number(10);
Table altered.
SQL> insert into EMP(empno,deptno) select rownum+10000,40 from EMP,(select * from dual connect by level


Why department 40? I'll explain it below, but I let you think about it before. In the default SCOTT schema, there is a department 40 in DEPT table, but which has no employees in EMP. And the new transformation is not useful in that case.


11g behaviour

Now, I'm running the following query to check all the departments that have at least one employee:

I can write it with IN:


SQL_ID  6y71msam9w32r, child number 0
select distinct deptno,dname from dept 
 where deptno in ( select deptno from emp)

Plan hash value: 1754319153

| Id  | Operation          | Name | Starts | E-Rows | A-Rows | Buffers |
|   0 | SELECT STATEMENT   |      |      1 |        |      4 |      15 |
|*  1 |  HASH JOIN SEMI    |      |      1 |      4 |      4 |      15 |
|   2 |   TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   3 |   TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       8 |

Predicate Information (identified by operation id):

   1 - access("DEPTNO"="DEPTNO")


or with EXISTS:


SQL_ID  cbpa3zjtzfzrn, child number 0
select distinct deptno,dname from dept 
 where exists ( select 1 from emp where emp.deptno=dept.deptno)

Plan hash value: 1754319153

| Id  | Operation          | Name | Starts | E-Rows | A-Rows | Buffers |
|   0 | SELECT STATEMENT   |      |      1 |        |      4 |      15 |
|*  1 |  HASH JOIN SEMI    |      |      1 |      4 |      4 |      15 |
|   2 |   TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   3 |   TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       8 |
Predicate Information (identified by operation id):

   1 - access("DEPTNO"="DEPTNO")


Both are good. We didn't have to read the whole EMP table. I have 15000 rows in my table, I do a full scan on it, but look at the A-Rows: only 388 rows were actually read.

The HASH JOIN first read the DEPT table in order to build the hash table. So it already knows that we cannot have more than 4 distinct departments.

Then we do the join to EMP just to check which of those departments have an employee. But we can stop as soon as we find the 4 departments. This is the reason why we have read only 388 rows here. And this is exactly what a Semi Join is: we don't need all the matching rows, we return at most one row per matching pair.

Ok. What if we write the join ourselves?


SQL_ID  2xjj9jybqja87, child number 1
select distinct deptno,dname from dept join emp using(deptno)

Plan hash value: 2962452962

| Id  | Operation           | Name | Starts | E-Rows | A-Rows | Buffers |
|   0 | SELECT STATEMENT    |      |      1 |        |      4 |     129 |
|   1 |  HASH UNIQUE        |      |      1 |  15068 |      4 |     129 |
|*  2 |   HASH JOIN         |      |      1 |  15068 |  14014 |     129 |
|   3 |    TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   4 |    TABLE ACCESS FULL| EMP  |      1 |  15068 |  14014 |     122 |

Predicate Information (identified by operation id):

   2 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")


Bad luck. We have to read all the rows. More rows and more buffers.


12c behaviour

Let's do the same in


SQL_ID  2xjj9jybqja87, child number 0
select distinct deptno,dname from dept join emp using(deptno)

Plan hash value: 1629510749

| Id  | Operation           | Name | Starts | E-Rows | A-Rows | Buffers |
|   0 | SELECT STATEMENT    |      |      1 |        |      4 |      14 |
|   1 |  HASH UNIQUE        |      |      1 |      4 |      4 |      14 |
|*  2 |   HASH JOIN SEMI    |      |      1 |      4 |      4 |      14 |
|   3 |    TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   4 |    TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       7 |

Predicate Information (identified by operation id):

   2 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")


Same plan but less rows have been read. If we check the outlines, we see the new feature:


      PARTIAL_JOIN(@"SEL$58A6D7F6" "EMP"@"SEL$1")


And here is what we see in the optimizer trace:


PJE: Checking validity of partial join eval on query block SEL$58A6D7F6 (#1)
PJE: Passed validity of partial join eval by query block SEL$58A6D7F6 (#1)
PJE: Partial join eval conversion for query block SEL$58A6D7F6 (#1).
PJE: Table marked for partial join eval: EMP[EMP]#1


The hints that controls the feature are PARTIAL_JOIN and NO_PARTIAL_JOIN and there are enabled by _optimizer_partial_join_eval which appeared in 12c.

But of course, the optimization is useful only when we have all the values at the beginning of the table. This is why I added at least one employee in department 40. If there are some rows in DEPT that have no matching row in EMP, then Oracle cannot know the result before reaching the end of the table.

Master Data Services installation for SQL Server 2012

Sun, 2014-07-13 20:32

This posting is a tutorial for installing Master Data Services on your Windows Server 2012. Microsoft SQL Server Master Data Services (MDS) is a Master Management product from Microsoft, code-named Bulldog. It is the rebranding of the Stratature MDM product, titled +EDM and acquired in June 2007 by Microsoft. Initially, it was integrated for the first time in Microsoft SQL Server 2008 as an additional installer. But since SQL Server 2012, Master Data Services is integrated as a feature within the SQL Server installer.



Master Data Services is part of the Enterprise Information Management (EMI) technologies, provided by Microsoft, for managing information in an enterprise.

EMI technologies include:

  • Integration Services
  • Master Data Services
  • Data Quality Services



Master Data Services covers five main components:

  • MDS Configuration Manager tool: used to configure Master Data Services
  • MDS Data Manager Web Application: used essentially to perform administrative tasks
  • MDS Web Service: used to extend or develop custom solutions
  • MDS Add-in for Excel: used to manage data, create new entities or attributes …


SQL Server Editions & Versions

Master Data Services can be installed only with the following SQL Server Editions & Versions:

  • SQL Server 2008 R2 edition: Datacenter or Enterprise versions
  • SQL Server 2012 or SQL Server 2014 editions: Enterprise or BI versions


Master Data Services prerequisites in SQL Server 2012

First, Master Data Services is based on an application web named Master Data Manager Web Application, in order to perform administrative task, for example. This web application is hosted by Internet Information Services (IIS), so it is a necessary prerequisite.

Furthermore, to be able to display the content from the web application, you need Internet Explorer 7 or later (Internet Explorer 6 is not supported) with Silverlight 5.

Moreover, if you planned to use Excel with Master Data Services, you also need to install Visual Studio 2010 Tools for Office Runtime, plus the Master Data Services Add-in for Microsoft Excel.

Finally, often forgotten, but PowerShell 2.0 is required for Master Data Services.

 Let’s resume the requirements for Master Data Services:

  • Internet Information Services (IIS)
  • Internet Explorer 7 or later
  • Silverlight 5
  •  PowerShell 2.0
  • Visual Studio 2010 Tools for Office Runtime and Excel Add-in for Microsoft Excel (only if you plan to use Excel with Master Data Services).


Configuration at the Windows Server level

In the Server Manager, you have to activate the Web Server (IIS) Server Roles to be able to host the Master Data Web Application, as well as the .Net 3.5 feature.

For the Server Roles, you have to select:

  • Web Server (IIS)

For the Server Features, you have to select:

- .NET Framework 3.5 Features
  - .NET Framework 3.5
  - HTTP Activation
- .NET Framework 4.5 features
  - .NET Framework 4.5
  - ASP.NET 4.5
  - WCF Services
    - HTTP Activation
    - TCP Port Sharing




For the IIS features selection, you have to select:

- Web Server
  - Common HTTP Features
    - Default Document
    - Directory Browsing
    - HTTP Errors
    - Static Content
  - Health and Diagnostics
    - HTTP Logging
    - Request Monitor
  - Performance
    - Static Content Compression
  - Security
    - Request Filtering
    - Windows Authentication
  - Application Development
    - .NET Extensibility
    - .NET Extensibility 4.5
    - ASP.NET 3.5
    - ASP.NET 4.5
    - ISAPI Extensions
    - ISAPI Filters
  - Management Tools
    - IIS Management Console





Installation of SQL Server 2012

Master Data Services stores its data on a SQL Server database, so you need a SQL Server Engine installed.

Of course, SQL Server Engine can be installed on a different Windows Server. So the Windows Server with Master Data Services installed is used as a Front Server.

Then, in order to personalize the roles of your Master Data Services, you also need to install Management Tools.

At the features installation step, you have to select:

  • Database Engine Services
  • Management Tools
  • Master Data Services



At this point, Master Data Services should be installed with all the needed prerequisites.



However, Master Data Services cannot be used without configuring it. Three main steps need to be performed through the MDS Configuration Manager:

  • First, you have to create a MDS database
  • Then, you have to create a MDS web application hosted in IIS
  • Finally, you have to link the MDS database with the MDS web application

SQL Server 2014: Are DENY 'SELECT ALL USERS SECURABLES' permissions sufficient for DBAs?

Wed, 2014-07-02 20:09

SQL Server 2014 improves the segregation of duties by implementing new server permissions. The most important is the SELECT ALL USERS SECURABLES permission that will help to restrict database administrators from viewing data in all databases.

My article is a complement to David Barbarin's article 'SQL Server 2014: SELECT ALL USERS SECURABLES & DB admins'.

Java Mission Control 5.2 (7u40) deserves your attention

Mon, 2014-06-30 20:38

Recently, some new versions of java were made available. Most people think Java updates are boring and only security-oriented. But one of the last updates (7u40) includes a feature which deserves attention. I mean Java Mission Control 5.2.


Hotspot incoming

If you know the Oracle JRockit JVM a little bit, you might have heard about JMC, which was called JRockit Mission Control in the past. In fact, it’s a tool suite embedded with the Hotspot JDK since this so called 7u40 release which allows to monitor and profile your JVM.

Previously, it was only available for JRockit JVMs and it has now been ported to the Hotspot. JMC is a way more accurate and complete than JConsole or other embedded tools. It does not affect JVM performances more than 1%.




Mission Completed

JMC gathers low level information thanks to its Flight Recorder tool which listens and waits for internal events. It can then monitor, manage, profile, and eliminate memory leaks from the JVM.

The new version of JMC now has a new browser with subnodes for available server side services with their states. It is supported by Eclipse 3.8.2/4.2.2 and later, it allows deeper management of MBeans, especially setting values directly in the attribute tree. It converges with JRockit event management: All information provided by JRockit is now available in the Hotspot as well.




To enable JMC, you will have to add the following arguments to the JVM:





Understanding JVM Java memory leaks

Thu, 2014-06-26 21:38

Lots of people think that Java is free of memory management as the JVM uses a Garbage Collector. The goal of this collector is to free objects that are no longer used in the program without the developer being forced to declare that the object can be collected. Everything is automatic.

It’s really helpful and it avoids wasting time in managing memory. But, as it is an automatic process, it can produce some issues. They are better known as memory leaks.


What is a memory leak?

A leak appears when an object is no longer used in the program but is still referenced somewhere at a location that is not reachable. Thus, the garbage collector cannot delete it. The memory space used for this object will not be released and the total memory used for the program will grow. This will degrade performances over time and the JVM may run out of memory.


Identifying memory leaks

Memory leaks are difficult to identify, they require a good knowledge of the application. Usually, they are related to an Out Of Memory Error exception (also called OOM). But note that not all Out Of Memory Errors imply memory leaks, and not all memory leaks are due to Out Of Memory Errors.

Having an Out Of Memory Error is a first sign but you must make a difference between a “normal” OOM and a memory leak.

For example, if the program loads data from external files and one time a file is bigger than it has been expected it could result in an OOM. But this one is “normal” as the design of the application was not able to handle such a big file.

In the other hand, if the program is used to process data with the same size or which are similar to each other and you get an OOM it may be a memory leak. In addition a memory leak generally eats free space gradually, if your memory has been fulfilled suddenly it might be a “normal” OOM and you should look at the stack trace to find out the origin.

Another symptom of a memory leak would be an allocation issue. In fact when too much space is taken in the heap and not freed allocations of new objects may be complicated. Mainly if objects are huge. When an object is first created in memory it goes to a part of the heap called Eden, then if the object survives several garbage collections it goes to the tenured space. Usually there are only old objects in the tenured space; objects which are long life based. But when the Eden is fulfilled by leaks the objects directly go to the tenured space and you can see an abnormal stack of objects in the tenured space as it is generally not fulfilled. So if you have an OOM of a tenured space it might be an allocation issue (or simply your heap configuration is too low).

The heap usage can be checked with tools such as VisualVM, jconsole, Java Mission Control (only for 7u40+ JRE) and so on. I will describe the use of this kind of tools in future blogs.


How to prevent memory leaks

As said before you cannot really prevent memory leaks as this is related to the design of the application. If you are not aware of the type of application you are running or the internal design it uses, you cannot prevent leaks.

Many applications, either desktop built or web-based, will use many threads to run. Some threads such as ThreadLocal can store references to objects which reference their classes which reference their class loader. These threads can keep objects references in order to use objects later in the thread as in methods without passing it as argument. This way, when a web app is redeployed (e.g. in tomcat) a new class loader is created to load the new application but LocalThreads might not be dealocated as they are loaded or they use things from the permgen. As you may know the permgen is a part of the JVM heap which is usually not collected by the GC. And this produces memory leaks.

What you have to remember is to reduce the use of such a threads and be careful of the implementation and design of the application you are using. Of course, memory leaks can result from other reasons but the principle is the same: Objects allocated are no longer reachable and they are not released by the garbage collector.

Oracle Parallel Query: Did you use MapReduce for years without knowing it?

Thu, 2014-06-26 06:42

I've read this morning that MapReduce is dead. The first time I heard about MapReduce was when a software architect proposed to stop writing SQL on Oracle Database and replace it with MapReduce processing. Because the project had to deal with a huge amount of data in a small time and they had enough budget to buy as many cores as they need, they wanted the scalability of parallel distributed processing.

The architect explained how you can code filters and aggregations in Map & Reduce functions and then distribute the work over hundreds of CPU cores. Of course, it's very interesting, but it was not actually new. I was doing this for years on Oracle with Parallel Query. And not only filters and aggregations, but joins as well - and without having to rewrite the SQL statements.

I don't know if MapReduce is dead, but for 20 years we are able to just flip a switch (ALTER TABLE ... PARALLEL ...) and bring scalability with parallel processing. Given that we understand how it works.

Reading a parallel query execution plan is not easy. In this post, I'll just show the basics. If you need to go further, you should have a look at some Randolf Geist presentations and read his Understanding Parallel Execution article. My goal is not to go very deep, but only to show that it is not that complex.

I'll explain how Parallel query works by showing an execution plan for a simple join between DEPT and EMP tables where I want to read EMP in parallel - and distribute the join operation as well.

For the fun of it, and maybe because it's easier to read at the first time, I've done the execution plan on an Oracle 7.3.3 database (1997):




Let's start by the end. I want to read the EMP table by several processes (4 processes because I've set the parallel degree to 4 on table EMP). The table is not partitioned. It is a heap table where rows are scattered into the segment without any specific clustering. So each process will process an arbitrary range of blocks and this is why you see an internal query filtering on ROWID between :1 and :2. My session process, which is known as the 'coordinator', and which will be represented in green below, has divided the range of rowid (it's a full table scan, that reads all blocks from start to high water mark) and has mandated 4 'producer' processes to do the full scan on their part. Those producers are represented in dark blue below.

But then there is a join to do. The coordinator could collect all the rows from the 'producer' processes and do the join, but that is expensive and not scalable. We want the join to be distributed as well. Each producer process can read the DEPT table and do the join, which is fine if it is a small table only. But anyway, we don't want the DEPT table to be read in parallel because we have not set a parallel degree on it. So the EMP table will be read by only one process: my session process, which does all the no-parallel (aka the serial) things in addition to its 'coordinator' role.

Then we have a new set of 4 processes that will do the Hash Join. They need some rows from DEPT and they need some rows from EMP. They are the 'consumer' processes that will consume rows from 'producers', and are represented in pink below. And they don't need them randomly. Because it is a join, each 'consumer' process must have the pairs of rows that match the join columns. In the plan above, you see an internal query on internal 'table queue' names. The parallel full scan on EMP distributes its rows: it's a PARALLEL_TO_PARALLEL distribution, the parallel producers sending their rows to parallel consumers. The serial full scan on DEPT distributes its rows as well: it's a PARALLEL_FROM_SERIAL distribution, the parallel consumers receiving their rows from the serial coordinator process. The key for both distributions are given by a hash function on the join column DEPTNO, so that rows are distributed to the 4 consumer processes, but keeping same DEPTNO into the same process.

We have a group by operation that will be done in parallel as well. But the processes that do the join on DEPTNO cannot do the group by which is on others columns (DNAME,JOB). So we have to distribute the rows again, but this time the distribution key is on DNAME and JOB columns. So the join consumer processes are also producers for the group by operation. And we will have a new set of consumer processes that will do the join, in light blue below. That distribution is a PARALLEL_TO_PARALLEL as it distributes from 4 producers arranged by (DEPTNO) to 4 consumers arranged by (DNAME,JOB).

At the end only one process receives the result and sends it to the client. It's the coordinator which is 'serial'. So it's a PARALLEL_TO_SERIAL distribution.

Now let's finish with my Oracle 7.3.3 PLAN_TABLE and upgrade to 12c which can show more detailed and more colorful execution plans. See here on how to get it.

I've added some color boxes to show the four parallel distributions that I've detailed above:

  • :TQ10001 Parallel full scan of EMP distributing its rows to the consumer processes doing the join.
  • :TQ10000 Serial full scan of DEPT distributing its rows to the same processes, with the same hash function on the join column.
  • :TQ10002 The join consumer receiving both, and then becoming the producer to send rows to the consumer processes doing the group by
  • :TQ10003 Those consumer processes doing the group by and sending the rows to the coordinator for the final result.



So what is different here?

First we are in 12c and the optimizer may choose to broadcast all the rows from DEPT instead of the hash distribution. It's the new HYBRID HASH distribution. That decision is done when there are very few rows and this is why they are counted by the STATISTICS COLLECTOR.

We don't see the predicate on rowid ranges, but the BLOCK ITERATOR is there to show that each process reads its range of blocks.

And an important point is illustrated here.

Intra-operation parallelism can have a high degree (here I've set it to 4 meaning that each parallel operation can be distributed among 4 processes). But Inter-operation parallelism is limited to one set of producer sending rows to one set of consumers. We cannot have two consumer operations at the same time. This is why the :TQ0001 and the :TQ10003 have the same color: it's the same processes that act as the EMP producer, and then when finished, then are reused as the GROUP BY consumer.

And there are additional limitations when the coordinator is also involved in a serial operation. For those reasons, in a parallel query plan, some non-blocking operations (those that can send rows above on the fly as they receive rows from below) have to buffer the rows before continuing. Here you see the BUFFER SORT (which buffers but doesn't sort - the name is misleading) which will keep all the rows from DEPT in memory (or tempfiles when it's big).

Besides the plan, SQL Monitoring show the activity from ASH and the time spent in each parallel process:




My parallel degree was 4 so I had 9 processes working on my query: 1 coordinator, two sets of 4 processes. The coordinator started to distribute the work plan to the other processes, then had to read DEPT and distribute its rows, and when completed it started to receive the result and send it to the client. The blue set of processes started to read EMP and distribute its rows, and when completed was able to process the group by. The red set of processes has done the join. The goal is to have the DB time distributed on all the processes running in parallel, so that the response time is equal to the longest one instead of the total. Here, it's the coordinator which has taken 18 milliseconds. The query duration was 15 milliseconds:




This is the point of parallel processing: we can do a 32 ms workload in only 15 ms. Because we had several cpu running at the same time. Of course we need enough resources (CPU, I/O and temp space). It's not new. We don't have to define complex MapReduce functions. Just use plain old SQL and set a parallel degree. You can use all the cores in your server. You can use all the servers in your cluster. If you're I/O bound on the parallel full scans, you can even use your Exadata storage cells to offload some work. And in the near future the CPU processing will be even more efficient, thanks to in-memory columnar storage.

Linux: how to monitor the nofile limit

Wed, 2014-06-18 01:47

In a previous post I explained how to measure the number of processes that are generated when a fork() or clone() call checks the nproc limit. There is another limit in /etc/limits.conf - or in /etc/limits.d - that is displayed by 'ulimit -n'. It's the number of open files - 'nofile' - and here again we need to know what kind of files are counted.



'nofile' is another limit that may not be easy to monitor, because if you just count the 'lsof' output you will include a lot of lines which are not file descriptors. So how can we count the number of files descriptors in a process?



'lsof' is a utility that show all the open files. Let's take an example:

I get the pid of my pmon process:

[oracle@VM211 ulimit]$ ps -edf | grep pmon
oracle   10586     1  0 19:21 ?        00:00:02 ora_pmon_DEMO
oracle   15494 15290  0 22:12 pts/1    00:00:00 grep pmon


And I list the open files for that process

[oracle@VM211 ulimit]$ lsof -p 10586
ora_pmon_ 10586 oracle  cwd  DIR  252,0      4096 /app/oracle/product/12.1/dbs
ora_pmon_ 10586 oracle  rtd  DIR  252,0      4096 /
ora_pmon_ 10586 oracle  txt  REG  252,0 322308753 /app/oracle/product/12.1/bin/oracle
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150175744_0
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_0
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_1
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_2
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_3
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_4
ora_pmon_ 10586 oracle  mem  REG   0,17   4194304 /dev/shm/ora_DEMO_150208513_5
ora_pmon_ 10586 oracle  mem  REG  252,0   1135194 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle  mem  REG  252,0   6776936 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle  mem  REG  252,0     14597 /app/oracle/product/12.1/lib/
ora_pmon_ 10586 oracle    0r CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    1w CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    2w CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    3r CHR    1,3       0t0 /dev/null
ora_pmon_ 10586 oracle    4r REG  252,0   1233408 /app/oracle/product/12.1/rdbms/mesg/oraus.msb
ora_pmon_ 10586 oracle    5r DIR    0,3         0 /proc/10586/fd
ora_pmon_ 10586 oracle    6u REG  252,0      1544 /app/oracle/product/12.1/dbs/hc_DEMO.dat
ora_pmon_ 10586 oracle    7u REG  252,0        24 /app/oracle/product/12.1/dbs/lkDEMO_SITE1
ora_pmon_ 10586 oracle    8r REG  252,0   1233408 /app/oracle/product/12.1/rdbms/mesg/oraus.msb

I've removed hundreds of lines with FD=mem and size=4M. I'm in AMM with memory_target=800M and SGA is implemented in /dev/shm granules. With lsof, we see all of them. And with a large memory_target we can have thousands of them (even if granule becomes 16M when memory_target is larger than 1GB). But don't worry, they don't count in the 'nofile' limit. Only 'real' file descriptors are counted - those with a numeric FD.

So, if you want to know the processes that are near the limit, you can use the following:

[oracle@VM211 ulimit]$ lsof | awk '$4 ~ /[0-9]+[rwu -].*/{p[$1"\t"$2"\t"$3]=p[$1"\t"$2"\t"$3]+1}END{for (i in p) print p[i],i}' | sort -n | tail
15 ora_dmon_    10634   oracle
16 ora_dbw0_    10608   oracle
16 ora_mmon_    10626   oracle
16 ora_rsm0_    10722   oracle
16 tnslsnr      9785    oracle
17 automount    1482    root
17 dbus-daem    1363    dbus
20 rpc.mount    1525    root
21 ora_lgwr_    10610   oracle
89 master       1811    root


The idea is to filter the output of lsof and use awk to keep only the numeric file descriptors, and aggregate per process. Then, we sort them and show the highest counts. Here the Postfix master process has 89 files open. Then log writer follows.

You can get the same information from /proc filesystem where files handles are in /proc//fd:

for p in /proc/[0-9]* ; do echo $(ls $p/fd | wc -l) $(cat $p/cmdline) ; done | sort -n | tail
15 ora_dmon_DEMO
16 ora_dbw0_DEMO
16 ora_mmon_DEMO
16 ora_rsm0_DEMO
16 /app/oracle/product/12.1/bin/tnslsnrLISTENER-inherit
17 automount--pid-file/var/run/
17 dbus-daemon--system
20 rpc.mountd
21 ora_lgwr_DEMO
89 /usr/libexec/postfix/master


Same result, much quicker and more information about the process. This is the way I prefer, but remember that if you want to see all processes, you should be logged as root.


The proof

As I did for nproc, I have written a small C program that open files (passed as arguments) for a few seconds, so that I'm sure I'm counting the right things.

And I encourage to do the same on a test system and let me know if your result differs. Here is the source:

First, I set my nofile limit to only 10

ulimit -n 10


Then, let's open 7 files. In addition with stdin, stdout and stderr we will have 10 file handles:

[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp &
open file 1 of 7 getrlimit nofile: soft=10 hard=10 myfile1.tmp
open file 2 of 7 getrlimit nofile: soft=10 hard=10 myfile2.tmp
open file 3 of 7 getrlimit nofile: soft=10 hard=10 myfile3.tmp
open file 4 of 7 getrlimit nofile: soft=10 hard=10 myfile4.tmp
open file 5 of 7 getrlimit nofile: soft=10 hard=10 myfile5.tmp
open file 6 of 7 getrlimit nofile: soft=10 hard=10 myfile6.tmp
open file 7 of 7 getrlimit nofile: soft=10 hard=10 myfile7.tmp


I was able to open those 7 files. Then I check lsof:

[oracle@VM211 ulimit]$ lsof | grep openfiles
openfiles 21853    oracle  cwd       DIR  0,24    380928    9320 /tmp/ulimit
openfiles 21853    oracle  rtd       DIR 252,0      4096       2 /
openfiles 21853    oracle  txt       REG  0,24      7630    9494 /tmp/ulimit/openfiles
openfiles 21853    oracle  mem       REG 252,0    156928 1579400 /lib64/
openfiles 21853    oracle  mem       REG 252,0   1926800 1579401 /lib64/
openfiles 21853    oracle    0u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    1u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    2u      CHR 136,1       0t0       4 /dev/pts/1
openfiles 21853    oracle    3r      REG  0,24         0    9487 /tmp/myfile1.tmp
openfiles 21853    oracle    4r      REG  0,24         0    9488 /tmp/myfile2.tmp
openfiles 21853    oracle    5r      REG  0,24         0    9489 /tmp/myfile3.tmp
openfiles 21853    oracle    6r      REG  0,24         0    9490 /tmp/myfile4.tmp
openfiles 21853    oracle    7r      REG  0,24         0    9491 /tmp/myfile5.tmp
openfiles 21853    oracle    8r      REG  0,24         0    9492 /tmp/myfile6.tmp
openfiles 21853    oracle    9r      REG  0,24         0    9493 /tmp/myfile7.tmp


We see our 10 file handles and this proves that only numeric FD are counted when checking the nofile limit of 10. You see stdin, stdout, stderr as FD 0,1,2 and then my 7 files opened in read only.

Let's try to open one more file:

[oracle@VM211 ulimit]$ ./openfiles myfile1.tmp myfile2.tmp myfile3.tmp myfile4.tmp myfile5.tmp myfile6.tmp myfile7.tmp myfile8.tmp
open file 1 of 8 getrlimit nofile: soft=10 hard=10 myfile1.tmp
open file 2 of 8 getrlimit nofile: soft=10 hard=10 myfile2.tmp
open file 3 of 8 getrlimit nofile: soft=10 hard=10 myfile3.tmp
open file 4 of 8 getrlimit nofile: soft=10 hard=10 myfile4.tmp
open file 5 of 8 getrlimit nofile: soft=10 hard=10 myfile5.tmp
open file 6 of 8 getrlimit nofile: soft=10 hard=10 myfile6.tmp
open file 7 of 8 getrlimit nofile: soft=10 hard=10 myfile7.tmp
open file 8 of 8 getrlimit nofile: soft=10 hard=10 myfile8.tmp
fopen() number 8 failed with errno=24


Here the limit is reached and the open() call returns error 24 (ENFILE) because we reached the nofile=10.



When counting the processes for the nproc limit, we have seen that threads must be counted as processes. For the nofile limit we don't need to detail the threads because all threads share the file descriptor table.


Recommended values

Currently this is what is set on Oracle linux 6 for 11gR2 (in /etc/security/limits.conf):

oracle   soft   nofile    1024
oracle   hard   nofile    65536


For 12c, these are set in /etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf which overrides /etc/security/limits.conf:

oracle soft nofile 1024
oracle hard nofile 65536


Do you think it's a bit low? Just for information, here is what is set in the ODA X4-2:

oracle soft nofile 131072


In any case, it is a good idea to check if you are reaching the limit and the above scripts on lsof or /proc should help for that.

SQL Server: How to find the default data path?

Sun, 2014-06-15 18:16

I have read a lot of SQL Server blog postings and articles in order to find the default data path. This post covers different SQL Server versions (SQL Server 2012, SQL Server 2014, SQL Server 2008, SQL Server 2005) and provides a generic script with different methods.

SQL Server 2014: sys.dm_exec_query_profiles, a new promising feature

Fri, 2014-06-13 03:52

Among the hidden features provided by SQL Server 2014 there is a very interesting dynamic management view named sys.dm_exec_query_profiles. This feature will help the database administrators and experienced developers to troubleshoot long running queries in real-time. I'm pretty sure you had to deal with the following questions: When will this query finish? What percentage of total workload is performed by this request? Which steps are the longest? Before SQL Server 2014 it was impossible to answer the questions above!

But first, I have to admit that this new DMV has raised of lot of questions during my tests. I will try to share my thoughts and findings with you in this post.

My test consisted of running this long query:


select        YEAR(TransactionDate) AS year_tran,        MONTH(TransactionDate) AS month_tran,        FIRST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS first_product_nb,        LAST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS last_product_nb from AdventureWorks2012.dbo.bigTransactionHistory as a        join AdventureWorks2012.dbo.bigProduct as p              on a.ProductID = p.ProductID


On my computer this query takes 05’’47’ to run. The parallelism is enabled. My SQL Server instance can use up to 8 processors. The first time I was disappointed by viewing the output of the sys.dm_exec_query_profiles DMV during my query execution:


select * from sys.dm_exec_query_profiles




… No output! Ok what are we supposed to do to see something with sys.dm_exec_query_profiles ? Laughing  Keep good habits and go back to the SQL Server BOL that says:

To serialize the requests of sys.dm_exec_query_profiles and return the final results to SHOWPLAN XML, use SET STATISTICS PROFILE ON; or SET STATISTICS XML ON;

In other words, to have a chance to see something with this DMV we must use some session options like STATISTICS PROFILE, STATISTICS XML or force SQL Server to display the execution plan after execution query with SQL Server Management Studio. My first though was: why do we have to use some additional options to produce some output for this new DMV? These constraints severely limit the scenarios where we can use this DMV … After some reflexions and discussions with others French MVPs we can think that is normal because tracking the execution plan information is expensive and we could potentially bring a production server to its knees by enabling this feature for all the workload statements. But using additional session options can be impractical in production environments because it requires to execute itself the concerned query and it is not always possible. Fortunately Microsoft provides the query_post_execution_showplan event which can be used into a SQL Server trace or an extended event session. But the implementation design of this event has a significant impact to the performance of a SQL Server instance. Indeed, even with a short-circuit predicate this event will be triggered each time a SQL statement will be executed because the query duration is not known ahead of time (please refer to this Microsoft connect item). To summarize, using this event in OLTP production environment should be used in a short period for troubleshooting purposes. In OLAP environment the story is not the same because we don’t have to deal with a lot of short queries but only with long running queries issued by cube processing or ETL processes for example.

After discussing the pros and cons of this new feature let’s start with my precedent T-SQL statement and the use of the session option: SET STATISTICS PROFILE ON


SET STATISTICS PROFILE ON;   select        YEAR(TransactionDate) AS year_tran,        MONTH(TransactionDate) AS month_tran,        FIRST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS first_product_nb,        LAST_VALUE(p.ProductNumber) OVER (PARTITION BY YEAR(TransactionDate), MONTH(TransactionDate) ORDER BY TransactionDate) AS last_product_nb from AdventureWorks2012.dbo.bigTransactionHistory as a        join AdventureWorks2012.dbo.bigProduct as p              on a.ProductID = p.ProductID


In other session:


select * from sys.dm_exec_query_profiles




This DMV provides a lot of useful information. These information are more granular than SET STATISTICS IO because  the counters returned are per operator per thread (node_id / physical_operator_name and thread_id columns).  In my case the SQL Server instance can use up to 8 processors and we can notice that for some operators (node_id) we have several threads using in parallel (thread_id). Furthermore two others columns are interesting like estimate_row_count and row_count. The former is the number of estimated rows an operator should address and the latter is the current number of rows addressed by the same operator (remember these counters are per operator per thread …). We can compare these two columns to estimate a completion percentage per operator (or per operator per thread). Finally we can categorize some information provided by this DMV in two parts: information related either to the query execution engine (row_count, estimated_row_count) or the storage execution engine (logical_read_count, write_page_count, lob_logical_read_count etc.)

As you can see, this DMV provide a lot of information to us and you will certainly customize the final output yourself depending on what you want to focus on.

Here is an example of a custom script we can create that uses the sys.dm_exec_query_profiles, sys.objects, sys.indexes, sys.dm_exec_requets, sys.dm_exec_sessions, and sys.dm_os_waiting_tasks to cover both the real-time execution information and the related waits occurring in the same time.

USE AdventureWorks2012; GO   SELECT        qp.node_id,        qp.session_id, AS [object_name], AS index_name,        qp.physical_operator_name + QUOTENAME(CAST(COUNT(qp.thread_id) AS VARCHAR(4))) AS physical_operator_name,        SUM(qp.estimate_row_count) AS estimate_row_count,        SUM(qp.row_count) AS row_count,        CASE              WHEN SUM(qp.row_count) * 1. / SUM(qp.estimate_row_count + .00001) THEN CAST(CAST(SUM(qp.row_count) * 100. / SUM(qp.estimate_row_count + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %'              ELSE '100 % + (' + CAST(CAST((SUM(qp.row_count) * 100. / SUM(qp.estimate_row_count + .00001)) - 100 AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %)'        END AS [completed_%],        -- Query execution engine        CAST(CAST(SUM(qp.elapsed_time_ms) * 100. /(SUM(SUM(qp.elapsed_time_ms)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_elapsed_time_%],        CAST(CAST(SUM(qp.cpu_time_ms) * 100. /(SUM(SUM(qp.cpu_time_ms)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_cpu_%],        -- Storage engine        CAST(CAST(SUM(qp.logical_read_count) * 100. / SUM(SUM(qp.logical_read_count)) OVER() + .00001 AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_logical_read_%],        CAST(CAST(SUM(qp.physical_read_count) * 100. /(SUM(SUM(qp.physical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_physical_read_%],        CAST(CAST(SUM(qp.lob_logical_read_count) * 100. /(SUM(SUM(qp.lob_logical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [lob_logical_read_%],        CAST(CAST(SUM(qp.lob_physical_read_count) * 100. /(SUM(SUM(qp.lob_physical_read_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [lob_physical_read_%],        CAST(CAST(SUM(qp.write_page_count) * 100. /(SUM(SUM(qp.write_page_count)) OVER() + .00001) AS DECIMAL(5,2)) AS VARCHAR(6)) + ' %' AS [total_write_%],        SUM(qp.logical_read_count) AS total_logical_read_count,        SUM(qp.lob_logical_read_count) AS total_lob_logical_read_count,        SUM(qp.lob_physical_read_count) AS total_lob_physical_read_count,        SUM(qp.lob_read_ahead_count) AS total_lob_read_ahead_count,        SUM(qp.physical_read_count) AS total_physical_read_count,        SUM(qp.read_ahead_count) AS total_read_ahead_count,        SUM(qp.write_page_count) AS total_write_page_count,        -- Both        SUM(qp.cpu_time_ms) AS total_cpu_time_ms,        SUM(qp.elapsed_time_ms) AS total_elapsed_time_ms,        -- wait info information        COALESCE(wait.wait_info, '') AS wait_info FROM sys.dm_exec_query_profiles AS qp(NOLOCK)        LEFT JOIN sys.objects AS o(NOLOCK)              ON qp.object_id = o.object_id        LEFT JOIN sys.indexes AS i(NOLOCK)              ON qp.index_id = i.index_id                     AND qp.object_id = i.object_id OUTER APPLY (        SELECT        STUFF(                     (SELECT                                   ',' + ws.wait_type + ' ' + QUOTENAME(CAST(SUM(COALESCE(ws.wait_duration_ms, 0)) AS VARCHAR(20)) + ' ms')                            FROM sys.dm_exec_requests AS r(NOLOCK)                                   JOIN sys.dm_exec_sessions AS s(NOLOCK)                                         ON r.session_id = s.session_id                                   JOIN sys.dm_os_waiting_tasks AS ws(NOLOCK)                                         ON ws.session_id = s.session_id                            WHERE s.session_id = qp.session_id                            GROUP BY ws.wait_type                            FOR XML PATH (''), TYPE).value('.', 'varchar(max)') , 1, 1, ''              ) ) AS wait(wait_info) GROUP BY qp.node_id, session_id, qp.physical_operator_name,,, COALESCE(wait.wait_info, '') ORDER BY node_id OPTION(RECOMPILE);


The first section of the above script concerns the general information:




1: This above script groups the information per operator. The information in brackets is the number of thread used for the concerned operator.

2: Completion is provided as a ratio between rows addressed by the corresponding operator so far and the total estimated rows. As we can expect, the estimated rows and the real rows addressed by an operator can differ. The above script highlights this gap by showing the difference in percentage between brackets if row_count is greater than estimated_row_count. This information can be useful to “estimate” the general completion of the query.

However Total_elasped_time_% and total_cpu_% columns are the relative percentage of the elapsed time and cpu time consumed by all the query execution operators.

3: For some physical operators like Clustered Index Scan, Table Scan etc … we will show the related table and index objects


The second section concerns the storage engine statistic for the query execution:




The above information can be useful to highlight the top operators depending on the resource or the operation type. For example, in my case I can notice that the cluster index scan of the bigTransactionHistory table is the main consuming physical operator for logical reads. Furthermore, we can observe that the sort operator is the root cause of spilling. Finally as expected we can notice that by design hash match operator consumes a lot of CPU and it’s the top operator in this case.


The next section shows the absolute values per operator per type of resources and operation:




Finally the last section shows the wait types and aggregated duration by all related threads in brackets during the query execution in a real time:




In my case we retrieve the wait type related to a query execution in parallel  (CXPACKET) and certainly one of the root cause of the CXPACKET wait type (IO_COMPLETION) but here we don't have sufficient information to verify it. We can just notice that the query execution generates CXPACKET and IO_COMPLETION wait types in a real time.

At this point we can ask why we don’t have a remaining time information. In fact, computing an accurate remaining time is very difficult because a lot of factors must be correlated together. We can have a naive approach by computing the remaining time with the row_count, estimated_row_count and elapsed time columns for each operator, but you will be disappointed when testing this… According to my own tests, I can claim that the result is never as accurate as you would expect. Please feel free to share your results or thoughts about computing an accurate remaining time with us!

The sys.dm_exec_query_profiles DMV can be correlated by external tools like XPERF or Perfmon. I will prepare a next blog post (as soon as I have some time) to share my experience with you.

Happy query profiling!

Linux: how to monitor the nproc limit

Tue, 2014-06-10 01:14

You probably know about 'nproc' limits in Linux which are set in /etc/limits.conf and checked with 'ulimit -u'. But do you know how to handle the monitoring and be alerted when you're close the fixed limit?

Nproc and ps

Nproc is defined at OS level to limit the number of processes per user. Oracle documentation recommends the following:

oracle soft nproc 2047
oracle hard nproc 16384

But that is often too low, especially when you have the Enterprise Manager agent or other java programs running.

Do you want to check that you are far from the limit? then you can use 'ps'. But beware, 'ps' by default does not show all processes.
In Linux, when doing multithreading, each thread is implemented as a light-weight process (LWP). And you must use the '-L' to see all of them.

Let's take an example. I have a system where 'ps -u oracle' returns 243 lines. But including LWPs shows a lot more processes which is near the limit:

$ ps h -Led -o user | sort | uniq -c | sort -n
      1 dbus
      1 ntp
      1 rpc
      1 rpcuser
      2 avahi
      2 haldaemon
      2 postfix
    166 grid
    400 root
   1370 oracle

So the 'oracle' user has 1370 processes. That's high. And this is the actual number where the nproc limit applies.

'ps -Lf' can show the detail. And even without '-L' we can display the NLWP which is the number of threads per process:

ps -o nlwp,pid,lwp,args -u oracle | sort -n
   1  8444  8444 oracleOPRODP3 (LOCAL=NO)
   1  9397  9397 oracleOPRODP3 (LOCAL=NO)
   1  9542  9542 oracleOPRODP3 (LOCAL=NO)
   1  9803  9803 /u00/app/oracle/product/agent12c/core/ /u00/app/oracle/product/agent12c/core/ agent /u00/app/oracle/product/agent12c/agent_inst/sysman/log/emagent.nohup
  19 11966 11966 /u00/app/11.2.0/grid/bin/oraagent.bin
1114  9963  9963 /u00/app/oracle/product/agent12c/core/ ... emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain

The Oracle 12c EM agent has started 1114 threads and the grid infrastructure 'oraagent.bin' has 19 threads. In addition to that I've a lot of other monothreaded processes. This is how we reach 1370 which is the exact value to compare to the nproc limit.

So what are the good values to set? About the high number of threads for EM agent 12c, there are a few bugs. And I suspect that 1000 threads is too much, especially when checking them with 'jstack' I see that they are "CRSeOns" threads that should not be used in and higher. But that's another problem which I'm currently investigating. When you reach the nproc limit, the user will not be able to create new processes. clone() calls will return EAGAIN and that is reported by Oracle as:

ORA-27300: OS system dependent operation:fork failed with status: 11 
ORA-27301: OS failure message: Resource temporarily unavailable 

And that is clearly bad when it concerns an +ASM instance or archiver processes.

The goal of the nproc limit is only to prevent 'fork bombs' where a process forks forever and exhausts all resources. So there is no problem to increase this limit. However if you set it high for some users ('oracle' and 'grid' usually), it can be a good idea to monitor the number of processes with the ps h -L above. Because having too many processes is suspect and increasing the limit just hides a process leak and defer the failure.

In 'ps h -L -o' The argument 'h' is there to remove the header line, and '-L' to show all processes including LWP. Then you can count with 'wc -l'.

The proof

In order to be sure that 'ps h -L' gives the exact number, I have tested it. In case you want to check this on your system, here is how to do it. And please report any difference.

First, set your limit to 1024 processes. This is a limit for my user, and the limit is set for my shell and all its child processes:

[oracle@VM211 ocm]$ ulimit -u 1024

Now you can check it:

[oracle@VM211 ocm]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15919
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Then you can run a small C program ( that calls fork() in a loop until it fails with EAGAIN:

[oracle@VM211 ocm]$ ./testnproc
parent says fork number 871 sucessful
 child says fork number 872 pid 1518
parent says fork number 872 sucessful
 child says fork number 873 pid 1519
parent says fork number 873 sucessful
 child says fork number 874 pid 1520
parent says fork number 874 sucessful
parent says fork number 875 failed (nproc: soft=1024 hard=1024) with errno=11

And finally, because the processes sleep for a while, you can check how many processes you have. I do that from another user account for the simple reason that I need to create 2 more processes ('ps' and 'wc') for that:

[root@VM211 ocm]# ps h -Lu oracle | wc -l
Recommended values for Oracle

Currently this is what is set on Oracle linux 6 for 11gR2 by the preinstall package (in /etc/security/limits.conf):

oracle   soft   nproc    16384
oracle   hard   nproc    16384

For 12c, these are set in /etc/security/limits.d/oracle-rdbms-server-12cR1-preinstall.conf which overrides /etc/security/limits.conf:

oracle soft nproc 16384
oracle hard nproc 16384

And just for your information, here is what is set in the ODA X4-2:

oracle soft nproc 131072

So what do you want to set? You probably don't want it too low and experience 'resource temporarily unavailable'. But what you don't want either is 100000 processes on your server. So my recommendation is to set it high but monitor it when the number of processes reaches something that is not sensible. Then you prevent having the system down in case of process leak, but you can detect it and ask for a patch.

SQL Server 2014: Availability groups failover impossible with Failover Cluster Manager

Fri, 2014-06-06 01:27

A couple of weeks ago, I worked for a customer that wanted to implement SQL Server 2012 (and not SQL Server 214) AlwaysOn with availability groups. During our tests we performed a bunch of failover tests and the customer tried to perform a failover of one of the installed availability group by using the failover cluster manager (FCM). Of course, I told him this is not best practice because the failover cluster manager is not aware of the synchronization state of the availability group. But with SQL Server 2014, the story has changed because I noticed a different behavior. I would like to share this information with you in this posting.

But let me first demonstrate the SQL Server 2012 behavior with the following example:

I have an availability group named SQL12 configured with 2 synchronous replicas and automatic failover. However, as you can see, the synchronization state of my availability database is not healthy as shown below:




Now, if I try to failover my availability group using the failover cluster manager ...






.... the next available node is SQL143 ...






... and we can notice the failover did not occur as expected because the SQL12 resource is still on the SQL141 node as shown in the following picture:




Having a look at the cluster error log does not help us in this case because we have a classic 1069 error number without helpful information:





Generating detailed cluster error logs could help us but I prefer to directly look at the SQL Server side for the moment. The AlwaysOn_health extended event is a good start to check for some existing records associated to the problem.




We have indeed some information about the failover attempt. First the SQL141SQL12 replica state changed from PRIMARY_NORMAL to RESOLVING_NORMAL due to the manual failover issued by the FCM.




Then we can see an error message that explains that the dummy database is changing its role from PRIMARY to RESOLVING because there is a problem with a role synchronization. This error is issued by the forced failover of the availability group resource that I used.




Finally, we notice the failover process did not complete succesfully and the dummy database failbacks on the SQL141SQL12 replica (availability_replica_id 1B9007CA-EE3F-486D-A974-838EFED0203D associated to the SQL141SQL12 replica in my case)




On the other side, the SQL143SQL12 secondary replica also features a lot of useful information:






To summarize, this test demonstrates clearly that the FCM is not aware of the availability databases synchronization state inside an availability group. Using FCM may result in unintended outcomes, including unexpected downtime!

Now, it's time to test the same scenario with SQL Server 2014 and a configured availability group. During the failover attempt, I get the following error message:






If we take a look at the possible owners of the corresponding availability group resource, we can see that the current cluster node that hosts the availability group is the only possible owner - unlike SQL Server 2012.




As a reminder, possible and preferred owners are resetted during the availability group creation and failover based on the primary replica and its secondaries. For fun, I decided to compare the two processes issued by the both versions of SQL Server and availability groups (SQL Server 2012 and SQL Server 2014) after having triggered an availability group failover and generating the associated cluster logs:


Get-ClusterLog -UseLocalTime -Span 5

Result with SQL Server 2012

000005d4.00000778::2014/05/26-22:10:55.088 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (sql12, 1) 000005d4.00000778::2014/05/26-22:10:55.088 INFO [GUM] Node 1: executing request locally, gumId:215, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.088 INFO [RCM] rcm::RcmGum::AddPossibleOwner(sql12,1) 000005d4.00000778::2014/05/26-22:10:55.103 ERR   mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.' 000005d4.00000778::2014/05/26-22:10:55.103 WARN [DM] Aborting group transaction 29:29:613+1 000005d4.00000778::2014/05/26-22:10:55.103 ERR   [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed' 000005d4.00000778::2014/05/26-22:10:55.103 WARN [RCM] sql12 cannot be hosted on node 3 000005d4.00000778::2014/05/26-22:10:55.103 WARN [RCM] Possible owners: 000005d4.00000778::2014/05/26-22:10:55.103 WARN     2 000005d4.00000778::2014/05/26-22:10:55.103 WARN     1 000005d4.00000778::2014/05/26-22:10:55.103 WARN 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmApi::RemovePossibleOwner: (sql12, 2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:215, my action: /rcm/gum/RemovePossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::RemovePossibleOwner(sql12,2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] Removing node 2 from resource 'sql12'. 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (sql12, 2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:216, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::AddPossibleOwner(sql12,2) 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] Adding node 2 to resource 'sql12'. 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.00000778::2014/05/26-22:10:55.103 INFO [GUM] Node 1: executing request locally, gumId:217, my action: /rcm/gum/SetGroupPreferredOwners, # of updates: 1 000005d4.00000778::2014/05/26-22:10:55.103 INFO [RCM] rcm::RcmGum::SetGroupPreferredOwners(sql12, 000005d4.00000778::2014/05/26-22:10:55.103 INFO     1 000005d4.00000778::2014/05/26-22:10:55.103 INFO     2 000005d4.00000778::2014/05/26-22:10:55.103 INFO


Result with SQL Server 2014

000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (AdventureWorksGrp, 1) 000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:230, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000005d4.00000bb0::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::AddPossibleOwner(AdventureWorksGrp,1) 000005d4.00000bb0::2014/05/26-22:14:54.578 ERR   mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.' 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [DM] Aborting group transaction 29:29:627+1 000005d4.00000bb0::2014/05/26-22:14:54.578 ERR   [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed' 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [RCM] AdventureWorksGrp cannot be hosted on node 3 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN [RCM] Possible owners: 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN     2 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN     1 000005d4.00000bb0::2014/05/26-22:14:54.578 WARN 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmApi::RemovePossibleOwner: (AdventureWorksGrp, 2) 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:230, my action: /rcm/gum/RemovePossibleOwner, # of updates: 1 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::RemovePossibleOwner(AdventureWorksGrp,2) 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [RCM] Removing node 2 from resource 'AdventureWorksGrp'. 000005d4.00000c34::2014/05/26-22:14:54.578 INFO [GEM] Node 1: Sending 1 messages as a batched GEM message 000005d4.000011f4::2014/05/26-22:14:54.578 INFO [GUM] Node 1: executing request locally, gumId:231, my action: /rcm/gum/SetGroupPreferredOwners, # of updates: 1 000005d4.000011f4::2014/05/26-22:14:54.578 INFO [RCM] rcm::RcmGum::SetGroupPreferredOwners(AdventureWorksGrp, 000005d4.000011f4::2014/05/26-22:14:54.578 INFO     1 000005d4.000011f4::2014/05/26-22:14:54.578 INFO


As I said earlier, possible and preferred owners properties are managed automatically by SQL Server AlwaysOn. We can see here this is done by the cluster resource control manager and the following functions:

rcm::RcmApi::AddPossibleOwner(), rcm::RcmApi::RemovePossibleOwner() and rcm::RcmApi::SetGroupPreferredOwners () .

You can notice that two nodes are added as possible owners with SQL Server 12. However, with SQL Server 2014 only one node is added as possible owner of the concerned availability group resource. Interesting change isn't it?

How to list all Oracle system schemas

Thu, 2014-06-05 04:51

Do you want to know which users come with the Oracle Database and which are the ones you have created? This is that not easy, especially prior to Oracle 12c. You know SYS and SYSTEM, but there are many others - especially when you have installed all options.

And it is important to know them all, e. g. when you import or apply a password policy, etc.

Let's see the options we have.


12c shows Oracle maintained users

Oracle 12c has introduced the ORACLE_MAINTAINED in the DBA_USERS dictionary view.

It is YES for the users created by oracle scripts (create database, calalog.sql, catproc.sql). So it's the right place to check when you are in 12c. 


12c Common users

Oracle 12c has another column to distinguish the user commonality. This is for CDB: the oracle dictionary is stored in the root container, and each PDB have links to it. And what is nice is that even when in a non-cdb the column is at 'YES' for Oracle maintained users and at 'NO' for created users. (except if you create C## common users in a cdb).


Default passwords

The table sys.default_pwd$ stores the default password hash value so that we can check which users can log with trivial password. Even if it includes all the Oracle maintained users, it has also some well known users which are not provided by database creation at all.


Sysaux occupants

Another idea that I took from Martin Bach is to crosscheck check with v$sysaux_occupants. Some Oracle provided users are missing, but the nice thing is that there is a description for the schemas that are listed.


Not exported by DataPump

DataPump maintains its own list of users for which the tables are not exported, and you can query sys.ku_noexp_tab for that. The old exp has also its list stored in sys.exu8usr. But both are not exhaustive.

Another exclusion list is the one used by logical standby: system.logstdby$skip which may be used as well to crosscheck (only records with 'action=0' are relevant for schema exclusion).

Let's check all that on a 12c non-cdb database. Here is the query that crosscheck all than information with outer joins:

select created,username,oracle_maintained,common,no_exp,no_expdp,no_sby,default_password,sysaux,occupant_desc
from dba_users 
left outer join 
 (select distinct name username,'Y' no_expdp from sys.ku_noexp_tab where obj_type='SCHEMA') 
left outer join (select distinct name username,'Y' no_exp from sys.exu8usr)
left outer join (select distinct name username,'Y' no_sby from system.logstdby$skip_support where action in (0,-1))
left outer join (select distinct user_name username,'Y' default_password from sys.default_pwd$)
left outer join (select schema_name username,'Y' sysaux,decode(count(*),1,min(occupant_desc)) occupant_desc from v$sysaux_occupants group by schema_name)
order by created,username;

and the result, ordered by creation date so that you can check which users were created by the database creation:

AUX 28-FEB-14 08:25:02 AUDSYS Y YES Y Y Y Y Y 28-FEB-14 08:25:02 SYS Y YES Y Y Y Y Y 28-FEB-14 08:25:02 SYSBACKUP Y YES   Y Y Y   28-FEB-14 08:25:02 SYSDG Y YES   Y Y Y   28-FEB-14 08:25:02 SYSKM Y YES   Y Y Y   28-FEB-14 08:25:02 SYSTEM Y YES Y   Y Y Y 28-FEB-14 08:25:05 OUTLN Y YES Y   Y Y   28-FEB-14 08:29:33 GSMADMIN_INTERNAL Y YES   Y Y Y   28-FEB-14 08:29:33 GSMUSER Y YES   Y Y Y   28-FEB-14 08:29:43 DIP Y YES   Y Y Y   28-FEB-14 08:30:39 XS$NULL Y YES   Y Y Y   28-FEB-14 08:31:12 ORACLE_OCM Y YES   Y Y Y   28-FEB-14 08:39:55 DBSNMP Y YES   Y Y Y Y 28-FEB-14 08:39:56 APPQOSSYS Y YES   Y Y Y   28-FEB-14 08:40:03 ANONYMOUS Y YES   Y Y Y   28-FEB-14 08:40:03 XDB Y YES   Y Y Y Y 28-FEB-14 08:44:29 GSMCATUSER Y YES   Y Y Y   28-FEB-14 08:53:38 WMSYS Y YES   Y Y Y Y 28-FEB-14 08:55:31 OJVMSYS Y YES   Y Y Y   28-FEB-14 08:59:58 CTXSYS Y YES   Y Y Y Y 28-FEB-14 09:00:44 MDSYS Y YES   Y Y Y Y 28-FEB-14 09:00:44 ORDDATA Y YES   Y Y Y Y 28-FEB-14 09:00:44 ORDPLUGINS Y YES   Y Y Y Y 28-FEB-14 09:00:44 ORDSYS Y YES   Y Y Y Y 28-FEB-14 09:00:44 SI_INFORMTN_SCHEMA Y YES   Y Y Y Y 28-FEB-14 09:09:48 OLAPSYS Y YES   Y Y Y Y 28-FEB-14 09:10:17 MDDATA Y YES Y   Y Y   28-FEB-14 09:15:09 SPATIAL_WFS_ADMIN_USR Y YES Y   Y Y   28-FEB-14 09:15:13 SPATIAL_CSW_ADMIN_USR Y YES Y   Y Y   28-FEB-14 09:20:00 LBACSYS Y YES   Y Y Y   28-FEB-14 09:20:16 APEX_040200 Y YES       Y   28-FEB-14 09:20:16 APEX_PUBLIC_USER Y YES Y     Y   28-FEB-14 09:20:16 FLOWS_FILES Y YES       Y   28-FEB-14 09:45:17 DVF Y YES   Y Y Y   28-FEB-14 09:45:17 DVSYS Y YES   Y Y Y   21-MAR-14 18:15:19 FRANCK N NO Y         09-APR-14 20:33:34 RMAN N NO Y     Y   11-APR-14 09:04:55 DEMO N NO Y     Y   23-APR-14 20:53:09 HR N NO Y   Y Y   23-APR-14 20:53:09 IX N NO Y   Y Y   23-APR-14 20:53:09 OE N NO Y   Y Y   23-APR-14 20:53:09 PM N NO Y   Y Y   23-APR-14 20:53:09 SH N NO Y   Y Y   25-APR-14 21:26:33 SQLTXADMIN N NO Y         25-APR-14 21:26:33 SQLTXPLAIN N NO Y         27-MAY-14 21:48:24 SCOTT N NO Y   Y Y  


So you see that the 12c columns are highly reliable. They include all users that have been created during the database creatio, but do not include the users that were created afterwards.

When not in 12c, the exclusion list coming from DataPump is quite good, but not exhaustive.

Sometimes the most useful features are not those that change the future of database, but rather those little things that make life easier.

In DBA_USERS, we have two very useful new columns: ORACLE_MAINTAINED and LAST_LOGIN. That is a great improvement.

How to avoid Failover Cluster with PowerShell scripting

Wed, 2014-06-04 20:15

One of my colleague asked me some days ago if I could create a PowerShell script to get rid from Windows Server Failover Cluster. The reason is that it has, in some specific cases, the bad habit to failover a resource or a service even when we don't want it. To be honest, I use a lot the Failover Cluster feature of Windows Server without any problems as a SQL Server DBA, but for this specific case, I did not find a successful configuration.

Here, we are using Failover Cluster to provide a Virtual IP address for a database using Dbvisit Standby. The database is open only on one node and the switchover or failover operation is performed by hand so we want to prevent failover of the Virtual IP resource to avoid redirecting client connection to a database that is not open.

At one of our clients, we have a Windows Server 2008 R2 cluster with two nodes and a FileShare. A Client Access Point has been provided as a service for client connections and this service must not failover automatically in any cases. For some reasons, this service failovers and I cannot avoid it even with special configuration settings.

For a solution to this problem, I propose to create a PowerShell script which will create a new IP address attached to the Network IPv4 interface. This script will read a configuration file to collect informations about the IP address like network address mask, network interface name, IP address, nodes where to start and stop the IP address, etc.

Here is the format of our configuration file:

# Configuration file for VIP management of databases
# Configuration format :
# SID            : Oracle SID for which the VIP is configured - mandatory
# VIP_ADR     : Address of the VIP - mandatory
# VIP_MASK   : VIP Network address Mask - mandatory
# VIP_IF         : Network interface to attach the VIP (i.e Local Area Connection) - mandatory
# VIP_LSNR    : VIP Listener - optional
# VIP_Node1   : Name of Node 1 - mandatory
# VIP_Node2   : Name of Node 2 - mandatory
# Note :
# The usage of parameters delimiters allows to have potentially empty (not
# defined parameters)
# How to call the script :
# .CreateIPAddress.ps1 'Start' 'DBITEST' [-F]
# Param 1: "Start" or "Stop" - mandatory
# Param 2: VIP Address name - mandatory
# Param 3: -F to force creation of the VIP Address locally even if it is not possible to check if the VIP Address exits on the second node - optional
# Example :
# DBITEST: Area Connection::test1:test2

In this blog, I will share how the script works step by step and what are the components/commands I had to learn about to create an IP on a network interface.

First of all, this PowerShell script uses the cmdlet Invoke-Command which uses Windows Remote Management (WinRM). If WinRM is not configured you can enable the WS-Management protocol on the local computer and set up the default configuration for remote management with the following command: Winrm quickconfig.

You need also to change the execution policy in your PowerShell screen with Set-ExecutionPolicy Unrestricted to allow script execution. You should also open your PowerShell screen with the Run As Administrator option.


The first thing I do is to clear my screen:


Now, I have to check if I receive the parameter I need for my script:


I have initialized my two mandatory parameters and perhaps a third one which is optional. It is time to read where my configuration file is located, test this location and initialize the variables.


I am able now to read my configuration file sequentially to find my VIP Address. I will test first character of each line to avoid comment line and when I will find a non-comment line, check the first parameter of the line to test if it is equal to my VIP Address enter into the command line.


I have found my VIP Address in my configuration file, I will now split my configuration line to collect all informations I need and connect to my second node to get back the different IP Addresses attached to Network Adapters.

If the second node doesn't respond we stop the script to avoid to create a local IP Address which already exists on the second node. But if the optional parameter "-F" has been entered we force the creation.


If we run the command "Get-WmiObject -Class Win32_NetworkAdapterConfiguration -Filter IPEnabled=TRUE -ComputerName $node2 -erroraction stop", we have a list of each Network Adapter with its IP Addresses associated.


We have now to search in the different lines if our VIP Address exists. If we find it, we have to delete this IP Address and if it doesn't exist, we should print a message.


For the moment, we have to check remotely if our IP Address existed on the remote server and if it was the case we have to delete this IP Address.

Now, based on our parameter, we will stop or start this IP Address locally.


Our VIP address is now started or stopped locally depending on the defined parameters. Finally, if we have not found the VIP Address in the configuration file, an error will be displayed and the script ends.


At present, we have a local VIP Address where the client can come to connect. If our node goes down for whatever reason when it comes back, our VIP address will by always in the same node. If we want to change for node 1 to node 2, we just have to run our script on node 2 and go for the parameter start. This will stop the IP Address on node 1 and start the IP Address on node 2.

I hope this script and the different functions I have shown here were able to help you.

Configure a Kerberos Client and Alfresco as a Kerberized application

Wed, 2014-05-28 03:57

In the previous blog, I described How to install and manage a Kerberos Server but that's useless if there are no clients and if no application have been kerberized! That's why in this blog I will explain in the first part how to install a kerberos client in linux. The second part will be dedicated to the configuration of a browser to use Kerberos tickets and the last part will explain how to configure a popular application to use the newly created Kerberos MIT KDC. This application is Alfresco (leader in Open Source solutions for Electronic Content Management).

For this blog, let's define the following properties/variables:

  • = the DNS Domain
  • = the FQDN of the KDC
  • document.write(['mpatou','EXAMPLE.COM'].join('@')) = the principal of a test user
  • = the FQDN of the Alfresco host server

All configurations below have been tested on our infrastructure.

1. Install MIT Kerberos Client

On this part, I will only present how to install a Linux Client. I think the Mac release is available as part of the Mac OS X since version 10.3 (the current release is Mavericks: 10.9) and so there is nothing to do to install a Kerberos client but this installation isn't configured. The Windows installation is quite easy if you use the installer but this isn't a perfect solution. I think the best solution for windows would be to use the Windows implementation of Kerberos to configure the connection to a Linux KDC but this is a little bit more complex and I personally have troubles to configure that...

So, all steps to install a Linux client are quite the same as those to install a Kerberos Server. Indeed, there is no specific source code for the client and so the basic installation is the same but it's not the case for the configuration.

Obviously, the first thing to do is to download the current release of the MIT Kerberos distribution for the target operating system. This could be done at the following URL: The current Linux release is krb5-1.12.1-signed.tar:

# wget
    --2014-04-01 14:00:28--
    Connecting to||:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 11950080 (11M) [application/x-tar]
    Saving to: “krb5-1.12.1-signed.tar”
    100%[===============================================>] 11,950,080  1.52M/s   in 7.3s

    2014-04-01 14:00:38 (1.56 MB/s) - “krb5-1.12.1-signed.tar” saved [11950080/11950080]

# tar  -xvf krb5-1.12.1-signed.tar

As you can see, this file is signed and you could (should) verify the integrity and identity of the software. This can be done, for example, using GNU Privacy Guard:

# gpg  --verify  krb5-1.12.1.tar.gz.asc

After that, just extract the MIT Kerberos source code:

# tar  -zxf  krb5-1.12.1.tar.gz
# cd  krb5-1.12.1/src/
# ./configure
# yum  install  *yacc*
# make
# make  install

At this step, Kerberos should be installed properly and the binaries, libraries and the documentation should be under /usr/local. The default location is sufficient for a client installation:

# krb5-config  --all
    Version:     Kerberos 5 release 1.12.1
    Vendor:      Massachusetts Institute of Technology
    Prefix:      /usr/local
    Exec_prefix: /usr/local

To set up a linux client, there is only one step remaining: tell Kerberos where to find a KDC. This is done through a configuration file named krb5.conf. In the following configuration, I've included two lines (forwardable and proxiable) that are important for Alfresco SSO using Kerberos to work properly. These two lines aren't mandatory for other kerberized applications:

# vi  /etc/krb5.conf
        default_realm = EXAMPLE.COM
        forwardable = true                            {Line only important for Alfresco}
        proxiable = true                              {Line only important for Alfresco}

        EXAMPLE.COM = {
            kdc =
            admin_server =
            default_domain =

    [domain_realm] = EXAMPLE.COM = EXAMPLE.COM

That should be enough to obtain a ticket for the test user ("kinit mpatou" OR "kinit document.write(['mpatou','EXAMPLE.COM'].join('@'))") and delete this ticket:

# klist
    klist: Credentials cache file '/tmp/krb5cc_0' not found
# kinit  mpatou
    Password for document.write(['mpatou','EXAMPLE.COM'].join('@')):
# klist
    Ticket cache: FILE:/tmp/krb5cc_0
    Default principal: document.write(['mpatou','EXAMPLE.COM'].join('@'))

    Valid starting               Expires               Service principal
    05/20/2014 10:54:48     05/20/2014 20:54:48     krbtgt/document.write(['mpatou','EXAMPLE.COM'].join('@'))
    renew until 05/21/2014 10:54:47
# kdestroy
# klist
    klist: Credentials cache file '/tmp/krb5cc_0' not found

2. Configure the Browser

Now the client should be able to acquire the first ticket (TGT) but that's not enough! The next step is to configure the Browser to use tickets. Indeed, if the client tries to access to a kerberized application, the browser has to present a ticket for that application to be logged in automatically.

a. Mozilla Firefox (Window/Linux/Mac)

Open a new window/tab in Mozilla Firefox:

  • Enter in the URL field "about:config"
  • All properties of Firefox should be displayed
  • Search for "network.negotiate" and set: network.negotiate-auth.delegation-uris = .EXAMPLE.COM
  • Set: network.negotiate-auth.trusted-uris = .EXAMPLE.COM

b. Google Chrome

i. Mac

Modify the Google Chrome application with:

open 'Google' --args --auth-schemes="basic,digest,ntlm,negotiate" --auth-server-whitelist="*EXAMPLE.COM" --auth-negotiate-delegate-whitelist="*EXAMPLE.COM"

ii. Linux

Modify Google Chrome with:

google-chrome --enable-plugins --args --auth-server-whitelist="*EXAMPLE.COM" --auth-negotiate-delegate-whitelist="*EXAMPLE.COM" --auth-schemes="basic,digest,ntlm,negotiate"

iii. Windows

Modify the Google Chrome shortcut with:

chrome.exe --auth-schemes="basic,digest,ntlm,negotiate" --auth-server-whitelist="*EXAMPLE.COM" --auth-negotiate-delegate-whitelist="*EXAMPLE.COM"

c. Internet Explorer

Open a new window/tab in Internet Explorer and:

  • Open the "tools" (ALT+X)
  • Click on "internet options" (o)
  • Choose the "Advanced" tab
  • In the security section, select the "Enable Integrated Windows Authentication*"
  • Choose the "Security" tab
  • Select "Local intranet"
  • Click on "Sites"
  • Click on "Advanced"
  • Add "http://*" or/and "https://*"
  • Click on "Close"
  • Click on "OK"
  • Click on "OK"
  • Restart the computer

If everything works fine, then the client should be able to access to kerberized applications. If it's not the case, a good start to debug the kerberos installation is to use a network analyzer like WireShark.

3. Configure Alfresco to use Kerberos SSO

Actual test configuration:

  • OS: Oracle Enterprise Linux 6
  • Alfresco: Alfresco Community Enterprise 4.2.c installed on /opt/alfresco
  • Application Server: Tomcat 7

So let's begin the configuration of a Kerberized application with Alfresco. The first thing to know about Alfresco is that Alfresco uses two main Web Clients: Alfresco Explorer and Alfresco Share. The first one is the core of Alfresco and the second one is the new interface (mainly oriented on collaboration) that uses a proxy to ask Explorer to do the job. The configuration of Alfresco Explorer is quite easy but to get Alfresco Share working it's not the same as Share add a new layer above Explorer.

If nothing is specified, all modifications below should be done on, the Alfresco host server.

For Kerberos to work properly, the maxHttpHeaderSize must be increased:

# vi  /opt/alfresco/alfresco-4.2.c/tomcat/conf/server.xml
   Connector port="8080" URIEncoding="UTF-8" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" maxHttpHeaderSize="32768"

 Then the authentication chain must allow Kerberos tickets to be used to authenticate a user:

# vi  /opt/alfresco/alfresco-4.2.c/tomcat/shared/classes/

After that, the Kerberos subsystem must be configured:

# cd  /opt/alfresco/alfresco-4.2.c/tomcat/shared/classes/alfresco/extension/subsystems
# vi  Authentication/kerberos/kerberos1/

Once this is done, the KDC must be configured to recognize Alfresco as a Kerberized application. For that purpose, enter in the KDC kadmin interface. Caution, this requires the installation of a Kerberos Server on as described in the previous blog (the important part is that kadmin must be available from a remote computer) and the installation of a Kerberos Client on If your kadmin doesn't work from a remote location, you will have to use the kadmin.local interface on the KDC host server.

# /usr/local/bin/kadmin
    addprinc  cifs/document.write(['','EXAMPLE.COM'].join('@'))
    addprinc  HTTP/document.write(['','EXAMPLE.COM'].join('@'))
    ktadd  -k  /etc/krb5cifs.keytab  cifs/document.write(['','EXAMPLE.COM'].join('@'))
    ktadd  -k  /etc/krb5http.keytab  HTTP/document.write(['','EXAMPLE.COM'].join('@'))


This will create 2 files named 'krb5cifs.keytab' and 'krb5http.keytab' on the host where the kadmin was run. That means that if you used the kadmin from, then there is nothing more to do but if you used the kadmin or kadmin.local from, then those 2 files must be moved to

# cd  /etc
# scp  krb5cifs.keytab  document.write(['root',''].join('@')):/etc
enter password:
# scp  krb5http.keytab  document.write(['root',''].join('@')):/etc
enter password:

Create or update the config file for the Java Security. Be careful that this is the Java used by Alfresco:

# vi  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security/java.login.config:
    Alfresco { sufficient;

    AlfrescoCIFS { required

    AlfrescoHTTP { required
    ShareHTTP { required
    }; { sufficient;

    other { sufficient;

Configure the default Java Security to use our custom configuration by adding a line at the end of the file:

# vi  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security/

 Finally, update the share-config-custom.xml file to contain the Kerberos configuration (the three images are in a unique file attached in this blog ->

# vi /opt/alfresco/alfresco-4.2.c/tomcat/shared/classes/alfresco/web-extension/share-config-custom.xml



That could be enough to get the Kerberos SSO working for Alfresco Explorer and Alfresco Share. I think there are some additional steps to get the Alfresco CIFS working too but I'm not sure. Moreover, it's possible that you get some strange exceptions that prevent Alfresco to authenticate your Kerberos Ticket. That probably comes from the Java Cryptography Extension that is missing on your Alfresco server. For that purpose, download the JCE corresponding to the Java version used by Alfresco (JCE6 or JCE7) and deploy it (I assume below that the JCE7 is under /opt):

# unzip  /opt/
# cd  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security
# cp  local_policy.jar  local_policy.jar.orig
# cp  US_export_policy.jar  US_export_policy.jar.orig
# cd  /opt/UnlimitedJCEPolicy
# cp  local_policy.jar  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security/
# cp  US_export_policy.jar  /opt/alfresco/alfresco-4.2.c/java/jre/lib/security/

I hope I was clear enough in my explanations and I hope I was able to share (a little bit?) my passion for open source solutions!

Good luck with Kerberos and Alfresco!

Oracle Exadata - poor optimization for FIRST_ROWS

Wed, 2014-05-28 03:44

In a previous blog, I discussed the difference between rownum and row_number(), in particular their behaviour in implicitely adding a first_rows(n) to the optimizer. That reminded me that I forgot to blog about an issue I encountered and which concerns both approaches. It was on an Exadata: a nice full table scan with smartscan was taking a long time. And forcing to an index access - with a very bad index - was better. The query had a very selective predicate (and that's why SmartScan should be very good here) but where the predicate cannot be used by the index access.

In order to explain this, I'll do the same query but without any predicates. It's on a simple TEST table with 1 million rows.

select /*+ gather_plan_statistics */ * from (
  select * from TEST order by n 
) where rownum


And here is the execution plan with execution statistics

| Id  | Operation                               | Name | E-Rows | A-Rows | Buffers | Reads  |
|   0 | SELECT STATEMENT                        |      |        |     10 |    2137 |   2135 |
|*  1 |  COUNT STOPKEY                          |      |        |     10 |    2137 |   2135 |
|   2 |   VIEW                                  |      |   1000K|     10 |    2137 |   2135 |
|*  3 |    SORT ORDER BY STOPKEY                |      |   1000K|     10 |    2137 |   2135 |
|*  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS| TEST |   1000K|   1000K|    2137 |   2135 |


It sounds good. SmartScan is used (TABLE ACCESS STORAGE). We read 1 million rows, then sort them and keep only the top-10. But it takes longer than our expectation. Let's have a look at some SmartScan statistics:


NAME                                                                   VALUE
---------------------------------------------------------------- -----------
cell IO uncompressed bytes                                        17,596,416
cell blocks processed by cache layer                                   3,329
cell blocks processed by data layer                                    2,148
cell blocks processed by txn layer                                     3,329
cell num fast response sessions                                            1
cell num fast response sessions continuing to smart scan                   1
cell physical IO bytes eligible for predicate offload             17,498,112
cell physical IO interconnect bytes                               31,273,680
cell physical IO interconnect bytes returned by smart scan        15,848,144
cell scans                                                                 2
physical read bytes                                               17,506,304
physical read total bytes                                         32,923,648


Something is wrong here. We have 17GB eligible to SmartScan, which is the 2135 blocks reads we see in the execution plan (I have the default 8k blocks).
But I've exchanged 31GB through interconnect. What are those additional 16GB ? I'm doing only SmartScan here, according to the execution plan.

Well. Easy to check. The full table scan must read all blocks up the the high water mark.


select blocks,blocks*8192 MBYTES from user_tables where table_name='TEST';
---------- ----------
      4013   32882245


So my table has 4013 blocks and is 32GB. Not only my execution plan above is not optimal. But it is false. It is just impossible to full scan my table with only 2137 block reads. I have to read 4013 blocks, which is more than 30GB.
So in the statistics above, the right values are 'physical read total bytes' and 'cell physical IO interconnect bytes'. But it seems that only 17GB of data has been read through smart scan.

There we have to know a little bit more about an exadata optimization that is not very well documented. The clue was 'FIRST ROWS' in the execution plan, just after 'STORAGE'. If it's a special feature, we can check how to enable/disable it with an underscore parameter. Names and descriptions are in sys.x$ksppi:


SQL> select KSPPINM,KSPPDESC from sys.x$ksppi where KSPPDESC like '%first%rows%';

Enable smart scan optimization for fast response (first rows)

cost ratio for sort eimination under first_rows mode

enable the use of first K rows due to rownum predicate

Optimizer index bias over FTS/IFFS under first K rows mode


From its description, _kcfis_fast_response_enabled is related with SmartScan and with 'first rows' (which we have in the execution plan and which is coming from the rownum < 10.
So let's disable it:


alter session set "_kcfis_fast_response_enabled"=false;


and execute again, with execution plan and statistics.


| Id  | Operation                               | Name | E-Rows | A-Rows | Buffers | Reads  |
|   0 | SELECT STATEMENT                        |      |        |     10 |    4019 |   4017 |
|*  1 |  COUNT STOPKEY                          |      |        |     10 |    4019 |   4017 |
|   2 |   VIEW                                  |      |   1000K|     10 |    4019 |   4017 |
|*  3 |    SORT ORDER BY STOPKEY                |      |   1000K|     10 |    4019 |   4017 |
|*  4 |     TABLE ACCESS STORAGE FULL FIRST ROWS| TEST |   1000K|   1000K|    4019 |   4017 |


The first observation was that the execution time was much faster. And here we see the 4000 blocks of my table that we have to read.


NAME                                                                                VALUE
---------------------------------------------------------------- ------------------------
cell IO uncompressed bytes                                                     33,120,256
cell blocks processed by cache layer                                                5,913
cell blocks processed by data layer                                                 4,043
cell blocks processed by txn layer                                                  5,913
cell physical IO bytes eligible for predicate offload                          32,915,456
cell physical IO interconnect bytes                                            29,833,624
cell physical IO interconnect bytes returned by smart scan                     29,825,432
cell scans                                                                              2
physical read bytes                                                            32,923,648
physical read total bytes                                                      32,923,648


All statistics are right now. 32GB eligible to SmartScan. 32GB returned by SmartScan.

So what happened?

When we use first_rows(10), either explicitely or coming from rownum < 10, Oracle knows that we need only 10 rows. And Exadata has an optimization to avoid SmartScan for only few rows because it has an overhead to start. This is the 'Enable smart scan optimization for fast response (first rows)' feature and you can see it in the 'cell num fast response sessions' statistic above. No SmartScan occurs when the result is expected to come quickly. However, if it is longer than expected, it can switch so SmartScan later ('cell num fast response sessions continuing to smart scan'). But there are two problems with that.
First, the statistics are wrong. The statistics from the first phase (the non-SmartScan one) are just lost. We see it in the execution plan where about 2000 reads are missing. And we see it also in 'physical read bytes' where 16GB are missing.
Second, and worse, this optimization is not good at all in our case. No matter we need only 10 rows. We have to read 1 million rows because we have to sort them before filtering them. And 1 million is not a good candidate for 'fast response (first rows)' optimization. Here, 16GB has been read without SmartScan, and when I checked the wait events, they were all single block reads. The fastest machine has chosen to use the slowest way to get data :(

This is just one example of first rows issues. There are many others. In general, first_rows_n optimization is good when it avoids to sort all the rows, using the index which is maintained ordered. And by the way, if you see 'first rows' without an order by, then it's suspicous: who wants the top-n of a random row set ?
In order to have a better control on first rows optimizations, I prefer to avoid rownum and use the row_number() function or the 12c FETCH FIRST ROWS syntax. See the previous blog about that.

Tempdb enhancements with SQL Server 2014

Sun, 2014-05-25 11:40

SQL Server 2014 is definitively designed for performance and I will try to demonstrate it during this blog post. I like to talk about hidden performance features because generally it does not require any changes for applications unlike in-memory tables for instance (aka hekaton tables).

Since SQL Server 2005 version some improvements have been made for tempdb. Tempdb caching is one of them and allows to reduce the page allocation contention. Basically to create a table SQL Server must first build the system catalog entries related to system pages. Then, SQL Server has to allocate an IAM page and find a mixed extent in an SGAM page to store data and mark it in the PFS page (as a reminder, by default mixed extent is chosen unless to force uniform extent by using the traceflag 1118). Finally the allocation process must be recorded to system pages. When a table is dropped SQL Server has to revert all it have done for creating the table. It implies some normal locks and latches during all the allocation process, same ones used for creating and dropping a temporary table. However in tempdb tables are created and dropped very quickly and it can generate page allocation contention especially for PFS, SGAM and GAM system pages (the famous PAGELATCH_UP wait type against the concerned pages). The bottom line is that SQL Server can cache some of the metadata and page allocations from temporary objects for easier and faster reuse with less contention.


In addition, to allow a temporary table to be cached it must be first used into a stored procedure but some situations can prevent this caching like:


  • Using named constraints
  • Using DDL after the temporary table creation
  • Create a table in a different scope
  • Using a stored procedure with recompile option

By executing the following T-SQL script with SQLQueryStress we can easily show that temporary tables are not reused by SQL Server.

use [AdventureWorks2012] go   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;


I used 8 concurrent threads with 100 iterations during this test.




In the same time, I enabled the following perfmon counters:


Counter name

Min value

avg value

Max value

Average latch wait time (ms)




Latch wait / sec




Temp tables creation rate / sec




Cache objects in Use




Now if I rewrite the same ad-hoc T-SQL statement into a stored procedure and then I perform the same test we can notice some speed improvements:

use [AdventureWorks2012] go   create procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;




counter name

min value

avg value

Max value

Average latch wait time (ms)




Latch wait / sec




Temp tables creation rate / sec




Cache objects in Use





As expected, this improvement is due to the tempdb caching mechanism. We can notice here that SQL Server reuses caching objects (“Cache objects in Use” counter > 0) that are in fact the temporary table into the stored procedure. Using caching objects decrease drastically the temporary table creation rate (Temp Tables creation rate / sec is equal to 0 here).

The cached objects themselves are visible by using the system table sys.tables in the tempdb context. For example during the first test we can easily observe that SQL Server does not deallocate completely a temporary table used into a stored procedure. The relationship can be made with the object id column value with a negative number. When SQL Server uses a temporary table the name of table is #test and when SQL Server doesn’t use it without deallocating the associated pages the name is composed of a 8-character hexadecimal string that maps in fact to the object id value. #AF42A2AE is the hexadecimal representation of the #test temporary table with the object id equal to -1354587474.








Furthermore we can notice several records in the above results because I used SQLQueryStress with 8 concurrent threads that imply concurrent executions of the stored procedure with separate cached objects in tempdb. We can see 4 records (I didn’t show completely the entire result here) but in fact we retrieved 8 records.

As I said earlier, DDL statements after the creation of the temporary table inhibits the ability to cache the temporary objects by SQL Server and can decrease the global performance of the stored procedure (we can ask here what is a DDL statement .. because DROP TABLE #table is apparently not considered as such because tempdb caching mechanism is not impacted). In my sample, SQL Server proposes to create the following index on the ProductID column to improve the query statement:


create nonclustered index idx_test_transaction_product_id on #test (        ProductID )

Go ahead, we trust SQL Server and we will add the creation of the index after the creation of the temporary table into the stored procedure:

use [AdventureWorks2012] go   create procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID    int,        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   --create index for ProductID predicate create nonclustered index idx_test_transaction_product_id on #test (        ProductID )   insert #test select top 10000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;


However, the result is not as good as we would expect ...




If we take a look at the perfmon counters values:

Counter name

min value

avg value

Max value

Average latch wait time (ms)




Latch wait / sec




Temp tables creation rate / sec




temp tables for destruction




Cache objects in Use





For this test I added a new perfmon counter: temp tables for destruction that indicates clearly that the temporary tables will be destroyed by SQL Server because they cannot be used in this case: the index creation DDL prevents the tempdb caching mechanism.

Here comes a new SQL Server 2014 feature that introduces a new way for declaring nonclustered indexes directly into the table creation DDL which can be a good workaround to the preceding test.


alter procedure [dbo].[sp_test_tempdb] as   create table #test (        TransactionID bigint,        ProductID int index idx_test_transaction_product_id, --< index created "on the fly"        TransactionDate     datetime,        Quantity     int,        ActualCost   money )   insert #test select top 1000000 * from AdventureWorks2012.dbo.bigTransactionHistory   select        ProductID,        sum(Quantity * ActualCost) as total_cost from #test where ProductID = '16004' group by ProductID   drop table #test;


After running the test we can notice that the temp tables creation rate and temp tables for destruction counters value are again equal to zero. SQL Server used the temporary table during the testing as showing the "Cache objects in User" counter.


Counter name

min value

avg value

Max value

Average latch wait time (ms)




Latch wait / sec




Temp tables creation rate / sec




temp tables for destruction




Cache objects in Use





However if we can still use the tempdb caching mechanism with SQL Server 2014 and this new tips the above result is contrasted with the total duration of execution as showed by the following picture:




The global execution time is larger than the test first with the stored procedure, the temporary table without any nonclustered index (02:44 vs 00:21) in my case. This is because inserting data into a table with a nonclustered index can take more time than a table without any indexes but in a real production environment we will probably encounter situations where the cost for inserting data into a table with an index would be substantial compared to the gain made for the following readings. If you have some examples please feel free to share with us Laughing

Another interesting feature since many versions is the concept of eager writes that prevent flooding the buffer pool with pages that are newly created, from bulk activities, and need to be written to disk. Eager write is another background process that helps to reduce the pressure of the well-known lazy writes and checkpoint background processes as well as increasing the IO performance by gathering pages before writing to disk. Basically, SQL Server tracks these pages into a circular list in memory. When the list is full old entries are removed by writing them to disk if still dirty.

Let me show you with the following T-SQL script on the SQL Server 2012 instance. I used the traceflag 3917 to show eager writes activity (thanks to Bod Dorr for this tip).


use AdventureWorks2012; go   -- create procedure sp_test_tempdb_2 -- bulk activity by using select into #table CREATE procedure sp_test_tempdb_2 as   select        bth.*,        p.Name AS ProductName,        p.Color into #test from AdventureWorks2012.dbo.bigTransactionHistory as bth        join AdventureWorks2012.dbo.bigProduct as p              on bth.ProductID = p.ProductID where p.Color in('White')        and p.Size = 'M' option (maxdop 1);   select        TransactionDate,        ProductID,        ProductName        Quantity        --Quantity * ActualCost AS total_individual_sale from (        select              ROW_NUMBER() OVER (PARTITION BY TransactionDate ORDER BY Quantity DESC) AS num,              *        from #test ) transaction_production_sales_top_ten where num option (maxdop 1);   drop table #test go   -- using of traceflag 3917 to show eager write activity (be carefull the ouput may be verbose) dbcc traceon(3917); dbcc traceon(3605); go   -- cycle errorlog for next easy read exec sp_cycle_errorlog; go   -- execution of the stored procedure dbo.sp_test_tempdb_2; exec dbo.sp_test_tempdb_2; go   -- Reading the error log file exec xp_readerrorlog;

Below a sample of the SQL Server error log:




We can notice that SQL Server writes up contiguous 32 dirty pages to disk in my test.

Even if this process is optimized to write pages efficiently to disk, we have still IO activity. SQL Server 2014 enhances this process by relaxing the need to flush these pages to disk as quickly as the older versions. SQL Server recognizes the bulk activity and the concerned pages are loaded, queried and released without any flushing disk activity.

The same test performed on the SQL Server 2014 environment gives the following result:




The eager write process was not triggered this time. So let’s compare with a simulating workload by using ostress this time. Ostress is a stress tool provided by the RML utilities. This time I used ostress with 4 threads and 1000 iterations each. SQLQueryStress generated a bunch of ASYNC_IO_NETWORK during my tests which potentially distorts the final result.

So, I used the following script for the both environment (SQL Server 2012 and SQL Server 2014):

"C:Program FilesMicrosoft CorporationRMLUtilsostress.exe" -Slocalhost -dAdventureWorks2012 -Q"exec dbo.sp_test_tempdb_2" -n4 -r1000 -N –q


SQL Server 2012




… the corresponding io file stats:


SELECT AS database_name, AS [file_name],        f.physical_name,        f.type_desc,        vf.num_of_reads,        vf.num_of_writes FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vf INNER JOIN sys.databases AS d ON d.database_id = vf.database_id INNER JOIN sys.master_files AS f ON f.file_id = vf.file_id    AND f.database_id = vf.database_id where f.database_id = db_id('tempdb')




… and the corresponding wait types:


Wait type

Total wait


Total wait count

Avg wait time



















SQL Server 2014










Wait type

Total wait


Total wait count

 style="margin-bottom: 0.0001pt; line-

Oracle 12c extended datatypes better than CLOB?

Fri, 2014-05-23 08:15

12c has introduced character strings that can can go above 4000 bytes. In the previous versions, in PL/SQL only we were allowed to have VARCHAR2 up to 32k. In SQL the VARCHAR2 datatype was limited to 4000 and CHAR was limited to 2000. That became a bit small especially when lot of applications needed to store unicode characters.

From 12c we can have SQL datatypes having up to 32k bytes for VARCHAR2, CHAR and RAW. It's not allowed by default. We need to set max_string_size=extended and recompile views with utl32k. Nice improvement. But is it a good idea to use that new feature when we already have CLOB for large character strings ? The New Features documentation is clear about that: extended datatypes have been introduced to be compatible with other databases - not to replace existing features.

I will not go into the details how they are stored. Information about that is available elsewhere. See for example @ludodba recent blog post Where are Extended Data Types stored?. Extended datatypes are stored as chained rows if you just extend an existing table, or as a LOB if you defined them on a new table. Chained rows is clearly not a good option, so, given that you (re)create the tables, their storage is similar to CLOB.

But there is something that I don't like with LOBS: they are fetched row by row. When you select a row you get only the handle. And you get the CLOB later when you access to it through the handle. Did you ever try to datapump a table with LOBs through network_link? Huge amount of roundtrips and very bad performance. It's one rare case where doing expdp/impdp with a dumpfile is better. For very large objects, you will do several roundtrips anyway, so this is not an issue. But with character strings that are just a few kilobytes having them as LOB introduces an ineffective overhead.

Let's compare the fetch behaviour with those new extended datatypes. For my demo, I'll use a table with a clob column "C" and an extended varchar2 column "E", and insert same data into both columns.


SQL> create table TEST ( C clob , E varchar2(9000) );
Table created.
SQL> insert into TEST select lpad(rownum,9000,'x'),lpad(rownum,9000,'x') from dual connect by level 10 rows created.


Here is the autotrace when reading the CLOB from 10 rows:


SQL> set autotrace trace stat
SQL> select C from TEST;

 10 rows selected.

           2  recursive calls
           0  db block gets
          27  consistent gets
          20  physical reads
           0  redo size
       93936  bytes sent via SQL*Net to client
        2722  bytes received via SQL*Net from client
          22  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
          10  rows processed


For only 10 rows I've made 22 roundtrips. This is the problem with LOBs. Too many roundtrips. Well there is another problem that I'll not show here, which is the fact that you can fetch the lob a long time after, even when the cursor is closed. It does consistent read so you have to set your undo_retention accordingly.

Now here is the same data from the extended varchar2 column:


SQL> select E from TEST;

 10 rows selected.

           1  recursive calls
           0  db block gets
          56  consistent gets
           0  physical reads
           0  redo size
       90501  bytes sent via SQL*Net to client
         492  bytes received via SQL*Net from client
           2  SQL*Net roundtrips to/from client
           0  sorts (memory)
           0  sorts (disk)
          10  rows processed


Here I got the same volume (10 times 9000 characters) but this time I did only 2 roundtrips.

Let's go further and trace with sql_trace. LOB calls are instrumented since 11g so we can see them from the trace file:


PARSING IN CURSOR #139894737850360 len=18 dep=0 uid=103 oct=3 lid=103 tim=8952647276 hv=844696927 ad='77e1a518' sqlid='132sh6wt5k3az'
select C from TEST
PARSE #139894737850360:c=0,e=82,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895260
EXEC #139894737850360:c=0,e=69,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=897416
FETCH #139894737850360:c=0,e=97,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=848200
LOBREAD: c=999,e=10893,p=2,cr=1,cu=0,tim=8952659696
FETCH #139894737850360:c=0,e=43,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=860372
LOBREAD: c=1000,e=1614,p=2,cr=1,cu=0,tim=8952662447
FETCH #139894737850360:c=0,e=47,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=863495
LOBREAD: c=2000,e=657,p=2,cr=1,cu=0,tim=8952664615
FETCH #139894737850360:c=0,e=40,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=865575
LOBREAD: c=0,e=706,p=2,cr=1,cu=0,tim=8952666808
FETCH #139894737850360:c=0,e=37,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=867552
LOBREAD: c=1000,e=949,p=2,cr=1,cu=0,tim=8952669193
FETCH #139894737850360:c=0,e=92,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=869825
LOBREAD: c=0,e=844,p=2,cr=1,cu=0,tim=8952671276
FETCH #139894737850360:c=0,e=68,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=872168
LOBREAD: c=1000,e=756,p=2,cr=1,cu=0,tim=8952673521
FETCH #139894737850360:c=0,e=44,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=874712
LOBREAD: c=999,e=914,p=2,cr=1,cu=0,tim=8952676180
FETCH #139894737850360:c=0,e=64,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=877352
LOBREAD: c=0,e=931,p=2,cr=1,cu=0,tim=8952678875
FETCH #139894737850360:c=0,e=52,p=0,cr=1,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=879774
LOBREAD: c=1000,e=795,p=2,cr=1,cu=0,tim=8952681136
FETCH #139894737850360:c=1000,e=7,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=891850
STAT #139894737850360 id=1 cnt=10 pid=0 pos=1 obj=100085 op='TABLE ACCESS FULL TEST (cr=16 pr=0 pw=0 time=60 us cost=5 size=20980 card=10)'
CLOSE #139894737850360:c=0,e=31,dep=0,type=0,tim=8952684289


And the sql_trace with the same data from the extended datatype.


PARSING IN CURSOR #139895028091224 len=18 dep=0 uid=103 oct=3 lid=103 tim=8954178349 hv=1829009117 ad='7b48ba08' sqlid='4kq232tqh8xqx'
select E from TEST
PARSE #139895028091224:c=0,e=64,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895345
EXEC #139895028091224:c=0,e=36,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1357081020,tim=895435
FETCH #139895028091224:c=1000,e=896,p=0,cr=8,cu=0,mis=0,r=1,dep=0,og=1,plh=1357081020,tim=899458
FETCH #139895028091224:c=4000,e=3660,p=0,cr=48,cu=0,mis=0,r=9,dep=0,og=1,plh=1357081020,tim=89535
STAT #139895028091224 id=1 cnt=10 pid=0 pos=1 obj=100085 op='TABLE ACCESS FULL TEST (cr=56 pr=0 pw=0 time=670 us cost=5 size=20010 card=10)'
CLOSE #139895028091224:c=0,e=13,dep=0,type=0,tim=8954214996


So there is is one big advantage over CLOB: the column values are returned without additional roundtrips.

That would mean that if you have character strings that may be between 4k and 32k then extended datatypes can be a good option. It's a new feature however, and designed for another goal (easy migration from other databases). So it's something to test carefully and the tests must integrate all you infrastructure components (backups, exports, replication, etc).