The ill effects of NOT using direct I/O

From: Gaja Krishna Vaidyanatha <gajav_at_yahoo.com>
Date: Sun, 13 Feb 2011 21:56:37 -0800 (PST)
Message-ID: <952412.92497.qm_at_web83602.mail.sp1.yahoo.com>



Hi Yong,

Nice writeup. I have started a new thread in an effort to focus on the topic of discussion. Yes, indeed on Linux with ext3 filesystems, setting FILESYSTEMIO_OPTIONS to SETALL and not doing anything special at the filesystem mount is enough to get direct I/O working. But as mentioned by me in a prior note (in the previous thread), the same is not true for jfs or vxfs, the last time I checked (a few months ago). You also bring a very relevant point to the discussion, the distinction between page cache and OS buffer cache. Let us for the purposes of our discussion, just say that the filesystem buffer cache includes both the page cache and the OS buffer cache.

Also, the scenario you have depicted in your last paragraph is interesting. My question to you - Have you observed what you are suggesting in your write up on a production system? The reason why I bring this up, is that in my experience the effects of NOT having direct I/O (with or without HugePages) has had crippling effects on the entire system, in this case in RH Linux. Similar behavior have also been observed and documented on other operating systems.

The following is pretty much a standard sequence of what has been observed on RH Linux without direct I/O on very busy Oracle systems (100+ active sessions at any given time):

  1. As the load on the system increases, more sessions run more SQL and eventually perform more I/O, thus increasing total I/O consumption on the system.
  2. This causes the OS to allocate more memory to the filesystem buffer cache, in an effort to buffer future I/O requests (which may or may not occur). The classic "double buffering" problem of data in the Oracle database buffer cache and the filesystem buffer cache has now begun.
  3. An increase in the memory footprint of the filesystem buffer cache causes an increase in the activity of the paging daemon - the OS tries to take preemptive action against a potential memory bottleneck.
  4. As the I/O pressure increases, the paging daemon starts to work even more aggressively (significantly increasing the number of pageins and pageouts). Eventually, the paging daemon starts to get even more aggressive, not only increasing its paging rates, but also swapping portions of the SGA (even if it were locked) when the demand for memory increases. A system-wide Memory Bottleneck has now become a reality.
  5. This further increases the CPU consumption on the system (listed below in #6) and also starts to create an artificial I/O Bottleneck (db file sequential read and db file scattered read times increase exponentially, as Oracle database blocks need to brought in from the OS's swap partitions). OUCH!!!
  6. This increase in paging daemon activity combined with the increase in the filesystem buffer cache footprint, increases the overall CPU utilization of the OS (verified with a continuous sar - u or vmstat). The "sar -u" breaks down CPU consumption is in 4 buckets - %usr, %sys, %wio, %idle. %sys will consistently and continuously increase.
  7. This starts to create an artificial CPU Bottleneck as %sys in many cases consumes a very significant percentage of total CPU on the system (have observed %sys consume 40% - 50% of total CPU).
  8. Imagine what havoc starts to unfold when the OS consumes 50% of the system's CPU resources - CPU Bottleneck becomes a reality. Nobody plans for the OS to consume 50% of configured CPU resources while engaging in system sizing and capacity planning efforts.
  9. Pretty soon, the CPU Bottleneck causes Oracle sessions start to timeout.
  10. Depending on the application vendor and their connection pool mechanics, the application then starts to re-try connections to the database, which starts a connection storm to the database. In some cases, the database is completely hung-up with various session/login-related latch issues.
  11. Sooner than later, the database comes to a grinding halt (with or without a crash), and the OS eventually may come to a grinding halt by imploding with a kernel panic. Even this did not occur, you pretty much have an unusable Oracle database at this stage of the game.

The above is pretty much a "poster child storyline" of very busy Oracle systems with hundreds of concurrent sessions that do not use direct I/O. Again, this is my experience and I'm curious if you or others on this list have observed anything to the contrary. If so, please provide the relevant details, as I would love to learn alternative system behavioral patterns without direct I/O. It is something that interests me quite a bit and I'm very curious. Thanks for taking the time to read this long note :)

Cheers,

Gaja
Gaja Krishna Vaidyanatha,
Founder/Principal, DBPerfMan LLC
http://www.dbperfman.com
Phone - 001-(650)-743-6060
Co-author:Oracle Insights:Tales of the Oak Table - http://www.apress.com/book/bookDisplay.html?bID=314 Co-author:Oracle Performance Tuning 101 - http://www.amazon.com/gp/reader/0072131454/ref=sib_dp_pt/102-6130796-4625766



From: Yong Huang <yong321_at_yahoo.com>
To: Freek.DHooge_at_uptime.be
Cc: oracle-l_at_freelists.org
Sent: Sun, February 13, 2011 10:41:30 AM Subject: RE: How much RAM is to much

> But I always thought that setting the filesystemio_options parameter
> to directIO or setall caused the processes to open the files with
> the O_DIRECT flag.

I can confirm what Freek said, on my Oracle 10.2.0.4, Red Hat 5 kernel 2.6.18-194.17.1.el5 architecture x86_64 if these details matter. The filesystem is ext3, mounted with no option (i.e. everything default). When the process opens a datafile for the first time, I can see the flags passed to open() call in strace are O_RDWR|O_SYNC|O_DIRECT.

By the way, ever since page cache is introduced to the file system, we should make a distinction between buffer cache and page cache. What's interesting here is page cache, because buffer cache caches file system metadata. Well, in Linux, I believe buffer cache is physically inside page cache. So you may call those buffers "buffer pages". Confusing name indeed. But to make it short, as soon as you have the word "buffer" in it, people knowing Linux kernel (not me!) will think you're referring to the cache for metadata, not for file content, which may be what you want to say.

Last point. While Oracle manages data in memory through a better mechanism because it knows the data better, leaving the decision of caching/not caching to the file system is not that dreadful. In some cases, it may even be a big help. So instead of assigning most memory to Oracle buffer cache and using direct I/O, someone can make a comprehensive test to see if locking a large portion of RAM to SGA (through HugePages etc.) and leaving file system to NOT direct I/O will be even better in some cases. If a table slightly bigger than 2% of the buffer cache is frequently full-scanned, and you didn't do anything special about it, I think it's better to let the file system cache its content.

Yong Huang       

--
http://www.freelists.org/webpage/oracle-l
--
http://www.freelists.org/webpage/oracle-l
Received on Sun Feb 13 2011 - 23:56:37 CST

Original text of this message