Huge pages patch and bleeding edge rant

From: Mladen Gogala <gogala.mladen_at_gmail.com>
Date: Sat, 28 Apr 2012 22:53:00 +0000 (UTC)
Message-ID: <pan.2012.04.28.22.53.00_at_gmail.com>



Linux kernel 2.6.38 introduced a novelty called "transparent huge pages". The goal was to blur the difference between the huge 2M pages and normal 4K pages. The patch is described here: http://lwn.net/Articles/359158

Of course, current "production grade" kernels are version 2.6.32, both in RH and OL version, which means that they do not contain this patch.

[root_at_rac1 hugepages]# uname -r

2.6.32-300.3.1.el6uek.i686
[root_at_rac1 hugepages]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.2 (Santiago)
[root_at_rac1 hugepages]# cat /etc/oracle-release
Oracle Linux Server release 6.2
[root_at_rac1 hugepages]#

However, the bleeding edge Linux distributions, like Fedora F16 use much higher kernel version which does have this patch. Oracle RDBMS will use large pages normally:

SQL*Plus: Release 11.2.0.3.0 Production on Sat Apr 28 18:07:56 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 2137886720 bytes

Fixed Size		    2230072 bytes
Variable Size		  469764296 bytes
Database Buffers	 1660944384 bytes
Redo Buffers		    4947968 bytes

Database mounted.
Database opened.
SQL> show parameter use_large

NAME TYPE VALUE

------------------------------------ ----------- 
------------------------------
use_large_pages 		     string	 ONLY

[root_at_medo transparent_hugepage]# grep -i huge /proc/meminfo
AnonHugePages: 284672 kB

HugePages_Total:    4096
HugePages_Free:     3072
HugePages_Rsvd:        1
HugePages_Surp:        0
Hugepagesize:       2048 kB

So, 2GB SGA and 1024 large pages consumed. That's expected. However, I expected this patch to help with VirtualBox. It doesn't do that. After some digging, I figured out that the program must be written in a special way to utilize this possibility. VirtualBox, apparently, still cannot do that:

[root_at_medo 2996]# ps -fp 2996

UID        PID  PPID  C STIME TTY          TIME CMD
mgogala   2996  2979 26 18:08 ?        00:04:34 /usr/lib/virtualbox/
VirtualBox -
[root_at_medo 2996]# ps -F -p 2996
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
mgogala   2996  2979 26 568521 691156 1 18:08 ?        00:04:34 /usr/lib/
virtual
[root_at_medo 2996]# grep -i huge /proc/meminfo
AnonHugePages: 286720 kB
HugePages_Total:    4096
HugePages_Free:     4096
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

[root_at_medo 2996]#

From the second line, it is visible that the VirtualBox process, PID=2996, is using a hefty 690MB of memory (RSS column, expressed in KB). However, not a single huge page was consumed. If you read the THP description from the earlier link, you will also see that the patch makes them swappable, which is not my intention. This can be disabled like this:

echo never>/sys/kernel/mm/transparent_hugepage/enabled cat enabled
always madvise [never]

Of course, in order to do that, kernel must be newer than 2.6.38:
[root_at_medo transparent_hugepage]# uname -r
3.3.2-6.fc16.x86_64

When VirtualBox is patched to utilize transparent huge pages, turning this feature on will make sense. Database instance can use huge pages, even without the transparent huge pages support. Also, for serious work I advise XFS:

[root_at_medo vm]# mount |grep xfs

/dev/sdb1 on /misc type xfs (rw,relatime,attr2,noquota)
/dev/sdb2 on /data type xfs (rw,relatime,attr2,noquota)
/dev/mapper/vg_medo-lv_home on /home type xfs (rw,relatime,attr2,noquota)

XFS has defragmenter, supports direct I/O and async I/O and my experiences with it so far are great. Brtfs will really have to be something special, in order to beat it. It is possible to set "real time IO priority" for database files on XFS file system (xfs_io command), which will make any I/O request against the database files be executed before any other IO requests pending against that device. Turning that flag on for redo logs and system tablespace makes sense.

-- 
http://mgogala.byethost5.com
Received on Sat Apr 28 2012 - 17:53:00 CDT

Original text of this message