analyzing, visualizing, understanding and rating I/O latency

From: kyle Hailey <kylelf_at_gmail.com>
Date: Mon, 30 Jul 2012 11:47:34 -0700
Message-ID: <CADsdiQgGZZ0LdO5qNER21tfksHkYeT-Jm6XpvaTkewCMurADog_at_mail.gmail.com>



Two questions I'm interested in answering or getting opinions on are:
  1. what is considered great, good, ok and bad I/O and why? what do you use? latency values, throughput values, spread (ie like stdev), percentile latency
  2. is average latency good enough and if not why not? how would one use a latency histogram or 95-99.99 percentile latency to judge an I/O subsystem?

For question 1, it seems latency should always be associated with a required throughput (MB/s,IOPs).
For question 2, it seems average is good enough. I'm curious why and how someone would use the histograms and/or the percentile latency. For me, the histograms are useful for identifying where I/O is coming from. If there is significant I/O coming in the sub 100us range it's probably host side caching. If there is too much under 1ms then it's probably SAN caching. The test below use fio with RAW devices so there should normally be little to no host caching and the I/O is random and short run so there should be little SAN caching, thus the histograms for me are mainly a sanity check that the I/Os are coming from where they are expected.

I've been testing out the I/O benchmark tool "fio" a bit and found it more flexible than the other popular I/O benchmark tools such as Iozone and Bonnie++ or Orion and fio has a more active user community.

In order to easily run fio tests, I've written a wrapper script to go through a series of tests.
In order to view the output, I've written a wrapper script to extract and format the results of multiple tests.
In order to try and understand the data I've written some graph routines in R.

The output of the graph routines is visible here:

https://sites.google.com/site/oraclemonitor/i-o-graphics#TOC-Percentile-Latency

The scripts to run the tests, extract the data and graph the data in R are available here:

https://github.com/khailey/fio_scripts/blob/master/README.md

Looking at tons of database reports I typically see random I/O around 6ms-8ms on solid
gear occasionally faster in the 3-4ms range if some has some serious caching on the SAN and access patterns that can take advantage of the cache . Occasionally the I/O slower than 8ms when the I/O subsystem is overtaxed.

The above latency values fits into some numbers I just grab from a Google search:

speed  rot_lat  seek     total
10K    3ms      4.3ms    =  7.3
15K    2ms      3.8ms    =  5.8


For rating I/O it seems easy to say something, for random I/O, like

< 5ms awesome
< 7ms good
< 9ms pretty good

> 9ms starting to have contention or slower gear

First I'm sure these numbers are debatable, but more importantly they don't take into account throughput.
The latency of a single users should be the base latency and then there should be a second value which the throughput that the I/O subsystem can sustain with some close factor of that base latency.

The above also doesn't take into account wide distributions of latency and outliers. For outliers, how important is it that the 99.99% is far from average? How concerning is it that the max is multi-second when the average is good?

  • Kyle
--
http://www.freelists.org/webpage/oracle-l
Received on Mon Jul 30 2012 - 13:47:34 CDT

Original text of this message