iostat

This is a standard Unix utility for analysing IO workloads and performance troubleshooting, at the individual drive level. The first iteration provides a cumulative summary over node uptime, following iterations are current data – real time.

Common usage

   # iostat -zxcn 2 2
		c percent time system in user/sys/wait/ idle times
		n show device names(CTD or WWN) not instance numbers
		x extended disk statistics. Need this for latency statistics
		z suppress lines that are all zeros

                    extended device statistics
                Lun queue depth --> +++         +++++ <-- target service time (NFS client's perception)
    r/s    w/s   kr/s   kw/s   wait actv wsvc_t asvc_t  %w  %b device
    0.0   15.0    0.0    33.0  0.0  0.0    2.0     0.6   1   1 c2t0d0
    0.0   13.0    0.0    27.5  0.0  0.0    1.1     0.4   0   0 c2t1d0
  119.0   55.0 15233.9 3817.0  0.0  2.2    0.0    12.8   0  87 c0t5000C5003416C207d0
  119.0   56.0 15233.9 3817.0  0.0  3.3    0.0    18.9   0  88 c0t5000C50034169BA3d0
    0.0  484.1    0.0 32916.2  0.0  1.0    0.0     2.1   1  28 c0tSUN24G1047M04YLQ1047M04YLQd0
    0.0  487.1    0.0 33120.2  0.0  1.0    0.0     2.1   0  29 c0tSUN24G1047M04YLR1047M04YLRd0
          Host queue depth --> +++  ^^^         ^^^^^^

          With :
          - Host queue depth : Number of transactions waiting to be submitted to a device.
          - Lun queue depth  : Number of transactions submitted to and being processed by the device.

The ZFSSA can handle a deep queue of I/O, so you should not consider 100% busy disks as overloaded. You probably want to look at target service time (asvc_t) as the NFS client’s perception of the NFS server’s response time, and the active queue (actv) as the number of pending I/O requests. actv represents the number of threads doing IOs onto the same device.

r/s : number of read IOPS
w/s : number of write IOPS
Kr/s : read data transfer rate
Kw/s : write data transfer rate
wait : average number of transactions waiting for service or Host queue depth.
actv : average number of transactions actively being serviced (removed from the queue but not yet completed) or Lun queue depth.
wsvc_t : average service time in wait queue, in milliseconds
asvc_t : target service time (NFS client’s perception) : 0.8 ms for logzila, around 20ms for SAS drives. Should only be taken only on a disk that is showing more than 5% activity
%w : Percentage of time that the queue is not empty
%b : Percentage of time that the disk is processing transactions
the disk device

%b at 100% does not mean the LUN is really busy. It’s calculated by the driver. It’s the percentage of time that there is at least one request pending from the driver to the hardware. Modern hardware can handle multiple requests simultaneously with tagged commands. Think of a store with 16 checkout lanes. You count the store as “busy” any time at least one person is in the store. You can be 100% busy, but still have some capacity because some requests are being fulfilled simultaneously in the checkout lanes.
In the overloaded condition, you should see a deep queue to the storage (high actv) and a long response time (asvc_t) from the client.

Interpretation of iostat data

Typical response times for various types of devices

Device	IOPs	Write Bandwidth	Read Bandwidth	Read Latency
DRAM	>300K	20.000 MB/s	20.000MB/s	120us
18G logzilla	3000	120 MB/sec	n/a	n/a
73G logzilla (gen3)	7000	200 MB/sec	n/a	n/a
73G logzilla (gen4)	11000	350 MB/sec	n/a	n/a
500G readzilla	2500	< 30 MB/sec	40 MB/sec	1300 us
1.6TB readzilla	21000	120 MB/sec	200 MB/sec	900 us
3TB 7K HDD	120	175 MB/sec	175 MB/sec	8600 us
4TB 7K HDD	120	180 MB/sec	180 MB/sec	8600 us
300G 15K HDD	250	170 MB/sec	170 MB/sec	4300 us
300G 10K HDD	160	165 MB/sec	170 MB/sec	6250 us
900G 10K HDD	160	170 MB/sec	170 MB/sec	6250 us

What does an oversubscribed device look like ?

Classic signs of overloaded disk : asvc_t time very high and actv – wait numbers greater than small integers. Best evidence the disk cannot keep up.

Overall latency

The overall latency is the sum of time waiting in the queue, plus the time it takes for the disks to respond to your requests. If the queue depth is 5 and the disk service time is 10 milliseconds, then the overall latency from the client point of view is 5 * 10ms = 50 milliseconds.

How is iostat used

Quick scan of the data

At least two iterations, only looking for extreme high values. Don’t read every number, just scan for anomalies or something that “sticks out”. The iostat data in some bundle may not have captured the slowdown, it is collected over 15 seconds during bundle collection.

First iteration is useful to find the cumulative level of activity since last boot.
Review the r/s and w/s columns
- Estimate approximate read to write mix, within about 20% is fine
- Estimate typical iops
Review asvc_t(avg disk service time) – is it excessive for the class of drive, then determine if the drives can “keep up” with the workload.
Review wait and actv columns – these will show if the drive is “backing up”

Dead ringer for a disk bottleneck

asvc_t is several multiples higher than typical. Note, one iteration may be much higher than the others.
wait and queues greater than small integer.

Determine the number of drives and timeframes they are oversubscribed

If many drives are oversubscribed for much of the time, then we do not have sufficient disk spindles (Typically, but not always). If they seem to be oversubscribed, make sure this is not an effect of another problem (like readzillas using too much memory, resulting in ZFS memory pressure and having to hit the spindles instead of caching data).
More common, intermittent performance problems where drives keep up most of the time, but start queueing up during the busy workloads.
A “slow drive” or uneven workload (like a resilver in one raidZ2 vdev) will slow the overall system down.

Notes

Some engineers like to filter iostat data looking for specifics, like disk service times > 100ms. It may be better to get a global picture, to look at the overall data first, and then filter for specifics. Check if fields have extra digits, or only skim the first significant figures (applies to wait actv, wsvc_t and asvc_t columns typically). For r/s and w/s, just do a quick estimation of the busiest lines.

Real iostat data

case 1

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.7   54.9   15.1 3163.1  0.3  0.4    6.1    7.7   4   9 c1t0d0
    2.0   54.8   29.2 3163.1  0.4  0.4    6.4    6.9   5   9 c1t1d0
    0.0    1.2    0.2   22.1  0.0  0.0    0.0    0.1   0   0 c2t5000A72030064181d0
    0.0    0.1    0.3    2.6  0.0  0.0    0.0    2.1   0   0 c2t5000C500438F21A3d0
   17.4   31.4  212.7  551.4  0.0  0.1    0.0    2.9   0   4 c2t5000C500438ED9AFd0
    0.0    0.1    0.3    2.3  0.0  0.0    0.0    2.2   0   0 c2t5000C50043B6B3A3d0
   17.5   31.7  212.6  552.4  0.0  0.1    0.0    2.9   0   4 c2t5000C500438EC99Fd0
   17.5   31.3  213.6  551.4  0.0  0.1    0.0    2.9   0   4 c2t5000C50043B68F6Bd0
   17.5   31.7  213.0  552.4  0.0  0.1    0.0    2.9   0   4 c2t5000C50043B8C303d0
   17.4   31.5  212.8  550.5  0.0  0.1    0.0    2.9   0   4 c2t5000C50043B692CFd0
   17.5   32.0  212.8  552.8  0.0  0.1    0.0    2.9   0   4 c2t5000C50043B8F4DFd0
    0.0    0.1    0.3    2.3  0.0  0.0    0.0    2.2   0   0 c2t5000C500475428BFd0
   58.8   65.7 1037.7 7223.9  0.2  0.0    1.9    0.4   4   5 c1t2d0
   58.9   65.7 1040.7 7220.9  0.2  0.0    1.9    0.4   4   5 c1t3d0

– comments

the busy data drives are doing roughly 50 iops, ~3:2 write biased.
roughly 200KB/s reads and 500KB/s writes
service times look good and ios are not queueing
the system drives are doing more IO than the data drives ~55iops vs 50iops – might be analytics or logging.
this customer has a known obsession with analytics

– questions

which drives are readzillas ?
which are the logzillas, that are active on this node ?

case 2

                   extended device statistics
    r/s    w/s   kr/s    kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    1.0    0.0     4.0  0.0  0.0    0.1    0.3   0   0 c1t0d0
    0.0    1.0    0.0     4.0  0.0  0.0    0.0    0.3   0   0 c1t1d0
    4.0  306.8  165.5 18869.9  0.0  0.8    0.0    2.7   0  17 c2t5000C500438ED9AFd0
   26.2  374.2 1604.6 18833.7  0.0  3.5    0.0    8.7   0  73 c2t5000C500438EC99Fd0
   22.1  297.8 1349.6 18938.4  0.0  1.5    0.0    4.6   0  35 c2t5000C50043B68F6Bd0
   17.1  385.3 1045.3 18833.8  0.0  2.6    0.0    6.5   0  53 c2t5000C50043B8C303d0
   17.1  267.6  803.8 18298.1  0.0  0.9    0.0    3.0   0  21 c2t5000C50043B692CFd0
   15.1  286.7  849.6 18741.3  0.0  1.2    0.0    4.1   0  39 c2t5000C50043B8F4DFd0
   39.2  287.7 1808.3 19927.5  0.0  1.3    0.0    3.8   0  35 c2t5000C500475416DFd0
   28.2  309.9 1350.1 18873.7  0.0  1.4    0.0    4.0   0  43 c2t5000C50043B6CEEBd0
   13.1  310.9  261.6 18875.2  0.0  1.0    0.0    3.1   0  20 c2t5000C500438ED92Fd0
   10.1  376.3  619.2 21063.4  0.0  1.7    0.0    4.4   0  27 c2t5000C50047541A73d0
   18.1  296.8  924.6 18990.5  0.0  1.2    0.0    3.7   0  33 c2t5000C50043B67B1Fd0
  673.1  210.3 43312.0 24662.3 2.1  0.5    2.4    0.6  43  52 c1t2d0
  599.7  145.9 38466.5 16476.9 2.5  0.5    3.4    0.7  42  50 c1t3d0

– comments

disks are busy roughly 400 iops.
asvc_t times are all < 5ms, this is ok.
no sign of trouble, the drives are busy and keeping up.

case 3

poor CIFS performance.

                  extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    4.0  143.0  511.9 16351.5 0.6  0.1    3.8    0.5   7   7 c1t2d0
    1.0    0.0  128.0    0.0  0.0  0.0    0.0    0.7   0   0 c1t3d0
    2.0  135.0  256.0 15873.5 0.5  0.1    3.9    0.5   6   6 c1t4d0
   27.0  218.0   62.0 1060.3  0.0  0.9    0.0    3.8   0  20 c3t5000CCA02A272A68d0
   41.0  142.0  524.4  774.9  0.0  0.6    0.0    3.4   0  15 c3t5000CCA02A26BB70d0
   40.0  191.0  160.5 1259.3  0.0  0.9    0.0    3.9   0  24 c3t5000CCA02A256000d0
   72.0  102.0  129.5  509.9  0.0  0.5    0.0    3.0   0  17 c3t5000CCA02A1D6468d0
   38.0  184.0  145.0 1259.8  0.0  0.9    0.0    3.8   0  23 c3t5000CCA02A2002F8d0
   19.0  242.0   79.0 1060.9  0.0  0.8    0.0    3.0   0  20 c3t5000CCA02A264B1Cd0
   78.0   76.0  124.5  511.9  0.0  0.5    0.0    3.3   0  16 c3t5000CCA02A1A6A64d0
   49.0  139.0  540.9  778.4  0.0  0.7    0.0    3.6   0  17 c3t5000CCA02A265FE0d0
   39.0  192.0  154.0 1259.8  0.0  1.0    0.0    4.5   0  25 c3t5000CCA02A266100d0
   23.0  250.0   76.0 1056.9  0.0  1.3    0.0    4.8   0  27 c3t5000CCA02A24C950d0
   43.0  123.0  606.9  774.4  0.0  0.7    0.0    4.2   0  19 c3t5000CCA02A26C588d0
   36.0  176.0  141.0 1265.3  0.0  0.8    0.0    3.6   0  21 c3t5000CCA02A28397Cd0
   37.0  188.0  151.5 1262.4  0.0  0.8    0.0    3.7   0  21 c3t5000CCA02A2655F4d0
   41.0  173.0  635.9  773.4  0.0  0.8    0.0    3.7   0  18 c3t5000CCA02A26A730d0
   42.0  129.0  701.9  775.9  0.0  0.8    0.0    4.7   0  18 c3t5000CCA02A1FC718d0
    0.0    4.0    0.0  204.0  0.0  0.0    0.0    0.2   0   0 c3t5000A7203006A41Cd0
   67.0   99.0  127.0  511.9  0.0  0.8    0.0    4.9   0  20 c3t5000CCA02A1FC7ACd0

– questions

what’s the size of each IO read ?
what’s the size of each IO write ?
What could be the zpool layout : mirror or raidz2 ?

case 4

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.4   30.9   21.5  545.5  0.0  0.1    1.4    2.0   1   2 c1t0d0
    0.4   30.8   18.7  545.5  0.0  0.1    1.4    2.1   1   2 c1t1d0
  288.7  259.9 17159.0 23806.6 1.7 0.4    3.1    0.8  29  43 c1t2d0
  288.6  259.9 17159.2 23808.2 1.7 0.4    3.1    0.8  29  43 c1t3d0
   63.3   28.7  364.2  283.5  0.0  1.5    0.0   15.9   0  62 c3t5000C50040908B8Bd0
    7.1    3.1   35.9   28.4  0.0  0.2    0.0   15.0   0   7 c3t5000C50035055FEBd0
   59.3   28.9  284.8  284.0  0.0  1.4    0.0   15.4   0  61 c3t5000C50034F61CEFd0
   59.6   29.0  286.6  284.2  0.0  1.3    0.0   15.2   0  60 c3t5000C50034EF953Fd0
   59.2   29.0  284.3  283.8  0.0  1.3    0.0   14.9   0  60 c3t5000C50034EF2FDFd0
    0.0  431.5    0.0 16373.9 0.0  0.7    0.0    1.5   1  18 c3t5000A7203003D9EDd0
   59.6   29.1  285.8  283.9  0.0  1.3    0.0   14.9   0  60 c3t5000C50034EC779Fd0
   59.6   28.9  286.3  283.6  0.0  1.4    0.0   15.8   0  62 c3t5000C50034F67A13d0
   15.2   11.6  111.1   84.2  0.0  0.3    0.0   11.9   0  16 c3t5000C50034F60A8Fd0
   14.5   15.6   65.4   87.4  0.0  0.4    0.0   13.4   0  18 c3t5000C50034EF52DFd0
   13.8    6.4   68.6  111.4  0.0  7.1    0.0  352.0   0  99 c3t5000C50034F5105Bd0
   15.6   16.6   79.2   97.3  0.0  0.5    0.0   14.1   0  19 c3t5000C50034EFAC7Fd0

– comments

These drives are only doing around 90 IOPs, yet they are about 60% busy.

– questions

identify readzillas and logzillas
Do the latency numbers look reasonable
anything abnormal stands out ?

– notes

one drive has asvc_t latencies that are almost 10x slower then the others
This drive is not doing very much IO, yet latency is horrible and multiple IOs are active.
observe the excessive actv counts for this drive

   # iostat -xncz 5 5 |grep c3t5000C50034F5105Bd0
   r/s    w/s   kr/s   kw/s   wait actv wsvc_t asvc_t  %w  %b device
   59.4   28.6  286.7  284.0  0.0  1.5    0.0   17.2   0  63 c3t5000C50034F5105Bd0
   21.0    9.2   93.4  304.5  0.0  5.2    0.0  171.6   0 100 c3t5000C50034F5105Bd0
   26.1    0.0  108.7    0.0  0.0  4.7    0.0  180.4   0  99 c3t5000C50034F5105Bd0
   57.9   15.3  309.8  410.4  0.0 10.0    0.0  136.7   0 100 c3t5000C50034F5105Bd0
   46.3    6.8   90.1  204.9  0.0  6.0    0.0  113.9   0 100 c3t5000C50034F5105Bd0

– solution

the slow drive was removed from the array and performance immediately improved.

zfs_vdev_max_pending

What’s wrong here ?

   SolarisHost # iostat -xnz 1
     r/s    w/s   kr/s   kw/s  wait actv wsvc_t asvc_t  %w  %b device
   764.5   69.3 97628.6 6848.2  0.0 10.0    0.0   12.0   0 100 c0t600144F0C6133EE90000525FE2120002d0
   632.0    7.1 80783.5  434.2  0.0 10.0    0.0   15.6   0 100 c0t600144F0C6133EE90000525FE2120002d0

On Solaris, zfs_vdev_max_pending controls, how many I/O requests can be pending per vdev. For example when you have 100 disks visible from your OS with a zfs:zfs_vdev_max_pending of 2, you have 200 request outstanding at maximum. When you have 100 disks hidden behind a ZFFSA just showing a single LUN, you will have 2 pending requests at maximum !

Under Solaris 11, the default value is 10. This means the the actv col is maxed out to 10 transactions : no more than 10 transactions can be performed at the same time. We can grow this to 35 if required, as it was under Solaris 10.

    SolarisHost # echo zfs_vdev_max_pending/W0t35 | mdb -kw
    SolarisHost # echo "set zfs:zfs_vdev_max_pending = 35" >> /etc/system
    SolarisHost # iostat -xnz 1
                        extended device statistics
      r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    772.9   28.4 98510.2 2805.7 0.0 18.9    0.0   23.6   0 100 c0t600144F0C6133EE90000525FE2120002d0
    858.8    8.3 109596.5 331.8 0.0 17.8    0.0   20.5   0 100 c0t600144F0C6133EE90000525FE2120002d0

Observe that asvc_t has increased. Do not increase this too much as this is a trade-off between throughput and latency.

Note : on the zfs appliance, zfs_dev_max_pending is still at 10. Do not change it !

   zs3-2-tvp540-a# echo zfs_vdev_max_pending/D | mdb -k
   zfs_vdev_max_pending:
   zfs_vdev_max_pending:           10

Analytics

Useful set of analytics

The full name is provided as well as the cli name. Some of them cannot be created easily from the BUI.

CPU: percent utilization broken down by CPU identifier                : cpu.utilization[cpu]
Disk: ZFS DMU operations per second broken down by type of operation  : zpool.dmuops[op]
Disk: ZFS DMU operations broken down by DMU object type               : zpool.dmuops[dmutype]
Network: TCP bytes per second broken down by local service            : tcp.bytes[local service]
Network: TCP packets per second broken down by local service          : tcp.packets[local service]
Disk: Disks broken down by percent utilization                        : io.disks[utilization]
Disk: I/O operations per second broken down by type of operation      : io.ops[op]
Disk: I/O operations per second broken down by latency                : io.ops[latency=100000][disk] : 100ms
Disk: I/O bytes per second broken down by disk                        : io.bytes[disk]
Memory: kernel memory broken down by kmem cache                       : kmem.total[cache]
Memory: kernel memory in use broken down by kmem cache                : kmem.inuse[cache]
Memory: kernel memory lost to fragmentation broken down by kmem cache : kmem.fragmented[cache]
Memory: dynamic memory usage broken down by application name          : mem.heap[application]
NFSv3 operations per second taking at least 100000 microseconds broken down by type of operation :
                                                                        nfs3.ops[latency=100000][op]

Some nice analytics introduced in 2013.1

dataset-030 active   37.8K      0 repl.bytes[peer]
dataset-031 active   37.8K      0 repl.ops[dataset]
dataset-032 active   37.8K      0 repl.ops[direction]
dataset-033 active   71.1K      0 cpu.spins[cpu]
dataset-034 active   36.9K      0 cpu.spins[type]
dataset-035 active   33.9K      0 shadow.ops[latency]
dataset-036 active   37.8K      0 shadow.ops[share]
...

Examples

cli> analytics datasets create repl.bytes[peer]

Memory usage via mdb

::memstat

Useful tool but may return incorrect information. Double check with freemem/D …

Page Summary           Pages             MB  %Tot
------------     ---------  --------  ----
Kernel             2998817     11714   72% : Kernel + ARC cache for ZFS metadata (included in crash dump)
ZFS File Data      1058644      4135   25% : ARC cache for ZFS data (not included in crash dump)
Anon                123711       483    3% : process's heap space, its stack, and copy-in-write pages.
Exec and libs          263         1    0%
Page cache            1438         5    0% : cache used by other FS than zfs (ufs, tmpfs, qfs)
Free (cachelist)      2287         8    0% : freed but still containing valid vnodes references (structure kept)
Free (freelist)       6898        26    0% : freed and not containing any vnodes references (structure not kept)

Total              4192058     16375
Physical           4192056     16375

Things to consider

Kernel memory embeds ZFS metadata (DDT, spacemap, L2 ARC headers …)
High level of ‘kernel memory’ is not always a pbm especially when the system is a replication target or a resilvering is running.
Small ‘ZFS File Data’ values will mean that ARC data miss can be of high rate, possible performance issue if IO workload is READ.
High value of ‘Anon’ memory, means some userland process may be huge in size (true for ndmp which is a 64 bits process).
High value of ‘Page cache’ means someone may have written some big files into /export (ftp can do this !). /export uses swap, which points to tmpfs (DRAM).

Try to collect ::memstat every minutes to check how memory usage changes. Big memory movements is a sign of memory re-allocation.

::arc -m

The dcmd output is slightly different from 2011.1 and 2013.1 :

2011.1

p                         =    364385 MB  : target size of MRU
c                         =    388888 MB  : constantly changing number of what the ARC thinks it should be
c_min                     =     16383 MB  : min target cache size, minimum size of the ARC
c_max                     =    523246 MB  : max target cache size, maximum size of the ARC : 98% of Memory
size                      =    381169 MB  : ARC current size
hdr_size                  = 5061417280    : Header size for ARC, in bytes
data_size                 = 380546095104  : ARC size for data (payload), in bytes
l2_size                   = 1691643692544 : l2 cache size currently used
l2_hdr_size               = 4942620096    : memory used for l2_headers in bytes, here 4.7GB
arc_meta_used             =     54568 MB  : ARC metadata currently used
arc_meta_limit            =    523246 MB  : ARC metadata max size (by default arc_meta_limit = arc_c_max)
arc_meta_max              =    342784 MB  : max size the ARC metadata has reached

2013.1

p                         =      2529 MB  : target size of MRU
c                         =    261101 MB  : constantly changing number of what the ARC thinks it should be
c_min                     =        64 MB  : min target cache size, minimum size of the ARC
c_max                     =    261101 MB  : max target cache size, maximum size of the ARC : 98% of Memory
size                      =     44213 MB  : ARC current size
data_size                 =     15015 MB  : ARC size for data (payload)
l2_hdr_size               =        96 MB  : memory used for l2_headers
l2_size                   = 1535512576    : l2 cache size currently used
meta_used                 =     29198 MB  : ARC metadata currently used
meta_max                  =     29198 MB  : max size the ARC metadata has reached
meta_limit                =         0 MB  : ARC metadata max size (0 = no limit)

Example :

::arc -m
l2_size                   = 1018129928704
l2_hdr_size               = 2339951616

# zpool iostat 1  // to get SSD usable size

                   capacity     operations    bandwidth
  pool          alloc   free   read  write   read  write

  cache             -      -      -      -      -      -
    c1t2d0       477G     8M     58      7   892K   773K
    c1t3d0       477G     8M     59      7   887K   772K

Max L2-headers size = SSD size * l2_hdr_size / l2_size = (477*2)*1024^3 * 2339951616 / 1018129928704 = 2354246416 = 2.354 GB

Possible outcome

When L2 header size is too high, some performance issues will happen. If DRAM cannot be increased, there are 2 possible outcomes :

remove 1 of the cache devices. A maintenance window is needed as the traffic may stop while the cache device is removed by the ‘zpool remove cachedevice’ command.
narrow down any share/LUNs whose the majority of IOs are writes (analytics can help). From the BUI, select the share and set “Cache device usage → none”.

::kmastat

To get a better picture of ::kmastat output, we need to introduce vmem.

vmem

Vmem stands for virtual memory and is essential for an operating system because it makes several things possible :

It helps to isolate tasks from each other by encapsulating them in their private address spaces.
Virtual memory can give tasks the feeling of more memory available than is actually possible.
By using virtual memory, there might be multiple copies of the same program, linked to the same addresses, running in the system

Solaris’ kernel memory allocator (also known as slab allocator) allows the kernel to allocate memory dynamically. The slab allocator

relies on two lower−level system services to create slabs: a virtual address allocator to provide kernel virtual addresses, and VM routines to back those addresses with physical pages and establish virtual−to−physical translations.
allows the clients to create a cache of very frequently used small objects/structures, so that objects of the same type come from a single page or contiguous pages.
reduces the risk of internal fragmentation.

It provides interfaces such as vmem_create(), kmem_cache_create(), kmem_cache_alloc() and kmem_cache_free() :

zio_arena = vmem_create(“zfs_file_data”,…) : memory allocated in physical or virtual memory: PM or VM (see args)
znode_cache = kmem_cache_create(“zfs_znode_cache”,…) : memory allocated in physical memory : PM
dbuf_cache = kmem_cache_create(“dmu_buf_impl_t”,…) : memory allocated in physical memory : PM
zp = kmem_cache_alloc(znode_cache,…);
kmem_cache_free(znode_cache, zp);

The memory returned from the kmem calls is allocated in physical memory.

Slab allocator sources its virtual address range from the vmem backend allocator. Vmem resource map allocator provides the vmem_create() interface to allow clients to register with the allocator some VA range from where the subsequent allocations should happen. As written before, VM routines establish virtual−to−physical translations of pages.

               vmem_create(const char *name, void *base, size_t size, size_t quantum, vmem_alloc_t *afunc,
               vmem_free_t *ffunc, vmem_t *source, size_t qcache_max, int vmflag)

               name       : name of the virtual memory segment
               base       : ase address of the segment in the 'kernel address space (KAS)';
               size       : how big the segment should be
               quantum    : unit of allocation (if equals to PAGESIZE -> Virtual Memory)
               afunc      : allocator function for the segment
               ffunc      : free function for the segment
               source     : parent vmem arena. if NULL, then it defaults to _kmem_default_
               qcache_max : qcache_max is the the maximum sized allocation upto which it could allows
                            slab allocator to cache such objects
               vmflag     : VM_SLEEP = can block for memory - success guaranteed, VM_NOSLEEP = cannot
                            block for memory - may fail

ZFS data buffers

Typical buffer names : zfs_file_data, zfs_file_data_buf, zio_data_buf. ZFS data is never captured in a kernel dump. This is because we do not use the heap arena but create a new arena in the KAS (Kernel Adress Space) : the zfs_file_data arena.

   # startup.c
     segkmem_zio_init(segzio_base, mmu_ptob(segziosize)); /* create zio area covering new segment */

   # seg_kmem.c
     void segkmem_zio_init(void *zio_mem_base, size_t zio_mem_size) {
        zio_arena = vmem_create("zfs_file_data", zio_mem_base, zio_mem_size, PAGESIZE, NULL,
        NULL, NULL, 256 * 1024, VM_SLEEP);
        // quantum = PAGESIZE : VM                                           ^^^^^^^^
        // Here, we allocate 'zio_mem_size' bytes for zfs_file_data and round to quantum multiples (ie PAGESIZE)

        zio_alloc_arena = vmem_create("zfs_file_data_buf", NULL, 0, PAGESIZE, segkmem_zio_alloc,
        segkmem_zio_free, zio_arena, 0, VM_SLEEP);
        // quantum = PAGESIZE but refers to segkmem_zio_alloc : invokes vmem_alloc() to get a virtual address and
        // then backs it with physical pages : PM

   # zio.c
     vmem_t * data_alloc_arena = zio_alloc_arena;
              ^^^^^^^^^^^^^^^^
     loop on 'c'
       (void) sprintf(name, "zio_data_buf_%lu", (ulong_t)size);
       zio_data_buf_cache = kmem_cache_create(name, size, align, NULL, NULL, NULL, NULL, data_alloc_arena, cflags
                             | KMC_NOTOUCH); // PM
     endloop

Any kmem_cache_alloc() calls involving any of the above zio_data_buf_ caches will result in returning memory in the range *[segzio_base, segzio_base+segziosize)* .
Let’s have a look to ::kmastat

   ::kmastat -m
   cache                        buf    buf    buf    memory     alloc alloc
   name                        size in use  total    in use   succeed  fail   = object caches
   ------------------------- ------ ------ ------ ---------- --------- -----
   kmem_magazine_31             256    441   4095          1M   2429326     0 | <== per−CPU caches added to the slab 
   kmem_magazine_47             384   3640   7590          2M    838299     0 |     algorithm to avoid spinlock 
   kmem_magazine_63             512    126   2009          1M   1249436     0 |     contention
   ...
   zio_data_buf_69632         69632      4     17          1M    288907     0 |
   zio_data_buf_90112         90112      1     14          1M    908801     0 |
   zio_data_buf_102400       102400      0     11          1M    185426     0 |
   zio_data_buf_114688       114688      0     10          1M    167585     0 |<==============+
   zio_data_buf_118784       118784      0     11          1M    127970     0 |               |
   zio_data_buf_122880       122880      0     10          1M    124515     0 |               |
   zio_data_buf_126976       126976      0     12          1M     85861     0 |               |
   zio_data_buf_131072       131072  32185  32879       4109M  56897448     0 |               |
   ------------------------- ------ ------ ------ ---------- --------- -----                  |
   Total [zfs_file_data_buf]                            4135M 168478350     0  <--------------|
                                                                                              |
   vmem                         memory     memory    memory     alloc alloc    vmem arena     |
   name                         in use      total    import   succeed  fail    <= vmem arena  |
   ------------------------- ---------- ----------- ---------- --------- -----                |
   zfs_file_data                  4142M    8388608M        0M   5809290     0  < 8TB of VM <==|==== arena name : VM
       zfs_file_data_buf          4135M       4135M     4135M  10342009     0  < 4GB of PM ---+ <=== subset of 
                                                                                                zfs_file_data 
                                                                                                arena : PM

ZFS metadata buffers

We rely on kmem to create caches such that all the zfs metadata is part of the ‘kernel heap’ and hence is catpured as part of the crash dump. Actually, we pass NULL to kmem_cache_create() for the vmem_t type argument. This means that zio_buf_ will be backed by kmem_default VMEM arena, like most of the other kernel caches.

   - zio.c
     loop on 'c'
        (void) sprintf(name, "zio_buf_%lu", (ulong_t)size);
        zio_buf_cache[c] = kmem_cache_create(name, size, align, NULL, NULL, NULL, NULL, NULL, cflags); // last but one arg is NULL => kmem_default

ZFS caches for metadata :

zio.c: zio_buf_; = kmem_cache_create(“zio_buf_%lu”,…);
dbuf.c: dbuf_cache = kmem_cache_create(“dmu_buf_impl_t”,…);
dnode.c: dnode_cache = kmem_cache_create(“dnode_t”,…);
sa.c: sa_cache = kmem_cache_create(“sa_cache”, …);
zfs_znode.c: znode_cache = kmem_cache_create(“zfs_znode_cache”, …);
zio.c: zio_cache = kmem_cache_create(“zio_cache”, …);
…

As these buffers are all allocated thanks to kmem_cache_create() calls, they correspond to Physical Memory : PM.
Lets have a look to ::kmastat

   ::kmastat -m
   cache                        buf    buf    buf    memory     alloc alloc
   name                        size in use  total    in use   succeed  fail   = object caches
   ------------------------- ------ ------ ------ ---------- --------- -----
   zio_buf_512                  512 2409586 4916584     2400M 552832304     0 ^
   zio_buf_2048                2048  71804  72078        140M  26528668     0 |
   zio_buf_4096                4096   2633   2703         10M  74524377     0 |
   zio_buf_12288              12288   1282   1353         15M  36569461     0 |
   zio_buf_16384              16384 108731 108854       1700M 254805641     0 |
   zio_buf_32768              32768     36     92          2M  22355627     0 |
   zio_buf_40960              40960     40     94          3M  27766298     0 |
   zio_buf_45056              45056      1     32          1M  23590138     0 |
   zio_buf_49152              49152      0     25          1M  12426311     0 |
   zio_buf_53248              53248     12     35          1M   5843836     0 |
   zio_buf_57344              57344     57    111          6M   2841986     0 |
   zio_buf_61440              61440      1     38          2M   6549109     0 |
   zio_buf_126976            126976      0     13          1M    658728     0 |
   zio_buf_131072            131072   5059   6799        849M 107583446     0 |
   dmu_buf_impl_t               192 2616165 5710460     1115M 860908891     0 |
   dnode_t                      744 2423090 2454573     1743M 412891701     0 |
   sa_cache                      56   8983 151088          8M 438341951     0 |<==================+
   zfs_znode_cache              248   8983  15136          3M 439890993     0 |                   |
   zio_cache                    816  20102  28750         22M 1922767786    0 v                   |
                                                                                                  |
                                                                                                  |
   ------------------------- ------ ------ ------ ---------- --------- -----                      |
   Total [kmem_default]                                 9611M 2194965189    0 <-------------------|
                                                                                                  |
   vmem                         memory     memory    memory     alloc    alloc                    |
   name                         in use      total    import    succeed   fail                     |
   ------------------------- ---------- ----------- ---------- --------- -----                    |
   heap                          16815M    8388608M         0M  36378781     0                    |
       kmem_va                   15017M      15017M     15017M   2756890     0 <  15  GB of VM    |
           kmem_default           9611M       9611M      9611M  21280467     0 <  9.6 GB of PM ---+

Trying to understand these data, focus only the kmem_default line which accounts for all the kmem allocations by different caches on the system, then look at individual caches to see how much each consumed.

Quantum caches

The intent of quantum caches is to prevent fragmentation in the vmem arena. zfs_file_data_ are quantum caches and are already accounted in zio_data_buf_ caches. So do not take them into account. In the code, the quantum caches are created in vmem_create() itself, depending on whether qcache_max is passed or not. qcache_max is the the maximum allocation size to cache such objects.

   zio_arena = vmem_create("zfs_file_data", zio_mem_base, zio_mem_size, PAGESIZE, NULL, NULL, NULL, 32 * 1024, VM_SLEEP);
                                                                                                    ^^^^^^^^^
                                                                                                    qcache_max

Let’s see the kmastat output :

::kmasat -m
   cache                        buf    buf    buf    memory     alloc alloc
   name                        size in use  total    in use   succeed  fail   <= object caches
   ------------------------- ------ ------ ------ ---------- --------- -----
   zfs_file_data_4096          4096    163    448          1M   2837290     0
   zfs_file_data_8192          8192     68    208          1M    977461     0
   zfs_file_data_12288        12288     58    160          2M   1589976     0 >> see, the 4KB pattern
   zfs_file_data_16384        16384     12     48          0M    467541     0    (PAGESIZE=4KB on x64)
   zfs_file_data_20480        20480     65    108          2M   1546868     0
   zfs_file_data_24576        24576     13     40          1M    494350     0
   zfs_file_data_28672        28672     31     68          2M    912287     0
   zfs_file_data_32768        32768     16     32          1M    340006     0

How to determine the type of each object cache : PM or VM ?

SCAT “kma stat” can be of great help :

       CAT(vmcore.4/11X)> kma stat
                                      buf      buf        buf       memory
       cache name                        size    avail      total       in use
       ============================== ======= ======== ========== ============
       kmem_magazine_1                     16    57562      61746      1007616 PK
       kmem_magazine_3                     32    45903      49000      1605632 PK
       ...
       kmem_va_4096                      4096   230223    1466784   6007947264 QU
       kmem_va_8192                      8192   284936     557168   4564320256 QU
       kmem_va_12288                    12288      700       2180     28573696 QU
       kmem_va_16384                    16384   143227     252336   4134273024 QU
       ...
       kmem_alloc_8                         8  1188947    1393310     11345920 PKU
       kmem_alloc_16                       16    56834     136293      2224128 PKU
       kmem_alloc_24                       24   155669     252170      6184960 PKU
       kmem_alloc_32                       32   136623     170750      5595136 PKU
       …
       zfs_file_data_135168            135168        0          0            0 QU
       zfs_file_data_139264            139264        0          0            0 QU
       …
       zio_cache                          872    27906      27945     25436160 PKU
       …
       zio_buf_131072                  131072        0       6715    880148480 PK
       zio_data_buf_131072             131072     1046     114558  15015346176 PZ
       …
       dnode_t                            752   914325    1511025   1237831680 PKU
       dmu_buf_impl_t                     208  1403215    2140958    461545472 PKU
       ------------------------------ ------- -------- ---------- ------------
       Total [hat_memload]                                             1978368 P
       Total [kmem_default]                                        10078785536 P
       Total [id32]                                                       4096 P
       Total [zfs_file_data]                                          13107200
       Total [zfs_file_data_buf]                                    4336205824 P
       ------------------------------ ------- -------- ---------- ------------
       arena                                memory       memory       memory
       name                                 in use        total       import
       ------------------------------ ------------ ------------- ------------
       heap                            17632604160 8796093022208            0
          kmem_va                      15747022848   15747022848  15747022848
             kmem_default              10078785536   10078785536  10078785536  P
       zfs_file_data                    4344115200 8796093022208            0
          zfs_file_data_buf             4336205824    4336205824   4336205824  P

       P - physical memory
       K - kvps[KV_KVP] vnode (kernel memory)
       Z - kvps[KV_ZVP] vnode (ZFS kernel memory)
       Q - quantum cache (KMC_QCACHE)
       U - enabling KMF_AUDIT will allow "kma users" to work for this cache
       F - cache has failed allocations (will be in red color)

For a deep understanding, see :

Magazines and Vmem by Jeff Bonwick http://www.parrot.org/sites/www.parrot.org/files/vmem.pdf
Slab Allocator From HelenOS : http://www.helenos.org/doc/design/html.chunked/mm.html#slab

Paging and freemen rules

The pageout daemon is visible from prstat. It looks for pages from userland processes than can be pushed back to disk (swap). When the free page list (freelist) falls below a certain level, the page daemon is invoked. The daemon scans memory to identify pages that can be pushed out to disk and reused. The system strategy for page replacement is not to let the system run out of memory

       ::memstat
       Page Summary                Pages                MB  %Tot
       ------------     ----------------  ----------------  ----
       Kernel                    3843924             15015    6%
       ZFS File Data              150299               587    0%
       Anon                        97798               382    0%
       Exec and libs                3451                13    0%
       Page cache                   8696                33    0%
       Free (cachelist)             7493                29    0%
       Free (freelist)          62992480            246064   94%

       Total                    67104141            262125
       Physical                 67104140            262125

       ::physmem/D
       physmem:
       physmem:        67104141  << in pages (4KB on x64 , 8KB on sparc)

To prevent from running out of memory, the OS uses the following tunable parameters as thresholds :

Parameter	Meaning	Value	Comments
lotsfree	Lost of Memory	MAX(looppages/64, 1/2MB)	if freemem < lotsfree, the OS run the page daemon
desfree	Desired free memory	lotsfree/2	if freemem < desfree, the OS starts the soft swapping
minfree	Minimum acceptable memory level	desfree/2	if freemem < minfree, the system biases allocations toward allocations necessary to successfully complete pageout operations or to swap processes completely out of memory
throttlefree	Same as minfree, could be tuned	minfree	if freemem < throttlefree, the OS cuts off user and non urgent kernel requests
pageout_reserve	Page daemon and swapper only	throttlefree/2	if freemem < pageout_reserve, the OS fails all requests except for the page daemon and the swapper

Beyond the pageout daemon, ZFS uses some strategy to reduce the ARC size and avoid memory starvation :
In the arc.c code, the arc_reclaim_thread() checks if freemem < lotsfree+desfree. If TRUE, it calls kmem_cache_reap_now() to look into the different kernel memory buffers to free up unused space and put it back to the cachelist (we keep the structure so we can allocate faster). This is a mono-threaded task and may lead to performance issues as threads trying to alloc memory may have to wait, ie :

::stacks -m zfs
      THREAD           STATE    SOBJ                COUNT
      ffffff86b38ab440 SLEEP    CV                     36
                 swtch+0x150
                 cv_wait+0x61
                 vmem_xalloc+0x63f
                 vmem_alloc+0x161
                 segkmem_xalloc+0x90
                 segkmem_alloc_vn+0xcd
                 segkmem_zio_alloc+0x24
                 vmem_xalloc+0x550
                 vmem_alloc+0x161
                 kmem_slab_create+0x81
                 kmem_slab_alloc+0x5b
                 kmem_cache_alloc+0x1fa
                 zio_data_buf_alloc+0x2c
                 arc_get_data_buf+0x18b
                 arc_buf_alloc+0xa2
                 arc_read_nolock+0x12f
                 arc_read+0x79
                 dsl_read+0x33
                 dbuf_read_impl+0x17e
                 dbuf_read+0xfd
                 dmu_tx_check_ioerr+0x6b
                 dmu_tx_count_write+0x175
                 dmu_tx_hold_write+0x5b
                 zfs_write+0x655

For ARC memory allocation changes, the sequence is as follows :

       arc_reclaim_thread() { // The arc_reclaim_thread() wakes up every second and will attempt to reduce  
                              // the size of the ARC to the target size (arc_c).
         if (arc_reclaim_needed() == TRUE) {  // freemen < lotsfree+desfree ?
             arc_kmem_reap_now() {
                kmem_cache_reap_now(zio_buf_cache[i]);
                kmem_cache_reap_now(zio_data_buf_cache[i]);
                kmem_cache_reap_now(buf_cache);
                kmem_cache_reap_now(hdr_cache);
             }
          }
          if (arc_size > arc_c) {
             arc_adjust(); //  resize the ARC lists (MRU/MFU/Ghost lists).
          }
       }

For details, see :

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

The kmem_cache reaper activity can be seen thanks to kmem.fragmented[cache] analytic.

Dtrace

cpu usage

Display the busy threads which are not in the idle loop (curthread->t_pri != -1)

   # dtrace -n 'profile-997hz /arg0 && curthread->t_pri != -1 && cpu==58 / { @[stack()] = count(); } tick-10sec { trunc(@, 10); printa(@); exit(0); }'
   zfs`zio_compress_data+0x41
   zfs`zio_write_bp_init+0x26d
   zfs`zio_execute+0x8d
   genunix`taskq_thread+0x22e
   unix`thread_start+0x8
   25898

arg0 !=0 : to check if the thread is running in the kernel
curthread→t_pri != -1 : to ensure that the CPU is not executing the kernel idle loop

Determine why threads are being switched off-CPU by looking at user stacks and kernel stacks when the sched:::off-cpu probe fires

Displays threads which are switched off-CPU.

   # dtrace -qn 'sched:::off-cpu /execname != "sched"/ { @[execname, stack()] = count(); } END { trunc(@,5); }'
   zpool-pool-raid
      unix`swtch+0x146
      genunix`cv_wait+0x60
      genunix`taskq_thread_wait+0x86
      genunix`taskq_thread+0x2a4
      unix`thread_start+0x8
   51328

Looks like we wait for some tasks to be done in one of the taskq worker threads. We should dig beyond thanks to the taskq.d trace script (See dtrace book by Brendan Gregg).

   # dtrace -n 'sched:::off-cpu /execname != "sched"/ { self->;start = timestamp; }
                sched:::on-cpu /self->;start/ { this->;delta = (timestamp - self->start ) >> 20 ;
                @["off-cpu (ms):", stack()] = quantize(this->delta); self->start = 0; }'

   off-cpu (ms):
      genunix`cv_wait+0x60
      zfs`zio_wait+0x5c
      zfs`dmu_buf_hold_array_by_dnode+0x238
      zfs`dmu_buf_hold_array+0x6f
      zfs`dmu_read_uio+0x4b
      zfs`zfs_read+0x267
      genunix`fop_read+0xa9
      genunix`vn_rdwr+0x193
      unix`mmapobj_map_interpret+0x20f
      unix`mmapobj+0x7a
      genunix`mmapobjsys+0x1d5
      unix`sys_syscall+0x17a

             value  ------------- Distribution ------------- count
                -1 |                                         0
                 0 |@@@@@@@@@@@@@@@                          3
                 1 |                                         0
                 2 |@@@@@@@@@@@@@@@                          3
                 4 |@@@@@@@@@@                               2
                 8 |                                         0

Flamegraph

Brillant tool from Brendan Gregg available here : http://www.brendangregg.com/flamegraphs.html
Steps :

download
- https://github.com/brendangregg/FlameGraph/blob/master/stackcollapse.pl
- https://github.com/brendangregg/FlameGraph/blob/master/flamegraph.pl
collect cpu usage with dtrace
- # dtrace -n ‘profile-997hz /arg0 && curthread→t_pri != -1 / { @[stack()] = count(); } tick-10sec { trunc(@, 10); printa(@); exit(0); }’ > sample.txt
create an svg file
- # stackcollapse.pl sample.txt | flamegraph.pl > example.svg

This will produce a mouse over picture giving the time spent in each function. The X abscissa determines the amount of time (in %) spent in each function during the capture. The Y ordinate provides the stack calls and depth.

spa_sync activity

spa_sync activity is the time to flush data from DRAM to safe storage (ie disks).

Examples

bad spa_sync time

   # dtrace -n 'spa_sync:entry/args[0]->spa_name=="pool-b232" /{self->t=timestamp} spa_sync:return/self->t/{printf("spa_sync=%d msec", (timestamp-self->t)>>20); self->t=0}'
      CPU     ID                    FUNCTION:NAME
       63  44795                  spa_sync:return spa_sync=5802 msec
       36  44795                  spa_sync:return spa_sync=4617 msec
       63  44795                  spa_sync:return spa_sync=4875 msec
       22  44795                  spa_sync:return spa_sync=6088 msec
       31  44795                  spa_sync:return spa_sync=1108 msec

good spa_sync time

   # dtrace -n 'spa_sync:entry/args[0]->spa_name=="pool-b232" /{self->t=timestamp} spa_sync:return/self->t/{printf("spa_sync=%d msec", (timestamp-self->t)>>20); self->t=0}'
      CPU     ID                    FUNCTION:NAME
       55  44795                  spa_sync:return spa_sync=608 msec
       42  44795                  spa_sync:return spa_sync=930 msec
       41  44795                  spa_sync:return spa_sync=664 msec
       31  44795                  spa_sync:return spa_sync=624 msec
       27  44795                  spa_sync:return spa_sync=948 msec

arcreap.d

From Brendan Gregg : measures how long we spend in arc_kmem_reap_now and arc_adjust.

   # cat -n arcreap.d
   #!/usr/sbin/dtrace -s

   fbt::arc_kmem_reap_now:entry,
   fbt::arc_adjust:entry
   {
      self->start[probefunc] = timestamp;
   }

   fbt::arc_shrink:entry
   {
      trace("called");
   }

   fbt::arc_kmem_reap_now:return,
   fbt::arc_adjust:return
   /self->start[probefunc]/
   {
      printf("%Y %d ms", walltimestamp,(timestamp - self->start[probefunc]) / 1000000);
      self->start[probefunc] = 0;
   }

   # ./arcreap.d
   dtrace: script './arcreap.d' matched 5 probes
   CPU     ID                    FUNCTION:NAME
        0  64929                 arc_shrink:entry   called
        0  62414                arc_adjust:return 2012 Jan  9 23:10:01 18 ms
        9  62420         arc_kmem_reap_now:return 2012 Jan  9 23:10:03 1511 ms
        0  62414                arc_adjust:return 2012 Jan  9 23:10:24 0 ms
        6  62414                arc_adjust:return 2012 Jan  9 23:10:49 0 ms

snoop and tcpdump

Snoop can be used to get the latency and throughput. Since 2011.1.9 and 2013.1, tcpdump is available as well.

   # snoop -q -d aggr2001001 -s 200 -o  port 2049 or port 111 or port 4045
   port 2049 (nfs)
   port 111  (rpcbind)
   port 4045 (nfs-locking)

There is workflow as well : snoop-all.akwf (see attachments section)

In some cases, *tcpdump* can be used. It seems to be able to capture more details than snoop and less memory intensive consumer.

   # tcpdump -c 200000 -i aggr1 -w /var/ak/dropbox/cpdump.out

tcptrace

This command is available from cores3 : /sw/es-tools/bin/tcptrace

overall latency

   # tcptrace -lr  port_216_snoop | egrep "RTT max"
   dq->dr:                                dr->dq:
   RTT max:               120.0 ms        RTT max:               121.7 ms
   RTT max:               118.4 ms        RTT max:               120.0 ms
   RTT max:               120.0 ms        RTT max:               121.0 ms

throughput

   # tcptrace -lr  port_216_snoop | grep throughput
   dq->dr:                                dr->dq:
   throughput:              336 Bps       throughput:              273 Bps
   throughput:               36 Bps       throughput:               33 Bps
   throughput:              198 Bps       throughput:             1492 Bps

filebench

Customers love to use dd or tar as a performance tool. In many cases, this does not represent the real workload hence we need a tool to simulate genuine workload : filebench
See http://sourceforge.net/apps/mediawiki/filebench/index.php?title=Filebench
It is available for linux and Solaris.

   # ./go_filebench

   statfile1                1051ops/s   0.0mb/s      0.5ms/op      385us/op-cpu
   deletefile1              1051ops/s   0.0mb/s      4.1ms/op      880us/op-cpu
   closefile3               1051ops/s   0.0mb/s      0.0ms/op       12us/op-cpu
   readfile1                1051ops/s 142.1mb/s      1.3ms/op      819us/op-cpu
   openfile2                1051ops/s   0.0mb/s      0.5ms/op      410us/op-cpu
   closefile2               1051ops/s   0.0mb/s      0.0ms/op       12us/op-cpu
   appendfilerand1          1051ops/s   8.3mb/s      2.4ms/op      446us/op-cpu
   openfile1                1052ops/s   0.0mb/s      0.6ms/op      420us/op-cpu
   closefile1               1052ops/s   0.0mb/s      0.0ms/op       13us/op-cpu
   wrtfile1                 1052ops/s 136.1mb/s     30.5ms/op     3910us/op-cpu
   createfile1              1052ops/s   0.0mb/s      4.6ms/op     1493us/op-cpu

   2492: 109.065:
   IO Summary:      694029 ops, 11566.6 ops/s, (1051/2103 r/w) 286.5mb/s,   3379us cpu/op,  14.8ms latency
                                                               ^^^^^^^^^ MB/s

kernel dump

This should be a last resort. Remember this : When developing a picture of a racing cyclist, it’s never easy to determine how fast he was running ! This can be helpful if the system is hung or if the issue is too short to be able to run dtrace scripts.
How to force an NMI :

   *** ILOM 2.x ***
   -> cd /SP/diag
   /SP/diag
   -> set generate_host_nmi=true
   Set 'generate_host_nmi' to 'true'

   *** ILOM 3.x ***
   -> cd /HOST/
   -> set generate_host_nmi=true

   The console session should report something similar to the following:
   -> start /SP/console
   panic[cpu2]/thread=ffffff001eccbc60: NMI received
   ffffff001eccbac0 pcplusmp:apic_nmi_intr+7c ()
   ffffff001eccbaf0 unix:av_dispatch_nmivect+30 ()
   ffffff001eccbb00 unix:nmiint+154 ()
   ffffff001eccbbf0 unix:mach_cpu_idle+b ()
   ffffff001eccbc20 unix:cpu_idle+c2 ()
   ffffff001eccbc40 unix:idle+114 ()
   ffffff001eccbc50 unix:thread_start+8 ()
   syncing file systems... done
   dumping to /dev/zvol/dsk/system/dump, offset 65536, content: kernel + curproc
   100% done: 356267 pages dumped, compression ratio 3.84, dump succeeded

   Real dump file created after reboot. It may take time so, just wait before collecting a supportfile bundle.

Useful commands to run :

::memstat

::memstat
   Page Summary                Pages                MB  %Tot
   ------------     ----------------  ----------------  ----
   Kernel                    2417967              9445   38%
   ZFS File Data             3686849             14401   59%
   Anon                       169828               663    3%
   Exec and libs                   6                 0    0%
   Page cache                    138                 0    0%
   Free (cachelist)             9806                38    0%
   Free (freelist)              2534                 9    0%

    Total                     6287128             24559
   Physical                  6287127             24559

::kmastat

::kmastat -g ! grep zio_cache
   cache                        buf    buf    buf    memory     alloc alloc
   name                        size in use  total    in use   succeed  fail
   ------------------------- ------ ------ ------ ---------- --------- -----
   zio_cache                    816      2 137450       107M 2960373347    0   2 buffers used only among 137450 buf allocated
                                                                               memory in use : 137450 x 816 = 107MB

A mem fragmentation evidence is when ‘memory in use’ is high (physical memory from slab allocation) but ‘buf in use’ * ‘buf size’ is low.

::stacks

::stacks
   ffffff0030e56c40 SLEEP    MUTEX                 187
                    swtch+0x146
                    turnstile_block+0x6ff
                    mutex_vector_enter+0x25e
                    zfs_zget+0x46
                    zfs_vget+0x1f5
                    fsop_vget+0x66
                    nfs3_fhtovp+0x47
                    rfs3_write+0x4a
                    common_dispatch+0x649
                    rfs_dispatch+0x2d
                    svc_process_request+0x184
                    stp_dispatch+0xb9
                    taskq_d_thread+0xc1
                    thread_start+8

   ::stacks -m zfs
   ::stacks -m nfs
   ::stacks -m smbsrv
   ::stacks -m iscsit
   ::stacks -m pmcs
   ::stacks -m sata
   ::stacks -m mpt
   ::stacks -c vmem_xalloc

   ::stacks -c mutex_enter
   THREAD           STATE    SOBJ                COUNT
   ffffff00318fbc40 RUN      NONE                  1
                    swtch+0x146
                    preempt+0xca
                    kpreempt+0x89
                    sys_rtt_common+0x13b
                    _sys_rtt_ints_disabled+8
                    mutex_enter+0x10
                    segkmem_page_create+0x92
                    segkmem_xalloc+0xc6
                    segkmem_alloc_vn+0xcb
                    segkmem_alloc+0x24
                    vmem_xalloc+0x502
                    vmem_alloc+0x172
                    kmem_slab_create+0x81
                    kmem_slab_alloc+0x60
                    kmem_cache_alloc+0x267
                    zio_buf_alloc+0x29
                    zil_alloc_lwb+0x4a
                    zil_get_current_lwb+0x281
                    zil_get_commit_list+0x185
                    zil_get_train+0x94
                    zil_commit+0xc0
                    zfs_fsync+0xcc
                    fop_fsync+0x5f
                    rfs3_commit+0x173
                    common_dispatch+0x539
                    rfs_dispatch+0x2d
                    svc_process_request+0x184
                    stp_dispatch+0xb9
                    taskq_d_thread+0xc1
                    thread_start+8

       ::stacks -c spa_sync
       THREAD           STATE    SOBJ                COUNT
       ffffff0030d2ac40 SLEEP    MUTEX                   1
                    swtch+0x146
                    turnstile_block+0x6ff
                    mutex_vector_enter+0x25e
                    zil_sync+0x5f
                    dmu_objset_sync+0x241
                    dsl_dataset_sync+0x63
                    dsl_pool_sync+0x18d
                    spa_sync+0x395
                    txg_sync_thread+0x244
                    thread_start+8

       ::stacks -c kmem_cache_reap_now
       THREAD           STATE    SOBJ                COUNT
       ffffff002e173c40 SLEEP    CV                      1
                    swtch+0x146
                    cv_wait+0x60
                    taskq_wait+0x4b
                    kmem_cache_reap_now+0x4e
                    arc_buf_cache_reap_now+0x38
                    arc_reaper_thread+0xb0
                    thread_start+8

::findstack

ffffff002e173c40::findstack -v
   stack pointer for thread ffffff002e173c40: ffffff002e173ac0
   [ ffffff002e173ac0 _resume_from_idle+0xf5() ]
     ffffff002e173af0 swtch+0x146()
     ffffff002e173b20 cv_wait+0x60(fffff6000df96b5a, fffff6000df96b48)
     ffffff002e173b60 taskq_wait+0x4b(fffff6000df96b28)
     ffffff002e173b80 kmem_cache_reap_now+0x4e(fffff60005d4a008)
     ffffff002e173bc0 arc_buf_cache_reap_now+0x38()
     ffffff002e173c20 arc_reaper_thread+0xb0()
     ffffff002e173c30 thread_start+8()

::cpuinfo

::cpuinfo
     ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
     0 fffffffffbc3ef20  1f   11    0 100   no    no t-1297 fffff6002767fae0 akd
     1 fffff6000dd0eac0  1f   13    0 100   no    no t-1298 fffff6002dfb9840 akd  ticks : 12s. CPU1 has been 
     2 fffff6000dd0c080  1f   12    0 100   no    no t-1296 fffff6002dc3d7c0 akd          running a thread for 12s
     3 fffff6000dd0a000  1f   10    0 100   no    no t-1296 fffff6002de034e0 akd
     4 fffff6000dd06b00  1f    8    0 100   no    no t-1296 fffff6002dc3c440 akd
     5 fffffffffbc49110  1b   28    0 100   no    no t-1296 fffff60026a03160 akd
     6 fffff6000dd02000  1f   13    0 100   no    no t-1298 fffff60029292b20 akd
     7 fffff6000dfbcac0  1f    9    0 100   no    no t-1296 fffff6002df3c000 akd

::taskq

fffff6000df96b28::taskq -T
   ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
   fffff6000df96b28 kmem_taskq                         1/   1  570   935    -
       THREAD           STATE    SOBJ                COUNT
       ffffff002ed96c40 RUN      NONE                  1
                        swtch+0x146
                        preempt+0xca
                        kpreempt+0x89
                        sys_rtt_common+0x13b
                        _sys_rtt_ints_disabled+8
                        kmem_partial_slab_cmp+0x42
                        avl_find+0x56
                        avl_add+0x2c
                        avl_update_gt+0x56
                        kmem_slab_free+0x1b9
                        kmem_magazine_destroy+0xdf
                        kmem_depot_ws_reap+0x77
                        kmem_cache_reap+0x76
                        taskq_thread+0x22e
                        thread_start+8

::spa

::spa
     ADDR                 STATE NAME
     fffff60039b60500    ACTIVE pool-0
     fffff60009817580    ACTIVE system

   fffff60039b60500::print spa_t spa_uberblock.ub_timestamp
     spa_uberblock.ub_timestamp = 0x52976059  : in sec since epoch

   time/J
     time:
     time:           5297619e  : in sec since epoch

   5297619e-52976059=D
     325 sec  : This is the last time a TXG has been done : quite long !

   0x52976059=Y
                2013 Nov 28 15:25:13

::zio_state

::zio_state
    ADDRESS                                  TYPE  STAGE            WAITER
    fffff6001b0b4cc8                         NULL  OPEN             -
    ffff60037f0e340                          NULL  OPEN             -
    fffff60039911cc0                         NULL  OPEN             -
    fffff60037f25338                         NULL  CHECKSUM_VERIFY  ffffff007bb30c40

    ffffff007bb30c40::findstack -v
    stack pointer for thread ffffff007bb30c40: ffffff007bb2ffb0
    [ ffffff007bb2ffb0 _resume_from_idle+0xf4() ]
      ffffff007bb2ffe0 swtch+0x150()
      ffffff007bb30010 cv_wait+0x61(fffff60037f25650, fffff60037f25648)
      ffffff007bb30050 zio_wait+0x5d(fffff60037f25338)  : arg0 is zio_t
      ffffff007bb300b0 dbuf_read+0x1e5(fffff6006a5539f8, 0, a)
      ffffff007bb30130 dmu_buf_hold+0xac(fffff6001b7fa0c0, 18a, 66715000, 0, ffffff007bb30148, 1)
      ffffff007bb301a0 zap_get_leaf_byblk+0x5c(fffff60037bf44c0, 66715, 0, 1, ffffff007bb30360)
      ffffff007bb30210 zap_deref_leaf+0x78(fffff60037bf44c0, 10b0600000000000, 0, 1, ffffff007bb30360)
      ffffff007bb302a0 fzap_cursor_retrieve+0xa7(fffff60037bf44c0, ffffff007bb30350, ffffff007bb30390)
      ffffff007bb30330 zap_cursor_retrieve+0x188(ffffff007bb30350, ffffff007bb30390)
      ffffff007bb30640 ddt_zap_walk+0x5d(fffff6001b7fa0c0, 18a, ffffff007bb306e0, fffff60051b8c100)
      ffffff007bb30680 ddt_object_walk+0x4e(fffff60052bc3000, 0, 1, fffff60051b8c100, ffffff007bb306e0)
      ffffff007bb306d0 ddt_walk+0x78(fffff600513cc080, fffff60051b8c0e8, ffffff007bb306e0)
      ffffff007bb308a0 dsl_scan_ddt+0x2de(fffff60051b8c040, fffff6004256b1e8)
      ffffff007bb30a60 dsl_scan_visit+0x54(fffff60051b8c040, fffff6004256b1e8)
      ffffff007bb30ac0 dsl_scan_sync+0x20f(fffff6003a230a00, fffff6004256b1e8)
      ffffff007bb30b80 spa_sync+0x3f7(fffff600513cc080, 8824b0)
      ffffff007bb30c20 txg_sync_thread+0x247(fffff6003a230a00)
      ffffff007bb30c30 thread_start+8()

::zio

Displays a recursive stack of IOs

fffff60037f25338::zio -rc
     ADDRESS                                  TYPE  STAGE            WAITER
     fffff60037f25338                         NULL  CHECKSUM_VERIFY  ffffff007bb30c40
      fffff60041603998                        READ  VDEV_IO_START    -
       fffff600399b9340                       READ  VDEV_IO_START    -
        fffff6003e24fcc8                      READ  VDEV_IO_START    -
         fffff60039999cc8                     READ  VDEV_IO_START    -   : last one


   fffff60039999cc8::print zio_t io_vd
     io_vd = 0xfffff600513cad40

   fffff60039999cc8::print zio_t io_vd | ::print vdev_t vdev_path
     vdev_path = 0xfffff60050f0bc38 "/dev/dsk/c3t5000CCA396E3D6FAd0s0"

::nm

Displays the size of a variable and can provide the argument types of any function in the kernel

::nm ! grep arc_meta_limit
   0xffffffffc00043b8|0x0000000000000008|OBJT |LOCL |0x0  |8       |arc_meta_limit

   kmem_cache_reap_now::nm -f ctype
   C Type
   void (*)(kmem_cache_t *)