Contents
iostat
This is a standard Unix utility for analysing IO workloads and performance troubleshooting, at the individual drive level. The first iteration provides a cumulative summary over node uptime, following iterations are current data – real time.
Common usage
# iostat -zxcn 2 2 c percent time system in user/sys/wait/ idle times n show device names(CTD or WWN) not instance numbers x extended disk statistics. Need this for latency statistics z suppress lines that are all zeros extended device statistics Lun queue depth --> +++ +++++ <-- target service time (NFS client's perception) r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 15.0 0.0 33.0 0.0 0.0 2.0 0.6 1 1 c2t0d0 0.0 13.0 0.0 27.5 0.0 0.0 1.1 0.4 0 0 c2t1d0 119.0 55.0 15233.9 3817.0 0.0 2.2 0.0 12.8 0 87 c0t5000C5003416C207d0 119.0 56.0 15233.9 3817.0 0.0 3.3 0.0 18.9 0 88 c0t5000C50034169BA3d0 0.0 484.1 0.0 32916.2 0.0 1.0 0.0 2.1 1 28 c0tSUN24G1047M04YLQ1047M04YLQd0 0.0 487.1 0.0 33120.2 0.0 1.0 0.0 2.1 0 29 c0tSUN24G1047M04YLR1047M04YLRd0 Host queue depth --> +++ ^^^ ^^^^^^ With : - Host queue depth : Number of transactions waiting to be submitted to a device. - Lun queue depth : Number of transactions submitted to and being processed by the device.
The ZFSSA can handle a deep queue of I/O, so you should not consider 100% busy disks as overloaded. You probably want to look at target service time (asvc_t) as the NFS client’s perception of the NFS server’s response time, and the active queue (actv) as the number of pending I/O requests. actv represents the number of threads doing IOs onto the same device.
-
r/s : number of read IOPS
-
w/s : number of write IOPS
-
Kr/s : read data transfer rate
-
Kw/s : write data transfer rate
-
wait : average number of transactions waiting for service or Host queue depth.
-
actv : average number of transactions actively being serviced (removed from the queue but not yet completed) or Lun queue depth.
-
wsvc_t : average service time in wait queue, in milliseconds
-
asvc_t : target service time (NFS client’s perception) : 0.8 ms for logzila, around 20ms for SAS drives. Should only be taken only on a disk that is showing more than 5% activity
-
%w : Percentage of time that the queue is not empty
-
%b : Percentage of time that the disk is processing transactions
-
the disk device
%b at 100% does not mean the LUN is really busy. It’s calculated by the driver. It’s the percentage of time that there is at least one request pending from the driver to the hardware. Modern hardware can handle multiple requests simultaneously with tagged commands. Think of a store with 16 checkout lanes. You count the store as “busy” any time at least one person is in the store. You can be 100% busy, but still have some capacity because some requests are being fulfilled simultaneously in the checkout lanes.
In the overloaded condition, you should see a deep queue to the storage (high actv) and a long response time (asvc_t) from the client.
Interpretation of iostat data
Typical response times for various types of devices
What does an oversubscribed device look like ?
Classic signs of overloaded disk : asvc_t time very high and actv – wait numbers greater than small integers. Best evidence the disk cannot keep up.
Overall latency
The overall latency is the sum of time waiting in the queue, plus the time it takes for the disks to respond to your requests. If the queue depth is 5 and the disk service time is 10 milliseconds, then the overall latency from the client point of view is 5 * 10ms = 50 milliseconds.
How is iostat used
Quick scan of the data
At least two iterations, only looking for extreme high values. Don’t read every number, just scan for anomalies or something that “sticks out”. The iostat data in some bundle may not have captured the slowdown, it is collected over 15 seconds during bundle collection.
-
First iteration is useful to find the cumulative level of activity since last boot.
-
Review the r/s and w/s columns
-
Estimate approximate read to write mix, within about 20% is fine
-
Estimate typical iops
-
-
Review asvc_t(avg disk service time) – is it excessive for the class of drive, then determine if the drives can “keep up” with the workload.
-
Review wait and actv columns – these will show if the drive is “backing up”
Dead ringer for a disk bottleneck
-
asvc_t is several multiples higher than typical. Note, one iteration may be much higher than the others.
-
wait and queues greater than small integer.
Determine the number of drives and timeframes they are oversubscribed
-
If many drives are oversubscribed for much of the time, then we do not have sufficient disk spindles (Typically, but not always). If they seem to be oversubscribed, make sure this is not an effect of another problem (like readzillas using too much memory, resulting in ZFS memory pressure and having to hit the spindles instead of caching data).
-
More common, intermittent performance problems where drives keep up most of the time, but start queueing up during the busy workloads.
-
A “slow drive” or uneven workload (like a resilver in one raidZ2 vdev) will slow the overall system down.
Notes
Some engineers like to filter iostat data looking for specifics, like disk service times > 100ms. It may be better to get a global picture, to look at the overall data first, and then filter for specifics. Check if fields have extra digits, or only skim the first significant figures (applies to wait actv, wsvc_t and asvc_t columns typically). For r/s and w/s, just do a quick estimation of the busiest lines.
Real iostat data
case 1
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.7 54.9 15.1 3163.1 0.3 0.4 6.1 7.7 4 9 c1t0d0 2.0 54.8 29.2 3163.1 0.4 0.4 6.4 6.9 5 9 c1t1d0 0.0 1.2 0.2 22.1 0.0 0.0 0.0 0.1 0 0 c2t5000A72030064181d0 0.0 0.1 0.3 2.6 0.0 0.0 0.0 2.1 0 0 c2t5000C500438F21A3d0 17.4 31.4 212.7 551.4 0.0 0.1 0.0 2.9 0 4 c2t5000C500438ED9AFd0 0.0 0.1 0.3 2.3 0.0 0.0 0.0 2.2 0 0 c2t5000C50043B6B3A3d0 17.5 31.7 212.6 552.4 0.0 0.1 0.0 2.9 0 4 c2t5000C500438EC99Fd0 17.5 31.3 213.6 551.4 0.0 0.1 0.0 2.9 0 4 c2t5000C50043B68F6Bd0 17.5 31.7 213.0 552.4 0.0 0.1 0.0 2.9 0 4 c2t5000C50043B8C303d0 17.4 31.5 212.8 550.5 0.0 0.1 0.0 2.9 0 4 c2t5000C50043B692CFd0 17.5 32.0 212.8 552.8 0.0 0.1 0.0 2.9 0 4 c2t5000C50043B8F4DFd0 0.0 0.1 0.3 2.3 0.0 0.0 0.0 2.2 0 0 c2t5000C500475428BFd0 58.8 65.7 1037.7 7223.9 0.2 0.0 1.9 0.4 4 5 c1t2d0 58.9 65.7 1040.7 7220.9 0.2 0.0 1.9 0.4 4 5 c1t3d0
– comments
-
the busy data drives are doing roughly 50 iops, ~3:2 write biased.
-
roughly 200KB/s reads and 500KB/s writes
-
service times look good and ios are not queueing
-
the system drives are doing more IO than the data drives ~55iops vs 50iops – might be analytics or logging.
-
this customer has a known obsession with analytics
– questions
-
which drives are readzillas ?
-
which are the logzillas, that are active on this node ?
case 2
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1.0 0.0 4.0 0.0 0.0 0.1 0.3 0 0 c1t0d0 0.0 1.0 0.0 4.0 0.0 0.0 0.0 0.3 0 0 c1t1d0 4.0 306.8 165.5 18869.9 0.0 0.8 0.0 2.7 0 17 c2t5000C500438ED9AFd0 26.2 374.2 1604.6 18833.7 0.0 3.5 0.0 8.7 0 73 c2t5000C500438EC99Fd0 22.1 297.8 1349.6 18938.4 0.0 1.5 0.0 4.6 0 35 c2t5000C50043B68F6Bd0 17.1 385.3 1045.3 18833.8 0.0 2.6 0.0 6.5 0 53 c2t5000C50043B8C303d0 17.1 267.6 803.8 18298.1 0.0 0.9 0.0 3.0 0 21 c2t5000C50043B692CFd0 15.1 286.7 849.6 18741.3 0.0 1.2 0.0 4.1 0 39 c2t5000C50043B8F4DFd0 39.2 287.7 1808.3 19927.5 0.0 1.3 0.0 3.8 0 35 c2t5000C500475416DFd0 28.2 309.9 1350.1 18873.7 0.0 1.4 0.0 4.0 0 43 c2t5000C50043B6CEEBd0 13.1 310.9 261.6 18875.2 0.0 1.0 0.0 3.1 0 20 c2t5000C500438ED92Fd0 10.1 376.3 619.2 21063.4 0.0 1.7 0.0 4.4 0 27 c2t5000C50047541A73d0 18.1 296.8 924.6 18990.5 0.0 1.2 0.0 3.7 0 33 c2t5000C50043B67B1Fd0 673.1 210.3 43312.0 24662.3 2.1 0.5 2.4 0.6 43 52 c1t2d0 599.7 145.9 38466.5 16476.9 2.5 0.5 3.4 0.7 42 50 c1t3d0
– comments
-
disks are busy roughly 400 iops.
-
asvc_t times are all < 5ms, this is ok.
-
no sign of trouble, the drives are busy and keeping up.
case 3
poor CIFS performance.
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 4.0 143.0 511.9 16351.5 0.6 0.1 3.8 0.5 7 7 c1t2d0 1.0 0.0 128.0 0.0 0.0 0.0 0.0 0.7 0 0 c1t3d0 2.0 135.0 256.0 15873.5 0.5 0.1 3.9 0.5 6 6 c1t4d0 27.0 218.0 62.0 1060.3 0.0 0.9 0.0 3.8 0 20 c3t5000CCA02A272A68d0 41.0 142.0 524.4 774.9 0.0 0.6 0.0 3.4 0 15 c3t5000CCA02A26BB70d0 40.0 191.0 160.5 1259.3 0.0 0.9 0.0 3.9 0 24 c3t5000CCA02A256000d0 72.0 102.0 129.5 509.9 0.0 0.5 0.0 3.0 0 17 c3t5000CCA02A1D6468d0 38.0 184.0 145.0 1259.8 0.0 0.9 0.0 3.8 0 23 c3t5000CCA02A2002F8d0 19.0 242.0 79.0 1060.9 0.0 0.8 0.0 3.0 0 20 c3t5000CCA02A264B1Cd0 78.0 76.0 124.5 511.9 0.0 0.5 0.0 3.3 0 16 c3t5000CCA02A1A6A64d0 49.0 139.0 540.9 778.4 0.0 0.7 0.0 3.6 0 17 c3t5000CCA02A265FE0d0 39.0 192.0 154.0 1259.8 0.0 1.0 0.0 4.5 0 25 c3t5000CCA02A266100d0 23.0 250.0 76.0 1056.9 0.0 1.3 0.0 4.8 0 27 c3t5000CCA02A24C950d0 43.0 123.0 606.9 774.4 0.0 0.7 0.0 4.2 0 19 c3t5000CCA02A26C588d0 36.0 176.0 141.0 1265.3 0.0 0.8 0.0 3.6 0 21 c3t5000CCA02A28397Cd0 37.0 188.0 151.5 1262.4 0.0 0.8 0.0 3.7 0 21 c3t5000CCA02A2655F4d0 41.0 173.0 635.9 773.4 0.0 0.8 0.0 3.7 0 18 c3t5000CCA02A26A730d0 42.0 129.0 701.9 775.9 0.0 0.8 0.0 4.7 0 18 c3t5000CCA02A1FC718d0 0.0 4.0 0.0 204.0 0.0 0.0 0.0 0.2 0 0 c3t5000A7203006A41Cd0 67.0 99.0 127.0 511.9 0.0 0.8 0.0 4.9 0 20 c3t5000CCA02A1FC7ACd0
– questions
-
what’s the size of each IO read ?
-
what’s the size of each IO write ?
-
What could be the zpool layout : mirror or raidz2 ?
case 4
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.4 30.9 21.5 545.5 0.0 0.1 1.4 2.0 1 2 c1t0d0 0.4 30.8 18.7 545.5 0.0 0.1 1.4 2.1 1 2 c1t1d0 288.7 259.9 17159.0 23806.6 1.7 0.4 3.1 0.8 29 43 c1t2d0 288.6 259.9 17159.2 23808.2 1.7 0.4 3.1 0.8 29 43 c1t3d0 63.3 28.7 364.2 283.5 0.0 1.5 0.0 15.9 0 62 c3t5000C50040908B8Bd0 7.1 3.1 35.9 28.4 0.0 0.2 0.0 15.0 0 7 c3t5000C50035055FEBd0 59.3 28.9 284.8 284.0 0.0 1.4 0.0 15.4 0 61 c3t5000C50034F61CEFd0 59.6 29.0 286.6 284.2 0.0 1.3 0.0 15.2 0 60 c3t5000C50034EF953Fd0 59.2 29.0 284.3 283.8 0.0 1.3 0.0 14.9 0 60 c3t5000C50034EF2FDFd0 0.0 431.5 0.0 16373.9 0.0 0.7 0.0 1.5 1 18 c3t5000A7203003D9EDd0 59.6 29.1 285.8 283.9 0.0 1.3 0.0 14.9 0 60 c3t5000C50034EC779Fd0 59.6 28.9 286.3 283.6 0.0 1.4 0.0 15.8 0 62 c3t5000C50034F67A13d0 15.2 11.6 111.1 84.2 0.0 0.3 0.0 11.9 0 16 c3t5000C50034F60A8Fd0 14.5 15.6 65.4 87.4 0.0 0.4 0.0 13.4 0 18 c3t5000C50034EF52DFd0 13.8 6.4 68.6 111.4 0.0 7.1 0.0 352.0 0 99 c3t5000C50034F5105Bd0 15.6 16.6 79.2 97.3 0.0 0.5 0.0 14.1 0 19 c3t5000C50034EFAC7Fd0
– comments
-
These drives are only doing around 90 IOPs, yet they are about 60% busy.
– questions
-
identify readzillas and logzillas
-
Do the latency numbers look reasonable
-
anything abnormal stands out ?
– notes
-
one drive has asvc_t latencies that are almost 10x slower then the others
-
This drive is not doing very much IO, yet latency is horrible and multiple IOs are active.
-
observe the excessive actv counts for this drive
# iostat -xncz 5 5 |grep c3t5000C50034F5105Bd0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 59.4 28.6 286.7 284.0 0.0 1.5 0.0 17.2 0 63 c3t5000C50034F5105Bd0 21.0 9.2 93.4 304.5 0.0 5.2 0.0 171.6 0 100 c3t5000C50034F5105Bd0 26.1 0.0 108.7 0.0 0.0 4.7 0.0 180.4 0 99 c3t5000C50034F5105Bd0 57.9 15.3 309.8 410.4 0.0 10.0 0.0 136.7 0 100 c3t5000C50034F5105Bd0 46.3 6.8 90.1 204.9 0.0 6.0 0.0 113.9 0 100 c3t5000C50034F5105Bd0
– solution
-
the slow drive was removed from the array and performance immediately improved.
zfs_vdev_max_pending
What’s wrong here ?
SolarisHost # iostat -xnz 1 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 764.5 69.3 97628.6 6848.2 0.0 10.0 0.0 12.0 0 100 c0t600144F0C6133EE90000525FE2120002d0 632.0 7.1 80783.5 434.2 0.0 10.0 0.0 15.6 0 100 c0t600144F0C6133EE90000525FE2120002d0
On Solaris, zfs_vdev_max_pending controls, how many I/O requests can be pending per vdev. For example when you have 100 disks visible from your OS with a zfs:zfs_vdev_max_pending of 2, you have 200 request outstanding at maximum. When you have 100 disks hidden behind a ZFFSA just showing a single LUN, you will have 2 pending requests at maximum !
Under Solaris 11, the default value is 10. This means the the actv col is maxed out to 10 transactions : no more than 10 transactions can be performed at the same time. We can grow this to 35 if required, as it was under Solaris 10.
SolarisHost # echo zfs_vdev_max_pending/W0t35 | mdb -kw SolarisHost # echo "set zfs:zfs_vdev_max_pending = 35" >> /etc/system SolarisHost # iostat -xnz 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 772.9 28.4 98510.2 2805.7 0.0 18.9 0.0 23.6 0 100 c0t600144F0C6133EE90000525FE2120002d0 858.8 8.3 109596.5 331.8 0.0 17.8 0.0 20.5 0 100 c0t600144F0C6133EE90000525FE2120002d0
Observe that asvc_t has increased. Do not increase this too much as this is a trade-off between throughput and latency.
Note : on the zfs appliance, zfs_dev_max_pending is still at 10. Do not change it !
zs3-2-tvp540-a# echo zfs_vdev_max_pending/D | mdb -k zfs_vdev_max_pending: zfs_vdev_max_pending: 10
Analytics
Useful set of analytics
The full name is provided as well as the cli name. Some of them cannot be created easily from the BUI.
CPU: percent utilization broken down by CPU identifier : cpu.utilization[cpu] Disk: ZFS DMU operations per second broken down by type of operation : zpool.dmuops[op] Disk: ZFS DMU operations broken down by DMU object type : zpool.dmuops[dmutype] Network: TCP bytes per second broken down by local service : tcp.bytes[local service] Network: TCP packets per second broken down by local service : tcp.packets[local service] Disk: Disks broken down by percent utilization : io.disks[utilization] Disk: I/O operations per second broken down by type of operation : io.ops[op] Disk: I/O operations per second broken down by latency : io.ops[latency=100000][disk] : 100ms Disk: I/O bytes per second broken down by disk : io.bytes[disk] Memory: kernel memory broken down by kmem cache : kmem.total[cache] Memory: kernel memory in use broken down by kmem cache : kmem.inuse[cache] Memory: kernel memory lost to fragmentation broken down by kmem cache : kmem.fragmented[cache] Memory: dynamic memory usage broken down by application name : mem.heap[application] NFSv3 operations per second taking at least 100000 microseconds broken down by type of operation : nfs3.ops[latency=100000][op]
Some nice analytics introduced in 2013.1
dataset-030 active 37.8K 0 repl.bytes[peer] dataset-031 active 37.8K 0 repl.ops[dataset] dataset-032 active 37.8K 0 repl.ops[direction] dataset-033 active 71.1K 0 cpu.spins[cpu] dataset-034 active 36.9K 0 cpu.spins[type] dataset-035 active 33.9K 0 shadow.ops[latency] dataset-036 active 37.8K 0 shadow.ops[share] ...
Examples
cli> analytics datasets create repl.bytes[peer]
Memory usage via mdb
::memstat
Useful tool but may return incorrect information. Double check with freemem/D …
Page Summary Pages MB %Tot ------------ --------- -------- ---- Kernel 2998817 11714 72% : Kernel + ARC cache for ZFS metadata (included in crash dump) ZFS File Data 1058644 4135 25% : ARC cache for ZFS data (not included in crash dump) Anon 123711 483 3% : process's heap space, its stack, and copy-in-write pages. Exec and libs 263 1 0% Page cache 1438 5 0% : cache used by other FS than zfs (ufs, tmpfs, qfs) Free (cachelist) 2287 8 0% : freed but still containing valid vnodes references (structure kept) Free (freelist) 6898 26 0% : freed and not containing any vnodes references (structure not kept) Total 4192058 16375 Physical 4192056 16375
Things to consider
-
Kernel memory embeds ZFS metadata (DDT, spacemap, L2 ARC headers …)
-
High level of ‘kernel memory’ is not always a pbm especially when the system is a replication target or a resilvering is running.
-
Small ‘ZFS File Data’ values will mean that ARC data miss can be of high rate, possible performance issue if IO workload is READ.
-
High value of ‘Anon’ memory, means some userland process may be huge in size (true for ndmp which is a 64 bits process).
-
High value of ‘Page cache’ means someone may have written some big files into /export (ftp can do this !). /export uses swap, which points to tmpfs (DRAM).
Try to collect ::memstat every minutes to check how memory usage changes. Big memory movements is a sign of memory re-allocation.
::arc -m
The dcmd output is slightly different from 2011.1 and 2013.1 :
-
2011.1
p = 364385 MB : target size of MRU c = 388888 MB : constantly changing number of what the ARC thinks it should be c_min = 16383 MB : min target cache size, minimum size of the ARC c_max = 523246 MB : max target cache size, maximum size of the ARC : 98% of Memory size = 381169 MB : ARC current size hdr_size = 5061417280 : Header size for ARC, in bytes data_size = 380546095104 : ARC size for data (payload), in bytes l2_size = 1691643692544 : l2 cache size currently used l2_hdr_size = 4942620096 : memory used for l2_headers in bytes, here 4.7GB arc_meta_used = 54568 MB : ARC metadata currently used arc_meta_limit = 523246 MB : ARC metadata max size (by default arc_meta_limit = arc_c_max) arc_meta_max = 342784 MB : max size the ARC metadata has reached
-
2013.1
p = 2529 MB : target size of MRU c = 261101 MB : constantly changing number of what the ARC thinks it should be c_min = 64 MB : min target cache size, minimum size of the ARC c_max = 261101 MB : max target cache size, maximum size of the ARC : 98% of Memory size = 44213 MB : ARC current size data_size = 15015 MB : ARC size for data (payload) l2_hdr_size = 96 MB : memory used for l2_headers l2_size = 1535512576 : l2 cache size currently used meta_used = 29198 MB : ARC metadata currently used meta_max = 29198 MB : max size the ARC metadata has reached meta_limit = 0 MB : ARC metadata max size (0 = no limit)
Example :
::arc -m l2_size = 1018129928704 l2_hdr_size = 2339951616 # zpool iostat 1 // to get SSD usable size capacity operations bandwidth pool alloc free read write read write cache - - - - - - c1t2d0 477G 8M 58 7 892K 773K c1t3d0 477G 8M 59 7 887K 772K
Max L2-headers size = SSD size * l2_hdr_size / l2_size = (477*2)*1024^3 * 2339951616 / 1018129928704 = 2354246416 = 2.354 GB
Possible outcome
When L2 header size is too high, some performance issues will happen. If DRAM cannot be increased, there are 2 possible outcomes :
-
remove 1 of the cache devices. A maintenance window is needed as the traffic may stop while the cache device is removed by the ‘zpool remove cachedevice’ command.
-
narrow down any share/LUNs whose the majority of IOs are writes (analytics can help). From the BUI, select the share and set “Cache device usage → none”.
::kmastat
To get a better picture of ::kmastat output, we need to introduce vmem.
vmem
Vmem stands for virtual memory and is essential for an operating system because it makes several things possible :
-
It helps to isolate tasks from each other by encapsulating them in their private address spaces.
-
Virtual memory can give tasks the feeling of more memory available than is actually possible.
-
By using virtual memory, there might be multiple copies of the same program, linked to the same addresses, running in the system
Solaris’ kernel memory allocator (also known as slab allocator) allows the kernel to allocate memory dynamically. The slab allocator
-
relies on two lower−level system services to create slabs: a virtual address allocator to provide kernel virtual addresses, and VM routines to back those addresses with physical pages and establish virtual−to−physical translations.
-
allows the clients to create a cache of very frequently used small objects/structures, so that objects of the same type come from a single page or contiguous pages.
-
reduces the risk of internal fragmentation.
It provides interfaces such as vmem_create(), kmem_cache_create(), kmem_cache_alloc() and kmem_cache_free() :
-
zio_arena = vmem_create(“zfs_file_data”,…) : memory allocated in physical or virtual memory: PM or VM (see args)
-
znode_cache = kmem_cache_create(“zfs_znode_cache”,…) : memory allocated in physical memory : PM
-
dbuf_cache = kmem_cache_create(“dmu_buf_impl_t”,…) : memory allocated in physical memory : PM
-
zp = kmem_cache_alloc(znode_cache,…);
-
kmem_cache_free(znode_cache, zp);
The memory returned from the kmem calls is allocated in physical memory.
Slab allocator sources its virtual address range from the vmem backend allocator. Vmem resource map allocator provides the vmem_create() interface to allow clients to register with the allocator some VA range from where the subsequent allocations should happen. As written before, VM routines establish virtual−to−physical translations of pages.
vmem_create(const char *name, void *base, size_t size, size_t quantum, vmem_alloc_t *afunc, vmem_free_t *ffunc, vmem_t *source, size_t qcache_max, int vmflag) name : name of the virtual memory segment base : ase address of the segment in the 'kernel address space (KAS)'; size : how big the segment should be quantum : unit of allocation (if equals to PAGESIZE -> Virtual Memory) afunc : allocator function for the segment ffunc : free function for the segment source : parent vmem arena. if NULL, then it defaults to _kmem_default_ qcache_max : qcache_max is the the maximum sized allocation upto which it could allows slab allocator to cache such objects vmflag : VM_SLEEP = can block for memory - success guaranteed, VM_NOSLEEP = cannot block for memory - may fail
ZFS data buffers
Typical buffer names : zfs_file_data, zfs_file_data_buf, zio_data_buf. ZFS data is never captured in a kernel dump. This is because we do not use the heap arena but create a new arena in the KAS (Kernel Adress Space) : the zfs_file_data arena.
# startup.c segkmem_zio_init(segzio_base, mmu_ptob(segziosize)); /* create zio area covering new segment */ # seg_kmem.c void segkmem_zio_init(void *zio_mem_base, size_t zio_mem_size) { zio_arena = vmem_create("zfs_file_data", zio_mem_base, zio_mem_size, PAGESIZE, NULL, NULL, NULL, 256 * 1024, VM_SLEEP); // quantum = PAGESIZE : VM ^^^^^^^^ // Here, we allocate 'zio_mem_size' bytes for zfs_file_data and round to quantum multiples (ie PAGESIZE) zio_alloc_arena = vmem_create("zfs_file_data_buf", NULL, 0, PAGESIZE, segkmem_zio_alloc, segkmem_zio_free, zio_arena, 0, VM_SLEEP); // quantum = PAGESIZE but refers to segkmem_zio_alloc : invokes vmem_alloc() to get a virtual address and // then backs it with physical pages : PM # zio.c vmem_t * data_alloc_arena = zio_alloc_arena; ^^^^^^^^^^^^^^^^ loop on 'c' (void) sprintf(name, "zio_data_buf_%lu", (ulong_t)size); zio_data_buf_cache = kmem_cache_create(name, size, align, NULL, NULL, NULL, NULL, data_alloc_arena, cflags | KMC_NOTOUCH); // PM endloop
Any kmem_cache_alloc() calls involving any of the above zio_data_buf_
Let’s have a look to ::kmastat
::kmastat -m cache buf buf buf memory alloc alloc name size in use total in use succeed fail = object caches ------------------------- ------ ------ ------ ---------- --------- ----- kmem_magazine_31 256 441 4095 1M 2429326 0 | <== per−CPU caches added to the slab kmem_magazine_47 384 3640 7590 2M 838299 0 | algorithm to avoid spinlock kmem_magazine_63 512 126 2009 1M 1249436 0 | contention ... zio_data_buf_69632 69632 4 17 1M 288907 0 | zio_data_buf_90112 90112 1 14 1M 908801 0 | zio_data_buf_102400 102400 0 11 1M 185426 0 | zio_data_buf_114688 114688 0 10 1M 167585 0 |<==============+ zio_data_buf_118784 118784 0 11 1M 127970 0 | | zio_data_buf_122880 122880 0 10 1M 124515 0 | | zio_data_buf_126976 126976 0 12 1M 85861 0 | | zio_data_buf_131072 131072 32185 32879 4109M 56897448 0 | | ------------------------- ------ ------ ------ ---------- --------- ----- | Total [zfs_file_data_buf] 4135M 168478350 0 <--------------| | vmem memory memory memory alloc alloc vmem arena | name in use total import succeed fail <= vmem arena | ------------------------- ---------- ----------- ---------- --------- ----- | zfs_file_data 4142M 8388608M 0M 5809290 0 < 8TB of VM <==|==== arena name : VM zfs_file_data_buf 4135M 4135M 4135M 10342009 0 < 4GB of PM ---+ <=== subset of zfs_file_data arena : PM
ZFS metadata buffers
We rely on kmem to create caches such that all the zfs metadata is part of the ‘kernel heap’ and hence is catpured as part of the crash dump. Actually, we pass NULL to kmem_cache_create() for the vmem_t type argument. This means that zio_buf_
- zio.c loop on 'c' (void) sprintf(name, "zio_buf_%lu", (ulong_t)size); zio_buf_cache[c] = kmem_cache_create(name, size, align, NULL, NULL, NULL, NULL, NULL, cflags); // last but one arg is NULL => kmem_default
ZFS caches for metadata :
-
zio.c: zio_buf_
; = kmem_cache_create(“zio_buf_%lu”,…); -
dbuf.c: dbuf_cache = kmem_cache_create(“dmu_buf_impl_t”,…);
-
dnode.c: dnode_cache = kmem_cache_create(“dnode_t”,…);
-
sa.c: sa_cache = kmem_cache_create(“sa_cache”, …);
-
zfs_znode.c: znode_cache = kmem_cache_create(“zfs_znode_cache”, …);
-
zio.c: zio_cache = kmem_cache_create(“zio_cache”, …);
-
…
As these buffers are all allocated thanks to kmem_cache_create() calls, they correspond to Physical Memory : PM.
Lets have a look to ::kmastat
::kmastat -m cache buf buf buf memory alloc alloc name size in use total in use succeed fail = object caches ------------------------- ------ ------ ------ ---------- --------- ----- zio_buf_512 512 2409586 4916584 2400M 552832304 0 ^ zio_buf_2048 2048 71804 72078 140M 26528668 0 | zio_buf_4096 4096 2633 2703 10M 74524377 0 | zio_buf_12288 12288 1282 1353 15M 36569461 0 | zio_buf_16384 16384 108731 108854 1700M 254805641 0 | zio_buf_32768 32768 36 92 2M 22355627 0 | zio_buf_40960 40960 40 94 3M 27766298 0 | zio_buf_45056 45056 1 32 1M 23590138 0 | zio_buf_49152 49152 0 25 1M 12426311 0 | zio_buf_53248 53248 12 35 1M 5843836 0 | zio_buf_57344 57344 57 111 6M 2841986 0 | zio_buf_61440 61440 1 38 2M 6549109 0 | zio_buf_126976 126976 0 13 1M 658728 0 | zio_buf_131072 131072 5059 6799 849M 107583446 0 | dmu_buf_impl_t 192 2616165 5710460 1115M 860908891 0 | dnode_t 744 2423090 2454573 1743M 412891701 0 | sa_cache 56 8983 151088 8M 438341951 0 |<==================+ zfs_znode_cache 248 8983 15136 3M 439890993 0 | | zio_cache 816 20102 28750 22M 1922767786 0 v | | | ------------------------- ------ ------ ------ ---------- --------- ----- | Total [kmem_default] 9611M 2194965189 0 <-------------------| | vmem memory memory memory alloc alloc | name in use total import succeed fail | ------------------------- ---------- ----------- ---------- --------- ----- | heap 16815M 8388608M 0M 36378781 0 | kmem_va 15017M 15017M 15017M 2756890 0 < 15 GB of VM | kmem_default 9611M 9611M 9611M 21280467 0 < 9.6 GB of PM ---+
Trying to understand these data, focus only the kmem_default line which accounts for all the kmem allocations by different caches on the system, then look at individual caches to see how much each consumed.
Quantum caches
The intent of quantum caches is to prevent fragmentation in the vmem arena. zfs_file_data_
zio_arena = vmem_create("zfs_file_data", zio_mem_base, zio_mem_size, PAGESIZE, NULL, NULL, NULL, 32 * 1024, VM_SLEEP); ^^^^^^^^^ qcache_max
Let’s see the kmastat output :
::kmasat -m cache buf buf buf memory alloc alloc name size in use total in use succeed fail <= object caches ------------------------- ------ ------ ------ ---------- --------- ----- zfs_file_data_4096 4096 163 448 1M 2837290 0 zfs_file_data_8192 8192 68 208 1M 977461 0 zfs_file_data_12288 12288 58 160 2M 1589976 0 >> see, the 4KB pattern zfs_file_data_16384 16384 12 48 0M 467541 0 (PAGESIZE=4KB on x64) zfs_file_data_20480 20480 65 108 2M 1546868 0 zfs_file_data_24576 24576 13 40 1M 494350 0 zfs_file_data_28672 28672 31 68 2M 912287 0 zfs_file_data_32768 32768 16 32 1M 340006 0
How to determine the type of each object cache : PM or VM ?
SCAT “kma stat” can be of great help :
CAT(vmcore.4/11X)> kma stat buf buf buf memory cache name size avail total in use ============================== ======= ======== ========== ============ kmem_magazine_1 16 57562 61746 1007616 PK kmem_magazine_3 32 45903 49000 1605632 PK ... kmem_va_4096 4096 230223 1466784 6007947264 QU kmem_va_8192 8192 284936 557168 4564320256 QU kmem_va_12288 12288 700 2180 28573696 QU kmem_va_16384 16384 143227 252336 4134273024 QU ... kmem_alloc_8 8 1188947 1393310 11345920 PKU kmem_alloc_16 16 56834 136293 2224128 PKU kmem_alloc_24 24 155669 252170 6184960 PKU kmem_alloc_32 32 136623 170750 5595136 PKU … zfs_file_data_135168 135168 0 0 0 QU zfs_file_data_139264 139264 0 0 0 QU … zio_cache 872 27906 27945 25436160 PKU … zio_buf_131072 131072 0 6715 880148480 PK zio_data_buf_131072 131072 1046 114558 15015346176 PZ … dnode_t 752 914325 1511025 1237831680 PKU dmu_buf_impl_t 208 1403215 2140958 461545472 PKU ------------------------------ ------- -------- ---------- ------------ Total [hat_memload] 1978368 P Total [kmem_default] 10078785536 P Total [id32] 4096 P Total [zfs_file_data] 13107200 Total [zfs_file_data_buf] 4336205824 P ------------------------------ ------- -------- ---------- ------------ arena memory memory memory name in use total import ------------------------------ ------------ ------------- ------------ heap 17632604160 8796093022208 0 kmem_va 15747022848 15747022848 15747022848 kmem_default 10078785536 10078785536 10078785536 P zfs_file_data 4344115200 8796093022208 0 zfs_file_data_buf 4336205824 4336205824 4336205824 P P - physical memory K - kvps[KV_KVP] vnode (kernel memory) Z - kvps[KV_ZVP] vnode (ZFS kernel memory) Q - quantum cache (KMC_QCACHE) U - enabling KMF_AUDIT will allow "kma users" to work for this cache F - cache has failed allocations (will be in red color)
For a deep understanding, see :
-
Magazines and Vmem by Jeff Bonwick http://www.parrot.org/sites/www.parrot.org/files/vmem.pdf
-
Slab Allocator From HelenOS : http://www.helenos.org/doc/design/html.chunked/mm.html#slab
Paging and freemen rules
The pageout daemon is visible from prstat. It looks for pages from userland processes than can be pushed back to disk (swap). When the free page list (freelist) falls below a certain level, the page daemon is invoked. The daemon scans memory to identify pages that can be pushed out to disk and reused. The system strategy for page replacement is not to let the system run out of memory
::memstat Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 3843924 15015 6% ZFS File Data 150299 587 0% Anon 97798 382 0% Exec and libs 3451 13 0% Page cache 8696 33 0% Free (cachelist) 7493 29 0% Free (freelist) 62992480 246064 94% Total 67104141 262125 Physical 67104140 262125 ::physmem/D physmem: physmem: 67104141 << in pages (4KB on x64 , 8KB on sparc)
To prevent from running out of memory, the OS uses the following tunable parameters as thresholds :
Parameter | Meaning | Value | Comments |
---|---|---|---|
lotsfree | Lost of Memory | MAX(looppages/64, 1/2MB) | if freemem < lotsfree, the OS run the page daemon |
desfree | Desired free memory | lotsfree/2 | if freemem < desfree, the OS starts the soft swapping |
minfree | Minimum acceptable memory level | desfree/2 | if freemem < minfree, the system biases allocations toward allocations necessary to successfully complete pageout operations or to swap processes completely out of memory |
throttlefree | Same as minfree, could be tuned | minfree | if freemem < throttlefree, the OS cuts off user and non urgent kernel requests |
pageout_reserve | Page daemon and swapper only | throttlefree/2 | if freemem < pageout_reserve, the OS fails all requests except for the page daemon and the swapper |
Beyond the pageout daemon, ZFS uses some strategy to reduce the ARC size and avoid memory starvation :
In the arc.c code, the arc_reclaim_thread() checks if freemem < lotsfree+desfree. If TRUE, it calls kmem_cache_reap_now() to look into the different kernel memory buffers to free up unused space and put it back to the cachelist (we keep the structure so we can allocate faster). This is a mono-threaded task and may lead to performance issues as threads trying to alloc memory may have to wait, ie :
::stacks -m zfs THREAD STATE SOBJ COUNT ffffff86b38ab440 SLEEP CV 36 swtch+0x150 cv_wait+0x61 vmem_xalloc+0x63f vmem_alloc+0x161 segkmem_xalloc+0x90 segkmem_alloc_vn+0xcd segkmem_zio_alloc+0x24 vmem_xalloc+0x550 vmem_alloc+0x161 kmem_slab_create+0x81 kmem_slab_alloc+0x5b kmem_cache_alloc+0x1fa zio_data_buf_alloc+0x2c arc_get_data_buf+0x18b arc_buf_alloc+0xa2 arc_read_nolock+0x12f arc_read+0x79 dsl_read+0x33 dbuf_read_impl+0x17e dbuf_read+0xfd dmu_tx_check_ioerr+0x6b dmu_tx_count_write+0x175 dmu_tx_hold_write+0x5b zfs_write+0x655
For ARC memory allocation changes, the sequence is as follows :
arc_reclaim_thread() { // The arc_reclaim_thread() wakes up every second and will attempt to reduce // the size of the ARC to the target size (arc_c). if (arc_reclaim_needed() == TRUE) { // freemen < lotsfree+desfree ? arc_kmem_reap_now() { kmem_cache_reap_now(zio_buf_cache[i]); kmem_cache_reap_now(zio_data_buf_cache[i]); kmem_cache_reap_now(buf_cache); kmem_cache_reap_now(hdr_cache); } } if (arc_size > arc_c) { arc_adjust(); // resize the ARC lists (MRU/MFU/Ghost lists). } }
For details, see :
The kmem_cache reaper activity can be seen thanks to kmem.fragmented[cache] analytic.
Dtrace
cpu usage
Display the busy threads which are not in the idle loop (curthread->t_pri != -1)
# dtrace -n 'profile-997hz /arg0 && curthread->t_pri != -1 && cpu==58 / { @[stack()] = count(); } tick-10sec { trunc(@, 10); printa(@); exit(0); }' zfs`zio_compress_data+0x41 zfs`zio_write_bp_init+0x26d zfs`zio_execute+0x8d genunix`taskq_thread+0x22e unix`thread_start+0x8 25898
-
arg0 !=0 : to check if the thread is running in the kernel
-
curthread→t_pri != -1 : to ensure that the CPU is not executing the kernel idle loop
Determine why threads are being switched off-CPU by looking at user stacks and kernel stacks when the sched:::off-cpu probe fires
Displays threads which are switched off-CPU.
# dtrace -qn 'sched:::off-cpu /execname != "sched"/ { @[execname, stack()] = count(); } END { trunc(@,5); }' zpool-pool-raid unix`swtch+0x146 genunix`cv_wait+0x60 genunix`taskq_thread_wait+0x86 genunix`taskq_thread+0x2a4 unix`thread_start+0x8 51328
Looks like we wait for some tasks to be done in one of the taskq worker threads. We should dig beyond thanks to the taskq.d trace script (See dtrace book by Brendan Gregg).
# dtrace -n 'sched:::off-cpu /execname != "sched"/ { self->;start = timestamp; } sched:::on-cpu /self->;start/ { this->;delta = (timestamp - self->start ) >> 20 ; @["off-cpu (ms):", stack()] = quantize(this->delta); self->start = 0; }' off-cpu (ms): genunix`cv_wait+0x60 zfs`zio_wait+0x5c zfs`dmu_buf_hold_array_by_dnode+0x238 zfs`dmu_buf_hold_array+0x6f zfs`dmu_read_uio+0x4b zfs`zfs_read+0x267 genunix`fop_read+0xa9 genunix`vn_rdwr+0x193 unix`mmapobj_map_interpret+0x20f unix`mmapobj+0x7a genunix`mmapobjsys+0x1d5 unix`sys_syscall+0x17a value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@ 3 1 | 0 2 |@@@@@@@@@@@@@@@ 3 4 |@@@@@@@@@@ 2 8 | 0
Flamegraph
Brillant tool from Brendan Gregg available here : http://www.brendangregg.com/flamegraphs.html
Steps :
-
download
-
collect cpu usage with dtrace
-
# dtrace -n ‘profile-997hz /arg0 && curthread→t_pri != -1 / { @[stack()] = count(); } tick-10sec { trunc(@, 10); printa(@); exit(0); }’ > sample.txt
-
-
create an svg file
-
# stackcollapse.pl sample.txt | flamegraph.pl > example.svg
-
This will produce a mouse over picture giving the time spent in each function. The X abscissa determines the amount of time (in %) spent in each function during the capture. The Y ordinate provides the stack calls and depth.
spa_sync activity
spa_sync activity is the time to flush data from DRAM to safe storage (ie disks).
Examples
-
bad spa_sync time
# dtrace -n 'spa_sync:entry/args[0]->spa_name=="pool-b232" /{self->t=timestamp} spa_sync:return/self->t/{printf("spa_sync=%d msec", (timestamp-self->t)>>20); self->t=0}' CPU ID FUNCTION:NAME 63 44795 spa_sync:return spa_sync=5802 msec 36 44795 spa_sync:return spa_sync=4617 msec 63 44795 spa_sync:return spa_sync=4875 msec 22 44795 spa_sync:return spa_sync=6088 msec 31 44795 spa_sync:return spa_sync=1108 msec
-
good spa_sync time
# dtrace -n 'spa_sync:entry/args[0]->spa_name=="pool-b232" /{self->t=timestamp} spa_sync:return/self->t/{printf("spa_sync=%d msec", (timestamp-self->t)>>20); self->t=0}' CPU ID FUNCTION:NAME 55 44795 spa_sync:return spa_sync=608 msec 42 44795 spa_sync:return spa_sync=930 msec 41 44795 spa_sync:return spa_sync=664 msec 31 44795 spa_sync:return spa_sync=624 msec 27 44795 spa_sync:return spa_sync=948 msec
arcreap.d
From Brendan Gregg : measures how long we spend in arc_kmem_reap_now and arc_adjust.
# cat -n arcreap.d #!/usr/sbin/dtrace -s fbt::arc_kmem_reap_now:entry, fbt::arc_adjust:entry { self->start[probefunc] = timestamp; } fbt::arc_shrink:entry { trace("called"); } fbt::arc_kmem_reap_now:return, fbt::arc_adjust:return /self->start[probefunc]/ { printf("%Y %d ms", walltimestamp,(timestamp - self->start[probefunc]) / 1000000); self->start[probefunc] = 0; } # ./arcreap.d dtrace: script './arcreap.d' matched 5 probes CPU ID FUNCTION:NAME 0 64929 arc_shrink:entry called 0 62414 arc_adjust:return 2012 Jan 9 23:10:01 18 ms 9 62420 arc_kmem_reap_now:return 2012 Jan 9 23:10:03 1511 ms 0 62414 arc_adjust:return 2012 Jan 9 23:10:24 0 ms 6 62414 arc_adjust:return 2012 Jan 9 23:10:49 0 ms
snoop and tcpdump
Snoop can be used to get the latency and throughput. Since 2011.1.9 and 2013.1, tcpdump is available as well.
# snoop -q -d aggr2001001 -s 200 -oport 2049 or port 111 or port 4045 port 2049 (nfs) port 111 (rpcbind) port 4045 (nfs-locking)
There is workflow as well : snoop-all.akwf (see attachments section)
In some cases, *tcpdump* can be used. It seems to be able to capture more details than snoop and less memory intensive consumer.
# tcpdump -c 200000 -i aggr1 -w /var/ak/dropbox/cpdump.out
tcptrace
This command is available from cores3 : /sw/es-tools/bin/tcptrace
-
overall latency
# tcptrace -lr port_216_snoop | egrep "RTT max" dq->dr: dr->dq: RTT max: 120.0 ms RTT max: 121.7 ms RTT max: 118.4 ms RTT max: 120.0 ms RTT max: 120.0 ms RTT max: 121.0 ms
-
throughput
# tcptrace -lr port_216_snoop | grep throughput dq->dr: dr->dq: throughput: 336 Bps throughput: 273 Bps throughput: 36 Bps throughput: 33 Bps throughput: 198 Bps throughput: 1492 Bps
filebench
Customers love to use dd or tar as a performance tool. In many cases, this does not represent the real workload hence we need a tool to simulate genuine workload : filebench
See http://sourceforge.net/apps/mediawiki/filebench/index.php?title=Filebench
It is available for linux and Solaris.
# ./go_filebench statfile1 1051ops/s 0.0mb/s 0.5ms/op 385us/op-cpu deletefile1 1051ops/s 0.0mb/s 4.1ms/op 880us/op-cpu closefile3 1051ops/s 0.0mb/s 0.0ms/op 12us/op-cpu readfile1 1051ops/s 142.1mb/s 1.3ms/op 819us/op-cpu openfile2 1051ops/s 0.0mb/s 0.5ms/op 410us/op-cpu closefile2 1051ops/s 0.0mb/s 0.0ms/op 12us/op-cpu appendfilerand1 1051ops/s 8.3mb/s 2.4ms/op 446us/op-cpu openfile1 1052ops/s 0.0mb/s 0.6ms/op 420us/op-cpu closefile1 1052ops/s 0.0mb/s 0.0ms/op 13us/op-cpu wrtfile1 1052ops/s 136.1mb/s 30.5ms/op 3910us/op-cpu createfile1 1052ops/s 0.0mb/s 4.6ms/op 1493us/op-cpu 2492: 109.065: IO Summary: 694029 ops, 11566.6 ops/s, (1051/2103 r/w) 286.5mb/s, 3379us cpu/op, 14.8ms latency ^^^^^^^^^ MB/s
kernel dump
This should be a last resort. Remember this : When developing a picture of a racing cyclist, it’s never easy to determine how fast he was running ! This can be helpful if the system is hung or if the issue is too short to be able to run dtrace scripts.
How to force an NMI :
*** ILOM 2.x *** -> cd /SP/diag /SP/diag -> set generate_host_nmi=true Set 'generate_host_nmi' to 'true' *** ILOM 3.x *** -> cd /HOST/ -> set generate_host_nmi=true The console session should report something similar to the following: -> start /SP/console panic[cpu2]/thread=ffffff001eccbc60: NMI received ffffff001eccbac0 pcplusmp:apic_nmi_intr+7c () ffffff001eccbaf0 unix:av_dispatch_nmivect+30 () ffffff001eccbb00 unix:nmiint+154 () ffffff001eccbbf0 unix:mach_cpu_idle+b () ffffff001eccbc20 unix:cpu_idle+c2 () ffffff001eccbc40 unix:idle+114 () ffffff001eccbc50 unix:thread_start+8 () syncing file systems... done dumping to /dev/zvol/dsk/system/dump, offset 65536, content: kernel + curproc 100% done: 356267 pages dumped, compression ratio 3.84, dump succeeded Real dump file created after reboot. It may take time so, just wait before collecting a supportfile bundle.
Useful commands to run :
::memstat
::memstat Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 2417967 9445 38% ZFS File Data 3686849 14401 59% Anon 169828 663 3% Exec and libs 6 0 0% Page cache 138 0 0% Free (cachelist) 9806 38 0% Free (freelist) 2534 9 0% Total 6287128 24559 Physical 6287127 24559
::kmastat
::kmastat -g ! grep zio_cache cache buf buf buf memory alloc alloc name size in use total in use succeed fail ------------------------- ------ ------ ------ ---------- --------- ----- zio_cache 816 2 137450 107M 2960373347 0 2 buffers used only among 137450 buf allocated memory in use : 137450 x 816 = 107MB
A mem fragmentation evidence is when ‘memory in use’ is high (physical memory from slab allocation) but ‘buf in use’ * ‘buf size’ is low.
::stacks
::stacks ffffff0030e56c40 SLEEP MUTEX 187 swtch+0x146 turnstile_block+0x6ff mutex_vector_enter+0x25e zfs_zget+0x46 zfs_vget+0x1f5 fsop_vget+0x66 nfs3_fhtovp+0x47 rfs3_write+0x4a common_dispatch+0x649 rfs_dispatch+0x2d svc_process_request+0x184 stp_dispatch+0xb9 taskq_d_thread+0xc1 thread_start+8 ::stacks -m zfs ::stacks -m nfs ::stacks -m smbsrv ::stacks -m iscsit ::stacks -m pmcs ::stacks -m sata ::stacks -m mpt ::stacks -c vmem_xalloc ::stacks -c mutex_enter THREAD STATE SOBJ COUNT ffffff00318fbc40 RUN NONE 1 swtch+0x146 preempt+0xca kpreempt+0x89 sys_rtt_common+0x13b _sys_rtt_ints_disabled+8 mutex_enter+0x10 segkmem_page_create+0x92 segkmem_xalloc+0xc6 segkmem_alloc_vn+0xcb segkmem_alloc+0x24 vmem_xalloc+0x502 vmem_alloc+0x172 kmem_slab_create+0x81 kmem_slab_alloc+0x60 kmem_cache_alloc+0x267 zio_buf_alloc+0x29 zil_alloc_lwb+0x4a zil_get_current_lwb+0x281 zil_get_commit_list+0x185 zil_get_train+0x94 zil_commit+0xc0 zfs_fsync+0xcc fop_fsync+0x5f rfs3_commit+0x173 common_dispatch+0x539 rfs_dispatch+0x2d svc_process_request+0x184 stp_dispatch+0xb9 taskq_d_thread+0xc1 thread_start+8 ::stacks -c spa_sync THREAD STATE SOBJ COUNT ffffff0030d2ac40 SLEEP MUTEX 1 swtch+0x146 turnstile_block+0x6ff mutex_vector_enter+0x25e zil_sync+0x5f dmu_objset_sync+0x241 dsl_dataset_sync+0x63 dsl_pool_sync+0x18d spa_sync+0x395 txg_sync_thread+0x244 thread_start+8 ::stacks -c kmem_cache_reap_now THREAD STATE SOBJ COUNT ffffff002e173c40 SLEEP CV 1 swtch+0x146 cv_wait+0x60 taskq_wait+0x4b kmem_cache_reap_now+0x4e arc_buf_cache_reap_now+0x38 arc_reaper_thread+0xb0 thread_start+8
::findstack
ffffff002e173c40::findstack -v stack pointer for thread ffffff002e173c40: ffffff002e173ac0 [ ffffff002e173ac0 _resume_from_idle+0xf5() ] ffffff002e173af0 swtch+0x146() ffffff002e173b20 cv_wait+0x60(fffff6000df96b5a, fffff6000df96b48) ffffff002e173b60 taskq_wait+0x4b(fffff6000df96b28) ffffff002e173b80 kmem_cache_reap_now+0x4e(fffff60005d4a008) ffffff002e173bc0 arc_buf_cache_reap_now+0x38() ffffff002e173c20 arc_reaper_thread+0xb0() ffffff002e173c30 thread_start+8()
::cpuinfo
::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc3ef20 1f 11 0 100 no no t-1297 fffff6002767fae0 akd 1 fffff6000dd0eac0 1f 13 0 100 no no t-1298 fffff6002dfb9840 akd ticks : 12s. CPU1 has been 2 fffff6000dd0c080 1f 12 0 100 no no t-1296 fffff6002dc3d7c0 akd running a thread for 12s 3 fffff6000dd0a000 1f 10 0 100 no no t-1296 fffff6002de034e0 akd 4 fffff6000dd06b00 1f 8 0 100 no no t-1296 fffff6002dc3c440 akd 5 fffffffffbc49110 1b 28 0 100 no no t-1296 fffff60026a03160 akd 6 fffff6000dd02000 1f 13 0 100 no no t-1298 fffff60029292b20 akd 7 fffff6000dfbcac0 1f 9 0 100 no no t-1296 fffff6002df3c000 akd
::taskq
fffff6000df96b28::taskq -T ADDR NAME ACT/THDS Q'ED MAXQ INST fffff6000df96b28 kmem_taskq 1/ 1 570 935 - THREAD STATE SOBJ COUNT ffffff002ed96c40 RUN NONE 1 swtch+0x146 preempt+0xca kpreempt+0x89 sys_rtt_common+0x13b _sys_rtt_ints_disabled+8 kmem_partial_slab_cmp+0x42 avl_find+0x56 avl_add+0x2c avl_update_gt+0x56 kmem_slab_free+0x1b9 kmem_magazine_destroy+0xdf kmem_depot_ws_reap+0x77 kmem_cache_reap+0x76 taskq_thread+0x22e thread_start+8
::spa
::spa ADDR STATE NAME fffff60039b60500 ACTIVE pool-0 fffff60009817580 ACTIVE system fffff60039b60500::print spa_t spa_uberblock.ub_timestamp spa_uberblock.ub_timestamp = 0x52976059 : in sec since epoch time/J time: time: 5297619e : in sec since epoch 5297619e-52976059=D 325 sec : This is the last time a TXG has been done : quite long ! 0x52976059=Y 2013 Nov 28 15:25:13
::zio_state
::zio_state ADDRESS TYPE STAGE WAITER fffff6001b0b4cc8 NULL OPEN - ffff60037f0e340 NULL OPEN - fffff60039911cc0 NULL OPEN - fffff60037f25338 NULL CHECKSUM_VERIFY ffffff007bb30c40 ffffff007bb30c40::findstack -v stack pointer for thread ffffff007bb30c40: ffffff007bb2ffb0 [ ffffff007bb2ffb0 _resume_from_idle+0xf4() ] ffffff007bb2ffe0 swtch+0x150() ffffff007bb30010 cv_wait+0x61(fffff60037f25650, fffff60037f25648) ffffff007bb30050 zio_wait+0x5d(fffff60037f25338) : arg0 is zio_t ffffff007bb300b0 dbuf_read+0x1e5(fffff6006a5539f8, 0, a) ffffff007bb30130 dmu_buf_hold+0xac(fffff6001b7fa0c0, 18a, 66715000, 0, ffffff007bb30148, 1) ffffff007bb301a0 zap_get_leaf_byblk+0x5c(fffff60037bf44c0, 66715, 0, 1, ffffff007bb30360) ffffff007bb30210 zap_deref_leaf+0x78(fffff60037bf44c0, 10b0600000000000, 0, 1, ffffff007bb30360) ffffff007bb302a0 fzap_cursor_retrieve+0xa7(fffff60037bf44c0, ffffff007bb30350, ffffff007bb30390) ffffff007bb30330 zap_cursor_retrieve+0x188(ffffff007bb30350, ffffff007bb30390) ffffff007bb30640 ddt_zap_walk+0x5d(fffff6001b7fa0c0, 18a, ffffff007bb306e0, fffff60051b8c100) ffffff007bb30680 ddt_object_walk+0x4e(fffff60052bc3000, 0, 1, fffff60051b8c100, ffffff007bb306e0) ffffff007bb306d0 ddt_walk+0x78(fffff600513cc080, fffff60051b8c0e8, ffffff007bb306e0) ffffff007bb308a0 dsl_scan_ddt+0x2de(fffff60051b8c040, fffff6004256b1e8) ffffff007bb30a60 dsl_scan_visit+0x54(fffff60051b8c040, fffff6004256b1e8) ffffff007bb30ac0 dsl_scan_sync+0x20f(fffff6003a230a00, fffff6004256b1e8) ffffff007bb30b80 spa_sync+0x3f7(fffff600513cc080, 8824b0) ffffff007bb30c20 txg_sync_thread+0x247(fffff6003a230a00) ffffff007bb30c30 thread_start+8()
::zio
Displays a recursive stack of IOs
fffff60037f25338::zio -rc ADDRESS TYPE STAGE WAITER fffff60037f25338 NULL CHECKSUM_VERIFY ffffff007bb30c40 fffff60041603998 READ VDEV_IO_START - fffff600399b9340 READ VDEV_IO_START - fffff6003e24fcc8 READ VDEV_IO_START - fffff60039999cc8 READ VDEV_IO_START - : last one fffff60039999cc8::print zio_t io_vd io_vd = 0xfffff600513cad40 fffff60039999cc8::print zio_t io_vd | ::print vdev_t vdev_path vdev_path = 0xfffff60050f0bc38 "/dev/dsk/c3t5000CCA396E3D6FAd0s0"
::nm
Displays the size of a variable and can provide the argument types of any function in the kernel
::nm ! grep arc_meta_limit 0xffffffffc00043b8|0x0000000000000008|OBJT |LOCL |0x0 |8 |arc_meta_limit kmem_cache_reap_now::nm -f ctype C Type void (*)(kmem_cache_t *)