{"id":340,"date":"2023-09-30T07:27:25","date_gmt":"2023-09-30T07:27:25","guid":{"rendered":"https:\/\/fredpayet.fr\/?page_id=340"},"modified":"2023-09-30T07:27:25","modified_gmt":"2023-09-30T07:27:25","slug":"tools-and-examples","status":"publish","type":"page","link":"https:\/\/fredpayet.fr\/index.php\/tools-and-examples\/","title":{"rendered":"Tools and Examples"},"content":{"rendered":"<h1 class=\"sectionedit1\">iostat<\/h1>\n<p>This is a standard Unix utility for analysing IO workloads and performance troubleshooting, at the individual drive level. The first iteration provides a cumulative summary over node uptime, following iterations are current data &#8211; real time.<\/p>\n<h2>Common usage<\/h2>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   # iostat -zxcn 2 2\r\n\t\tc percent time system in user\/sys\/wait\/ idle times\r\n\t\tn show device names(CTD or WWN) not instance numbers\r\n\t\tx extended disk statistics. Need this for latency statistics\r\n\t\tz suppress lines that are all zeros\r\n\r\n                    extended device statistics\r\n                Lun queue depth --&gt; +++         +++++ &lt;-- target service time (NFS client's perception)\r\n    r\/s    w\/s   kr\/s   kw\/s   wait actv wsvc_t asvc_t  %w  %b device\r\n    0.0   15.0    0.0    33.0  0.0  0.0    2.0     0.6   1   1 c2t0d0\r\n    0.0   13.0    0.0    27.5  0.0  0.0    1.1     0.4   0   0 c2t1d0\r\n  119.0   55.0 15233.9 3817.0  0.0  2.2    0.0    12.8   0  87 c0t5000C5003416C207d0\r\n  119.0   56.0 15233.9 3817.0  0.0  3.3    0.0    18.9   0  88 c0t5000C50034169BA3d0\r\n    0.0  484.1    0.0 32916.2  0.0  1.0    0.0     2.1   1  28 c0tSUN24G1047M04YLQ1047M04YLQd0\r\n    0.0  487.1    0.0 33120.2  0.0  1.0    0.0     2.1   0  29 c0tSUN24G1047M04YLR1047M04YLRd0\r\n          Host queue depth --&gt; +++  ^^^         ^^^^^^\r\n\r\n          With :\r\n          - Host queue depth : Number of transactions waiting to be submitted to a device.\r\n          - Lun queue depth  : Number of transactions submitted to and being processed by the device.\r\n<\/pre>\n<p>The ZFSSA can handle a deep queue of I\/O, so you should not consider 100% busy disks as overloaded. You probably want to look at target service time (<strong>asvc_t<\/strong>) as the NFS client&#8217;s perception of the NFS server&#8217;s response time, and the active queue (<strong>actv<\/strong>) as the number of pending I\/O requests. <strong>actv<\/strong> represents the number of threads doing IOs onto the same device.<\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\"><strong>r\/s<\/strong> : number of read IOPS<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>w\/s<\/strong> : number of write IOPS<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>Kr\/s<\/strong> : read data transfer rate<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>Kw\/s<\/strong> : write data transfer rate<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>wait<\/strong> : average number of transactions waiting for service or <strong>Host queue depth<\/strong>.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>actv<\/strong> : average number of transactions actively being serviced (removed from the queue but not yet completed) or <strong>Lun queue depth<\/strong>.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>wsvc_t<\/strong> : average service time in wait queue, in milliseconds<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>asvc_t<\/strong> : target service time (NFS client&#8217;s perception) : 0.8 ms for logzila, around 20ms for SAS drives. Should only be taken only on a disk that is showing more than 5% activity<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>%w<\/strong> : Percentage of time that the queue is not empty<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>%b<\/strong> : Percentage of time that the disk is processing transactions<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>the disk device<\/strong><\/div>\n<\/li>\n<\/ul>\n<p>%b at 100% does not mean the LUN is really busy. It&#8217;s calculated by the driver. It&#8217;s the percentage of time that there is at least <strong><em class=\"u\">one<\/em><\/strong> request pending from the driver to the hardware. Modern hardware can handle multiple requests simultaneously with tagged commands. Think of a store with 16 checkout lanes. You count the store as \u201cbusy\u201d any time at least one person is in the store. You can be 100% busy, but still have some capacity because some requests are being fulfilled simultaneously in the checkout lanes.<br \/>\nIn the overloaded condition, you should see a deep queue to the storage (<strong>high actv<\/strong>) and a long response time (<strong>asvc_t<\/strong>) from the client.<\/p>\n<h2>Interpretation of iostat data<\/h2>\n<p><strong>Typical response times for various types of devices<\/strong><\/p>\n<div class=\"level5\">\n<div class=\"table sectionedit3\">\n<table class=\"inline\">\n<tbody>\n<tr class=\"row0\">\n<th class=\"col0 leftalign\">Device<\/th>\n<th class=\"col1 leftalign\">IOPs<\/th>\n<th class=\"col2 leftalign\">Write Bandwidth<\/th>\n<th class=\"col3 leftalign\">Read Bandwidth<\/th>\n<th class=\"col4\">Read Latency<\/th>\n<\/tr>\n<tr class=\"row1\">\n<td class=\"col0 leftalign\">DRAM<\/td>\n<td class=\"col1 leftalign\">>300K<\/td>\n<td class=\"col2 leftalign\">20.000 <acronym title=\"Megabyte\">MB<\/acronym>\/s<\/td>\n<td class=\"col3 leftalign\">20.000MB\/s<\/td>\n<td class=\"col4 leftalign\">120us<\/td>\n<\/tr>\n<tr class=\"row2\">\n<td class=\"col0 leftalign\">18G logzilla<\/td>\n<td class=\"col1 leftalign\">3000<\/td>\n<td class=\"col2 leftalign\">120 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">n\/a<\/td>\n<td class=\"col4 leftalign\">n\/a<\/td>\n<\/tr>\n<tr class=\"row3\">\n<td class=\"col0\">73G logzilla (gen3)<\/td>\n<td class=\"col1 leftalign\">7000<\/td>\n<td class=\"col2 leftalign\">200 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">n\/a<\/td>\n<td class=\"col4 leftalign\">n\/a<\/td>\n<\/tr>\n<tr class=\"row4\">\n<td class=\"col0\">73G logzilla (gen4)<\/td>\n<td class=\"col1 leftalign\">11000<\/td>\n<td class=\"col2 leftalign\">350 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">n\/a<\/td>\n<td class=\"col4 leftalign\">n\/a<\/td>\n<\/tr>\n<tr class=\"row5\">\n<td class=\"col0 leftalign\">500G readzilla<\/td>\n<td class=\"col1 leftalign\">2500<\/td>\n<td class=\"col2 leftalign\">< 30 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">40 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col4 leftalign\">1300 us<\/td>\n<\/tr>\n<tr class=\"row6\">\n<td class=\"col0 leftalign\">1.6TB readzilla<\/td>\n<td class=\"col1 leftalign\">21000<\/td>\n<td class=\"col2 leftalign\">120 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">200 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col4 leftalign\">900 us<\/td>\n<\/tr>\n<tr class=\"row7\">\n<td class=\"col0 leftalign\">3TB 7K HDD<\/td>\n<td class=\"col1 leftalign\">120<\/td>\n<td class=\"col2 leftalign\">175 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">175 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col4 leftalign\">8600 us<\/td>\n<\/tr>\n<tr class=\"row8\">\n<td class=\"col0 leftalign\">4TB 7K HDD<\/td>\n<td class=\"col1 leftalign\">120<\/td>\n<td class=\"col2 leftalign\">180 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">180 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col4 leftalign\">8600 us<\/td>\n<\/tr>\n<tr class=\"row9\">\n<td class=\"col0 leftalign\">300G 15K HDD<\/td>\n<td class=\"col1 leftalign\">250<\/td>\n<td class=\"col2 leftalign\">170 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">170 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col4 leftalign\">4300 us<\/td>\n<\/tr>\n<tr class=\"row10\">\n<td class=\"col0 leftalign\">300G 10K HDD<\/td>\n<td class=\"col1 leftalign\">160<\/td>\n<td class=\"col2 leftalign\">165 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">170 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col4 leftalign\">6250 us<\/td>\n<\/tr>\n<tr class=\"row11\">\n<td class=\"col0 leftalign\">900G 10K HDD<\/td>\n<td class=\"col1 leftalign\">160<\/td>\n<td class=\"col2 leftalign\">170 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col3 leftalign\">170 <acronym title=\"Megabyte\">MB<\/acronym>\/sec<\/td>\n<td class=\"col4 leftalign\">6250 us<a id=\"what_does_an_oversubscribed_device_look_like\" name=\"what_does_an_oversubscribed_device_look_like\"><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p><strong>What does an oversubscribed device look like ?<\/strong><\/p>\n<\/div>\n<div class=\"level5\">\n<p>Classic signs of overloaded disk : <strong>asvc_t<\/strong> time very high and <strong>actv<\/strong> &#8211; <strong>wait<\/strong> numbers greater than small integers. Best evidence the disk cannot keep up.<\/p>\n<\/div>\n<p><strong>Overall latency<\/strong><\/p>\n<div class=\"level5\">\n<p>The overall latency is the sum of time waiting in the queue, plus the time it takes for the disks to respond to your requests. If the queue depth is 5 and the disk service time is 10 milliseconds, then the overall latency from the client point of view is 5 * 10ms = 50 milliseconds.<\/p>\n<\/div>\n<h2>How is iostat used<\/h2>\n<p><strong>Quick scan of the data<\/strong><\/p>\n<p>At least two iterations, only looking for extreme high values. Don&#8217;t read every number, just scan for anomalies or something that \u201csticks out\u201d. The iostat data in some bundle may not have captured the slowdown, it is collected over 15 seconds during bundle collection.<\/p>\n<ol>\n<li class=\"level1\">\n<div class=\"li\">First iteration is useful to find the cumulative level of activity since last boot.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">Review the r\/s and w\/s columns<\/div>\n<ul>\n<li class=\"level2\">\n<div class=\"li\">Estimate approximate read to write mix, within about 20% is fine<\/div>\n<\/li>\n<li class=\"level2\">\n<div class=\"li\">Estimate typical iops<\/div>\n<\/li>\n<\/ul>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">Review asvc_t(avg disk service time) &#8211; is it excessive for the class of drive, then determine if the drives can \u201ckeep up\u201d with the workload.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">Review wait and actv columns &#8211; these will show if the drive is \u201cbacking up\u201d<\/div>\n<\/li>\n<\/ol>\n<p><strong>Dead ringer for a disk bottleneck<\/strong><\/p>\n<ol>\n<li class=\"level1\">\n<div class=\"li\">asvc_t is several multiples higher than typical. Note, one iteration may be much higher than the others.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">wait and queues greater than small integer.<\/div>\n<\/li>\n<\/ol>\n<p><strong>Determine the number of drives and timeframes they are oversubscribed<\/strong><\/p>\n<ol>\n<li class=\"level1\">\n<div class=\"li\">If many drives are oversubscribed for much of the time, then we do not have sufficient disk spindles (Typically, but not always). If they seem to be oversubscribed, make sure this is not an effect of another problem (like readzillas using too much memory, resulting in ZFS memory pressure and having to hit the spindles instead of caching data).<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">More common, intermittent performance problems where drives keep up most of the time, but start queueing up during the busy workloads.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">A \u201cslow drive\u201d or uneven workload (like a resilver in one raidZ2 vdev) will slow the overall system down.<\/div>\n<\/li>\n<\/ol>\n<p><strong>Notes<\/strong><\/p>\n<div class=\"level5\">\n<p>Some engineers like to filter iostat data looking for specifics, like disk service times > 100ms. It may be better to get a global picture, to look at the overall data first, and then filter for specifics. Check if fields have extra digits, or only skim the first significant figures (applies to wait actv, wsvc_t and asvc_t columns typically). For r\/s and w\/s, just do a quick estimation of the busiest lines.<\/p>\n<h2>Real iostat data<\/h2>\n<\/div>\n<h3><strong>case 1<\/strong><\/h3>\n<pre>                    extended device statistics\r\n    r\/s    w\/s   kr\/s   kw\/s wait actv wsvc_t asvc_t  %w  %b device\r\n    0.7   54.9   15.1 3163.1  0.3  0.4    6.1    7.7   4   9 c1t0d0\r\n    2.0   54.8   29.2 3163.1  0.4  0.4    6.4    6.9   5   9 c1t1d0\r\n    0.0    1.2    0.2   22.1  0.0  0.0    0.0    0.1   0   0 c2t5000A72030064181d0\r\n    0.0    0.1    0.3    2.6  0.0  0.0    0.0    2.1   0   0 c2t5000C500438F21A3d0\r\n   17.4   31.4  212.7  551.4  0.0  0.1    0.0    2.9   0   4 c2t5000C500438ED9AFd0\r\n    0.0    0.1    0.3    2.3  0.0  0.0    0.0    2.2   0   0 c2t5000C50043B6B3A3d0\r\n   17.5   31.7  212.6  552.4  0.0  0.1    0.0    2.9   0   4 c2t5000C500438EC99Fd0\r\n   17.5   31.3  213.6  551.4  0.0  0.1    0.0    2.9   0   4 c2t5000C50043B68F6Bd0\r\n   17.5   31.7  213.0  552.4  0.0  0.1    0.0    2.9   0   4 c2t5000C50043B8C303d0\r\n   17.4   31.5  212.8  550.5  0.0  0.1    0.0    2.9   0   4 c2t5000C50043B692CFd0\r\n   17.5   32.0  212.8  552.8  0.0  0.1    0.0    2.9   0   4 c2t5000C50043B8F4DFd0\r\n    0.0    0.1    0.3    2.3  0.0  0.0    0.0    2.2   0   0 c2t5000C500475428BFd0\r\n   58.8   65.7 1037.7 7223.9  0.2  0.0    1.9    0.4   4   5 c1t2d0\r\n   58.9   65.7 1040.7 7220.9  0.2  0.0    1.9    0.4   4   5 c1t3d0<\/pre>\n<p><strong>&#8211; comments<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">the busy data drives are doing roughly 50 iops, ~3:2 write biased.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">roughly 200KB\/s reads and 500KB\/s writes<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">service times look good and ios are not queueing<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">the system drives are doing more IO than the data drives ~55iops vs 50iops &#8211; might be analytics or logging.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">this customer has a known obsession with analytics<\/div>\n<\/li>\n<\/ul>\n<p><strong>&#8211; questions<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">which drives are readzillas ?<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">which are the logzillas, that are active on this node ?<\/div>\n<\/li>\n<\/ul>\n<h3><strong>case 2<\/strong><\/h3>\n<div class=\"level5\">\n<pre class=\"code\">                   extended device statistics\r\n    r\/s    w\/s   kr\/s    kw\/s wait actv wsvc_t asvc_t  %w  %b device\r\n    0.0    1.0    0.0     4.0  0.0  0.0    0.1    0.3   0   0 c1t0d0\r\n    0.0    1.0    0.0     4.0  0.0  0.0    0.0    0.3   0   0 c1t1d0\r\n    4.0  306.8  165.5 18869.9  0.0  0.8    0.0    2.7   0  17 c2t5000C500438ED9AFd0\r\n   26.2  374.2 1604.6 18833.7  0.0  3.5    0.0    8.7   0  73 c2t5000C500438EC99Fd0\r\n   22.1  297.8 1349.6 18938.4  0.0  1.5    0.0    4.6   0  35 c2t5000C50043B68F6Bd0\r\n   17.1  385.3 1045.3 18833.8  0.0  2.6    0.0    6.5   0  53 c2t5000C50043B8C303d0\r\n   17.1  267.6  803.8 18298.1  0.0  0.9    0.0    3.0   0  21 c2t5000C50043B692CFd0\r\n   15.1  286.7  849.6 18741.3  0.0  1.2    0.0    4.1   0  39 c2t5000C50043B8F4DFd0\r\n   39.2  287.7 1808.3 19927.5  0.0  1.3    0.0    3.8   0  35 c2t5000C500475416DFd0\r\n   28.2  309.9 1350.1 18873.7  0.0  1.4    0.0    4.0   0  43 c2t5000C50043B6CEEBd0\r\n   13.1  310.9  261.6 18875.2  0.0  1.0    0.0    3.1   0  20 c2t5000C500438ED92Fd0\r\n   10.1  376.3  619.2 21063.4  0.0  1.7    0.0    4.4   0  27 c2t5000C50047541A73d0\r\n   18.1  296.8  924.6 18990.5  0.0  1.2    0.0    3.7   0  33 c2t5000C50043B67B1Fd0\r\n  673.1  210.3 43312.0 24662.3 2.1  0.5    2.4    0.6  43  52 c1t2d0\r\n  599.7  145.9 38466.5 16476.9 2.5  0.5    3.4    0.7  42  50 c1t3d0<\/pre>\n<p><strong>&#8211; comments<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">disks are busy roughly 400 iops.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">asvc_t times are all < 5ms, this is ok.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">no sign of trouble, the drives are busy and keeping up.<\/div>\n<\/li>\n<\/ul>\n<\/div>\n<h3><strong>case 3<\/strong><\/h3>\n<div class=\"level5\">\n<p>poor CIFS performance.<\/p>\n<pre class=\"code\">                  extended device statistics\r\n    r\/s    w\/s   kr\/s   kw\/s wait actv wsvc_t asvc_t  %w  %b device\r\n    4.0  143.0  511.9 16351.5 0.6  0.1    3.8    0.5   7   7 c1t2d0\r\n    1.0    0.0  128.0    0.0  0.0  0.0    0.0    0.7   0   0 c1t3d0\r\n    2.0  135.0  256.0 15873.5 0.5  0.1    3.9    0.5   6   6 c1t4d0\r\n   27.0  218.0   62.0 1060.3  0.0  0.9    0.0    3.8   0  20 c3t5000CCA02A272A68d0\r\n   41.0  142.0  524.4  774.9  0.0  0.6    0.0    3.4   0  15 c3t5000CCA02A26BB70d0\r\n   40.0  191.0  160.5 1259.3  0.0  0.9    0.0    3.9   0  24 c3t5000CCA02A256000d0\r\n   72.0  102.0  129.5  509.9  0.0  0.5    0.0    3.0   0  17 c3t5000CCA02A1D6468d0\r\n   38.0  184.0  145.0 1259.8  0.0  0.9    0.0    3.8   0  23 c3t5000CCA02A2002F8d0\r\n   19.0  242.0   79.0 1060.9  0.0  0.8    0.0    3.0   0  20 c3t5000CCA02A264B1Cd0\r\n   78.0   76.0  124.5  511.9  0.0  0.5    0.0    3.3   0  16 c3t5000CCA02A1A6A64d0\r\n   49.0  139.0  540.9  778.4  0.0  0.7    0.0    3.6   0  17 c3t5000CCA02A265FE0d0\r\n   39.0  192.0  154.0 1259.8  0.0  1.0    0.0    4.5   0  25 c3t5000CCA02A266100d0\r\n   23.0  250.0   76.0 1056.9  0.0  1.3    0.0    4.8   0  27 c3t5000CCA02A24C950d0\r\n   43.0  123.0  606.9  774.4  0.0  0.7    0.0    4.2   0  19 c3t5000CCA02A26C588d0\r\n   36.0  176.0  141.0 1265.3  0.0  0.8    0.0    3.6   0  21 c3t5000CCA02A28397Cd0\r\n   37.0  188.0  151.5 1262.4  0.0  0.8    0.0    3.7   0  21 c3t5000CCA02A2655F4d0\r\n   41.0  173.0  635.9  773.4  0.0  0.8    0.0    3.7   0  18 c3t5000CCA02A26A730d0\r\n   42.0  129.0  701.9  775.9  0.0  0.8    0.0    4.7   0  18 c3t5000CCA02A1FC718d0\r\n    0.0    4.0    0.0  204.0  0.0  0.0    0.0    0.2   0   0 c3t5000A7203006A41Cd0\r\n   67.0   99.0  127.0  511.9  0.0  0.8    0.0    4.9   0  20 c3t5000CCA02A1FC7ACd0<\/pre>\n<p><strong>&#8211; questions<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">what&#8217;s the size of each IO read ?<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">what&#8217;s the size of each IO write ?<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">What could be the zpool layout : mirror or raidz2 ?<\/div>\n<\/li>\n<\/ul>\n<\/div>\n<h3><strong>case 4<\/strong><\/h3>\n<div class=\"level5\">\n<pre class=\"code\">                    extended device statistics\r\n    r\/s    w\/s   kr\/s   kw\/s wait actv wsvc_t asvc_t  %w  %b device\r\n    0.4   30.9   21.5  545.5  0.0  0.1    1.4    2.0   1   2 c1t0d0\r\n    0.4   30.8   18.7  545.5  0.0  0.1    1.4    2.1   1   2 c1t1d0\r\n  288.7  259.9 17159.0 23806.6 1.7 0.4    3.1    0.8  29  43 c1t2d0\r\n  288.6  259.9 17159.2 23808.2 1.7 0.4    3.1    0.8  29  43 c1t3d0\r\n   63.3   28.7  364.2  283.5  0.0  1.5    0.0   15.9   0  62 c3t5000C50040908B8Bd0\r\n    7.1    3.1   35.9   28.4  0.0  0.2    0.0   15.0   0   7 c3t5000C50035055FEBd0\r\n   59.3   28.9  284.8  284.0  0.0  1.4    0.0   15.4   0  61 c3t5000C50034F61CEFd0\r\n   59.6   29.0  286.6  284.2  0.0  1.3    0.0   15.2   0  60 c3t5000C50034EF953Fd0\r\n   59.2   29.0  284.3  283.8  0.0  1.3    0.0   14.9   0  60 c3t5000C50034EF2FDFd0\r\n    0.0  431.5    0.0 16373.9 0.0  0.7    0.0    1.5   1  18 c3t5000A7203003D9EDd0\r\n   59.6   29.1  285.8  283.9  0.0  1.3    0.0   14.9   0  60 c3t5000C50034EC779Fd0\r\n   59.6   28.9  286.3  283.6  0.0  1.4    0.0   15.8   0  62 c3t5000C50034F67A13d0\r\n   15.2   11.6  111.1   84.2  0.0  0.3    0.0   11.9   0  16 c3t5000C50034F60A8Fd0\r\n   14.5   15.6   65.4   87.4  0.0  0.4    0.0   13.4   0  18 c3t5000C50034EF52DFd0\r\n   13.8    6.4   68.6  111.4  0.0  7.1    0.0  352.0   0  99 c3t5000C50034F5105Bd0\r\n   15.6   16.6   79.2   97.3  0.0  0.5    0.0   14.1   0  19 c3t5000C50034EFAC7Fd0<\/pre>\n<p><strong>&#8211; comments<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">These drives are only doing around 90 IOPs, yet they are about 60% busy.<\/div>\n<\/li>\n<\/ul>\n<p><strong>&#8211; questions<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">identify readzillas and logzillas<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">Do the latency numbers look reasonable<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">anything abnormal stands out ?<\/div>\n<\/li>\n<\/ul>\n<p><strong>&#8211; notes<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">one drive has <strong>asvc_t<\/strong> latencies that are almost 10x slower then the others<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">This drive is not doing very much IO, yet latency is horrible and multiple IOs are active.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">observe the excessive actv counts for this drive<\/div>\n<\/li>\n<\/ul>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   # iostat -xncz 5 5 |grep c3t5000C50034F5105Bd0\r\n   r\/s    w\/s   kr\/s   kw\/s   wait actv wsvc_t asvc_t  %w  %b device\r\n   59.4   28.6  286.7  284.0  0.0  1.5    0.0   17.2   0  63 c3t5000C50034F5105Bd0\r\n   21.0    9.2   93.4  304.5  0.0  5.2    0.0  171.6   0 100 c3t5000C50034F5105Bd0\r\n   26.1    0.0  108.7    0.0  0.0  4.7    0.0  180.4   0  99 c3t5000C50034F5105Bd0\r\n   57.9   15.3  309.8  410.4  0.0 10.0    0.0  136.7   0 100 c3t5000C50034F5105Bd0\r\n   46.3    6.8   90.1  204.9  0.0  6.0    0.0  113.9   0 100 c3t5000C50034F5105Bd0\r\n<\/pre>\n<p><strong>&#8211; solution<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">the slow drive was removed from the array and performance immediately improved.<\/div>\n<\/li>\n<\/ul>\n<\/div>\n<h3>zfs_vdev_max_pending<\/h3>\n<div class=\"level5\">\n<p>What&#8217;s wrong here ?<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   SolarisHost # iostat -xnz 1\r\n     r\/s    w\/s   kr\/s   kw\/s  wait actv wsvc_t asvc_t  %w  %b device\r\n   764.5   69.3 97628.6 6848.2  0.0 10.0    0.0   12.0   0 100 c0t600144F0C6133EE90000525FE2120002d0\r\n   632.0    7.1 80783.5  434.2  0.0 10.0    0.0   15.6   0 100 c0t600144F0C6133EE90000525FE2120002d0\r\n<\/pre>\n<p>On Solaris, <strong>zfs_vdev_max_pending<\/strong> controls, how many I\/O requests can be pending per <strong>vdev<\/strong>. For example when you have 100 disks visible from your <acronym title=\"Operating System\">OS<\/acronym> with a zfs:zfs_vdev_max_pending of 2, you have 200 request outstanding at maximum. When you have 100 disks hidden behind a ZFFSA just showing a single LUN, you will have 2 pending requests at maximum !<\/p>\n<p>Under Solaris 11, the default value is 10. This means the the actv col is maxed out to 10 transactions : no more than 10 transactions can be performed at the same time. We can grow this to 35 if required, as it was under Solaris 10.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    SolarisHost # echo zfs_vdev_max_pending\/W0t35 | mdb -kw\r\n    SolarisHost # echo &quot;set zfs:zfs_vdev_max_pending = 35&quot; &gt;&gt; \/etc\/system\r\n    SolarisHost # iostat -xnz 1\r\n                        extended device statistics\r\n      r\/s    w\/s   kr\/s   kw\/s wait actv wsvc_t asvc_t  %w  %b device\r\n    772.9   28.4 98510.2 2805.7 0.0 18.9    0.0   23.6   0 100 c0t600144F0C6133EE90000525FE2120002d0\r\n    858.8    8.3 109596.5 331.8 0.0 17.8    0.0   20.5   0 100 c0t600144F0C6133EE90000525FE2120002d0\r\n<\/pre>\n<p>Observe that <strong>asvc_t<\/strong> has increased. Do not increase this too much as this is a <em class=\"u\">trade-off<\/em> between <strong>throughput<\/strong> and <strong>latency<\/strong>.<\/p>\n<p>Note : on the <strong>zfs appliance<\/strong>, zfs_dev_max_pending is still at 10. Do not change it !<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   zs3-2-tvp540-a# echo zfs_vdev_max_pending\/D | mdb -k\r\n   zfs_vdev_max_pending:\r\n   zfs_vdev_max_pending:           10\r\n<\/pre>\n<\/div>\n<h1 class=\"sectionedit4\">Analytics<\/h1>\n<h2 class=\"sectionedit4\">Useful set of analytics<\/h2>\n<div class=\"level4\">\n<p>The full name is provided as well as the cli name. Some of them cannot be created easily from the BUI.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nCPU: percent utilization broken down by CPU identifier                : cpu.utilization&#x5B;cpu]\r\nDisk: ZFS DMU operations per second broken down by type of operation  : zpool.dmuops&#x5B;op]\r\nDisk: ZFS DMU operations broken down by DMU object type               : zpool.dmuops&#x5B;dmutype]\r\nNetwork: TCP bytes per second broken down by local service            : tcp.bytes&#x5B;local service]\r\nNetwork: TCP packets per second broken down by local service          : tcp.packets&#x5B;local service]\r\nDisk: Disks broken down by percent utilization                        : io.disks&#x5B;utilization]\r\nDisk: I\/O operations per second broken down by type of operation      : io.ops&#x5B;op]\r\nDisk: I\/O operations per second broken down by latency                : io.ops&#x5B;latency=100000]&#x5B;disk] : 100ms\r\nDisk: I\/O bytes per second broken down by disk                        : io.bytes&#x5B;disk]\r\nMemory: kernel memory broken down by kmem cache                       : kmem.total&#x5B;cache]\r\nMemory: kernel memory in use broken down by kmem cache                : kmem.inuse&#x5B;cache]\r\nMemory: kernel memory lost to fragmentation broken down by kmem cache : kmem.fragmented&#x5B;cache]\r\nMemory: dynamic memory usage broken down by application name          : mem.heap&#x5B;application]\r\nNFSv3 operations per second taking at least 100000 microseconds broken down by type of operation :\r\n                                                                        nfs3.ops&#x5B;latency=100000]&#x5B;op]\r\n<\/pre>\n<p>Some nice analytics introduced in 2013.1<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\ndataset-030 active   37.8K      0 repl.bytes&#x5B;peer]\r\ndataset-031 active   37.8K      0 repl.ops&#x5B;dataset]\r\ndataset-032 active   37.8K      0 repl.ops&#x5B;direction]\r\ndataset-033 active   71.1K      0 cpu.spins&#x5B;cpu]\r\ndataset-034 active   36.9K      0 cpu.spins&#x5B;type]\r\ndataset-035 active   33.9K      0 shadow.ops&#x5B;latency]\r\ndataset-036 active   37.8K      0 shadow.ops&#x5B;share]\r\n...\r\n<\/pre>\n<\/div>\n<h2>Examples<\/h2>\n<div class=\"level4\">\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\ncli&gt; analytics datasets create repl.bytes&#x5B;peer]\r\n<\/pre>\n<\/div>\n<h1>Memory usage via mdb<\/h1>\n<h2>::memstat<\/h2>\n<div class=\"level4\">\n<p>Useful tool but may return incorrect information. Double check with freemem\/D \u2026<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nPage Summary           Pages             MB  %Tot\r\n------------     ---------  --------  ----\r\nKernel             2998817     11714   72% : Kernel + ARC cache for ZFS metadata (included in crash dump)\r\nZFS File Data      1058644      4135   25% : ARC cache for ZFS data (not included in crash dump)\r\nAnon                123711       483    3% : process's heap space, its stack, and copy-in-write pages.\r\nExec and libs          263         1    0%\r\nPage cache            1438         5    0% : cache used by other FS than zfs (ufs, tmpfs, qfs)\r\nFree (cachelist)      2287         8    0% : freed but still containing valid vnodes references (structure kept)\r\nFree (freelist)       6898        26    0% : freed and not containing any vnodes references (structure not kept)\r\n\r\nTotal              4192058     16375\r\nPhysical           4192056     16375\r\n<\/pre>\n<\/div>\n<p><strong>Things to consider<\/strong><\/p>\n<div class=\"level5\">\n<ul>\n<li class=\"level1\">\n<div class=\"li\">Kernel memory embeds ZFS metadata (DDT, spacemap, L2 ARC headers \u2026)<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">High level of &#8216;kernel memory&#8217; is not always a pbm especially when the system is a replication target or a resilvering is running.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">Small &#8216;ZFS File Data&#8217; values will mean that ARC data miss can be of high rate, possible performance issue if IO workload is READ.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">High value of &#8216;Anon&#8217; memory, means some userland process may be huge in size (true for ndmp which is a 64 bits process).<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">High value of &#8216;Page cache&#8217; means someone may have written some big files into \/export (ftp can do this !). \/export uses swap, which points to tmpfs (DRAM).<\/div>\n<\/li>\n<\/ul>\n<p>Try to collect ::memstat every minutes to check how memory usage changes. Big memory movements is a sign of memory re-allocation.<\/p>\n<\/div>\n<h2>::arc -m<\/h2>\n<div class=\"level4\">\n<p>The dcmd output is slightly different from 2011.1 and 2013.1 :<\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">2011.1<\/div>\n<\/li>\n<\/ul>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\np                         =    364385 MB  : target size of MRU\r\nc                         =    388888 MB  : constantly changing number of what the ARC thinks it should be\r\nc_min                     =     16383 MB  : min target cache size, minimum size of the ARC\r\nc_max                     =    523246 MB  : max target cache size, maximum size of the ARC : 98% of Memory\r\nsize                      =    381169 MB  : ARC current size\r\nhdr_size                  = 5061417280    : Header size for ARC, in bytes\r\ndata_size                 = 380546095104  : ARC size for data (payload), in bytes\r\nl2_size                   = 1691643692544 : l2 cache size currently used\r\nl2_hdr_size               = 4942620096    : memory used for l2_headers in bytes, here 4.7GB\r\narc_meta_used             =     54568 MB  : ARC metadata currently used\r\narc_meta_limit            =    523246 MB  : ARC metadata max size (by default arc_meta_limit = arc_c_max)\r\narc_meta_max              =    342784 MB  : max size the ARC metadata has reached\r\n<\/pre>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">2013.1<\/div>\n<\/li>\n<\/ul>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\np                         =      2529 MB  : target size of MRU\r\nc                         =    261101 MB  : constantly changing number of what the ARC thinks it should be\r\nc_min                     =        64 MB  : min target cache size, minimum size of the ARC\r\nc_max                     =    261101 MB  : max target cache size, maximum size of the ARC : 98% of Memory\r\nsize                      =     44213 MB  : ARC current size\r\ndata_size                 =     15015 MB  : ARC size for data (payload)\r\nl2_hdr_size               =        96 MB  : memory used for l2_headers\r\nl2_size                   = 1535512576    : l2 cache size currently used\r\nmeta_used                 =     29198 MB  : ARC metadata currently used\r\nmeta_max                  =     29198 MB  : max size the ARC metadata has reached\r\nmeta_limit                =         0 MB  : ARC metadata max size (0 = no limit)\r\n<\/pre>\n<\/div>\n<p><strong>Example :<\/strong><\/p>\n<div class=\"level5\">\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n::arc -m\r\nl2_size                   = 1018129928704\r\nl2_hdr_size               = 2339951616\r\n\r\n# zpool iostat 1  \/\/ to get SSD usable size\r\n\r\n                   capacity     operations    bandwidth\r\n  pool          alloc   free   read  write   read  write\r\n\r\n  cache             -      -      -      -      -      -\r\n    c1t2d0       477G     8M     58      7   892K   773K\r\n    c1t3d0       477G     8M     59      7   887K   772K\r\n<\/pre>\n<p>Max L2-headers size = SSD size * l2_hdr_size \/ l2_size = (477*2)*1024^3 * 2339951616 \/ 1018129928704 = 2354246416 = 2.354 <acronym title=\"Gigabyte\">GB<\/acronym><\/p>\n<\/div>\n<p><strong>Possible outcome<\/strong><\/p>\n<div class=\"level5\">\n<p>When L2 header size is too high, some performance issues will happen. If DRAM cannot be increased, there are 2 possible outcomes :<\/p>\n<ol>\n<li class=\"level1\">\n<div class=\"li\">remove 1 of the cache devices. A maintenance window is needed as the traffic may stop while the cache device is removed by the &#8216;zpool remove cachedevice&#8217; command.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">narrow down any share\/LUNs whose the majority of IOs are writes (analytics can help). From the BUI, select the share and set \u201cCache device usage \u2192 none\u201d.<\/div>\n<\/li>\n<\/ol>\n<\/div>\n<h2>::kmastat<\/h2>\n<div class=\"level4\">\n<p>To get a better picture of ::kmastat output, we need to introduce vmem.<\/p>\n<\/div>\n<p><strong>vmem<\/strong><\/p>\n<div class=\"level5\">\n<p>Vmem stands for virtual memory and is essential for an operating system because it makes several things possible :<\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">It helps to isolate tasks from each other by encapsulating them in their private address spaces.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">Virtual memory can give tasks the feeling of more memory available than is actually possible.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">By using virtual memory, there might be multiple copies of the same program, linked to the same addresses, running in the system<\/div>\n<\/li>\n<\/ul>\n<p>Solaris&#8217; kernel memory allocator (also known as <strong>slab allocator<\/strong>) allows the kernel to allocate memory dynamically. The <strong>slab allocator<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">relies on two lower\u2212level system services to create slabs: a virtual address allocator to provide kernel virtual addresses, and VM routines to back those addresses with physical pages and establish virtual\u2212to\u2212physical translations.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">allows the clients to create a cache of very frequently used small objects\/structures, so that objects of the same type come from a single page or contiguous pages.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">reduces the risk of internal fragmentation.<\/div>\n<\/li>\n<\/ul>\n<p>It provides interfaces such as <strong>vmem_create(), kmem_cache_create(), kmem_cache_alloc() and kmem_cache_free()<\/strong> :<\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">zio_arena = vmem_create(\u201czfs_file_data\u201d,\u2026) : memory allocated in physical or virtual memory: PM or VM (see args)<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">znode_cache = kmem_cache_create(\u201czfs_znode_cache\u201d,\u2026) : memory allocated in physical memory : PM<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">dbuf_cache = kmem_cache_create(\u201cdmu_buf_impl_t\u201d,\u2026) : memory allocated in physical memory : PM<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">zp = kmem_cache_alloc(znode_cache,\u2026);<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">kmem_cache_free(znode_cache, zp);<\/div>\n<\/li>\n<\/ul>\n<p><strong><em class=\"u\">The memory returned from the kmem calls is allocated in physical memory.<\/em><\/strong><\/p>\n<p>Slab allocator sources its <em class=\"u\">virtual address<\/em> range from the <strong>vmem<\/strong> backend allocator. Vmem resource map allocator provides the <strong>vmem_create()<\/strong> interface to allow clients to register with the allocator some VA range from where the subsequent allocations should happen. As written before, VM routines establish virtual\u2212to\u2212physical translations of pages.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n               vmem_create(const char *name, void *base, size_t size, size_t quantum, vmem_alloc_t *afunc,\r\n               vmem_free_t *ffunc, vmem_t *source, size_t qcache_max, int vmflag)\r\n\r\n               name       : name of the virtual memory segment\r\n               base       : ase address of the segment in the 'kernel address space (KAS)';\r\n               size       : how big the segment should be\r\n               quantum    : unit of allocation (if equals to PAGESIZE -&gt; Virtual Memory)\r\n               afunc      : allocator function for the segment\r\n               ffunc      : free function for the segment\r\n               source     : parent vmem arena. if NULL, then it defaults to _kmem_default_\r\n               qcache_max : qcache_max is the the maximum sized allocation upto which it could allows\r\n                            slab allocator to cache such objects\r\n               vmflag     : VM_SLEEP = can block for memory - success guaranteed, VM_NOSLEEP = cannot\r\n                            block for memory - may fail\r\n<\/pre>\n<\/div>\n<p><strong>ZFS data buffers<\/strong><\/p>\n<div class=\"level5\">\n<p>Typical buffer names : zfs_file_data, zfs_file_data_buf, zio_data_buf. ZFS data is never captured in a kernel dump. This is because we do not use the <strong>heap arena<\/strong> but create a new arena in the KAS (Kernel Adress Space) : the <strong>zfs_file_data<\/strong> arena.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   # startup.c\r\n     segkmem_zio_init(segzio_base, mmu_ptob(segziosize)); \/* create zio area covering new segment *\/\r\n\r\n   # seg_kmem.c\r\n     void segkmem_zio_init(void *zio_mem_base, size_t zio_mem_size) {\r\n        zio_arena = vmem_create(&quot;zfs_file_data&quot;, zio_mem_base, zio_mem_size, PAGESIZE, NULL,\r\n        NULL, NULL, 256 * 1024, VM_SLEEP);\r\n        \/\/ quantum = PAGESIZE : VM                                           ^^^^^^^^\r\n        \/\/ Here, we allocate 'zio_mem_size' bytes for zfs_file_data and round to quantum multiples (ie PAGESIZE)\r\n\r\n        zio_alloc_arena = vmem_create(&quot;zfs_file_data_buf&quot;, NULL, 0, PAGESIZE, segkmem_zio_alloc,\r\n        segkmem_zio_free, zio_arena, 0, VM_SLEEP);\r\n        \/\/ quantum = PAGESIZE but refers to segkmem_zio_alloc : invokes vmem_alloc() to get a virtual address and\r\n        \/\/ then backs it with physical pages : PM\r\n\r\n   # zio.c\r\n     vmem_t * data_alloc_arena = zio_alloc_arena;\r\n              ^^^^^^^^^^^^^^^^\r\n     loop on 'c'\r\n       (void) sprintf(name, &quot;zio_data_buf_%lu&quot;, (ulong_t)size);\r\n       zio_data_buf_cache&#x5B;c] = kmem_cache_create(name, size, align, NULL, NULL, NULL, NULL, data_alloc_arena, cflags\r\n                             | KMC_NOTOUCH); \/\/ PM\r\n     endloop\r\n<\/pre>\n<p>Any <strong>kmem_cache_alloc()<\/strong> calls involving any of the above <strong>zio_data_buf_<size><\/strong> caches will result in returning memory in the range *[segzio_base, segzio_base+segziosize)* .<br \/>\nLet&#8217;s have a look to ::kmastat<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   ::kmastat -m\r\n   cache                        buf    buf    buf    memory     alloc alloc\r\n   name                        size in use  total    in use   succeed  fail   = object caches\r\n   ------------------------- ------ ------ ------ ---------- --------- -----\r\n   kmem_magazine_31             256    441   4095          1M   2429326     0 | &lt;== per\u2212CPU caches added to the slab \r\n   kmem_magazine_47             384   3640   7590          2M    838299     0 |     algorithm to avoid spinlock \r\n   kmem_magazine_63             512    126   2009          1M   1249436     0 |     contention\r\n   ...\r\n   zio_data_buf_69632         69632      4     17          1M    288907     0 |\r\n   zio_data_buf_90112         90112      1     14          1M    908801     0 |\r\n   zio_data_buf_102400       102400      0     11          1M    185426     0 |\r\n   zio_data_buf_114688       114688      0     10          1M    167585     0 |&lt;==============+\r\n   zio_data_buf_118784       118784      0     11          1M    127970     0 |               |\r\n   zio_data_buf_122880       122880      0     10          1M    124515     0 |               |\r\n   zio_data_buf_126976       126976      0     12          1M     85861     0 |               |\r\n   zio_data_buf_131072       131072  32185  32879       4109M  56897448     0 |               |\r\n   ------------------------- ------ ------ ------ ---------- --------- -----                  |\r\n   Total &#x5B;zfs_file_data_buf]                            4135M 168478350     0  &lt;--------------|\r\n                                                                                              |\r\n   vmem                         memory     memory    memory     alloc alloc    vmem arena     |\r\n   name                         in use      total    import   succeed  fail    &lt;= vmem arena  |\r\n   ------------------------- ---------- ----------- ---------- --------- -----                |\r\n   zfs_file_data                  4142M    8388608M        0M   5809290     0  &lt; 8TB of VM &lt;==|==== arena name : VM\r\n       zfs_file_data_buf          4135M       4135M     4135M  10342009     0  &lt; 4GB of PM ---+ &lt;=== subset of \r\n                                                                                                zfs_file_data \r\n                                                                                                arena : PM\r\n<\/pre>\n<\/div>\n<p><strong>ZFS metadata buffers<\/strong><\/p>\n<div class=\"level5\">\n<p>We rely on kmem to create caches such that all the zfs metadata is part of the &#8216;kernel heap&#8217; and hence is catpured as part of the crash dump. Actually, we pass <strong>NULL<\/strong> to <strong>kmem_cache_create()<\/strong> for the <strong>vmem_t<\/strong> type argument. This means that <strong>zio_buf_<size><\/strong> will be backed by <strong>kmem_default<\/strong> VMEM arena, like most of the other kernel caches.<\/p>\n<pre class=\"code c\">   <span class=\"sy0\">-<\/span> zio.<span class=\"me1\">c<\/span>\r\n     loop on <span class=\"st0\">'c'<\/span>\r\n        <span class=\"br0\">(<\/span><span class=\"kw4\">void<\/span><span class=\"br0\">)<\/span> sprintf<span class=\"br0\">(<\/span>name<span class=\"sy0\">,<\/span> <span class=\"st0\">\"zio_buf_%lu\"<\/span><span class=\"sy0\">,<\/span> <span class=\"br0\">(<\/span>ulong_t<span class=\"br0\">)<\/span>size<span class=\"br0\">)<\/span><span class=\"sy0\">;<\/span>\r\n        zio_buf_cache<span class=\"br0\">[<\/span>c<span class=\"br0\">]<\/span> <span class=\"sy0\">=<\/span> kmem_cache_create<span class=\"br0\">(<\/span>name<span class=\"sy0\">,<\/span> size<span class=\"sy0\">,<\/span> align<span class=\"sy0\">,<\/span> NULL<span class=\"sy0\">,<\/span> NULL<span class=\"sy0\">,<\/span> NULL<span class=\"sy0\">,<\/span> NULL<span class=\"sy0\">,<\/span> NULL<span class=\"sy0\">,<\/span> cflags<span class=\"br0\">)<\/span><span class=\"sy0\">;<\/span> <span class=\"co1\">\/\/ last but one arg is NULL => kmem_default<\/span><\/pre>\n<p>ZFS caches for metadata :<\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">zio.c: zio_buf_<size>; = kmem_cache_create(\u201czio_buf_%lu\u201d,\u2026);<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">dbuf.c: dbuf_cache = kmem_cache_create(\u201cdmu_buf_impl_t\u201d,\u2026);<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">dnode.c: dnode_cache = kmem_cache_create(\u201cdnode_t\u201d,\u2026);<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">sa.c: sa_cache = kmem_cache_create(\u201csa_cache\u201d, \u2026);<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">zfs_znode.c: znode_cache = kmem_cache_create(\u201czfs_znode_cache\u201d, \u2026);<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">zio.c: zio_cache = kmem_cache_create(\u201czio_cache\u201d, \u2026);<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">\u2026<\/div>\n<\/li>\n<\/ul>\n<p>As these buffers are all allocated thanks to <strong>kmem_cache_create()<\/strong> calls, they correspond to Physical Memory : PM.<br \/>\nLets have a look to ::kmastat<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   ::kmastat -m\r\n   cache                        buf    buf    buf    memory     alloc alloc\r\n   name                        size in use  total    in use   succeed  fail   = object caches\r\n   ------------------------- ------ ------ ------ ---------- --------- -----\r\n   zio_buf_512                  512 2409586 4916584     2400M 552832304     0 ^\r\n   zio_buf_2048                2048  71804  72078        140M  26528668     0 |\r\n   zio_buf_4096                4096   2633   2703         10M  74524377     0 |\r\n   zio_buf_12288              12288   1282   1353         15M  36569461     0 |\r\n   zio_buf_16384              16384 108731 108854       1700M 254805641     0 |\r\n   zio_buf_32768              32768     36     92          2M  22355627     0 |\r\n   zio_buf_40960              40960     40     94          3M  27766298     0 |\r\n   zio_buf_45056              45056      1     32          1M  23590138     0 |\r\n   zio_buf_49152              49152      0     25          1M  12426311     0 |\r\n   zio_buf_53248              53248     12     35          1M   5843836     0 |\r\n   zio_buf_57344              57344     57    111          6M   2841986     0 |\r\n   zio_buf_61440              61440      1     38          2M   6549109     0 |\r\n   zio_buf_126976            126976      0     13          1M    658728     0 |\r\n   zio_buf_131072            131072   5059   6799        849M 107583446     0 |\r\n   dmu_buf_impl_t               192 2616165 5710460     1115M 860908891     0 |\r\n   dnode_t                      744 2423090 2454573     1743M 412891701     0 |\r\n   sa_cache                      56   8983 151088          8M 438341951     0 |&lt;==================+\r\n   zfs_znode_cache              248   8983  15136          3M 439890993     0 |                   |\r\n   zio_cache                    816  20102  28750         22M 1922767786    0 v                   |\r\n                                                                                                  |\r\n                                                                                                  |\r\n   ------------------------- ------ ------ ------ ---------- --------- -----                      |\r\n   Total &#x5B;kmem_default]                                 9611M 2194965189    0 &lt;-------------------|\r\n                                                                                                  |\r\n   vmem                         memory     memory    memory     alloc    alloc                    |\r\n   name                         in use      total    import    succeed   fail                     |\r\n   ------------------------- ---------- ----------- ---------- --------- -----                    |\r\n   heap                          16815M    8388608M         0M  36378781     0                    |\r\n       kmem_va                   15017M      15017M     15017M   2756890     0 &lt;  15  GB of VM    |\r\n           kmem_default           9611M       9611M      9611M  21280467     0 &lt;  9.6 GB of PM ---+\r\n<\/pre>\n<p>Trying to understand these data, focus only the <strong>kmem_default<\/strong> line which accounts for all the kmem allocations by different caches on the system, then look at individual caches to see how much each consumed.<\/p>\n<\/div>\n<p><strong>Quantum caches<\/strong><\/p>\n<div class=\"level5\">\n<p>The intent of quantum caches is to prevent fragmentation in the vmem arena. <strong>zfs_file_data_<size><\/strong> are quantum caches and are already accounted in <strong>zio_data_buf_<size><\/strong> caches. So do not take them into account. In the code, the quantum caches are created in <strong>vmem_create()<\/strong> itself, depending on whether <strong>qcache_max<\/strong> is passed or not. qcache_max is the the maximum allocation size to cache such objects.<\/p>\n<pre class=\"code c\">   zio_arena <span class=\"sy0\">=<\/span> vmem_create<span class=\"br0\">(<\/span><span class=\"st0\">\"zfs_file_data\"<\/span><span class=\"sy0\">,<\/span> zio_mem_base<span class=\"sy0\">,<\/span> zio_mem_size<span class=\"sy0\">,<\/span> PAGESIZE<span class=\"sy0\">,<\/span> NULL<span class=\"sy0\">,<\/span> NULL<span class=\"sy0\">,<\/span> NULL<span class=\"sy0\">,<\/span> <span class=\"nu0\">32<\/span> <span class=\"sy0\">*<\/span> <span class=\"nu0\">1024<\/span><span class=\"sy0\">,<\/span> VM_SLEEP<span class=\"br0\">)<\/span><span class=\"sy0\">;<\/span>\r\n                                                                                                    <span class=\"sy0\">^^^^^^^^^<\/span>\r\n                                                                                                    qcache_max<\/pre>\n<p>Let&#8217;s see the kmastat output :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n::kmasat -m\r\n   cache                        buf    buf    buf    memory     alloc alloc\r\n   name                        size in use  total    in use   succeed  fail   &lt;= object caches\r\n   ------------------------- ------ ------ ------ ---------- --------- -----\r\n   zfs_file_data_4096          4096    163    448          1M   2837290     0\r\n   zfs_file_data_8192          8192     68    208          1M    977461     0\r\n   zfs_file_data_12288        12288     58    160          2M   1589976     0 &gt;&gt; see, the 4KB pattern\r\n   zfs_file_data_16384        16384     12     48          0M    467541     0    (PAGESIZE=4KB on x64)\r\n   zfs_file_data_20480        20480     65    108          2M   1546868     0\r\n   zfs_file_data_24576        24576     13     40          1M    494350     0\r\n   zfs_file_data_28672        28672     31     68          2M    912287     0\r\n   zfs_file_data_32768        32768     16     32          1M    340006     0\r\n<\/pre>\n<\/div>\n<p><strong>How to determine the type of each object cache : PM or VM ?<\/strong><\/p>\n<div class=\"level5\">\n<p>SCAT \u201ckma stat\u201d can be of great help :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n       CAT(vmcore.4\/11X)&gt; kma stat\r\n                                      buf      buf        buf       memory\r\n       cache name                        size    avail      total       in use\r\n       ============================== ======= ======== ========== ============\r\n       kmem_magazine_1                     16    57562      61746      1007616 PK\r\n       kmem_magazine_3                     32    45903      49000      1605632 PK\r\n       ...\r\n       kmem_va_4096                      4096   230223    1466784   6007947264 QU\r\n       kmem_va_8192                      8192   284936     557168   4564320256 QU\r\n       kmem_va_12288                    12288      700       2180     28573696 QU\r\n       kmem_va_16384                    16384   143227     252336   4134273024 QU\r\n       ...\r\n       kmem_alloc_8                         8  1188947    1393310     11345920 PKU\r\n       kmem_alloc_16                       16    56834     136293      2224128 PKU\r\n       kmem_alloc_24                       24   155669     252170      6184960 PKU\r\n       kmem_alloc_32                       32   136623     170750      5595136 PKU\r\n       \u2026\r\n       zfs_file_data_135168            135168        0          0            0 QU\r\n       zfs_file_data_139264            139264        0          0            0 QU\r\n       \u2026\r\n       zio_cache                          872    27906      27945     25436160 PKU\r\n       \u2026\r\n       zio_buf_131072                  131072        0       6715    880148480 PK\r\n       zio_data_buf_131072             131072     1046     114558  15015346176 PZ\r\n       \u2026\r\n       dnode_t                            752   914325    1511025   1237831680 PKU\r\n       dmu_buf_impl_t                     208  1403215    2140958    461545472 PKU\r\n       ------------------------------ ------- -------- ---------- ------------\r\n       Total &#x5B;hat_memload]                                             1978368 P\r\n       Total &#x5B;kmem_default]                                        10078785536 P\r\n       Total &#x5B;id32]                                                       4096 P\r\n       Total &#x5B;zfs_file_data]                                          13107200\r\n       Total &#x5B;zfs_file_data_buf]                                    4336205824 P\r\n       ------------------------------ ------- -------- ---------- ------------\r\n       arena                                memory       memory       memory\r\n       name                                 in use        total       import\r\n       ------------------------------ ------------ ------------- ------------\r\n       heap                            17632604160 8796093022208            0\r\n          kmem_va                      15747022848   15747022848  15747022848\r\n             kmem_default              10078785536   10078785536  10078785536  P\r\n       zfs_file_data                    4344115200 8796093022208            0\r\n          zfs_file_data_buf             4336205824    4336205824   4336205824  P\r\n\r\n       P - physical memory\r\n       K - kvps&#x5B;KV_KVP] vnode (kernel memory)\r\n       Z - kvps&#x5B;KV_ZVP] vnode (ZFS kernel memory)\r\n       Q - quantum cache (KMC_QCACHE)\r\n       U - enabling KMF_AUDIT will allow &quot;kma users&quot; to work for this cache\r\n       F - cache has failed allocations (will be in red color)\r\n<\/pre>\n<p><strong>For a deep understanding, see :<\/strong><\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">Magazines and Vmem by Jeff Bonwick <a class=\"urlextern\" title=\"http:\/\/www.parrot.org\/sites\/www.parrot.org\/files\/vmem.pdf\" href=\"http:\/\/www.parrot.org\/sites\/www.parrot.org\/files\/vmem.pdf\" rel=\"nofollow\">http:\/\/www.parrot.org\/sites\/www.parrot.org\/files\/vmem.pdf<\/a><\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">Slab Allocator From HelenOS : <a class=\"urlextern\" title=\"http:\/\/www.helenos.org\/doc\/design\/html.chunked\/mm.html#slab\" href=\"http:\/\/www.helenos.org\/doc\/design\/html.chunked\/mm.html#slab\" rel=\"nofollow\">http:\/\/www.helenos.org\/doc\/design\/html.chunked\/mm.html#slab<\/a><\/div>\n<\/li>\n<\/ul>\n<\/div>\n<h2>Paging and freemen rules<\/h2>\n<div class=\"level4\">\n<p>The pageout daemon is visible from prstat. It looks for pages from userland processes than can be pushed back to disk (swap). When the free page list (freelist) falls below a certain level, the page daemon is invoked. The daemon scans memory to identify pages that can be pushed out to disk and reused. The system strategy for page replacement is not to let the system run out of memory<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n       ::memstat\r\n       Page Summary                Pages                MB  %Tot\r\n       ------------     ----------------  ----------------  ----\r\n       Kernel                    3843924             15015    6%\r\n       ZFS File Data              150299               587    0%\r\n       Anon                        97798               382    0%\r\n       Exec and libs                3451                13    0%\r\n       Page cache                   8696                33    0%\r\n       Free (cachelist)             7493                29    0%\r\n       Free (freelist)          62992480            246064   94%\r\n\r\n       Total                    67104141            262125\r\n       Physical                 67104140            262125\r\n\r\n       ::physmem\/D\r\n       physmem:\r\n       physmem:        67104141  &lt;&lt; in pages (4KB on x64 , 8KB on sparc)\r\n<\/pre>\n<p>To prevent from running out of memory, the <acronym title=\"Operating System\">OS<\/acronym> uses the following tunable parameters as thresholds :<\/p>\n<div class=\"table sectionedit7\">\n<table class=\"inline\">\n<tbody>\n<tr class=\"row0\">\n<th class=\"col0 leftalign\">Parameter<\/th>\n<th class=\"col1 leftalign\">Meaning<\/th>\n<th class=\"col2 leftalign\">Value<\/th>\n<th class=\"col3 leftalign\">Comments<\/th>\n<\/tr>\n<tr class=\"row1\">\n<td class=\"col0 leftalign\">lotsfree<\/td>\n<td class=\"col1 leftalign\">Lost of Memory<\/td>\n<td class=\"col2\">MAX(looppages\/64, 1\/2MB)<\/td>\n<td class=\"col3\">if freemem < lotsfree, the <acronym title=\"Operating System\">OS<\/acronym> run the page daemon<\/td>\n<\/tr>\n<tr class=\"row2\">\n<td class=\"col0 leftalign\">desfree<\/td>\n<td class=\"col1 leftalign\">Desired free memory<\/td>\n<td class=\"col2 leftalign\">lotsfree\/2<\/td>\n<td class=\"col3\">if freemem < desfree, the <acronym title=\"Operating System\">OS<\/acronym> starts the soft swapping<\/td>\n<\/tr>\n<tr class=\"row3\">\n<td class=\"col0 leftalign\">minfree<\/td>\n<td class=\"col1\">Minimum acceptable memory level<\/td>\n<td class=\"col2 leftalign\">desfree\/2<\/td>\n<td class=\"col3\">if freemem < minfree, the system biases allocations toward allocations necessary to successfully complete pageout operations or to swap processes completely out of memory<\/td>\n<\/tr>\n<tr class=\"row4\">\n<td class=\"col0 leftalign\">throttlefree<\/td>\n<td class=\"col1\">Same as minfree, could be tuned<\/td>\n<td class=\"col2 leftalign\">minfree<\/td>\n<td class=\"col3\">if freemem < throttlefree, the <acronym title=\"Operating System\">OS<\/acronym> cuts off user and non urgent kernel requests<\/td>\n<\/tr>\n<tr class=\"row5\">\n<td class=\"col0\">pageout_reserve<\/td>\n<td class=\"col1 leftalign\">Page daemon and swapper only<\/td>\n<td class=\"col2 leftalign\">throttlefree\/2<\/td>\n<td class=\"col3\">if freemem < pageout_reserve, the <acronym title=\"Operating System\">OS<\/acronym> fails all requests except for the page daemon and the swapper<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>Beyond the pageout daemon, ZFS uses some strategy to reduce the ARC size and avoid memory starvation :<br \/>\nIn the arc.c code, the <em>arc_reclaim_thread()<\/em> checks if <strong>freemem < lotsfree+desfree<\/strong>. If <strong>TRUE<\/strong>, it calls <em>kmem_cache_reap_now()<\/em> to look into the different kernel memory buffers to free up unused space and put it back to the cachelist (we keep the structure so we can allocate faster). This is a mono-threaded task and may lead to performance issues as threads trying to alloc memory may have to wait, ie :<\/p>\n<pre class=\"code\">::stacks -m zfs\r\n    \u00a0 THREAD\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 STATE\u00a0\u00a0\u00a0 SOBJ\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 COUNT\r\n\u00a0     ffffff86b38ab440 SLEEP\u00a0\u00a0\u00a0 CV\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 36\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 swtch+0x150\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cv_wait+0x61\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 vmem_xalloc+0x63f\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 vmem_alloc+0x161\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 segkmem_xalloc+0x90\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 segkmem_alloc_vn+0xcd\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 segkmem_zio_alloc+0x24\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 vmem_xalloc+0x550\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 vmem_alloc+0x161\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 kmem_slab_create+0x81\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 kmem_slab_alloc+0x5b\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 kmem_cache_alloc+0x1fa\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 zio_data_buf_alloc+0x2c\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 arc_get_data_buf+0x18b\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 arc_buf_alloc+0xa2\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 arc_read_nolock+0x12f\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 arc_read+0x79\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dsl_read+0x33\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dbuf_read_impl+0x17e\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dbuf_read+0xfd\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dmu_tx_check_ioerr+0x6b\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dmu_tx_count_write+0x175\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dmu_tx_hold_write+0x5b\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 zfs_write+0x655\r\n<\/pre>\n<p>For ARC memory allocation changes, the sequence is as follows :<\/p>\n<pre class=\"code\">\r\n       arc_reclaim_thread() { \/\/ The arc_reclaim_thread() wakes up every second and will attempt to reduce  \r\n                              \/\/ the size of the ARC to the target size (arc_c).\r\n         if (arc_reclaim_needed() == TRUE) {  \/\/ freemen < lotsfree+desfree ?\r\n             arc_kmem_reap_now() {\r\n                kmem_cache_reap_now(zio_buf_cache[i]);\r\n                kmem_cache_reap_now(zio_data_buf_cache[i]);\r\n                kmem_cache_reap_now(buf_cache);\r\n                kmem_cache_reap_now(hdr_cache);\r\n             }\r\n          }\r\n          if (arc_size > arc_c) {\r\n             arc_adjust(); \/\/  resize the ARC lists (MRU\/MFU\/Ghost lists).\r\n          }\r\n       }\r\n<\/pre>\n<p>For details, see :<\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\"><a class=\"urlextern\" title=\"http:\/\/dtrace.org\/blogs\/brendan\/2012\/01\/09\/activity-of-the-zfs-arc\/\" href=\"http:\/\/dtrace.org\/blogs\/brendan\/2012\/01\/09\/activity-of-the-zfs-arc\/\" rel=\"nofollow\">http:\/\/dtrace.org\/blogs\/brendan\/2012\/01\/09\/activity-of-the-zfs-arc\/<\/a><\/div>\n<\/li>\n<\/ul>\n<p>The kmem_cache reaper activity can be seen thanks to kmem.fragmented[cache] analytic.<\/p>\n<h1>Dtrace<\/h1>\n<\/div>\n<h2 class=\"level3\">cpu usage<\/h2>\n<p class=\"level3\"><strong>Display the busy threads which are not in the idle loop (curthread->t_pri != -1)<\/strong><\/p>\n<div class=\"level5\">\n<pre class=\"code\">   # dtrace -n 'profile-997hz \/arg0 &amp;&amp; curthread->t_pri != -1 &amp;&amp; cpu==58 \/ { @[stack()] = count(); } tick-10sec { trunc(@, 10); printa(@); exit(0); }'\r\n   zfs`zio_compress_data+0x41\r\n   zfs`zio_write_bp_init+0x26d\r\n   zfs`zio_execute+0x8d\r\n   genunix`taskq_thread+0x22e\r\n   unix`thread_start+0x8\r\n   25898<\/pre>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">arg0 !=0 : to check if the thread is running in the kernel<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">curthread\u2192t_pri != -1 : to ensure that the CPU is not executing the kernel idle loop<\/div>\n<\/li>\n<\/ul>\n<\/div>\n<p><strong>Determine why threads are being switched off-CPU by looking at user stacks and kernel stacks when the sched:::off-cpu probe fires<\/strong><\/p>\n<div class=\"level5\">\n<p>Displays threads which are switched off-CPU.<\/p>\n<pre class=\"code\">   # dtrace -qn 'sched:::off-cpu \/execname != \"sched\"\/ { @[execname, stack()] = count(); } END { trunc(@,5); }'\r\n   zpool-pool-raid\r\n      unix`swtch+0x146\r\n      genunix`cv_wait+0x60\r\n      genunix`taskq_thread_wait+0x86\r\n      genunix`taskq_thread+0x2a4\r\n      unix`thread_start+0x8\r\n   51328<\/pre>\n<p>Looks like we wait for some tasks to be done in one of the taskq worker threads. We should dig beyond thanks to the taskq.d trace script (See dtrace book by Brendan Gregg).<\/p>\n<pre class=\"code\">   # dtrace -n 'sched:::off-cpu \/execname != \"sched\"\/ { self->;start = timestamp; }\r\n                sched:::on-cpu \/self->;start\/ { this->;delta = (timestamp - self->start ) >> 20 ;\r\n                @[\"off-cpu (ms):\", stack()] = quantize(this->delta); self->start = 0; }'\r\n\r\n   off-cpu (ms):\r\n      genunix`cv_wait+0x60\r\n      zfs`zio_wait+0x5c\r\n      zfs`dmu_buf_hold_array_by_dnode+0x238\r\n      zfs`dmu_buf_hold_array+0x6f\r\n      zfs`dmu_read_uio+0x4b\r\n      zfs`zfs_read+0x267\r\n      genunix`fop_read+0xa9\r\n      genunix`vn_rdwr+0x193\r\n      unix`mmapobj_map_interpret+0x20f\r\n      unix`mmapobj+0x7a\r\n      genunix`mmapobjsys+0x1d5\r\n      unix`sys_syscall+0x17a\r\n\r\n             value  ------------- Distribution ------------- count\r\n                -1 |                                         0\r\n                 0 |@@@@@@@@@@@@@@@                          3\r\n                 1 |                                         0\r\n                 2 |@@@@@@@@@@@@@@@                          3\r\n                 4 |@@@@@@@@@@                               2\r\n                 8 |                                         0<\/pre>\n<\/div>\n<h2>Flamegraph<\/h2>\n<div class=\"level4\">\n<p>Brillant tool from Brendan Gregg available here : <a class=\"urlextern\" title=\"http:\/\/www.brendangregg.com\/flamegraphs.html\" href=\"http:\/\/www.brendangregg.com\/flamegraphs.html\" rel=\"nofollow\">http:\/\/www.brendangregg.com\/flamegraphs.html<\/a><br \/>\nSteps :<\/p>\n<ol>\n<li class=\"level1\">\n<div class=\"li\">download<\/div>\n<ul>\n<li class=\"level2\">\n<div class=\"li\"><a class=\"urlextern\" title=\"https:\/\/github.com\/brendangregg\/FlameGraph\/blob\/master\/stackcollapse.pl\" href=\"https:\/\/github.com\/brendangregg\/FlameGraph\/blob\/master\/stackcollapse.pl\" rel=\"nofollow\">https:\/\/github.com\/brendangregg\/FlameGraph\/blob\/master\/stackcollapse.pl<\/a><\/div>\n<\/li>\n<li class=\"level2\">\n<div class=\"li\"><a class=\"urlextern\" title=\"https:\/\/github.com\/brendangregg\/FlameGraph\/blob\/master\/flamegraph.pl\" href=\"https:\/\/github.com\/brendangregg\/FlameGraph\/blob\/master\/flamegraph.pl\" rel=\"nofollow\">https:\/\/github.com\/brendangregg\/FlameGraph\/blob\/master\/flamegraph.pl<\/a><\/div>\n<\/li>\n<\/ul>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">collect cpu usage with dtrace<\/div>\n<ul>\n<li class=\"level2\">\n<div class=\"li\"># dtrace -n &#8216;profile-997hz \/arg0 &amp;&amp; curthread\u2192t_pri != -1 \/ { @[stack()] = count(); } tick-10sec { trunc(@, 10); printa(@); exit(0); }&#8217; > sample.txt<\/div>\n<\/li>\n<\/ul>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">create an svg file<\/div>\n<ul>\n<li class=\"level2\">\n<div class=\"li\"># stackcollapse.pl sample.txt | flamegraph.pl > example.svg<\/div>\n<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>This will produce a mouse over picture giving the time spent in each function. The X abscissa determines the amount of time (in %) spent in each function during the capture. The Y ordinate provides the stack calls and depth.<\/p>\n<\/div>\n<h2>spa_sync activity<\/h2>\n<div class=\"level4\">\n<p>spa_sync activity is the time to flush data from DRAM to safe storage (ie disks).<\/p>\n<\/div>\n<p><strong>Examples<\/strong><\/p>\n<div class=\"level5\">\n<ul>\n<li class=\"level1\">\n<div class=\"li\">bad spa_sync time<\/div>\n<\/li>\n<\/ul>\n<pre class=\"code\">   # dtrace -n 'spa_sync:entry\/args[0]->spa_name==\"pool-b232\" \/{self->t=timestamp} spa_sync:return\/self->t\/{printf(\"spa_sync=%d msec\", (timestamp-self->t)>>20); self->t=0}'\r\n      CPU     ID                    FUNCTION:NAME\r\n       63  44795                  spa_sync:return spa_sync=5802 msec\r\n       36  44795                  spa_sync:return spa_sync=4617 msec\r\n       63  44795                  spa_sync:return spa_sync=4875 msec\r\n       22  44795                  spa_sync:return spa_sync=6088 msec\r\n       31  44795                  spa_sync:return spa_sync=1108 msec<\/pre>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">good spa_sync time<\/div>\n<\/li>\n<\/ul>\n<pre class=\"code\">   # dtrace -n 'spa_sync:entry\/args[0]->spa_name==\"pool-b232\" \/{self->t=timestamp} spa_sync:return\/self->t\/{printf(\"spa_sync=%d msec\", (timestamp-self->t)>>20); self->t=0}'\r\n      CPU     ID                    FUNCTION:NAME\r\n       55  44795                  spa_sync:return spa_sync=608 msec\r\n       42  44795                  spa_sync:return spa_sync=930 msec\r\n       41  44795                  spa_sync:return spa_sync=664 msec\r\n       31  44795                  spa_sync:return spa_sync=624 msec\r\n       27  44795                  spa_sync:return spa_sync=948 msec<\/pre>\n<\/div>\n<h2>arcreap.d<\/h2>\n<div class=\"level4\">\n<p>From Brendan Gregg : measures how long we spend in arc_kmem_reap_now and arc_adjust.<\/p>\n<pre class=\"code\">   # cat -n arcreap.d\r\n   #!\/usr\/sbin\/dtrace -s\r\n\r\n   fbt::arc_kmem_reap_now:entry,\r\n   fbt::arc_adjust:entry\r\n   {\r\n      self->start[probefunc] = timestamp;\r\n   }\r\n\r\n   fbt::arc_shrink:entry\r\n   {\r\n      trace(\"called\");\r\n   }\r\n\r\n   fbt::arc_kmem_reap_now:return,\r\n   fbt::arc_adjust:return\r\n   \/self->start[probefunc]\/\r\n   {\r\n      printf(\"%Y %d ms\", walltimestamp,(timestamp - self->start[probefunc]) \/ 1000000);\r\n      self->start[probefunc] = 0;\r\n   }\r\n\r\n   # .\/arcreap.d\r\n   dtrace: script '.\/arcreap.d' matched 5 probes\r\n   CPU     ID                    FUNCTION:NAME\r\n        0  64929                 arc_shrink:entry   called\r\n        0  62414                arc_adjust:return 2012 Jan  9 23:10:01 18 ms\r\n        9  62420         arc_kmem_reap_now:return 2012 Jan  9 23:10:03 1511 ms\r\n        0  62414                arc_adjust:return 2012 Jan  9 23:10:24 0 ms\r\n        6  62414                arc_adjust:return 2012 Jan  9 23:10:49 0 ms<\/pre>\n<\/div>\n<h1>snoop and tcpdump<\/h1>\n<div class=\"level3\">\n<p>Snoop can be used to get the latency and throughput. Since 2011.1.9 and 2013.1, tcpdump is available as well.<\/p>\n<pre class=\"code\">\r\n   # snoop -q -d aggr2001001 -s 200 -o <filename> port 2049 or port 111 or port 4045\r\n   port 2049 (nfs)\r\n   port 111  (rpcbind)\r\n   port 4045 (nfs-locking)<\/pre>\n<p>There is workflow as well : snoop-all.akwf (see attachments section)<\/p>\n<p>In some cases, *tcpdump* can be used. It seems to be able to capture more details than snoop and less memory intensive consumer.<\/p>\n<pre class=\"code\">   # tcpdump -c 200000 -i aggr1 -w \/var\/ak\/dropbox\/cpdump.out<\/pre>\n<\/div>\n<h1 class=\"sectionedit10\">tcptrace<\/h1>\n<div class=\"level3\">\n<p>This command is available from cores3 : \/sw\/es-tools\/bin\/tcptrace<\/p>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">overall latency<\/div>\n<\/li>\n<\/ul>\n<pre class=\"code\">   # tcptrace -lr  port_216_snoop | egrep \"RTT max\"\r\n   dq->dr:                                dr->dq:\r\n   RTT max:               120.0 ms        RTT max:               121.7 ms\r\n   RTT max:               118.4 ms        RTT max:               120.0 ms\r\n   RTT max:               120.0 ms        RTT max:               121.0 ms<\/pre>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">throughput<\/div>\n<\/li>\n<\/ul>\n<pre class=\"code\">\r\n   # tcptrace -lr  port_216_snoop | grep throughput\r\n   dq->dr:                                dr->dq:\r\n   throughput:              336 Bps       throughput:              273 Bps\r\n   throughput:               36 Bps       throughput:               33 Bps\r\n   throughput:              198 Bps       throughput:             1492 Bps<\/pre>\n<\/div>\n<h1 class=\"secedit editbutton_section editbutton_10\">filebench<\/h1>\n<div class=\"level3\">\n<p>Customers love to use <strong>dd<\/strong> or <strong>tar<\/strong> as a performance tool. In many cases, this does not represent the real workload hence we need a tool to simulate genuine workload : <strong>filebench<\/strong><br \/>\nSee <a class=\"urlextern\" title=\"http:\/\/sourceforge.net\/apps\/mediawiki\/filebench\/index.php?title=Filebench\" href=\"http:\/\/sourceforge.net\/apps\/mediawiki\/filebench\/index.php?title=Filebench\" rel=\"nofollow\">http:\/\/sourceforge.net\/apps\/mediawiki\/filebench\/index.php?title=Filebench<\/a><br \/>\nIt is available for linux and Solaris.<\/p>\n<pre class=\"code\">   # .\/go_filebench\r\n\r\n   statfile1                1051ops\/s   0.0mb\/s      0.5ms\/op      385us\/op-cpu\r\n   deletefile1              1051ops\/s   0.0mb\/s      4.1ms\/op      880us\/op-cpu\r\n   closefile3               1051ops\/s   0.0mb\/s      0.0ms\/op       12us\/op-cpu\r\n   readfile1                1051ops\/s 142.1mb\/s      1.3ms\/op      819us\/op-cpu\r\n   openfile2                1051ops\/s   0.0mb\/s      0.5ms\/op      410us\/op-cpu\r\n   closefile2               1051ops\/s   0.0mb\/s      0.0ms\/op       12us\/op-cpu\r\n   appendfilerand1          1051ops\/s   8.3mb\/s      2.4ms\/op      446us\/op-cpu\r\n   openfile1                1052ops\/s   0.0mb\/s      0.6ms\/op      420us\/op-cpu\r\n   closefile1               1052ops\/s   0.0mb\/s      0.0ms\/op       13us\/op-cpu\r\n   wrtfile1                 1052ops\/s 136.1mb\/s     30.5ms\/op     3910us\/op-cpu\r\n   createfile1              1052ops\/s   0.0mb\/s      4.6ms\/op     1493us\/op-cpu\r\n\r\n   2492: 109.065:\r\n   IO Summary:      694029 ops, 11566.6 ops\/s, (1051\/2103 r\/w) 286.5mb\/s,   3379us cpu\/op,  14.8ms latency\r\n                                                               ^^^^^^^^^ MB\/s<\/pre>\n<h1>kernel dump<\/h1>\n<\/div>\n<div class=\"level3\">\n<p>This should be a last resort. <strong>Remember this<\/strong> : When developing a picture of a racing cyclist, it&#8217;s never easy to determine how fast he was running ! This can be helpful if the system is hung or if the issue is too short to be able to run dtrace scripts.<br \/>\nHow to force an NMI :<\/p>\n<pre class=\"code\">   *** ILOM 2.x ***\r\n   -> cd \/SP\/diag\r\n   \/SP\/diag\r\n   -> set generate_host_nmi=true\r\n   Set 'generate_host_nmi' to 'true'\r\n\r\n   *** ILOM 3.x ***\r\n   -> cd \/HOST\/\r\n   -> set generate_host_nmi=true\r\n\r\n   The console session should report something similar to the following:\r\n   -> start \/SP\/console\r\n   panic[cpu2]\/thread=ffffff001eccbc60: NMI received\r\n   ffffff001eccbac0 pcplusmp:apic_nmi_intr+7c ()\r\n   ffffff001eccbaf0 unix:av_dispatch_nmivect+30 ()\r\n   ffffff001eccbb00 unix:nmiint+154 ()\r\n   ffffff001eccbbf0 unix:mach_cpu_idle+b ()\r\n   ffffff001eccbc20 unix:cpu_idle+c2 ()\r\n   ffffff001eccbc40 unix:idle+114 ()\r\n   ffffff001eccbc50 unix:thread_start+8 ()\r\n   syncing file systems... done\r\n   dumping to \/dev\/zvol\/dsk\/system\/dump, offset 65536, content: kernel + curproc\r\n   100% done: 356267 pages dumped, compression ratio 3.84, dump succeeded\r\n\r\n   Real dump file created after reboot. It may take time so, just wait before collecting a supportfile bundle.<\/pre>\n<p>Useful commands to run :<\/p>\n<\/div>\n<h2>::memstat<\/h2>\n<div class=\"level4\">\n<pre class=\"code\">::memstat\r\n   Page Summary                Pages                MB  %Tot\r\n   ------------     ----------------  ----------------  ----\r\n   Kernel                    2417967              9445   38%\r\n   ZFS File Data             3686849             14401   59%\r\n   Anon                       169828               663    3%\r\n   Exec and libs                   6                 0    0%\r\n   Page cache                    138                 0    0%\r\n   Free (cachelist)             9806                38    0%\r\n   Free (freelist)              2534                 9    0%\r\n\r\n    Total                     6287128             24559\r\n   Physical                  6287127             24559<\/pre>\n<\/div>\n<h2>::kmastat<\/h2>\n<div class=\"level4\">\n<pre class=\"code\">::kmastat -g ! grep zio_cache\r\n   cache                        buf    buf    buf    memory     alloc alloc\r\n   name                        size in use  total    in use   succeed  fail\r\n   ------------------------- ------ ------ ------ ---------- --------- -----\r\n   zio_cache                    816      2 137450       107M 2960373347    0   2 buffers used only among 137450 buf allocated\r\n                                                                               memory in use : 137450 x 816 = 107MB<\/pre>\n<p>A mem fragmentation evidence is when &#8216;memory in use&#8217; is high (physical memory from slab allocation) but &#8216;buf in use&#8217; * &#8216;buf size&#8217; is low.<\/p>\n<\/div>\n<h2>::stacks<\/h2>\n<div class=\"level4\">\n<pre class=\"code\">::stacks\r\n   ffffff0030e56c40 SLEEP    MUTEX                 187\r\n                    swtch+0x146\r\n                    turnstile_block+0x6ff\r\n                    mutex_vector_enter+0x25e\r\n                    zfs_zget+0x46\r\n                    zfs_vget+0x1f5\r\n                    fsop_vget+0x66\r\n                    nfs3_fhtovp+0x47\r\n                    rfs3_write+0x4a\r\n                    common_dispatch+0x649\r\n                    rfs_dispatch+0x2d\r\n                    svc_process_request+0x184\r\n                    stp_dispatch+0xb9\r\n                    taskq_d_thread+0xc1\r\n                    thread_start+8\r\n\r\n   ::stacks -m zfs\r\n   ::stacks -m nfs\r\n   ::stacks -m smbsrv\r\n   ::stacks -m iscsit\r\n   ::stacks -m pmcs\r\n   ::stacks -m sata\r\n   ::stacks -m mpt\r\n   ::stacks -c vmem_xalloc\r\n\r\n   ::stacks -c mutex_enter\r\n   THREAD           STATE    SOBJ                COUNT\r\n   ffffff00318fbc40 RUN      NONE                  1\r\n                    swtch+0x146\r\n                    preempt+0xca\r\n                    kpreempt+0x89\r\n                    sys_rtt_common+0x13b\r\n                    _sys_rtt_ints_disabled+8\r\n                    mutex_enter+0x10\r\n                    segkmem_page_create+0x92\r\n                    segkmem_xalloc+0xc6\r\n                    segkmem_alloc_vn+0xcb\r\n                    segkmem_alloc+0x24\r\n                    vmem_xalloc+0x502\r\n                    vmem_alloc+0x172\r\n                    kmem_slab_create+0x81\r\n                    kmem_slab_alloc+0x60\r\n                    kmem_cache_alloc+0x267\r\n                    zio_buf_alloc+0x29\r\n                    zil_alloc_lwb+0x4a\r\n                    zil_get_current_lwb+0x281\r\n                    zil_get_commit_list+0x185\r\n                    zil_get_train+0x94\r\n                    zil_commit+0xc0\r\n                    zfs_fsync+0xcc\r\n                    fop_fsync+0x5f\r\n                    rfs3_commit+0x173\r\n                    common_dispatch+0x539\r\n                    rfs_dispatch+0x2d\r\n                    svc_process_request+0x184\r\n                    stp_dispatch+0xb9\r\n                    taskq_d_thread+0xc1\r\n                    thread_start+8\r\n\r\n       ::stacks -c spa_sync\r\n       THREAD           STATE    SOBJ                COUNT\r\n       ffffff0030d2ac40 SLEEP    MUTEX                   1\r\n                    swtch+0x146\r\n                    turnstile_block+0x6ff\r\n                    mutex_vector_enter+0x25e\r\n                    zil_sync+0x5f\r\n                    dmu_objset_sync+0x241\r\n                    dsl_dataset_sync+0x63\r\n                    dsl_pool_sync+0x18d\r\n                    spa_sync+0x395\r\n                    txg_sync_thread+0x244\r\n                    thread_start+8\r\n\r\n       ::stacks -c kmem_cache_reap_now\r\n       THREAD           STATE    SOBJ                COUNT\r\n       ffffff002e173c40 SLEEP    CV                      1\r\n                    swtch+0x146\r\n                    cv_wait+0x60\r\n                    taskq_wait+0x4b\r\n                    kmem_cache_reap_now+0x4e\r\n                    arc_buf_cache_reap_now+0x38\r\n                    arc_reaper_thread+0xb0\r\n                    thread_start+8<\/pre>\n<\/div>\n<h2>::findstack<\/h2>\n<div class=\"level4\">\n<pre class=\"code\">ffffff002e173c40::findstack -v\r\n   stack pointer for thread ffffff002e173c40: ffffff002e173ac0\r\n   [ ffffff002e173ac0 _resume_from_idle+0xf5() ]\r\n     ffffff002e173af0 swtch+0x146()\r\n     ffffff002e173b20 cv_wait+0x60(fffff6000df96b5a, fffff6000df96b48)\r\n     ffffff002e173b60 taskq_wait+0x4b(fffff6000df96b28)\r\n     ffffff002e173b80 kmem_cache_reap_now+0x4e(fffff60005d4a008)\r\n     ffffff002e173bc0 arc_buf_cache_reap_now+0x38()\r\n     ffffff002e173c20 arc_reaper_thread+0xb0()\r\n     ffffff002e173c30 thread_start+8()<\/pre>\n<\/div>\n<h2>::cpuinfo<\/h2>\n<div class=\"level4\">\n<pre class=\"code\">::cpuinfo\r\n     ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC\r\n     0 fffffffffbc3ef20  1f   11    0 100   no    no t-1297 fffff6002767fae0 akd\r\n     1 fffff6000dd0eac0  1f   13    0 100   no    no t-1298 fffff6002dfb9840 akd  ticks : 12s. CPU1 has been \r\n     2 fffff6000dd0c080  1f   12    0 100   no    no t-1296 fffff6002dc3d7c0 akd          running a thread for 12s\r\n     3 fffff6000dd0a000  1f   10    0 100   no    no t-1296 fffff6002de034e0 akd\r\n     4 fffff6000dd06b00  1f    8    0 100   no    no t-1296 fffff6002dc3c440 akd\r\n     5 fffffffffbc49110  1b   28    0 100   no    no t-1296 fffff60026a03160 akd\r\n     6 fffff6000dd02000  1f   13    0 100   no    no t-1298 fffff60029292b20 akd\r\n     7 fffff6000dfbcac0  1f    9    0 100   no    no t-1296 fffff6002df3c000 akd<\/pre>\n<\/div>\n<h2>::taskq<\/h2>\n<div class=\"level4\">\n<pre class=\"code\">fffff6000df96b28::taskq -T\r\n   ADDR             NAME                             ACT\/THDS Q'ED  MAXQ INST\r\n   fffff6000df96b28 kmem_taskq                         1\/   1  570   935    -\r\n       THREAD           STATE    SOBJ                COUNT\r\n       ffffff002ed96c40 RUN      NONE                  1\r\n                        swtch+0x146\r\n                        preempt+0xca\r\n                        kpreempt+0x89\r\n                        sys_rtt_common+0x13b\r\n                        _sys_rtt_ints_disabled+8\r\n                        kmem_partial_slab_cmp+0x42\r\n                        avl_find+0x56\r\n                        avl_add+0x2c\r\n                        avl_update_gt+0x56\r\n                        kmem_slab_free+0x1b9\r\n                        kmem_magazine_destroy+0xdf\r\n                        kmem_depot_ws_reap+0x77\r\n                        kmem_cache_reap+0x76\r\n                        taskq_thread+0x22e\r\n                        thread_start+8<\/pre>\n<\/div>\n<h2>::spa<\/h2>\n<div class=\"level4\">\n<pre class=\"code\">::spa\r\n     ADDR                 STATE NAME\r\n     fffff60039b60500    ACTIVE pool-0\r\n     fffff60009817580    ACTIVE system\r\n\r\n   fffff60039b60500::print spa_t spa_uberblock.ub_timestamp\r\n     spa_uberblock.ub_timestamp = 0x52976059  : in sec since epoch\r\n\r\n   time\/J\r\n     time:\r\n     time:           5297619e  : in sec since epoch\r\n\r\n   5297619e-52976059=D\r\n     325 sec  : This is the last time a TXG has been done : quite long !\r\n\r\n   0x52976059=Y\r\n                2013 Nov 28 15:25:13<\/pre>\n<\/div>\n<h2>::zio_state<\/h2>\n<div class=\"level4\">\n<pre class=\"code\">::zio_state\r\n    ADDRESS                                  TYPE  STAGE            WAITER\r\n    fffff6001b0b4cc8                         NULL  OPEN             -\r\n    ffff60037f0e340                          NULL  OPEN             -\r\n    fffff60039911cc0                         NULL  OPEN             -\r\n    fffff60037f25338                         NULL  CHECKSUM_VERIFY  ffffff007bb30c40\r\n\r\n    ffffff007bb30c40::findstack -v\r\n    stack pointer for thread ffffff007bb30c40: ffffff007bb2ffb0\r\n    [ ffffff007bb2ffb0 _resume_from_idle+0xf4() ]\r\n      ffffff007bb2ffe0 swtch+0x150()\r\n      ffffff007bb30010 cv_wait+0x61(fffff60037f25650, fffff60037f25648)\r\n      ffffff007bb30050 zio_wait+0x5d(fffff60037f25338)  : arg0 is zio_t\r\n      ffffff007bb300b0 dbuf_read+0x1e5(fffff6006a5539f8, 0, a)\r\n      ffffff007bb30130 dmu_buf_hold+0xac(fffff6001b7fa0c0, 18a, 66715000, 0, ffffff007bb30148, 1)\r\n      ffffff007bb301a0 zap_get_leaf_byblk+0x5c(fffff60037bf44c0, 66715, 0, 1, ffffff007bb30360)\r\n      ffffff007bb30210 zap_deref_leaf+0x78(fffff60037bf44c0, 10b0600000000000, 0, 1, ffffff007bb30360)\r\n      ffffff007bb302a0 fzap_cursor_retrieve+0xa7(fffff60037bf44c0, ffffff007bb30350, ffffff007bb30390)\r\n      ffffff007bb30330 zap_cursor_retrieve+0x188(ffffff007bb30350, ffffff007bb30390)\r\n      ffffff007bb30640 ddt_zap_walk+0x5d(fffff6001b7fa0c0, 18a, ffffff007bb306e0, fffff60051b8c100)\r\n      ffffff007bb30680 ddt_object_walk+0x4e(fffff60052bc3000, 0, 1, fffff60051b8c100, ffffff007bb306e0)\r\n      ffffff007bb306d0 ddt_walk+0x78(fffff600513cc080, fffff60051b8c0e8, ffffff007bb306e0)\r\n      ffffff007bb308a0 dsl_scan_ddt+0x2de(fffff60051b8c040, fffff6004256b1e8)\r\n      ffffff007bb30a60 dsl_scan_visit+0x54(fffff60051b8c040, fffff6004256b1e8)\r\n      ffffff007bb30ac0 dsl_scan_sync+0x20f(fffff6003a230a00, fffff6004256b1e8)\r\n      ffffff007bb30b80 spa_sync+0x3f7(fffff600513cc080, 8824b0)\r\n      ffffff007bb30c20 txg_sync_thread+0x247(fffff6003a230a00)\r\n      ffffff007bb30c30 thread_start+8()<\/pre>\n<\/div>\n<h2>::zio<\/h2>\n<div class=\"level4\">\n<p>Displays a recursive stack of IOs<\/p>\n<pre class=\"code\">fffff60037f25338::zio -rc\r\n     ADDRESS                                  TYPE  STAGE            WAITER\r\n     fffff60037f25338                         NULL  CHECKSUM_VERIFY  ffffff007bb30c40\r\n      fffff60041603998                        READ  VDEV_IO_START    -\r\n       fffff600399b9340                       READ  VDEV_IO_START    -\r\n        fffff6003e24fcc8                      READ  VDEV_IO_START    -\r\n         fffff60039999cc8                     READ  VDEV_IO_START    -   : last one\r\n\r\n\r\n   fffff60039999cc8::print zio_t io_vd\r\n     io_vd = 0xfffff600513cad40\r\n\r\n   fffff60039999cc8::print zio_t io_vd | ::print vdev_t vdev_path\r\n     vdev_path = 0xfffff60050f0bc38 \"\/dev\/dsk\/c3t5000CCA396E3D6FAd0s0\"<\/pre>\n<\/div>\n<h2>::nm<\/h2>\n<div class=\"level4\">\n<p>Displays the size of a variable and can provide the argument types of any function in the kernel<\/p>\n<pre class=\"code\">::nm ! grep arc_meta_limit\r\n   0xffffffffc00043b8|0x0000000000000008|OBJT |LOCL |0x0  |8       |arc_meta_limit\r\n\r\n   kmem_cache_reap_now::nm -f ctype\r\n   C Type\r\n   void (*)(kmem_cache_t *)<\/pre>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>iostat This is a standard Unix utility for analysing IO workloads and performance troubleshooting, at the individual drive level. The first iteration provides a cumulative summary over node uptime, following &#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"template-fullwidth.php","meta":{"_themeisle_gutenberg_block_has_review":false,"footnotes":""},"_links":{"self":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/340"}],"collection":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/comments?post=340"}],"version-history":[{"count":1,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/340\/revisions"}],"predecessor-version":[{"id":341,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/340\/revisions\/341"}],"wp:attachment":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/media?parent=340"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}