{"id":342,"date":"2023-09-30T07:28:10","date_gmt":"2023-09-30T07:28:10","guid":{"rendered":"https:\/\/fredpayet.fr\/?page_id=342"},"modified":"2023-09-30T07:28:10","modified_gmt":"2023-09-30T07:28:10","slug":"zfs-fragmentation","status":"publish","type":"page","link":"https:\/\/fredpayet.fr\/index.php\/zfs-fragmentation\/","title":{"rendered":"ZFS Fragmentation"},"content":{"rendered":"<p style=\"text-align: justify;\">Due to the Copy-On-Write nature of ZFS, we can get fragmented pools at some point. Some enhancements seems to be in the pipe but still, we cannot defragment a pool. The only way to workaround that is to do a full backup of the pool and restore everything. Hopefully, this should not happen too often (years) and the high amount of hits in ARC or L2ARC for metadata masks the delay ZFS could have trying to find the correct segments in the vdevs. Here is a presentation telling\u00a0how to confirm a pool is fragmented.<\/p>\n<h2>Internals<\/h2>\n<ul>\n<li>ZFS divides each vdevs into a few hundred regions called <strong>metaslabs<\/strong>.<\/li>\n<li class=\"level1\">\n<div class=\"li\">A <strong>metaslab<\/strong> is divided into <strong>segments<\/strong> of not fixed size.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">A <strong>spacemap<\/strong> tracks all the free space in a metaslab. It&#8217;s a log of allocations and frees, in time order.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">2 areas of contiguous free space will be merged together in the <strong>spacemap<\/strong><\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\">The metaslab allocator doesn&#8217;t necessarily use the largest free segment.<\/div>\n<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">The figure below shows a visualization of the zpool after it is filled to 25% using a simple dd command (sequential writes). Each cell represents one metaslab. The pool here has 256 metaslabs (128MB each). The percentage value is the <strong><em class=\"u\">amount of free space<\/em><\/strong> in each metaslab. Here, around 32 metaslabs are full, accounting for 8GB of data :<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/fredpayet.free.fr\/wp-content\/uploads\/2017\/01\/fragmentation_ok.png\"><img decoding=\"async\" loading=\"lazy\" class=\" size-full wp-image-219 alignnone\" src=\"http:\/\/fredpayet.free.fr\/wp-content\/uploads\/2017\/01\/fragmentation_ok.png\" alt=\"fragmentation_ok\" width=\"544\" height=\"121\" srcset=\"https:\/\/fredpayet.fr\/wp-content\/uploads\/2017\/01\/fragmentation_ok.png 544w, https:\/\/fredpayet.fr\/wp-content\/uploads\/2017\/01\/fragmentation_ok-300x67.png 300w\" sizes=\"(max-width: 544px) 100vw, 544px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">After some time of random IO reads and writes onto the pool (reading 1 block and writing it again), the pool becomes fragmented :<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/fredpayet.free.fr\/wp-content\/uploads\/2017\/01\/fragmentation_ok1.png\"><img decoding=\"async\" loading=\"lazy\" class=\" size-full wp-image-220 alignnone\" src=\"http:\/\/fredpayet.free.fr\/wp-content\/uploads\/2017\/01\/fragmentation_ok1.png\" alt=\"fragmentation_ok\" width=\"544\" height=\"121\" srcset=\"https:\/\/fredpayet.fr\/wp-content\/uploads\/2017\/01\/fragmentation_ok1.png 544w, https:\/\/fredpayet.fr\/wp-content\/uploads\/2017\/01\/fragmentation_ok1-300x67.png 300w\" sizes=\"(max-width: 544px) 100vw, 544px\" \/><\/a><\/p>\n<p>The pool is not compact, much more metaslabs are used. The majority of metaslabs have regions of free space. ZFS tries to place the random blocks in contiguous region of the pool by allocating new metaslabs. Due to copy-on-write, every time a block is overwritten it will create a hole somewhere in the pool. New data blocks can be allocated to new metaslabs or fill the holes if they fit in.<\/p>\n<h2>zdb<\/h2>\n<p>zdb is the ZFS debugger. It works well on a not busy pool. Under heavy load, it\u00a0may just report\u00a0&#8220;<strong>io error<\/strong>&#8221; or even crash.<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n   # zdb -emm MaPoule\r\n   Metaslabs:\r\n\tvdev          0 &lt;&lt;&lt; mirror-0\r\n\tmetaslabs   116   offset                spacemap          free\r\n\t---------------   -------------------   ---------------   -------------\r\n\tmetaslab      0   offset            0   spacemap     39   free    7.41G\r\n\t                  segments       7484   maxsize    487M   freepct   92%\r\n\tmetaslab      1   offset    200000000   spacemap    150   free    5.97G\r\n\t                  segments       1280   maxsize   4.87G   freepct   74%\r\n\tmetaslab      2   offset    400000000   spacemap     83   free     258M\r\n\t                  segments       1292   maxsize   21.2M   freepct    3%\r\n\tmetaslab      3   offset    600000000   spacemap   2526   free     632M\r\n\t                  segments       1672   maxsize   39.9M   freepct    7%\r\n        ...\r\n\tmetaslab    115   offset   e600000000   spacemap      0   free       8G\r\n\t                  segments          1   maxsize      8G   freepct  100%\r\n\r\n<\/pre>\n<p>space_maps entries are to track the segments (space, offset)<\/p>\n<h3>Meaning<\/h3>\n<ul>\n<li class=\"level1\">\n<div class=\"li\">vdev 0 is made of 116 metaslabs<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>offset<\/strong> : metslab 0 starts at offset 0 and is made of 7484 segments<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>spacemap<\/strong> : space_maps entries are to track the segments (space, offset). metaslab 0 is tracked by spacemap #39, which is an object number in the DMU.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>maxsize<\/strong> : metaslab 0 has a max contiguous free space of 487M<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>free<\/strong> : metaslab 0 has 7.41G of free space.<\/div>\n<\/li>\n<li class=\"level1\">\n<div class=\"li\"><strong>frepct<\/strong> : metaslab 0 92% of free space<\/div>\n<\/li>\n<\/ul>\n<h3>Interpretation<\/h3>\n<ul>\n<li>At the beginning, when no block is written to a vdev, the mumber of segments is equal to 1 and the <strong>spacemap<\/strong> number is 0 (no object allocated to the spacemap). It takes up no room in the pool until we start allocating from it.<\/li>\n<li><strong>Having maxsize smaller than 128KB is an indication of potential fragmented filesystem<\/strong>. Note that the metaslab could just be full from serial writes<\/li>\n<li>Having too many segments in the metaslab is also an indication of fragmentation in this metaslab.<\/li>\n<li>Having bad balanced vdev (some full and other free) will impact performance<\/li>\n<\/ul>\n<h2>dtrace<\/h2>\n<h3><strong>metaslab_alloc_dva, easy<\/strong><\/h3>\n<p>A way to determine if fragmentation is present is to measure how long\u00a0<strong>metaslab_alloc_dva()<\/strong>\u00a0takes.<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n   # dtrace -n 'metaslab_alloc_dva:entry { self-&gt;t = timestamp }\r\n     metaslab_alloc_dva:return \/ self-&gt;t\/ { printf(&quot;%u ms %Y&quot;,(timestamp - self-&gt;t)&gt;&gt;20,walltimestamp);}'\r\n\r\n   CPU ID FUNCTION:NAME\r\n   09 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n   10 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n   10 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n   12 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n   12 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:20\r\n   18 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n   18 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n   18 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n   18 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n   21 41948 metaslab_alloc_dva:return 0 ms 2014 Jan 31 13:04:19\r\n<\/pre>\n<h3>metaslab_alloc_dva,\u00a0improved<\/h3>\n<div class=\"level5\">\n<p>Display, every 10 seconds, the stacks involving <strong>metaslab_alloc_dva()<\/strong> when it takes more than 1ms to complete<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n   #!\/usr\/sbin\/dtrace -Cqs\r\n\r\n   #define MAX(a, b) (a&gt;b ? a : b)\r\n   #define STACK_PRINT_US 1000000 \/* 1000000us = 1ms *\/\r\n\r\n   dtrace:::BEGIN\r\n   {\r\n      dva_max = 0;\r\n   }\r\n\r\n   metaslab_alloc_dva:entry\r\n   {\r\n      self-&gt;t = timestamp\r\n   }\r\n\r\n   metaslab_alloc_dva:return \/ self-&gt;t \/\r\n   {\r\n      self-&gt;diff = timestamp - self-&gt;t;\r\n      dva_max = MAX(dva_max, self-&gt;diff);\r\n   }\r\n\r\n   metaslab_alloc_dva:return \/ self-&gt;t &amp;&amp; self-&gt;diff &gt; STACK_PRINT_US \/\r\n   {\r\n      \/* we grab the stacks only if the current thread has taken more than STACK_PRINT_US us *\/\r\n      @s&#x5B;stack()] = count();\r\n   }\r\n\r\n   tick-10s\r\n   {\r\n      printf(&quot;\\n&quot;);\r\n      printf(&quot;----- %Y -----\\n&quot;,walltimestamp);\r\n      printf(&quot; dva_max = %d\\n&quot;,dva_max);\r\n      printf(&quot;Stacks when metaslab_alloc_dva &gt; %dus : \\n&quot;,STACK_PRINT_US);\r\n      printa(@s);\r\n      trunc(@s);\r\n      dva_max = 0;\r\n    }\r\n<\/pre>\n<p>Results :<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n   host# .\/dva_max.d\r\n   ----- 2014 Jan 31 14:23:16 -----\r\n   dva_max = 3827391\r\n   Stacks when metaslab_alloc_dva &gt; 1000000us :\r\n\r\n   zfs`metaslab_alloc+0xd6\r\n   zfs`zio_dva_allocate+0xd8\r\n   zfs`zio_execute+0x8d\r\n   genunix`taskq_thread+0x22e\r\n   unix`thread_start+0x8\r\n   545\r\n<\/pre>\n<h3>seeksize.d<\/h3>\n<p style=\"text-align: justify;\">Trying to determine the overall seek distance of the disk heads can help to determine if the pool is fragmented. There are other factors of large disk heads movements (dedup), but this is worth considering. The <strong>seeksize.d<\/strong> script is part of the <strong>DTrace Toolkit<\/strong> and is deeply explained in the dtrace Book from <strong>Brendan Greg<\/strong>. See <a class=\"urlextern\" title=\"http:\/\/dtrace.org\/blogs\/brendan\/2010\/09\/23\/dtrace-book-coming-soon\/\" href=\"http:\/\/dtrace.org\/blogs\/brendan\/2010\/09\/23\/dtrace-book-coming-soon\/\" rel=\"nofollow\">http:\/\/dtrace.org\/blogs\/brendan\/2010\/09\/23\/dtrace-book-coming-soon\/<\/a> (Chap4 page 184).<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n   #!\/usr\/sbin\/dtrace -s\r\n   \/*\r\n   * seeksize.d - analyse disk head seek distance by process.\r\n   * Written using DTrace (Solaris 10 3\/05).\r\n   *\/\r\n   #pragma D option quiet&lt;\/code&gt;\r\n\r\n   dtrace:::BEGIN\r\n   {\r\n      printf(&quot;Tracing... Hit Ctrl-C to end.\\n&quot;);\r\n   }\r\n\r\n   self int last&#x5B;dev_t];\r\n\r\n   io:genunix::start\r\n   {\r\n      \/* save last position of disk head *\/\r\n      self-&gt;last&#x5B;args&#x5B;0]-&gt;b_edev] = args&#x5B;0]-&gt;b_blkno + args&#x5B;0]-&gt;b_bcount \/ 512;\r\n   }\r\n\r\n   io:genunix::start\r\n   \/self-&gt;last&#x5B;args&#x5B;0]-&gt;b_edev] != 0\/\r\n   {\r\n      \/* calculate seek distance *\/\r\n      this-&gt;last = self-&gt;last&#x5B;args&#x5B;0]-&gt;b_edev];\r\n      this-&gt;dist = (int)(args&#x5B;0]-&gt;b_blkno - this-&gt;last) &gt; 0 ?\r\n                   args&#x5B;0]-&gt;b_blkno - this-&gt;last : this-&gt;last - args&#x5B;0]-&gt;b_blkno;\r\n\r\n      \/* store details *\/\r\n      @Size&#x5B;pid, curpsinfo-&gt;pr_psargs] = quantize(this-&gt;;dist);\r\n   }\r\n\r\n   dtrace:::END\r\n   {\r\n      printf(&quot;\\n%8s %s\\n&quot;, &quot;PID&quot;, &quot;CMD&quot;);\r\n      printa(&quot;%8d %S\\n%@d\\n&quot;, @Size);\r\n   }\r\n<\/pre>\n<p>Output example :<\/p>\n<pre class=\"brush: cpp; title: seeksize.d; notranslate\" title=\"seeksize.d\">\r\n   # .\/seeksize.d\r\n   ^C\r\n     PID  CMD\r\n     613  zpool-pool-raid\r\n\r\n           value  ------------- Distribution ------------- count\r\n              -1 |                                         0\r\n               0 |@                                        16\r\n               1 |                                         11\r\n               2 |@                                        32\r\n               4 |@                                        40\r\n               8 |@                                        18\r\n              16 |                                         6\r\n              32 |                                         12\r\n              64 |@                                        26\r\n             128 |                                         2\r\n             256 |@@                                       61\r\n             512 |                                         0\r\n            1024 |@@@@@@                                   191\r\n            2048 |@@@@                                     111\r\n            4096 |@                                        24\r\n            8192 |@@@@@@@                                  234\r\n           16384 |                                         0\r\n           32768 |                                         0\r\n           65536 |                                         0\r\n          131072 |                                         11\r\n          262144 |                                         12\r\n          524288 |                                         13\r\n         1048576 |                                         0\r\n         2097152 |                                         0\r\n         4194304 |@@@@@@                                   193\r\n         8388608 |                                         0\r\n        16777216 |                                         0\r\n        33554432 |                                         0\r\n        67108864 |                                         0\r\n       134217728 |                                         0\r\n       268435456 |                                         0\r\n       536870912 |@                                        47\r\n      1073741824 |@@@@@                                    162\r\n      2147483648 |                                         0\r\n      4294967296 |@                                        45\r\n      8589934592 |                                         0\r\n<\/pre>\n<p style=\"text-align: justify;\">The way ZFS is designed (maximize the amount of data in DRAM) can hide the head movements. When doing IO writes, ZFS tries to group IOS in contiguous segments of each metaslab and may fill empty ones.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Due to the Copy-On-Write nature of ZFS, we can get fragmented pools at some point. Some enhancements seems to be in the pipe but still, we cannot defragment a pool. &#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"template-fullwidth.php","meta":{"_themeisle_gutenberg_block_has_review":false,"footnotes":""},"_links":{"self":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/342"}],"collection":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/comments?post=342"}],"version-history":[{"count":1,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/342\/revisions"}],"predecessor-version":[{"id":343,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/342\/revisions\/343"}],"wp:attachment":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/media?parent=342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}