{"id":344,"date":"2023-09-30T07:28:45","date_gmt":"2023-09-30T07:28:45","guid":{"rendered":"https:\/\/fredpayet.fr\/?page_id=344"},"modified":"2023-09-30T07:50:39","modified_gmt":"2023-09-30T07:50:39","slug":"zfssa-know-bugs-2017","status":"publish","type":"page","link":"https:\/\/fredpayet.fr\/index.php\/zfssa-know-bugs-2017\/","title":{"rendered":"ZFSSA know bugs &#8211; 2017"},"content":{"rendered":"<p><br \/>\nAll of the following information has to be taken &#8220;as is&#8221; and has not been updated for 3 years now.<br \/>\nThough I am quite sure all of these bugs are fixed now, it was worth listing them and share this with my fellows that are still working onto the ZFSA.<br \/>\nHope that helps.<\/p>\n<p>Credits to all the engineers that did work with me at Sun and Oracle from 1999 to 2017.<\/p>\n<h2>General bugs list<\/h2>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15020395<\/td>\n<td >performance decrease if 1st nameserver is down<\/td>\n<td > <\/td>\n<td >2011.1.3<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >15040725<\/td>\n<td >RFE for improved DNS resolver failover performance<\/td>\n<td > <\/td>\n<td >2011.1.3<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >15770608<\/td>\n<td >Enable DNS defer-on-fail functionality<\/td>\n<td > <\/td>\n<td >2011.1.3<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >15505861<\/td>\n<td >Write performance degrades when ZFS filesystem is near quota<\/td>\n<td > <\/td>\n<td >2013.1.1.9<\/td>\n<td >Attempts was done for 2011.1.6 IDR but regression tests did fail<\/td>\n<\/tr>\n\r\n<tr><td >15742475<\/td>\n<td >Extremely sluggish 7420 node due to heap fragmentation limit arc_size and arc_meta_limit<\/td>\n<td > <\/td>\n<td >2011.1.3<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >15770103<\/td>\n<td >Intermittent akd hangs &#8211; mutex owned by a thread stuck in zfs land<\/td>\n<td > <\/td>\n<td >2011.1.4<\/td>\n<td >Impacts rollback, snapshot deletion. Refers to CR 7155750<\/td>\n<\/tr>\n\r\n<tr><td >15789968<\/td>\n<td >Unable to delete zombie snapshot causes CPU spikes<\/td>\n<td > <\/td>\n<td >2011.1.5<\/td>\n<td >High CPU usage is an indicator<\/td>\n<\/tr>\n\r\n<tr><td >15814083<\/td>\n<td >z_fuid_lock becomes too hot when zfs_fuid_find_by_idx() is heavily used<\/td>\n<td > <\/td>\n<td >2013.1.2<\/td>\n<td >CIFS performance is bad. High CPU usage on ZFSSA is a clear indicator<\/td>\n<\/tr>\n\r\n<tr><td >15705245<\/td>\n<td >Elapsed time for zfs delete may grow quadratically<\/td>\n<td > <\/td>\n<td >2011.1.5<\/td>\n<td >LUN or filesystem deletion may lead to BUI hang, for hours<\/td>\n<\/tr>\n\r\n<tr><td >15710347<\/td>\n<td >System hung, lots of threads in htable_steal<\/td>\n<td >limit arc_size<\/td>\n<td >2013.1.1.5<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >15751065<\/td>\n<td >Appliance unavailable due to zio_arena fragmentation<\/td>\n<td >limit arc_size and arc_meta_limit<\/td>\n<td >2011.1.3<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >15710534<\/td>\n<td >Advance rotor before blocking on space_map_load<\/td>\n<td >metaslab_unload_delay<\/td>\n<td >2011.1.6<\/td>\n<td >sawtooth IOPs pattern during sync of metadata<\/td>\n<\/tr>\n\r\n<tr><td >15790948<\/td>\n<td >soft-ring fanout is single threaded for VLAN<\/td>\n<td > <\/td>\n<td >2013.1<\/td>\n<td >Throuhput limited when VLAN used<\/td>\n<\/tr>\n\r\n<tr><td >17650441<\/td>\n<td >NFS drops to about 0 for about 30 seconds and then returns to normal<\/td>\n<td > <\/td>\n<td >IDR 2011.04.24.8.0,1-2.43.2.3<\/td>\n<td >spa_sync activity is more than 5 sec<\/td>\n<\/tr>\n\r\n<tr><td >15723911<\/td>\n<td >concurrency opportunity during spa_sync<\/td>\n<td >sync=always<\/td>\n<td >2013.1.1.1<\/td>\n<td >20% loss of throughput with RMAN backups<\/td>\n<\/tr>\n\r\n<tr><td >15652868<\/td>\n<td >DTrace compiler doesn't align structs properly disable some analytics<\/td>\n<td > <\/td>\n<td >2013.1<\/td>\n<td >ip.bytes[hostname] impact perf<\/td>\n<\/tr>\n\r\n<tr><td >16423386<\/td>\n<td >dnode_next_offset() doesn't do exhaustive search<\/td>\n<td > <\/td>\n<td >2013.1.1.4<\/td>\n<td >perf issue when deleting clones<\/td>\n<\/tr>\n\r\n<tr><td >18753152<\/td>\n<td >zil train and logbias=throughput, 2X performance opportunity<\/td>\n<td > <\/td>\n<td >2013.1.2.8<\/td>\n<td >perf issue impacting databases running OLTP workload with 8k recordsize and logbias=throughput<\/td>\n<\/tr>\n\r\n<tr><td >17832061<\/td>\n<td >SMB read performance are not consistent<\/td>\n<td >Avoid using different VLANs<\/td>\n<td >2013.1.2.9<\/td>\n<td >50% bandwidth loss over different VLANs<\/td>\n<\/tr>\n\r\n<tr><td >15754028<\/td>\n<td >read-modify-write in space_map_sync() may significantly slow down spa_sync() performance<\/td>\n<td >metaslab_unload_delay<\/td>\n<td >2013.1.2.9<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >15784616<\/td>\n<td >need an upper bound on memory footprint of l2hdr for large l2arc<\/td>\n<td > <\/td>\n<td >2013.1.2.11<\/td>\n<td >L2ARC_HEADER_PERCENT added : default = 30% of ARC<\/td>\n<\/tr>\n\r\n<tr><td >18609976<\/td>\n<td >large deletes cause NFS ops to drop to zero<\/td>\n<td > <\/td>\n<td >2013.1.3.5<\/td>\n<td >To be monitored with spawtf.d script. Happens when large datasets are destroyed. spa_sync can last more than 10 mins. The fix being pursued here also fixes two problems causing scan\/scrub\/resilver to overrun its time allocations. Prior to this, if a scrub is running, it consistently goes over 6s. With the fix, it's just over 5s as it's intended to be.<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nTypical stack as follows:\r\n               genunix`avl_find+0x56\r\n               zfs`space_map_add+0xa6\r\n               zfs`zio_free+0x1a5\r\n               zfs`dsl_free+0x20\r\n               zfs`dsl_dataset_block_kill+0x19b\r\n               zfs`free_blocks+0x76\r\n               zfs`free_children+0x2c7\r\n               zfs`free_children+0x23f\r\n               zfs`free_children+0x23f\r\n               zfs`dnode_sync_free_range+0x113\r\n               zfs`dnode_sync+0x272\r\n               zfs`dmu_objset_sync_dnodes+0x7f\r\n               zfs`dmu_objset_sync+0x1c9\r\n               zfs`dsl_dataset_sync+0x63\r\n               zfs`dsl_pool_sync+0xdb\r\n               zfs`spa_sync+0x39a\r\n               zfs`txg_sync_thread+0x244\r\n               unix`thread_start+0x8\r\n              2407\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >20693077<\/td>\n<td >snapshot deletes don't check timeout while iterating<\/td>\n<td > <\/td>\n<td > 2013.1.4.6<\/td>\n<td ><\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nHundreds of stacks similar to this seen onproc:\r\n              zfs`dbuf_hold_impl+0x1\r\n              zfs`dnode_hold_impl+0xc3\r\n              zfs`dnode_hold+0x2b\r\n              zfs`dmu_bonus_hold+0x32\r\n              zfs`bpobj_open+0x6d\r\n              zfs`bpobj_iterate_impl+0x35d\r\n              zfs`bpobj_iterate_impl+0x3ff\r\n              zfs`bpobj_iterate_impl+0x3ff\r\n              zfs`bpobj_iterate_impl+0x3ff\r\n              zfs`bpobj_iterate_impl+0x3ff\r\n              zfs`bpobj_iterate_impl+0x3ff\r\n              zfs`bpobj_iterate_impl+0x3ff\r\n              zfs`bpobj_iterate_impl+0x3ff\r\n              zfs`bpobj_iterate+0x23\r\n              zfs`dsl_scan_sync+0x11b\r\n              zfs`spa_sync+0x447\r\n              zfs`txg_sync_thread+0x244\r\n              unix`thread_start+0x8\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >19772024<\/td>\n<td >ls_reclaim() blocks kmem_cache_reap()<\/td>\n<td >limit arc_size<\/td>\n<td >2013.1.3.1<\/td>\n<td >kmem_cache_reap thread stuck trying to retrieve some space from specific NFS cache<\/td>\n<\/tr>\n\r\n<tr><td >19591405<\/td>\n<td >akd running in RT class is blocking critical kernel memory management<\/td>\n<td > <\/td>\n<td >2013.1.6.0<\/td>\n<td >Some RT akd threads needing memory prevent kmem_cache_reap and pageout scanner from running. See MOS Doc ID 2043383.1 : Some systems with LOW Memory configurations might not run reliably\/stable with 2013.1.x releases<\/td>\n<\/tr>\n<\/tbody><\/table><\/div> The issue can happen on small sized systems with 24GB of DRAM. A typical solution is to add DRAM (to have a final size of 96GB). <\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nAn admitted workaround is to set the non cluster akd threads scheduling class to Fare Share (FS), as follows :\r\n              zfssa # svccfg -s akd setprop akd\/rt_off = &quot;true&quot;\r\n              zfssa # svccfg -s akd listprop | grep rt_off\r\n              akd\/rt_off_cluster astring     false\r\n              akd\/rt_off astring true  &lt;&lt;&lt;&lt;\r\n              zfssa # svcadm refresh akd\r\n              zfssa # svcadm restart akd\r\nMust be done on both cluster nodes.\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >18914201<\/td>\n<td >Hole buffers are calculating zero checksums leading to performance slowdown<\/td>\n<td > <\/td>\n<td >2013.1.3.1<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >19137061<\/td>\n<td >top-level vdev allocations may go over mg_aliquot<\/td>\n<td > <\/td>\n<td >2013.1.3.1<\/td>\n<td >The fix spreads IOs accross vdevs is a more efficient way, making each vdev less busy<\/td>\n<\/tr>\n\r\n<tr><td >19949855<\/td>\n<td >optimize the number of cross-calls during hat_unload<\/td>\n<td > <\/td>\n<td >2013.1.3.3<\/td>\n<td >Reaper thread not able to free buffers fast enough.This is causing NFS clients to experience IO latency.<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nTypical stack is as follows :\r\n              unix`do_splx+0x97\r\n              unix`xc_common+0x230\r\n              unix`xc_call+0x46\r\n              unix`hat_tlb_inval+0x289\r\n              unix`hat_unload_callback+0x302\r\n              unix`hat_unload+0x41\r\n              unix`segkmem_free_vn+0x5a\r\n              unix`segkmem_free+0x27\r\n              genunix`vmem_xfree+0x10a\r\n              genunix`vmem_free+0x27\r\n              genunix`kmem_slab_destroy+0x87\r\n              genunix`kmem_slab_free+0x2c7\r\n              genunix`kmem_magazine_destroy+0xdf\r\n              genunix`kmem_depot_ws_reap+0x77\r\n              genunix`kmem_cache_magazine_purge+0xd6\r\n              genunix`kmem_cache_magazine_resize+0x41\r\n              genunix`taskq_thread+0x22e\r\n              unix`thread_start+0x8\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >19883033<\/td>\n<td >ZS3 encountering unexplained high CPU utilization<\/td>\n<td > <\/td>\n<td >2013.1.3.3<\/td>\n<td >SMB issue mostly<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nUsing dtrace to count the number of calls from\r\nzfs_groupmember()-&gt;zfs_fuid_find_by_idx()\r\n\r\nfor the exact same workload to the same share as the same user:\r\ncalls                version\r\n12843966        ak-2013.1 3.0\r\n848928          ak-2013.1 3.3\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15324796<\/td>\n<td >zfetch needs a whole lotta love<\/td>\n<td > <\/td>\n<td >2013.1.4.4<\/td>\n<td >For sequential IO reads with small blocksize, zfetch_block_cap can be increased. By default, zfetch_block_cap = 16. With a recordsize of 8k, ZFS will prefetch 128K only. An high number of vdevs will not help as we just do not prefetch deep enough to make them all busy. To get better disk usage, zfetch_block_cap has to be increased to 32 or even 64. See tuning_zfetch_block_cap.akwf attached to bug 19982476 (backport for ZFSSA).<\/td>\n<\/tr>\n\r\n<tr><td >20361021<\/td>\n<td >Latency seen on upgrade to SMB2.1 with leases. SMB2.1 server is waiting for lease break ack<\/td>\n<td > <\/td>\n<td >2013.1.4.7<\/td>\n<td ><\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSMB2.1 version is available in 2013.1.3.0 which introduces leases. This feature works in a similar way like oplocks where a 35sec delay can be seen. So just disabling oplocks from the BUI might not work. <\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nYou have to go back to SMB2.0 (to disable leases) as follows .\r\n              zfssa# svccfg -s smb listprop | grep maxprotocol\r\n              smb\/client_maxprotocol astring 1\r\n              smb\/server_maxprotocol astring 2\r\n              zfssa# echo '::smbsrv -v' | mdb -k | grep maxprotocol\r\n              skc_server_maxprotocol = 3 (SMB_D2_1)\r\n\r\n              zfssa# svccfg -s smb setprop smb\/server_maxprotocol = astring: &quot;2.0&quot;\r\n              zfssa# svccfg -s smb listprop | grep protocol\r\n              smb\/client_maxprotocol astring 1\r\n              smb\/server_maxprotocol astring 2.0\r\n              zfssa# echo '::smbsrv -v' | mdb -k | grep maxprotocol\r\n              skc_server_maxprotocol = 2 (SMB_D2_0)\r\n\r\n              zfssa# aksh\r\n              zfssa:&gt; configuration services smb\r\n              zfssa:configuration services smb&gt; disable\r\n              zfssa:configuration services smb&gt; enable\r\n<\/pre>\n<p>Caveat : Disabling oplocks, will result in more traffic over the wire. The clients will be denied  read\/write caching of the data and all the requests have to go to the server.<\/p>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >19454536<\/td>\n<td >Improve the performance of SMB2_QUERY_DIRECTORY<\/td>\n<td > <\/td>\n<td >2013.1.6.0<\/td>\n<td >Browsing CIFS shares with lots of files will make the ZFFSA CPUs busy &#8230; latencies can happen<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nzfssa# dtrace -n 'profile-997hz \/arg0 &amp;&amp; curthread-&gt;t_pri != -1\/ { @&#x5B;stack()] = count(); }\r\n       tick-10sec { trunc(@, 10); printa(@); exit(0); }'\r\n              unix`mutex_delay_default+0x7\r\n              unix`mutex_vector_enter+0x2ae\r\n              zfs`zfs_zaccess_aces_check+0x77\r\n              zfs`zfs_zaccess_common+0x85\r\n              zfs`zfs_zaccess+0x14d\r\n              zfs`zfs_fastaccesschk_execute+0x173\r\n              zfs`zfs_lookup+0xa8\r\n              genunix`vhead_lookup+0xd5\r\n              genunix`fop_lookup+0x128\r\n              smbsrv`smb_vop_lookup+0x17c\r\n              smbsrv`smb_fsop_do_lookup+0x14e\r\n              smbsrv`smb_fsop_lookup+0x27d\r\n              smbsrv`smb_odir_wildcard_fileinfo+0x6f\r\n              smbsrv`smb_odir_read_fileinfo+0x10e\r\n              smbsrv`smb2_query_directory_encode_dynamic+0xd5\r\n              smbsrv`smb2_query_directory_encode+0x7e\r\n              smbsrv`smb2_query_directory+0x84\r\n              smbsrv`smb2_dispatch_request_impl+0x237\r\n              smbsrv`smb2_dispatch_request+0x15\r\n              smbsrv`smb_session_worker+0x88\r\n             5672\r\n\r\n   Lockstat extract (32 cores) :\r\n   -------------------------------------------------------------------------------\r\n   Count indv cuml rcnt     nsec Lock                   Hottest Caller\r\n   2959889  95%  95% 0.00   224391 0xfffff6011546f100 zfs_root+0x58\r\n         nsec ------ Time Distribution ------ count     Stack\r\n          256 |                               98 fsop_root+0x2d\r\n          512 |                               56784 smb_node_alloc+0x11e\r\n         1024 |                               82330 smb_node_lookup+0x1fe\r\n         2048 |@                              187212 smb_fsop_do_lookup+0x49b\r\n         4096 |@@@                            372976 smb_fsop_lookup+0x27d\r\n         8192 |@@@@@@@@                       876803 smb_odir_wildcard_fileinfo+0x6f\r\n        16384 |@@@                            354769 smb_odir_read_fileinfo+0x10e\r\n        32768 |@                              167806 smb2_query_directory_encode_dynamic+0xd5\r\n        65536 |@                              107972 smb2_query_directory_encode+0x7e\r\n       131072 |                               91142 smb2_query_directory+0x84\r\n       262144 |@                              107285 smb2_dispatch_request_impl+0x237\r\n       524288 |@                              150887 smb2_dispatch_request+0x15\r\n      1048576 |@                              182779 smb_session_worker+0x88\r\n      2097152 |@                              152531 taskq_d_thread+0xc1\r\n      4194304 |                               66082 thread_start+0x8\r\n      8388608 |                               2424\r\n     16777216 |                               9\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >23216069<\/td>\n<td >Missing flag in prefetch backport causes excessive prefetch activity<\/td>\n<td > <\/td>\n<td >2013.1.6<\/td>\n<td >Related to : 21974849 &#8211; RMAN backup over ASM leads to 1M random read foiling zfetch. With the new zfs_fetch implementation brought in 2013.1.4.4, RMAN backup through ASM make the SAS drives saturated because of unexpected prefetching.<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n@ With the new zfetch code when\r\n@ subject to 1MB random read, we actually do issue some readaheads and given\r\n@ that the ASM Allocation Units can be interleaved, we're likely to end up not using the prefetched data.\r\n\r\nThis makes the disks very busy for no profit.\r\nTo compare the FC\/iSCSI\/NFS activity to the SAS2 disks activity, it can be interesting to add some \r\nanalytics as follows :\r\n\r\n   fc.bytes&#x5B;op]\r\n   iscsi.bytes&#x5B;op]\r\n   nfs3.bytes&#x5B;op]\r\n   ...\r\n\r\nThis can even make the CPUs busy with the following stack :\r\n               zfs`dbuf_prefetch+0x1\r\n               zfs`dmu_zfetch+0x10d\r\n               zfs`dbuf_read+0x128\r\n               zfs`dmu_buf_hold_array_by_dnode+0x1cc\r\n               zfs`dmu_buf_hold_array+0x6f\r\n               zfs`dmu_read_uio+0x4b\r\n               zfs`zfs_read+0x260\r\n               genunix`fop_read+0xa9\r\n               nfssrv`rfs3_read+0x3ec\r\n               nfssrv`common_dispatch+0x53f\r\n               nfssrv`rfs_dispatch+0x2d\r\n               rpcmod`svc_process_request+0x184\r\n               rpcmod`stp_dispatch+0xb9\r\n               genunix`taskq_d_thread+0xc1\r\n               unix`thread_start+0x8\r\n             47770\r\n@\r\n@ The current workaround that we implemented was to tune\r\n@ zfetch_maxbytes_lb=128K &amp; zfetch_maxbytes_ub=256K. This keeps prefetch enabled but reduces the amount\r\n@ of prefetch data that needs to be thrown away, This is not a very comfortable\r\n@ workaround.\r\n\r\n   # echo zfetch_maxbytes_ub\/W0t262144 | mdb -kw\r\n   zfetch_maxbytes_ub:             0x2000000       =       0x40000\r\n   # echo zfetch_maxbytes_lb\/W0t131072 | mdb -kw\r\n   zfetch_maxbytes_lb:             0x400000        =       0x20000\r\n\r\nFrom the dmu_zfetch.c :\r\n076 * Max amount of data to prefetch at a time *\/\r\n077 * unsigned int zfetch_maxbytes_ub = 32 &lt;&lt; MB_SHIFT;\r\n078 * unsigned int zfetch_maxbytes_lb = 4 &lt;&lt; MB_SHIFT;\r\n079 * unsigned int zfetch_target_blks = 256;\r\n\r\n650 * Provides an upper bound on the prefetch window size for a single prefetch.\r\n651 * We try to prefetch zfetch_target_blks (256) but stay within the size bounds\r\n652 * &#x5B;zfetch_maxbytes_lb (4 MB), zfetch_maxbytes_ub (32 MB)]\r\n653 *\r\n654 * BLOCKSIZE MAX PREFETCH WINDOW SIZE\r\n655 * 128K-1MB 32MB\r\n656 * 32K-64K 256 blocks (8MB - 16MB)\r\n657 * &lt; 32K 4MB\r\n658 *\/\r\n659 static unsigned int\r\n660 dmu_zfetch_max_blocks(uint64_t blkshft)\r\n661 {\r\n662 unsigned int blks = zfetch_target_blks;\r\n663 unsigned int size = blks &lt;&lt; blkshft;\r\n664\r\n665 if (size &gt; zfetch_maxbytes_ub)\r\n666 return (zfetch_maxbytes_ub &gt;&gt; blkshft);\r\n667\r\n668 if (size &lt; zfetch_maxbytes_lb)\r\n669 return (zfetch_maxbytes_lb &gt;&gt; blkshft);\r\n670\r\n671 return (blks);\r\n672 }\r\n\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >21071219<\/td>\n<td >Comstar framework should allow a cmd to abort prior to ZFS I\/O completion<\/td>\n<td > <\/td>\n<td >2013.1.5.7 &#038; 2013.1.6<\/td>\n<td >Relates to 22080255 : ZFSSA backend \"resilvering\" causes FC front end port inaccesible to remote host<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nWe encountered this issue when we were investigating CR 2829652. Basically on the qlt FC front end of ZFSSA, if a scsi command was detected timedout by remote host, remote host will issue ABTS and retries ABTS if the first one is not processed timely. If the second ABTS was not processed timely the LOGO will be issued from remote host and the I\/O path may get lost permanently (see CR 21071136) as a result. For the ZFSSA using FC as the transport interface the capability to be able to handle ABTS timely is critical so that the host can complete its error recovery and retry IOs without having the risk of losing IO path as a result of LOGO. The AK8 kernel and comstack stack does not have the way for fct\/stmf to allow a command to abort prior to the ZFS backend IO completion (pmcs may involved in most cases).<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\ntraces : \r\n     *stmf_trace_buf\/s\r\n     &#x5B;2014 Aug 22 13:56:23:075:142:122] !584=&gt;QEL qlt(0,0):: qlt_handle_rcvd_abts, (0)(544h FFFFh)\r\n     (ah)(fffff600e30dd480, fffff600e30dd510)(FFFFFFFFh FFFFFFFFh)(0)(B070\r\n     &#x5B;2014 Aug 22 13:56:23:075:142:122] !584=&gt;QEL qlt(0,0):: qlt_handle_rcvd_abts, (0)(544h FFFFh)\r\n     (ah)(fffff600e30dd480, fffff600e30dd\r\n     &#x5B;2014 Aug 22 13:56:23:075:142:122] !584=&gt;QEL qlt(0,0):: qlt_handle_rcvd_abts,\r\n \r\n     *qlt_state::walk softstate |::qltgettrace : fc\/qlttrace\r\n     &#x5B;2016 May 25 22:01:48:229:519:056] !46890=&gt;QEL qlt(2,0):: qlt_handle_rcvd_abts, (0)(2C0h 724h)\r\n     (4h)(fffff607a0c25700, fffff607a0c25790)(FFFFFFFFh 126B10h)(0)(603\r\n     &#x5B;2016 May 25 22:01:48:229:519:056] !46890=&gt;QEL qlt(2,0):: qlt_handle_rcvd_abts, )\r\n     &#x5B;2016 May 25 22:01:48:229:544:907] !46891=&gt;QEL qlt(2,0):: qlt_abort_cmd, cmd_type = FCT_CMD_FCP_XCHG\r\n     &#x5B;2016 May 25 22:01:48:229:547:011] !46892=&gt;QEL qlt(2,0):: qlt_abort_unsol_scsi_cmd, qi(0) fctcmd-\r\n     fffff604a0877c80(2C0h, FFFFh),9d000477 rex2=126B10h\r\n     &#x5B;2016 May 25 22:01:48:229:562:586] !46893=&gt;QEL qlt(2,0):: qlt_handle_ctio_completion, abort(oxid=2C0h):\r\n     hndl-9d170477, 1, fffff6007dcc6b40 (126B10h)\r\n     &#x5B;2016 May 25 22:01:48:229:566:267] !46894=&gt;QEL qlt(2,0):: qlt_handle_ctio_completion, \r\n     0(fffff604a0877c80,fffff604a0877d10)(2C0h,FFFFh),9d170477 126B10(2016, 0)\r\n     &#x5B;2016 May 25 22:01:48:784:693:584] !46895=&gt;QEL qlt(2,0):: qlt_send_abts_response, \r\n     (160h 5E1h)(0)(fffff607998c6e00 fffff607998c6e90) (FFFFFFFFh 122E80h)\r\n     &#x5B;2016 May 25 22:01:48:784:746:901] !46896=&gt;QEL qlt(2,0):: qlt_handle_abts_completion, qi(0) status =0h,\r\n     (160h 5E1h)\r\n     &#x5B;2016 May 25 22:01:48:796:029:698] !46897=&gt;QEL qlt(2,0):: qlt_send_abts_response,\r\n     (1BAh 63Eh)(0)(fffff6058ccd3c40 fffff6058ccd3cd0) (FFFFFFFFh 123FF0h) \r\n\r\nThe spagettime.d will show high gap as follows :\r\n     # cat spagettime.Pool_02.2016-05-25_14\\:14\\:10 | nawk '{ if ($16&gt;5000) print}'\r\n     2016 May 25 14:15:21 txg : 9996375; walltime:759  ms ; sync gap:14845 ms ; txg delays:0 ; txg stalled:0\r\n     2016 May 25 16:05:37 txg : 9997695; walltime:1449 ms ; sync gap:7279  ms ; txg delays:0 ; txg stalled:0\r\n     2016 May 25 22:01:49 txg : 10001958;walltime:1175 ms ; sync gap:25238 ms ; txg delays:0 ; txg stalled:0\r\n     2016 May 25 22:03:27 txg : 10001976;walltime:1065 ms ; sync gap:11427 ms ; txg delays:0 ; txg stalled:0\r\n     2016 May 25 22:12:45 txg : 10002087;walltime:1287 ms ; sync gap:5106  ms ; txg delays:0 ; txg stalled:0\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >23526362<\/td>\n<td >mptsas driver TRAN_BUSY on readzilla degraded shadow-migration performance<\/td>\n<td > <\/td>\n<td >2013.1.6.11<\/td>\n<td ><\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nSignature : iostat -xnz 1 - wait queue &gt; 0 is a smoking gun + actv &gt;= 1 :\r\n\r\n                   extended device statistics\r\n     r\/s w\/s kr\/s kw\/s wait actv wsvc_t asvc_t   %w %b device\r\n     124.0 61.0 1423.0 38681.3 5.0 1.3 26.8  7.3 66 68 c0t5001E82002763F90d0 &lt;&lt; READZILLA\r\n     263.0 51.0 2233.4 36262.7 4.9 1.4 15.7  4.4 65 71 c0t5001E82002763F90d0\r\n      95.0 61.0  956.0 43606.2 6.8 1.8 43.6 11.8 89 95 c0t5001E82002763F90d0\r\n     288.0 56.0 4157.5 41983.7 5.5 1.5 16.0  4.4 73 76 c0t5001E82002763F90d0\r\n     172.0 64.0 1422.9 43494.1 5.4 1.4 22.8  6.1 69 74 c0t5001E82002763F90d0\r\n      55.0 58.0  517.1 40532.3 5.2 1.4 45.8 12.4 68 70 c0t5001E82002763F90d0\r\n\r\n     wait : average number of transactions waiting for service (queue length) or host queue depth.\r\n     actv : active queue is the number of pending I\/O requests. This represents the number of threads doing \r\n     IOs onto the same device (Cf Roch).\r\n<\/pre>\n<p>L2ARC devices (SANDISK 1.6TB) are 100% busy, but critically have only 1 IO in the active queue and 9 in the wait queue. Blaise suspects it&#8217;s because the L2 improvements in recent AK releases means the L2ARC is caching much better, so the bug is triggered more easily.<\/p>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >20466027<\/td>\n<td >zfs_zget is causing high cpu usage<\/td>\n<td > <\/td>\n<td >2013.1.6.12<\/td>\n<td ><\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nSignature from a dtrace profile :\r\n\r\n                unix`mutex_delay_default+0x7\r\n                unix`mutex_vector_enter+0x2ae\r\n                zfs`zfs_zget+0x4a\r\n                zfs`zfs_vget+0x1d2\r\n                genunix`fsop_vget+0x66\r\n                nfssrv`nfs3_fhtovp+0x47\r\n                nfssrv`rfs3_read+0x4d\r\n                nfssrv`common_dispatch+0x53f\r\n                nfssrv`rfs_dispatch+0x2d\r\n                rpcmod`svc_process_request+0x184\r\n                rpcmod`stp_dispatch+0xb9\r\n                genunix`taskq_d_thread+0xc1\r\n                unix`thread_start+0x8\r\n                4652\r\n\r\n\r\n                nfssrv`rfs3_write+0xab0\r\n                nfssrv`common_dispatch+0x694\r\n                nfssrv`rfs_dispatch+0x2d\r\n                rpcmod`svc_process_request+0x184\r\n                rpcmod`stp_dispatch+0xb9\r\n                genunix`taskq_d_thread+0xc1\r\n                unix`thread_start+0x8\r\n                10390\r\n\r\n                unix`mutex_delay_default+0x7\r\n                unix`lock_set_spl_spin+0xc7\r\n                genunix`disp_lock_enter+0x2d\r\n                genunix`turnstile_lookup+0x69\r\n                unix`mutex_vector_enter+0x1b0\r\n                zfs`zfs_zget+0x4a\r\n                zfs`zfs_vget+0x1d2\r\n                genunix`fsop_vget+0x66\r\n                nfssrv`nfs3_fhtovp+0x47\r\n                nfssrv`rfs3_write+0x4a\r\n                nfssrv`common_dispatch+0x694\r\n                nfssrv`rfs_dispatch+0x2d\r\n                rpcmod`svc_process_request+0x184\r\n                rpcmod`stp_dispatch+0xb9\r\n                genunix`taskq_d_thread+0xc1\r\n                unix`thread_start+0x8\r\n                19267\r\n<\/pre>\n<p>The fix was to remove the mutex and replace it with a numa aware reader\/writer lock and have it grabbed as a reader in the normal zget cases.  The zget code will then have to deal with the races that can occur when the same znode is being looked up at the same time.<\/p>\n<p>kernel memory related issues<\/p>\n<h2>Kernel memory related issues<\/h2>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15942559<\/td>\n<td >ZFS data can not be easily reclaimed for other uses<\/td>\n<td > <\/td>\n<td >ak9<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >15775761<\/td>\n<td >Converting memory between kmem caches can be limiting factor<\/td>\n<td > <\/td>\n<td >no fix<\/td>\n<td >Points to 15942559 &#8211; ZFS data can not be easily reclaimed for other uses<\/td>\n<\/tr>\n\r\n<tr><td >15796281<\/td>\n<td >kmem taskq processing can wedge the system<\/td>\n<td > <\/td>\n<td >2013.1.6.3 IDR IDR #4.1<\/td>\n<td >The fix design was to change the kmem_taskq into a standalone workq with a dedicated service thread. Each cache could have a bitmap of possible operations that the service thread could perform, along with a \"busy\" bit.  When async work needs to be done on a cache, the thread requesting the work would set the corresponding bit and enqueue the cache on the global list.  When the kmem thread wanted to do work, it would set the \"busy\" bit on a cache, perform the operations specified by the bitmap, then clear the \"busy\" bit.  If a kmem_cache_destroy() needed to come in, it could grab the list lock, wait for the \"busy\" bit on its cache to be clear (hopefully it would be clear to begin with), remove the cache from the list, and then manually do whatever processing it needed to do before destroying the cache.<\/td>\n<\/tr>\n\r\n<tr><td >19294541<\/td>\n<td >kmem cache reaping should be made more scalable<\/td>\n<td > <\/td>\n<td >2013.1.6.3 IDR IDR #4.1<\/td>\n<td ><\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nList of caches getting processed by kmem_async are queued :\r\n    &gt; kmem_caches_processing=J\r\n    fffffffffc0b99c0\r\n    &gt; fffffffffc0b99c0::walk list |::kmem_cache\r\n    ADDR             NAME              FLAG    CFLAG  BUFSIZE  BUFTOTL\r\n    ffffc10000047030 kmem_alloc_256    120f 00200000      256   677052 &lt;&lt; processing this cache\r\n    ffffc1010e5c8030 dnode_t           120f 04000000      752    16760 &lt;&lt; waiting\r\n    ffffc1010e5d4030 zfs_znode_cache   120f 00000000      280    10842 &lt;&lt; waiting\r\n    ffffc10113b2d030 zio_buf_512       1200 01020000      512    16816 &lt;&lt; this cache may wait a long \r\n                                                                          time before getting processed\r\n<\/pre>\n<p>Synchronous processing of caches seems to be not scaling. Things can go bad especially in low memory conditions.<br \/>\nSolution was to split kmem_reap in 2 steps :<br \/>\n1. All &#8220;background processing&#8221; on a given cache (reap or update) is to be done by a single thread at a time. Multiple threads can run for different buffers at the same time. This is the &#8220;scalable part&#8221;<br \/>\n2. The &#8220;global cleanup&#8221; work (re-scheduling timeouts, etc) can be done after all processing related to the base work (reap or update) is complete.<\/p>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >23555751<\/td>\n<td >ZFSSA &#8211; FC VM frozen for more than 10 mins<\/td>\n<td > see below <\/td>\n<td >2013.1.6.3 IDR IDR #4.1<\/td>\n<td >Points to 15796281 and 19294541<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    echo &quot;arc_shrink_shift\/W 6&quot;| mdb -kw (default is 7)\r\n    echo &quot;arc_evict_chunk_minbufs\/Z 0t256&quot;|mdb -kw\r\n<\/pre>\n<p>Adjust arc_evict_chunk_minbufs to 256 from its default value of 128 &#8211; See 26264372.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    echo &quot;arc_evict_chunk_minbufs\/Z 0t256&quot;|mdb -kw \r\n    Tuning arc_shrink_shift will have the effect of making zfs reacting to low memory at an early point.\r\n    See arc.c\/arc_reclaim_thread() and redzone calculation used by arc_reclaim_needed() :\r\n    redzone = 2 * btop(arc_size &gt;&gt; arc_shrink_shift);\r\n<\/pre>\n<p>The reaper_trace.d script may be used to see how long it takes for the arc_get_data_block() function to return when throttleing happens.<br \/>\nHere we see that it can take 37 secs to return with some throttles :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    52  43171        arc_get_data_block:return 31658262460856570 2016 Jul 29 17:15:40 \r\n        fffffe0c06f84c20 37658131850 throttle 1877 arn 1878 aen 1878 freemem 9435104 arc_size 1317113940032\r\n    53  43171        arc_get_data_block:return 31658262460053455 2016 Jul 29 17:15:40 \r\n        fffffe0c0582dc20 31811415899 throttle 1591 arn 1592 aen 1592 freemem 9435104 arc_size 1317112963016\r\n<\/pre>\n<p>The script has to be run as follows :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    cli&gt; confirm shell\r\n    # mkdir \/var\/tmp\/tsc\r\n    # cd \/var\/tmp\/tsc\r\n    # vi reaper_trace.d &lt;&lt; copy the content of reaper_trace.d\r\n    # chmod u+x reaper_trace.d\r\n    # vi reaper_collection.sh &lt;&lt; copy the content of reaper_collection.sh\r\n    # chmod u+x reaper_collection.sh\r\n    # .\/reaper_collection.sh start \r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >22256256<\/td>\n<td >Lengthy kmem_reap() processing can block progress of arc_reap_thread<\/td>\n<td > <\/td>\n<td >2013.1.5.7<\/td>\n<td >kmem_cache_reap_async() added to offload advisory reaps to kmem-internal taskq threads.That prevents advisory reaps from blocking behind the async kmem reap activity if they happen to be working on the same kmem cache.<\/td>\n<\/tr>\n\r\n<tr><td >22344599<\/td>\n<td >Final detach at the end of a resilvering leads to high spa_sync time<\/td>\n<td >no workaround<\/td>\n<td >2013.1.6.0<\/td>\n<td >Relates to  22256256 : Lengthy kmem_reap() processing can block progress of arc_reap_thread. Typical signature is kernel memory reorganization at the end of a disk replacement, when the final spare detach happens.<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nThe arc_delays.d script will report some kernel memory buffers being reaped :\r\n\r\n  2015 Dec  9 22:04:40 freemem     4054 MB\r\n  2015 Dec  9 22:04:50 freemem     4546 MB; reaping in progress\r\n  2015 Dec  9 22:05:00 freemem     5313 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU; \r\n  reaping in progress\r\n  2015 Dec  9 22:05:10 freemem     5334 MB; reaping in progress\r\n  2015 Dec  9 22:05:20 freemem     4799 MB; reaping in progress\r\n  2015 Dec  9 22:05:30 freemem     4702 MB; reaping in progress\r\n  2015 Dec  9 22:05:40 freemem     5368 MB; reaping in progress\r\n  2015 Dec  9 22:05:50 freemem     6196 MB; reaping in progress\r\n  2015 Dec  9 22:06:00 freemem     6964 MB; reaping in progress\r\n  2015 Dec  9 22:06:10 freemem     7985 MB; reaping in progress\r\n  2015 Dec  9 22:06:20 freemem     9085 MB; reaping in progress\r\n  2015 Dec  9 22:06:30 freemem    10152 MB; reaping in progress\r\n  2015 Dec  9 22:06:40 freemem    11409 MB; reaping in progress\r\n  2015 Dec  9 22:06:50 freemem    12681 MB; reaping in progress\r\n\r\nLooking at the analytics, the resilvering end matches with some kernel memory reorganization.\r\nSee how bpmap_cache buffer size decreases :\r\n\r\n  lbl-1656:analytics dataset-035&gt; show\r\n  Properties:\r\n                         name = kmem.fragmented&#x5B;cache]\r\n                     grouping = Memory\r\n                  explanation = kernel memory lost to fragmentation broken down by kmem cache\r\n                       incore = 4.89M\r\n\r\n\r\n  lbl-1656:analytics dataset-035&gt; read 2015-12-09 22:04:49 1\r\n  DATE\/TIME                    MB         MB BREAKDOWN\r\n  2015-12-9 22:04:49        17393      11495 bpmap_cache\r\n                                        1610 space_seg_cache\r\n                                         659 dmu_buf_impl_t\r\n                                         536 l2arc_buf_t\r\n                                         518 arc_buf_t\r\n                                         364 zio_cache\r\n                                         158 arc_ref_t\r\n                                         157 arc_elink_t\r\n                                         144 dnode_t\r\n\r\n  lbl-1656:analytics dataset-035&gt; read 2015-12-09 22:06:10 1\r\n  DATE\/TIME                    MB         MB BREAKDOWN\r\n  2015-12-9 22:06:10        10753       4979 bpmap_cache\r\n                                        1598 space_seg_cache\r\n                                         596 dmu_buf_impl_t\r\n                                         536 l2arc_buf_t\r\n                                         489 arc_buf_t\r\n                                         361 zio_cache\r\n                                         164 zio_data_buf_8\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >19949855<\/td>\n<td >optimize the number of cross-calls during hat_unload<\/td>\n<td > <\/td>\n<td >2013.1.3.3<\/td>\n<td ><\/td>\n<\/tr>\n\r\n<tr><td >20645565<\/td>\n<td >arc_throttle logic can delay zfs operations under low memory conditions<\/td>\n<td > <\/td>\n<td >2013.1.6.3 IDR IDR #4.1<\/td>\n<td ><\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\na) Excessive wait time in kmem taskq :<br \/>\narc_reaper_thread dispatches the reap request to kmem_taskq. kmem_taskq is single threaded and it is responsible for multiple kmem maintenance tasks, not just reaping. If there are tasks pending in the taskq, it may take a while for our task to get to the front of the queue and execute.<\/p>\n<p>b) Latency in taskq_wait :<br \/>\narc_reaper_thread, after it dispatches the reaping task to taskq, it does taskq_wait to wait for the task to be done. Once it returns from taskq_wait, it signals arc_reaper_thread so that it may shrink arc further and evict more buffers if needed. However, it seems like we wait in taksq_wait until all the pending tasks in queue are complete, not just for the completion of the task dispatched by arc_reaper_thread. For example, if some other thread calls kmem_reap, that may dispatch lots of reaping tasks to kmem taskq. In that case, we seem to be keep on waiting until these other reaping tasks are also complete.<\/p>\n<p>It seems like the fix for following bug may help to solve this problem. But, the fix was not backported in AK : 15796281 &#8211; kmem taskq processing can wedge the system<\/p>\n<p>c) The red zone size used for reacting to memory shortage :<br \/>\nThe memory throttle logic in arc_get_data_block puts the calling thread to wait if allocation of additional memory can result freemem falling below (desfree + lotsfree). On a 512GB system, (desfree + lotsfree) is approximately 12GB. In order that we may not enter throttle logic and put the threads to wait for memory, we may consider reacting to low memory situation at an early point.  We shouldn&#8217;t wait until freemem goes as low as 12GB. In arc_reclaim_thread logic, that early point is decided based on the size of red zone. Probably, the red zone size also needs some tuning so that we may react much earlier<\/p>\n<h2>zfs_signature.sh<\/h2>\n<p>The zfs_signature.sh script was commonly used to narrow down performance issues possibly related to ZFS. This was written by Roch Bourbonnais and Blaise Sanouillet, 2 gurus I miss a lot.<\/p>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >18125379<\/td>\n<td >memory leak in l2arc on sync reads that fail<\/td>\n<td > <\/td>\n<td >2013.1.2.0<\/td>\n<td ><\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nSee l2feeding.d script comments .\r\n    * The output below with write BW under the target BW, and all devices able to\r\n    * keep up, is a signature for:\r\n    *      16564195 ARC l_evict_lock contention in arc_evict_bytes()\r\n    *\r\n    * 2014 Nov 12 12:08:08\r\n    * Data eligible for L2, evicted but not fed:   352 MB\/s\r\n    * L2 feeding suspended due to low memory:        0\r\n    *\r\n    * L2 device            Waiting for data  Not keeping up    Write BW  Target BW\r\n    * fffff600afd5b940     0                 0                   7 MB\/s    40 MB\/s\r\n    * fffff600e6e2f280     0                 0                   8 MB\/s    40 MB\/s\r\n    * fffff600f5be52c0     0                 0                   8 MB\/s    40 MB\/s\r\n    * fffff600e836ea00     0                 0                   7 MB\/s    40 MB\/s\r\n\r\nlockstat gives this :\r\n\r\nAdaptive mutex spin: 4123264 events in 30.136 seconds (136823 events\/sec)\r\n-------------------------------------------------------------------------------\r\nops\/s indv cuml rcnt     nsec Lock                   Caller\r\n 1973  19%  19% 0.00  4254743 ARC_mru+0xa0           arc_evict_bytes+0x28\r\n\r\n      nsec ------ Time Distribution ------ ops\/s     Stack\r\n       128 |                               0         arc_get_data_block+0xe4\r\n       256 |                               8         arc_read+0x64c\r\n       512 |                               20        dsl_read+0xac\r\n      1024 |                               47        dbuf_read_impl+0x22c\r\n      2048 |@                              115       dbuf_read+0x100\r\n      4096 |@@@                            215 \r\n\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >16564195<\/td>\n<td >ARC l_evict_lock contention in arc_evict_bytes<\/td>\n<td > <\/td>\n<td >2013.1.4.5<\/td>\n<td ><\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   Adaptive mutex spin: 5320606 events in 62.214 seconds (85521 events\/sec)\r\n   -------------------------------------------------------------------------------\r\n   Count indv cuml rcnt nsec Lock Hottest Caller\r\n   2611062 80% 80% 0.00 124236 ARC_mru&#x5B;272] arc_evict_bytes+0x5e\r\n\r\n \r\n      nsec ------ Time Distribution ------ count Stack\r\n      2 56 | 584 arc_get_data_block+0xc7\r\n       512 | @ 91853 arc_read+0x606\r\n      1024 | @ 78003 dsl_read+0x34\r\n      2048 | @ 150450 dbuf_prefetch+0x1e6\r\n      4096 | @@@ 300149 dmu_zfetch_fetch+0x5f\r\n      8192 | @@@@@@@ 640676 dmu_zfetch+0x10d\r\n     16384 | @@@@ 426892 dbuf_read+0x128\r\n     32768 | @@ 246084 dmu_buf_hold_array_by_dnode+0x1cc\r\n     65536 | @ 135358 dmu_buf_hold_array+0x6f\r\n    131072 | @ 117919 dmu_read_uio+0x4b\r\n    262144 | @ 124844 zfs_read+0x260\r\n    524288 | @ 127585 fop_read+0xa9\r\n   1048576 | @ 93911 rfs3_read+0x4e0\r\n   2097152 | @ 53434 common_dispatch+0x53f\r\n   4194304 | @ 20964 rfs_dispatch+0x2d\r\n   8388608 | @ 2233 svc_process_request+0x184\r\n  16777216 | @ 76 stp_dispatch+0xb9\r\n  33554432 | @ 21 taskq_d_thread+0xc1\r\n  67108864 | @ 23 thread_start+0x8\r\n 134217728 | @ 3\r\n\r\n<\/pre>\n<p>19297091 &#8211; arc_evictable_memory has needless locks, is part of the fix<\/p>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15648057<\/td>\n<td >lack of accounting for DDT in dedup'd frees can oversubscribe txg<\/td>\n<td > <\/td>\n<td >2010.Q1.2.0<\/td>\n<td >See spagetti.d script output. walltime \u00bb 5 seconds, sync gap close to 0, arc misses from \u201cDDT ZAP algorithm\u201d in the 100s<\/td>\n<\/tr>\n\r\n<tr><td >18281514<\/td>\n<td >spa_sync\/space_map_map delay in dbuf_read causes NFS thread to pause<\/td>\n<td > <\/td>\n<td >2013.1.2.9<\/td>\n<td >See spagetti.d script output. walltime \u00bb 5 seconds, sync gap close to 0, arc misses from \u201cSPA space map\u201d in the 100s<\/td>\n<\/tr>\n\r\n<tr><td >16423386<\/td>\n<td >dnode_next_offset() doesn't do exhaustive search<\/td>\n<td > <\/td>\n<td >2013.1.1.4<\/td>\n<td >See spagetti.d script output. Signature for dataset \/ large file destroy pathologies. walltime \u00bb 5 seconds, sync gap close to 0, dsl_dataset_destroy_sync() on-cpu time > 2 seconds<\/td>\n<\/tr>\n\r\n<tr><td >19621408<\/td>\n<td >do_user_quota_updates can consume 10% of spa_sync time in some cases<\/td>\n<td > <\/td>\n<td >no fix<\/td>\n<td >See spagetti.d script output. walltime >= 5 seconds, do_user_quota_updates() time > 1 sec (or even bigger).<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    -------- spa_sync exclusive time breakdown by tracked function :\r\n                                      Elapse ms  On-cpu ms\r\n    &lt;total&gt;                                5252        487\r\n    spa_sync                                  6          4\r\n    dsl_pool_sync                          2469        277\r\n    metaslab_sync                             3          1\r\n    metaslab_sync_reassess                    2          2          \r\n    metaslab_sync_done                      125        122\r\n    dmu_objset_do_userquota_updates        2193         55   &lt;&lt;&lt;&lt;\r\n    taskq_wait -&gt; vdev_sync                   2          0\r\n    taskq_wait -&gt; vdev_sync_done            109          0\r\n    vdev_config_sync                         95          1\r\n    dsl_dataset_destroy_sync                  0          0\r\n    zio_wait                                  0          0\r\n    lost                                    248         25\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15623702<\/td>\n<td >Write throttling starts too late (arc_memory_throttle)<\/td>\n<td > <\/td>\n<td >no fix<\/td>\n<td >See spagetti.d script output. pattern of 2 out of 3 consecutive spa_sync with dp_write used ~0 and ARC mem throttle > 0<\/td>\n<\/tr>\n\r\n<tr><td >15744920<\/td>\n<td >read throttle required to survive read write storm<\/td>\n<td > <\/td>\n<td > 2013.1.5<\/td>\n<td >See spagetti.d script output. walltime >= 5 seconds, sync gap close to 0, no arc misses, dp_write thput limit ~= min + txg stalls and delays &#8211; see spagetti.d details.Seting zfs_write_limit_min to 24 * 100MB * #HDD, has been shown to alleviate the case of writes stalled for minutes<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nFor a 20 disks (10 vdevs pool), this gives : 24 * 100MB * 20 = 48000000000 bytes\r\n    zfssa# echo zfs_write_limit_min\/Z0t48000000000 | mdb -kw\r\n    zfs_write_limit_min:            0x2000000               =       0xb2d05e000\r\n\r\nThis has the effect of diminuishing the max throughput for the pool, but this make the txg \r\nstall and delays disappear.\r\n\r\nBefore the tuning of zfs_write_limit_min to 48000000000, we may have some very fluctuating \r\n&quot;used throughput&quot; for TXGs (second col) :\r\n    min\/used\/thput limit\/free limit  :        1\/      23\/     616\/ 1179190\r\n    min\/used\/thput limit\/free limit  :        1\/     409\/     973\/ 1179174\r\n    min\/used\/thput limit\/free limit  :        1\/     696\/    1347\/ 1179146\r\n    min\/used\/thput limit\/free limit  :        1\/     708\/    1631\/ 1179117\r\n    min\/used\/thput limit\/free limit  :        1\/      85\/    1464\/ 1179099\r\n    min\/used\/thput limit\/free limit  :        1\/      23\/    1176\/ 1179099\r\n    min\/used\/thput limit\/free limit  :        1\/      24\/     983\/ 1179114\r\n    min\/used\/thput limit\/free limit  :        1\/      24\/     810\/ 1179114\r\n    min\/used\/thput limit\/free limit  :        1\/      23\/     656\/ 1179114\r\n    min\/used\/thput limit\/free limit  :        1\/      11\/     537\/ 1179115\r\n    min\/used\/thput limit\/free limit  :        1\/       1\/     409\/ 1179116\r\n\r\nAfter the tuning, its more constant :\r\n    min\/used\/thput limit\/free limit  :     1907\/     147\/     147\/ 1179075\r\n    min\/used\/thput limit\/free limit  :     1907\/     147\/     147\/ 1179071\r\n    min\/used\/thput limit\/free limit  :     1907\/     147\/     147\/ 1179066\r\n    min\/used\/thput limit\/free limit  :     1907\/     147\/     147\/ 1179061\r\n    min\/used\/thput limit\/free limit  :     1907\/     147\/     147\/ 1179057\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >16275386<\/td>\n<td >metaslab_group_alloc shouldn't block on space_map_load<\/td>\n<td >see below<\/td>\n<td >2013.1.6<\/td>\n<td >When destroying a dataset, entries are added to a chain list in disk for a non loaded spacemap. When this spacemap is loaded for further writes, the chain list has to be parsed to determine where free space is. This makes a lot of random IO reads and this takes time while we are in the middle of IO writes. From bug 16275386 : It turns out that this issue is more general than previously thought. During txg sync, if an allocation on vdev 1 is waiting on space_map_load, further allocations will be directed to vdev 2 until it reaches its aliquot, then vdev 3, etc. Eventually it will come back to vdev 1 and also block waiting on space_map_load. This means vdev 1 won't be kept busy and so will reduce the write throughput, even when the ZIL is not in play. It should be possible to have a space map loaded asynchronously in advance so that metaslab_group_alloc() can always proceed without blocking, modulo corner cases.<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nAliquo : mg_aliquot = 512K * (# leaf vdevs)\r\nNormally, mg_aliquot is used as a limit to move to the next vdev (or metaslab_group).\r\nIn metaslab_ndf_fragmented(), mg_aliquot is used to decide whether we need to move to another metaslab \r\nif (maxsize &lt; mg_aliquot).\r\n\r\nWO :\r\n    # echo &quot;metaslab_unload_delay\/W 0x7FFFFFFF&quot; | mdb -kw\r\n<\/pre>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th >Bug number<\/th>\n<th >Title<\/th>\n<th >Workaround<\/th>\n<th >Fix<\/th>\n<th >Comments<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >17856080<\/td>\n<td >arc_meta_used is not necessarily all MRU<\/td>\n<td >see below<\/td>\n<td >2013.1.7<\/td>\n<td >Symptoms : IO reads can be slow, especially for backups. Disks can be busy due to lack of hits in MRU ( see bug 22818942 ). Disks busy on IO reads due to no evictable space<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nSignature : arc_delays.d\r\n    2016 Apr 12 09:21:02 freemem     8359 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU;\r\n    2016 Apr 12 09:21:12 freemem     7909 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU;\r\n\r\nThese messages mean that the MRU buffer is too small to keep the most recently used data.\r\nThe issue comes from the arc_adjust() function which adjust the size of the MRU or MFU buffers \r\nif arc_size &gt; arc_c. arc_ajust() can be used when ZFS needs to reduce its memory consumption.\r\nThe current algorithm (Apr 2016) takes arc_meta_used into account but &quot;arc_meta_used is not \r\nnecessarily all MRU&quot;. That means we reduce the MRU too much.\r\nThe adjust_arc_p.sh tries to artificially increase arc_p (the target size of MRU) to 3 times the value \r\nof meta_used. It limits the size of arc_p to half of arc_c (target size of cache). If the meta_used and arc_p \r\nare already too high, the script will not do any action but displays a message.\r\n\r\nFor example, here arc_p is too small :\r\n\r\n   p                         =        11696 MB  \/* target size of MRU    *\/\r\n   c                         =       187039 MB  \/* target size of cache  *\/\r\n   c_min                     =           64 MB  \/* min target cache size *\/\r\n   c_max                     =       261101 MB  \/* max target cache size *\/\r\n   size                      =       187039 MB  \/* actual total arc size *\/\r\n   &#x5B;..]\r\n   meta_used                 =        21967 MB\r\n   meta_max                  =        41046 MB\r\n   meta_limit                =            0 MB\r\n\r\nThanks to the script, p will go to 60GB (that is still below 187GB\/2).\r\nBefore, with tuning_zfs_arc_p_min_shift.akwf, arc_p was set to 1\/8 or arc_size (reboot needed).\r\nApparently, this may be not enough.\r\n<\/pre>\n<h2>Bugs under the bonnet<\/h2>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15505861<\/td>\n<td >Write performance degrades when ZFS filesystem is near quota<\/td>\n<td >Avoid quota usage, Free some space by deleting some datasets<\/td>\n<td >2013.1.1.9<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nTypical stack when reaching filesystem size limit :\r\n   &gt; ::stacks -c dsl_dir_tempreserve_impl\r\n     zfs`dsl_dataset_check_quota\r\n     zfs`dsl_dir_tempreserve_impl+0xc2\r\n     zfs`dsl_dir_tempreserve_space+0x13e\r\n     zfs`dmu_tx_try_assign+0x208\r\n     zfs`dmu_tx_assign+0x2a\r\n     zfs`zfs_write+0x63c\r\n     genunix`fop_write+0xa6\r\n     genunix`write+0x309\r\n     genunix`write32+0x22\r\n     unix`_sys_sysenter_post_swapgs+0x149\r\n<\/pre>\n<p>Actually, we get in trouble when the return code is 91 : ERESTART. This can lead to long spa_sync time.<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n    dsl_dataset.c\r\n    dsl_dataset_check_quota(dsl_dataset_t *ds, uint64_t inflight)\r\n    {\r\n        int error = 0;\r\n\r\n        mutex_enter(&amp;ds-&gt;ds_lock);\r\n\r\n        \/*\r\n         * If they are requesting more space, and our current estimate\r\n         * is over quota, they get to try again unless the actual\r\n         * on-disk is over quota and there are no pending changes (which\r\n         * may free up space for us).\r\n         *\/\r\n        if (ds-&gt;ds_quota != 0 &amp;&amp;\r\n            ds-&gt;ds_phys-&gt;ds_used_bytes + inflight &gt;= ds-&gt;ds_quota) {\r\n                if (inflight &gt; 0 || ds-&gt;ds_phys-&gt;ds_used_bytes &lt; ds-&gt;ds_quota)\r\n                        error = ERESTART;\r\n                else\r\n                        error = EDQUOT;\r\n        }\r\n        mutex_exit(&amp;ds-&gt;ds_lock);\r\n\r\n        return (error);\r\n    }\r\n<\/pre>\n<p>Dtrace can be of great help to check the return value of dsl_dataset_check_quota() and dsl_dir_tempreserve_space(). If we get something different of 0, we can hit the issue.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n# dtrace -n 'dsl_dataset_check_quota:return \/arg1!=0\/ { trace(arg1) }'\r\n    CPU     ID                    FUNCTION:NAME\r\n    3  45299   dsl_dataset_check_quota:return                91\r\n    3  45299   dsl_dataset_check_quota:return                91\r\n    3  45299   dsl_dataset_check_quota:return                91\r\n    3  45299   dsl_dataset_check_quota:return                91\r\n    3  45299   dsl_dataset_check_quota:return                91\r\n    3  45299   dsl_dataset_check_quota:return                91\r\n    3  45299   dsl_dataset_check_quota:return                91\r\n    3  45299   dsl_dataset_check_quota:return                91\r\n\r\n# dtrace -n 'dsl_dir_tempreserve_space:entry { printf(&quot;%s a:%lx f:%lx u:%lx&quot;,\r\n             stringof(args&#x5B;0]-&gt;dd_myname),arg2,arg3,arg4)}\r\n             dsl_dir_tempreserve_space:return { trace(arg1)}'\r\n    CPU     ID                    FUNCTION:NAME\r\n    0  44317  dsl_dir_tempreserve_space:entry prepcc-t1_oradata2-Clone_DAA1\r\n    0  44318  dsl_dir_tempreserve_space:return               91\r\n    1  44317  dsl_dir_tempreserve_space:entry preptux-d1_home\r\n    1  44318  dsl_dir_tempreserve_space:return               91\r\n    1  44317  dsl_dir_tempreserve_space:entry preptux-d1_home\r\n    1  44318  dsl_dir_tempreserve_space:return               91\r\n    3  44317  dsl_dir_tempreserve_space:entry preptux-d1_home\r\n    3  44318  dsl_dir_tempreserve_space:return               91\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15710347<\/td>\n<td >system hung, lots of threads in htable_steal<\/td>\n<td >See below<\/td>\n<td >2013.1.1.5<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    &gt; ::stacks -c htable_steal\r\n    THREAD           STATE    SOBJ                COUNT\r\n    fffff60011597760 RUN      &lt;NONE&gt;                  9\r\n                 swtch+0x150\r\n                 turnstile_block+0x760\r\n                 mutex_vector_enter+0x261\r\n                 htable_steal+0x1a4\r\n                 htable_alloc+0x248\r\n                 htable_create+0x1dc\r\n                 hati_load_common+0xa3\r\n                 hat_memload+0x81\r\n                 hat_memload_region+0x25\r\n                 segvn_faultpage+0x937\r\n                 segvn_fault+0xc13\r\n                 as_fault+0x5ee\r\n                 pagefault+0x99\r\n                 trap+0xe63\r\n<\/pre>\n<p>Workaround :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    arc_stats::print -ta arc_stats_t arcstat_c_max.value.i64\r\n    fffffffffbd43680 int64_t arcstat_c_max.value.i64 = 0x5bef18000\r\n    &gt; fffffffffbd43680\/Z0x4000000\r\n    arc_stats+0x4a0:0x5bef18000             =       0x4000000 &lt;&lt;&lt; 64MB\r\n    &gt; ::arc -m ! grep max\r\n    c_max                     =        64 MB\r\n    meta_max                  =       889 MB\r\n\r\n    With :\r\n    0x400000000 &lt;&lt; 16GB\r\n    0x600000000 &lt;&lt; 24GB\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15742475<\/td>\n<td >AK-2011.04.24 Extremely sluggish 7420 node due to heap fragmentation<\/td>\n<td >limit arc_size to half of the DRAM, set arc_meta_limit to 2\/5 of DRAM<\/td>\n<td >2011.1.3<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nmemstat shows that we have plenty of memory free, however a &quot;::stacks -m zfs&quot; shows that we have a number \r\nof zfs stacks waiting for allocations.\r\n\r\n    &gt;::stacks -m zfs\r\n    ffffffaac7a07440 SLEEP    CV                     36\r\n    swtch+0x145\r\n    cv_wait+0x61\r\n    vmem_xalloc+0x635\r\n    vmem_alloc+0x161\r\n    segkmem_xalloc+0x9\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15751065<\/td>\n<td >appliance unavailable due to zio_arena fragmentation<\/td>\n<td >limit arc_size to half of the DRAM, set arc_meta_limit to 2\/5 of DRAM<\/td>\n<td >2011.1.3<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nWe haven't run out of available memory.\r\n     &gt; ::memstat\r\n     Page Summary                Pages                MB  %Tot\r\n     ------------     ----------------  ----------------  ----\r\n     Kernel                    8480307             33126    6%\r\n     ZFS File Data           120780433            471798   90%  &lt;&lt; ZFS ARC cache for data\r\n     Anon                       231541               904    0%\r\n     Exec and libs                  73                 0    0%\r\n     Page cache                   6816                26    0%\r\n     Free (cachelist)            22290                87    0%\r\n     Free (freelist)           4691874             18327    3  %\r\n\r\n     Total                   134213334            524270\r\n     Physical                134213333            524270\r\n\r\nWe have some threads waiting for some memory allocation\r\n\r\n    &gt; ::stacks -m zfs\r\n    THREAD           STATE    SOBJ                COUNT\r\n    ffffff86b38ab440 SLEEP    CV                     23\r\n                swtch+0x150\r\n                cv_wait+0x61\r\n                vmem_xalloc+0x63f\r\n                vmem_alloc+0x161\r\n                segkmem_xalloc+0x90\r\n                segkmem_alloc_vn+0xcd\r\n                segkmem_zio_alloc+0x24\r\n                vmem_xalloc+0x550\r\n                vmem_alloc+0x161\r\n                kmem_slab_create+0x81\r\n                kmem_slab_alloc+0x5b\r\n                kmem_cache_alloc+0x1fa\r\n                zio_data_buf_alloc+0x2c\r\n                arc_get_data_buf+0x18b\r\n                arc_buf_alloc+0xa2\r\n                arc_read_nolock+0x12f\r\n                arc_read+0x79\r\n                dsl_read+0x33\r\n                dbuf_read_impl+0x17e\r\n                dbuf_read+0xfd\r\n                dmu_tx_check_ioerr+0x6b\r\n                dmu_tx_count_write+0x175\r\n                dmu_tx_hold_write+0x5b\r\n                zfs_write+0x655\r\n                fop_write+0xa4\r\n                rfs3_write+0x50e\r\n                common_dispatch+0x48b\r\n                rfs_dispatch+0x2d\r\n                svc_getreq+0x19c\r\n                svc_run+0x171\r\n                svc_do_run+0x81\r\n                nfssys+0x760\r\n                _sys_sysenter_post_swapgs+0x149\r\n\r\nfrom ::kmastat -g, the buffers consuming memory are for data\r\n\r\n    &gt;::kmastat -g ! grep -v 0G\r\n    zio_data_buf_65536         65536 4694099 4694101      286G  21342088     0\r\n    zio_data_buf_131072       131072 1424500 1424500      173G   8844051     0\r\n    ...\r\n    zfs_file_data                  460G        512G         0G  10832435     0\r\n    zfs_file_data_buf              460G        460G       460G  10843627     0\r\n\r\n\r\nLooking in the thread stacks, we were waiting memory allocation just for 0x20000 bytes (128KB)\r\n    &gt;::stacks -c vmem_xalloc\r\n    stack pointer for thread ffffff86b38ab440: ffffff03d56a8700\r\n    &#x5B; ffffff03d56a8700 _resume_from_idle+0xf1() ]\r\n      ffffff03d56a8730 swtch+0x150()\r\n      ffffff03d56a8760 cv_wait+0x61(ffffff84949ac01e, ffffff84949ac020)\r\n      ffffff03d56a88a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) &lt;&lt;&lt; 0x20000\r\n      ffffff03d56a8900 vmem_alloc+0x161(ffffff84949ac000, 20000, 4)\r\n      ...\r\n     ]\r\n\r\nMost of the waiters on this condition variable are nfsd daemons\r\n\r\n    &gt; ffffff84949ac01e::wchaninfo -v\r\n    ADDR             TYPE NWAITERS   THREAD           PROC\r\n    ffffff84949ac01e cond       60:  ffffff8570a5db60 akd\r\n                                     ffffff85a1baabc0 akd\r\n                                     ffffff86bdf66880 akd\r\n                                     ffffff85a1af6420 nfsd\r\n                                     ffffff85e09da420 nfsd\r\n                                     ffffffbd7b172bc0 nfsd\r\n                                     ffffff86bc6450e0 nfsd\r\n                                     ffffff86b38ab440 nfsd\r\n                                     ffffff85a1cdf0c0 nfsd\r\n                                     ffffffbd7b16e100 nfsd\r\n                                     ...\r\n\r\nWe have many ZFS threads doing some memory allocation\r\n\r\n    &gt; ::stacks -m zfs|::findstack -v ! grep vmem_xalloc | grep ffffff84949ac000 | head\r\n    ffffff03d56a88a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d3cb08a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d4e498a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d37dc8a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d06288a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d43708a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d35f58a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d3d768a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d38368a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n    ffffff03d382a8a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)\r\n\r\nWalking in the free vmem_segs of vmem_t ffffff84949ac000, we have many free segments of 0x10000 bytes (64KB) \r\nbut none for 0x20000 (128KB):\r\n\r\n    &gt; ffffff84949ac000::walk vmem_free | ::vmem_seg -s ! awk '{print $4}' | sort | uniq -c\r\n    108 0000000000001000\r\n     97 0000000000002000\r\n    166 0000000000003000\r\n    148 0000000000004000\r\n    ...\r\n    09 000000000000f000\r\n 837558 0000000000010000 &lt;&lt; 64KB\r\n     30 0000000000011000\r\n     34 0000000000012000\r\n     42 0000000000013000\r\n \r\n\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15710534<\/td>\n<td >SUNBT7038138 advance rotor before blocking on space_map_load<\/td>\n<td >See below<\/td>\n<td >2011.1.6<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<br \/>\nTime spent to read metadata during sync is responsible for the sawtooth pattern of IOPS<br \/>\nWorkaround :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    # echo &quot;metaslab_unload_delay\/W 0x7fffffff&quot;|mdb -kw\r\n    # echo &quot;set zfs:metaslab_unload_delay = 0x7fffffff&quot; &gt;&gt; \/etc\/system\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15790948<\/td>\n<td >SUNBT7167903 pre-6998140 soft-ring fanout is single threaded for VLANs<\/td>\n<td >Do not use VLAN under 2011.1 if you need high throughput<\/td>\n<td >2013.1<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n  * or 2 CPU 99% busy\r\n  * High level of interruptions\r\n\r\nVLAN usage :\r\n\r\n    # dladm show-link\r\n    bge0            type: non-vlan  mtu: 1500       device: bge0\r\n    bge1            type: non-vlan  mtu: 1500       device: bge1\r\n    bge2            type: non-vlan  mtu: 1500       device: bge2\r\n    bge3            type: non-vlan  mtu: 1500       device: bge3\r\n    aggr1           type: non-vlan  mtu: 1500       aggregation: key 1\r\n    aggr111001      type: vlan 111  mtu: 1500       aggregation: key 1\r\n    aggr112001      type: vlan 112  mtu: 1500       aggregation: key 1\r\n\r\n    aggr23002   vnic      1500   up       --         aggr2 &lt;&lt;&lt; VLANID - 23 of aggr2\r\n    aggr21002   vnic      1500   up       --         aggr2\r\n    aggr22002   vnic      1500   up       --         aggr2\r\n\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15754028<\/td>\n<td >read-modify-write in space_map_sync() may significantly slow down spa_sync() performance<\/td>\n<td >see below<\/td>\n<td >2013.1.2.9<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<br \/>\nRun the spacemap_15754028.d script to determine if the observed performance issue is due to this bug. If it is, the script shows consistently long spa sync times (>15s) in the &#8220;Max sync&#8221; column, and high numbers (>200) in the &#8220;Sync RMW&#8221; column:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n         Pool : Max sync  Sync  Load Minsize MB    Count  Load Miss Sync RMW anon\/MRU\/MFU   : 2013 Sep 13 06:09:30\r\n       system :   503 ms     0     0     133 MB,    6017          0 0        0\/0\/0\r\n   mstorepool   56329 ms     5     0       0 MB,  355572          0 21259    0\/21\/21238\r\n<\/pre>\n<p>Typical stack trace is as follow :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    swtch+0x150()\r\n    cv_wait+0x61(ffffff8a047b2cb8, ffffff8a047b2cb0)\r\n    zio_wait+0x5d(ffffff8a047b29a0)\r\n    dbuf_read+0x1e5(ffffff8b754b8d88, 0, 9)\r\n    dbuf_will_dirty+0x5d(ffffff8b754b8d88, ffffff91ec49c338)\r\n    dmu_write+0xe1(ffffff856606d340, 446, 10fa0, 478,\r\n    space_map_sync+0x295(ffffff86174c8dd0, 1, ffffff86174c8ae0,\r\n    metaslab_sync+0x2de(ffffff86174c8ac0, a8b)\r\n    vdev_sync+0xd5(ffffff8615400640, a8b)\r\n    spa_sync+0x45b(ffffff8638592040, a8b)\r\n    txg_sync_thread+0x247(ffffff85660903c0)\r\n    thread_start+8()\r\n<\/pre>\n<p>Workaround :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    # echo &quot;metaslab_unload_delay\/W 0x7fffffff&quot;|mdb -kw\r\n    # echo &quot;set zfs:metaslab_unload_delay = 0x7fffffff&quot; &gt;&gt; \/etc\/system\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >16423386<\/td>\n<td >dnode_next_offset() doesn't do exhaustive search<\/td>\n<td >no WO<\/td>\n<td >2013.1.1.4<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nClone deletion issue :  Note the presence of traverse_dnode() twice on the stack.\r\n\r\n               zfs`arc_read_nolock+0x7e7\r\n               zfs`arc_read+0x79\r\n               zfs`dsl_read+0x33\r\n               zfs`traverse_visitbp+0x431\r\n               zfs`traverse_visitbp+0x4dd\r\n               zfs`traverse_dnode+0xa3   &lt;---- B\r\n               zfs`traverse_visitbp+0x387\r\n               zfs`traverse_visitbp+0x4dd\r\n               zfs`traverse_visitbp+0x4dd\r\n               zfs`traverse_visitbp+0x4dd\r\n               zfs`traverse_visitbp+0x4dd\r\n               zfs`traverse_visitbp+0x4dd\r\n               zfs`traverse_visitbp+0x4dd\r\n               zfs`traverse_dnode+0xa3  &lt;----- A\r\n               zfs`traverse_visitbp+0x1fc\r\n               zfs`traverse_impl+0x213\r\n               zfs`traverse_dataset+0x57\r\n               zfs`dsl_dataset_destroy_sync+0x290\r\n               zfs`dsl_sync_task_group_sync+0xf3\r\n               zfs`dsl_pool_sync+0x1ec\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >17650441<\/td>\n<td >NFS drops to about 0 for about 30 seconds and then returns to normal<\/td>\n<td >no WO<\/td>\n<td >IDR 2011.04.24.8.0,1-2.43.2.3<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nFixed thanks to bug 6988530 NFS threads partially STOPPED after nfsd restart during ZFSSA fail-back.<br \/>\nSignature :<br \/>\nspa_sync activity is more than 5 sec. Looks like 15754028 but WO may not help. Use the synctime7 dtrace script to confirm the issue.<\/p>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15723911<\/td>\n<td >SUNBT7058001 concurrency opportunity during spa_sync<\/td>\n<td >see below<\/td>\n<td >2013.1.1.1<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nParent of 15885830 : RMAN BACKUP PERFORMANCE REDUCED 15-30% WHEN USING UNSTABLE WRITES WITH 11.2.0.3 &#8211; Fixed in 2013.<br \/>\nSignature :<br \/>\nWhen doing an high Asynchronous IO write workload, NFSv3 response time can report significant variations (up to 7 seconds). With Synchronous workload, the max response time is still normal and around 400ms max. For RMAN backups, a performance decrease from 15-30% has been observed.<\/p>\n<p>ZFS periodically writes out dirty cache buffers to disk. This is called a &#8220;sync&#8221; &#8212; we are synchronising what is in core memory with what is on disk and it has many steps. The syncing procedure is managed by a component of ZFS called the Storage Pool Allocator, or SPA. This enhancements addresses two of the many steps that the SPA takes in order to complete the sync.<\/p>\n<p>One of the steps is to process all of the &#8220;frees&#8221;. A &#8220;free&#8221; involves releasing a previously allocated region on a disk and making it available for allocation again. As ZFS determines which blocks to free, it adds them to a list. Once the list is complete, the SPA would loop through that list and process the frees one at a time. The enhancement that was made to this step was to reduce the number of items on the list by processing the frees as they come in. This greatly reduces the amount of CPU required to handle each block that is being freed.<\/p>\n<p>Another step that the SPA sync procedure does is to write out the metaslab space maps. These space maps contain the allocations that are done on disks, so ZFS can know which regions are free. Prior to this enhancement, each disk was processed sequentially and its space maps were written out. With the change in place, we can now write out the space maps for each disk in parallel.<\/p>\n<p>Workaround :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    zfs set sync=always &lt;dataset in trouble&gt;\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >15652868<\/td>\n<td >SUNBT6965013 DTrace compiler doesn't align structs properly<\/td>\n<td >Do not use analytics broken down by hostname<\/td>\n<td >2013.1<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<br \/>\nWhen monitoring IP packets by hostname we can devrease the throughput by 25%<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n\r\n    # dtrace -n '\r\n     profile-997hz \/arg0 &amp;&amp; curthread-&gt;t_pri != -1 &amp;&amp; cpu==58 \/ { @&#x5B;stack()] = count(); }\r\n     tick-60sec { trunc(@, 10); printa(@); exit(0); }' &gt; \/var\/ak\/dropbox\/cpu-busy-$date.out &amp;\r\n\r\nThe most used stacks were as follows :\r\n\r\n     ip`ill_input_short_v4+0xdb\r\n     ip`ip_input+0x23b\r\n     dls`i_dls_link_rx+0x2e7\r\n     mac`mac_rx_deliver+0x5d\r\n     mac`mac_rx_soft_ring_process+0x17a\r\n     mac`mac_rx_srs_fanout+0x823\r\n     mac`mac_rx_srs_drain+0x261\r\n     mac`mac_rx_srs_process+0x180\r\n     mac`mac_rx_classify+0x159\r\n     mac`mac_rx_flow+0x54\r\n     mac`mac_rx_common+0x1f6\r\n     mac`mac_rx+0xac\r\n     mac`mac_rx_ring+0x4c\r\n     ixgbe`ixgbe_intr_msix+0x99\r\n     apix`apix_dispatch_pending_autovect+0x12c\r\n     apix`apix_dispatch_pending_hardint+0x33\r\n     unix`switch_sp_and_call+0x13\r\n     16175\r\n\r\nChecking what this part of code does, we can see this.\r\nThe 'nop' instructions is a signature of a dtrace probe.\r\n\r\n   st1f742003b# echo ill_input_short_v4+0xdb::dis | mdb -k\r\n   ill_input_short_v4+0xdb:        nop\r\n   ill_input_short_v4+0xdc:        nop\r\n   ill_input_short_v4+0xdd:        nop\r\n   ill_input_short_v4+0xde:        addq   $0x10,%rsp\r\n   ill_input_short_v4+0xe2:        movq   -0xb0(%rbp),%rcx\r\n   ill_input_short_v4+0xe9:        movq   %r12,%rdi\r\n   ill_input_short_v4+0xec:        xorq   %rsi,%rsi\r\n   ill_input_short_v4+0xef:        movq   %r13,%rdx\r\n   ill_input_short_v4+0xf2:        nop\r\n   ill_input_short_v4+0xf3:        nop\r\n   ill_input_short_v4+0xf4:        nop\r\n\r\nChecking which analytics do have probes in ip, I noticed this :\r\n\r\n   dataset-022 active   6.72M   235M ip.bytes&#x5B;hostname]\r\n\r\nAfter disabling it, we have been able to sustain an higher throughput :\r\n\r\n   st1f742003b:analytics dataset-047&gt; read 10\r\n   DATE\/TIME                KB\/SEC     KB\/SEC BREAKDOWN\r\n   2013-9-10 15:06:49       349787     349755 ixgbe0\r\n                                           32 igb1\r\n   2013-9-10 15:06:50       351674     351643 ixgbe0\r\n                                           31 igb1\r\n   2013-9-10 15:06:51       650618     650580 ixgbe0\r\n                                           38 igb1\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >17832061<\/td>\n<td >One dropped packet can permanently throttle TCP transfer<\/td>\n<td >No easy workaround unless using the same VLANS for clients and the ZFS appliance<\/td>\n<td >2013.1.2.9<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<br \/>\nWhen clients and ZFS appliance are on different VLANS, a 50% throughput decrease has been observed. Note that this does not impact IO writes. This is a pure network issue.<br \/>\n   * is the issue seen under 2011.1 : YES<br \/>\n   * is the issue seen under 2013.1 and same VLANSs (not routed)  : NO<br \/>\n   * is the issue seen under 2013.1 and different VLANSs (routed) : YES<\/p>\n<p>In addition to the congestion window, there is a variable that governs the maximum number of packets allowed to be sent in a single burst of packets. Initially this number if essentially infinite, with the number of packets in a burst allowed to keep growing indefinitely as long as there are no dropped packets indicating network congestion. However, when a packet is dropped, the number of packets may be reduced.<\/p>\n<p>The problem we have here that if the destination of the connection is on the same local network, the burst is kept at infinity, but if it is not on the same local network the burst is reduced to 5. Under some circumstances it can even be reduce further to 3, but from 3 it will grow back to 5. This accounts for the difference of the VLANs. One VLAN is deemed local and one is deemed not local. On the non-local VLAN, once a packet is dropped the limits is reduced forever to 5.<\/p>\n<p>Beyond 2013.1.2.9 the following tunable has to be changed :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    echo 'tcp_cwnd_normal\/W 0x14' | mdb -kw\r\n<\/pre>\n<p>For some cases, it seems we can set tcp_cwnd_normal to 0x64 to get better throughput. Adam feedback for a perf issue with IO reads done by NDMP, is as follows :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n    tcp_cwnd_normal=0x5  :  3MB\/s\r\n    tcp_cwnd_normal=0x14 : 30MB\/s\r\n    tcp_cwnd_normal=0x64 : 80MB\/s\r\n\r\n    Bigger values did not help much\r\n<\/pre>\n<hr>\n<p><div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >18125379<\/td>\n<td >memory leak in l2arc on sync reads that fail<\/td>\n<td >see below<\/td>\n<td >2013.1.2.0<\/td>\n<\/tr>\n<\/tbody><\/table><\/div><br \/>\nSignature :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nA typical kstate.out file will be as follows :\r\n    ::arc -m\r\n    p                         =         4 MB\r\n    c                         =        64 MB  &lt;&lt; target ARC size has reached its min\r\n    c_min                     =        64 MB\r\n    c_max                     =    130030 MB\r\n    size                      =     98748 MB  &lt;&lt; current ARC size, but no data is evictable anymore\r\n    &#x5B;...]\r\n    l2_evict_reading          =         0\r\n    l2_abort_lowmem           =     10240     &lt;&lt; memory pressure : the kmem_cache_reap_now() thread has run\r\n    l2_cksum_bad              = 504213672     &lt;&lt; issue when data read from SSD cache devices is validated before \r\n                                                 it is given to the ARC\r\n    l2_io_error               =         0\r\n    l2_hdr_size               =     29696 MB\r\n\r\n   * l2_cksum_bad will cause memory leak on sync reads (not on async reads) due to 18125379\r\n   * getting 'l2_cksum_bad' will make a read from the original storage device\r\n<\/pre>\n<p>Workaround :<br \/>\nSSD do not have to be replaced<br \/>\nThere is no workaround so far. A reboot will be needed to get around the issue but we have analytics to measure arc.c and so, we can get preemptive actions and schedule a reboot before it gets messy. Do as follows :<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n   - Add the following analytic, if not present already : **Cache - ARC target size : arc.c**\r\n   - Add an ALERT from **Configuration-&gt;Alerts panel**\r\n      * Click on (+) Threshold alerts.\r\n      * Select &quot;Cache ARC target size&quot;\r\n      * Select &quot;fails below&quot;\r\n      * Enter 10000 (MB)\r\n      * Fill the correct email address\r\n   - If you receive an alert, it will be time to schedule a NAS head reboot.\r\n<\/pre>\n<p>I did chose **10000MB** but it really depends on the memory size. Obviously, when arc.c has decreased to 64MB, it&#8217;s already too late.<\/p>\n<p>Beyond that, we still do not know why &#8216;l2_cksum_bad&#8217; did increase so much and Victor Latushkin told this may not be an HW issue in the SSD.<\/p>\n<hr>\n<div class=\"table-responsive\"><table  style=\"width:100%; \"  class=\"easy-table easy-table-default \" >\n<thead>\r\n<tr><th  style=\"width:15px\" >Bug number<\/th>\n<th  style=\"width:100px\" >Title<\/th>\n<th  style=\"width:80px\" >Workaround<\/th>\n<th  style=\"width:15px\" >Fix<\/th>\n<\/tr>\n<\/thead>\n<tbody>\r\n<tr><td >18562374<\/td>\n<td >missing call to arc_free_data_block in l2arc_read()<\/td>\n<td >NA<\/td>\n<td >2013.2<\/td>\n<\/tr>\n\r\n<tr><td >18695640<\/td>\n<td >race between arc_write_done() and arc_evict_buf()<\/td>\n<td >NA<\/td>\n<td >in Oct. 2017, this did not seem to fixed<\/td>\n<\/tr>\n<\/tbody><\/table><\/div>\n","protected":false},"excerpt":{"rendered":"<p>All of the following information has to be taken &#8220;as is&#8221; and has not been updated for 3 years now. Though I am quite sure all of these bugs are &#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"template-fullwidth.php","meta":{"_themeisle_gutenberg_block_has_review":false,"footnotes":""},"_links":{"self":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/344"}],"collection":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/comments?post=344"}],"version-history":[{"count":1,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/344\/revisions"}],"predecessor-version":[{"id":345,"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/pages\/344\/revisions\/345"}],"wp:attachment":[{"href":"https:\/\/fredpayet.fr\/index.php\/wp-json\/wp\/v2\/media?parent=344"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}