ZFSSA know bugs – 2017 – my very little tricks

Contents

1 General bugs list
2 Kernel memory related issues
3 zfs_signature.sh
4 Bugs under the bonnet

All of the following information has to be taken “as is” and has not been updated for 3 years now.
Though I am quite sure all of these bugs are fixed now, it was worth listing them and share this with my fellows that are still working onto the ZFSA.
Hope that helps.

Credits to all the engineers that did work with me at Sun and Oracle from 1999 to 2017.

General bugs list

Bug number	Title	Workaround	Fix	Comments
15020395	performance decrease if 1st nameserver is down		2011.1.3
15040725	RFE for improved DNS resolver failover performance		2011.1.3
15770608	Enable DNS defer-on-fail functionality		2011.1.3
15505861	Write performance degrades when ZFS filesystem is near quota		2013.1.1.9	Attempts was done for 2011.1.6 IDR but regression tests did fail
15742475	Extremely sluggish 7420 node due to heap fragmentation limit arc_size and arc_meta_limit		2011.1.3
15770103	Intermittent akd hangs – mutex owned by a thread stuck in zfs land		2011.1.4	Impacts rollback, snapshot deletion. Refers to CR 7155750
15789968	Unable to delete zombie snapshot causes CPU spikes		2011.1.5	High CPU usage is an indicator
15814083	z_fuid_lock becomes too hot when zfs_fuid_find_by_idx() is heavily used		2013.1.2	CIFS performance is bad. High CPU usage on ZFSSA is a clear indicator
15705245	Elapsed time for zfs delete may grow quadratically		2011.1.5	LUN or filesystem deletion may lead to BUI hang, for hours
15710347	System hung, lots of threads in htable_steal	limit arc_size	2013.1.1.5
15751065	Appliance unavailable due to zio_arena fragmentation	limit arc_size and arc_meta_limit	2011.1.3
15710534	Advance rotor before blocking on space_map_load	metaslab_unload_delay	2011.1.6	sawtooth IOPs pattern during sync of metadata
15790948	soft-ring fanout is single threaded for VLAN		2013.1	Throuhput limited when VLAN used
17650441	NFS drops to about 0 for about 30 seconds and then returns to normal		IDR 2011.04.24.8.0,1-2.43.2.3	spa_sync activity is more than 5 sec
15723911	concurrency opportunity during spa_sync	sync=always	2013.1.1.1	20% loss of throughput with RMAN backups
15652868	DTrace compiler doesn't align structs properly disable some analytics		2013.1	ip.bytes[hostname] impact perf
16423386	dnode_next_offset() doesn't do exhaustive search		2013.1.1.4	perf issue when deleting clones
18753152	zil train and logbias=throughput, 2X performance opportunity		2013.1.2.8	perf issue impacting databases running OLTP workload with 8k recordsize and logbias=throughput
17832061	SMB read performance are not consistent	Avoid using different VLANs	2013.1.2.9	50% bandwidth loss over different VLANs
15754028	read-modify-write in space_map_sync() may significantly slow down spa_sync() performance	metaslab_unload_delay	2013.1.2.9
15784616	need an upper bound on memory footprint of l2hdr for large l2arc		2013.1.2.11	L2ARC_HEADER_PERCENT added : default = 30% of ARC
18609976	large deletes cause NFS ops to drop to zero		2013.1.3.5	To be monitored with spawtf.d script. Happens when large datasets are destroyed. spa_sync can last more than 10 mins. The fix being pursued here also fixes two problems causing scan/scrub/resilver to overrun its time allocations. Prior to this, if a scrub is running, it consistently goes over 6s. With the fix, it's just over 5s as it's intended to be.

Typical stack as follows:
               genunix`avl_find+0x56
               zfs`space_map_add+0xa6
               zfs`zio_free+0x1a5
               zfs`dsl_free+0x20
               zfs`dsl_dataset_block_kill+0x19b
               zfs`free_blocks+0x76
               zfs`free_children+0x2c7
               zfs`free_children+0x23f
               zfs`free_children+0x23f
               zfs`dnode_sync_free_range+0x113
               zfs`dnode_sync+0x272
               zfs`dmu_objset_sync_dnodes+0x7f
               zfs`dmu_objset_sync+0x1c9
               zfs`dsl_dataset_sync+0x63
               zfs`dsl_pool_sync+0xdb
               zfs`spa_sync+0x39a
               zfs`txg_sync_thread+0x244
               unix`thread_start+0x8
              2407

Bug number	Title	Workaround	Fix	Comments
20693077	snapshot deletes don't check timeout while iterating		2013.1.4.6

Hundreds of stacks similar to this seen onproc:
              zfs`dbuf_hold_impl+0x1
              zfs`dnode_hold_impl+0xc3
              zfs`dnode_hold+0x2b
              zfs`dmu_bonus_hold+0x32
              zfs`bpobj_open+0x6d
              zfs`bpobj_iterate_impl+0x35d
              zfs`bpobj_iterate_impl+0x3ff
              zfs`bpobj_iterate_impl+0x3ff
              zfs`bpobj_iterate_impl+0x3ff
              zfs`bpobj_iterate_impl+0x3ff
              zfs`bpobj_iterate_impl+0x3ff
              zfs`bpobj_iterate_impl+0x3ff
              zfs`bpobj_iterate_impl+0x3ff
              zfs`bpobj_iterate+0x23
              zfs`dsl_scan_sync+0x11b
              zfs`spa_sync+0x447
              zfs`txg_sync_thread+0x244
              unix`thread_start+0x8

Bug number	Title	Workaround	Fix	Comments
19772024	ls_reclaim() blocks kmem_cache_reap()	limit arc_size	2013.1.3.1	kmem_cache_reap thread stuck trying to retrieve some space from specific NFS cache
19591405	akd running in RT class is blocking critical kernel memory management		2013.1.6.0	Some RT akd threads needing memory prevent kmem_cache_reap and pageout scanner from running. See MOS Doc ID 2043383.1 : Some systems with LOW Memory configurations might not run reliably/stable with 2013.1.x releases

The issue can happen on small sized systems with 24GB of DRAM. A typical solution is to add DRAM (to have a final size of 96GB).

An admitted workaround is to set the non cluster akd threads scheduling class to Fare Share (FS), as follows :
              zfssa # svccfg -s akd setprop akd/rt_off = "true"
              zfssa # svccfg -s akd listprop | grep rt_off
              akd/rt_off_cluster astring     false
              akd/rt_off astring true  <<<<
              zfssa # svcadm refresh akd
              zfssa # svcadm restart akd
Must be done on both cluster nodes.

Bug number	Title	Fix	Comments
18914201	Hole buffers are calculating zero checksums leading to performance slowdown	2013.1.3.1
19137061	top-level vdev allocations may go over mg_aliquot	2013.1.3.1	The fix spreads IOs accross vdevs is a more efficient way, making each vdev less busy
19949855	optimize the number of cross-calls during hat_unload	2013.1.3.3	Reaper thread not able to free buffers fast enough.This is causing NFS clients to experience IO latency.

Typical stack is as follows :
              unix`do_splx+0x97
              unix`xc_common+0x230
              unix`xc_call+0x46
              unix`hat_tlb_inval+0x289
              unix`hat_unload_callback+0x302
              unix`hat_unload+0x41
              unix`segkmem_free_vn+0x5a
              unix`segkmem_free+0x27
              genunix`vmem_xfree+0x10a
              genunix`vmem_free+0x27
              genunix`kmem_slab_destroy+0x87
              genunix`kmem_slab_free+0x2c7
              genunix`kmem_magazine_destroy+0xdf
              genunix`kmem_depot_ws_reap+0x77
              genunix`kmem_cache_magazine_purge+0xd6
              genunix`kmem_cache_magazine_resize+0x41
              genunix`taskq_thread+0x22e
              unix`thread_start+0x8

Bug number	Title	Workaround	Fix	Comments
19883033	ZS3 encountering unexplained high CPU utilization		2013.1.3.3	SMB issue mostly

Using dtrace to count the number of calls from
zfs_groupmember()->zfs_fuid_find_by_idx()

for the exact same workload to the same share as the same user:
calls                version
12843966        ak-2013.1 3.0
848928          ak-2013.1 3.3

Bug number	Title	Workaround	Fix	Comments
15324796	zfetch needs a whole lotta love		2013.1.4.4	For sequential IO reads with small blocksize, zfetch_block_cap can be increased. By default, zfetch_block_cap = 16. With a recordsize of 8k, ZFS will prefetch 128K only. An high number of vdevs will not help as we just do not prefetch deep enough to make them all busy. To get better disk usage, zfetch_block_cap has to be increased to 32 or even 64. See tuning_zfetch_block_cap.akwf attached to bug 19982476 (backport for ZFSSA).
20361021	Latency seen on upgrade to SMB2.1 with leases. SMB2.1 server is waiting for lease break ack		2013.1.4.7

SMB2.1 version is available in 2013.1.3.0 which introduces leases. This feature works in a similar way like oplocks where a 35sec delay can be seen. So just disabling oplocks from the BUI might not work.

You have to go back to SMB2.0 (to disable leases) as follows .
              zfssa# svccfg -s smb listprop | grep maxprotocol
              smb/client_maxprotocol astring 1
              smb/server_maxprotocol astring 2
              zfssa# echo '::smbsrv -v' | mdb -k | grep maxprotocol
              skc_server_maxprotocol = 3 (SMB_D2_1)

              zfssa# svccfg -s smb setprop smb/server_maxprotocol = astring: "2.0"
              zfssa# svccfg -s smb listprop | grep protocol
              smb/client_maxprotocol astring 1
              smb/server_maxprotocol astring 2.0
              zfssa# echo '::smbsrv -v' | mdb -k | grep maxprotocol
              skc_server_maxprotocol = 2 (SMB_D2_0)

              zfssa# aksh
              zfssa:> configuration services smb
              zfssa:configuration services smb> disable
              zfssa:configuration services smb> enable

Caveat : Disabling oplocks, will result in more traffic over the wire. The clients will be denied read/write caching of the data and all the requests have to go to the server.

Bug number	Title	Workaround	Fix	Comments
19454536	Improve the performance of SMB2_QUERY_DIRECTORY		2013.1.6.0	Browsing CIFS shares with lots of files will make the ZFFSA CPUs busy … latencies can happen

Signature :

zfssa# dtrace -n 'profile-997hz /arg0 && curthread->t_pri != -1/ { @[stack()] = count(); }
       tick-10sec { trunc(@, 10); printa(@); exit(0); }'
              unix`mutex_delay_default+0x7
              unix`mutex_vector_enter+0x2ae
              zfs`zfs_zaccess_aces_check+0x77
              zfs`zfs_zaccess_common+0x85
              zfs`zfs_zaccess+0x14d
              zfs`zfs_fastaccesschk_execute+0x173
              zfs`zfs_lookup+0xa8
              genunix`vhead_lookup+0xd5
              genunix`fop_lookup+0x128
              smbsrv`smb_vop_lookup+0x17c
              smbsrv`smb_fsop_do_lookup+0x14e
              smbsrv`smb_fsop_lookup+0x27d
              smbsrv`smb_odir_wildcard_fileinfo+0x6f
              smbsrv`smb_odir_read_fileinfo+0x10e
              smbsrv`smb2_query_directory_encode_dynamic+0xd5
              smbsrv`smb2_query_directory_encode+0x7e
              smbsrv`smb2_query_directory+0x84
              smbsrv`smb2_dispatch_request_impl+0x237
              smbsrv`smb2_dispatch_request+0x15
              smbsrv`smb_session_worker+0x88
             5672

   Lockstat extract (32 cores) :
   -------------------------------------------------------------------------------
   Count indv cuml rcnt     nsec Lock                   Hottest Caller
   2959889  95%  95% 0.00   224391 0xfffff6011546f100 zfs_root+0x58
         nsec ------ Time Distribution ------ count     Stack
          256 |                               98 fsop_root+0x2d
          512 |                               56784 smb_node_alloc+0x11e
         1024 |                               82330 smb_node_lookup+0x1fe
         2048 |@                              187212 smb_fsop_do_lookup+0x49b
         4096 |@@@                            372976 smb_fsop_lookup+0x27d
         8192 |@@@@@@@@                       876803 smb_odir_wildcard_fileinfo+0x6f
        16384 |@@@                            354769 smb_odir_read_fileinfo+0x10e
        32768 |@                              167806 smb2_query_directory_encode_dynamic+0xd5
        65536 |@                              107972 smb2_query_directory_encode+0x7e
       131072 |                               91142 smb2_query_directory+0x84
       262144 |@                              107285 smb2_dispatch_request_impl+0x237
       524288 |@                              150887 smb2_dispatch_request+0x15
      1048576 |@                              182779 smb_session_worker+0x88
      2097152 |@                              152531 taskq_d_thread+0xc1
      4194304 |                               66082 thread_start+0x8
      8388608 |                               2424
     16777216 |                               9

Bug number	Title	Workaround	Fix	Comments
23216069	Missing flag in prefetch backport causes excessive prefetch activity		2013.1.6	Related to : 21974849 – RMAN backup over ASM leads to 1M random read foiling zfetch. With the new zfs_fetch implementation brought in 2013.1.4.4, RMAN backup through ASM make the SAS drives saturated because of unexpected prefetching.

@ With the new zfetch code when
@ subject to 1MB random read, we actually do issue some readaheads and given
@ that the ASM Allocation Units can be interleaved, we're likely to end up not using the prefetched data.

This makes the disks very busy for no profit.
To compare the FC/iSCSI/NFS activity to the SAS2 disks activity, it can be interesting to add some 
analytics as follows :

   fc.bytes[op]
   iscsi.bytes[op]
   nfs3.bytes[op]
   ...

This can even make the CPUs busy with the following stack :
               zfs`dbuf_prefetch+0x1
               zfs`dmu_zfetch+0x10d
               zfs`dbuf_read+0x128
               zfs`dmu_buf_hold_array_by_dnode+0x1cc
               zfs`dmu_buf_hold_array+0x6f
               zfs`dmu_read_uio+0x4b
               zfs`zfs_read+0x260
               genunix`fop_read+0xa9
               nfssrv`rfs3_read+0x3ec
               nfssrv`common_dispatch+0x53f
               nfssrv`rfs_dispatch+0x2d
               rpcmod`svc_process_request+0x184
               rpcmod`stp_dispatch+0xb9
               genunix`taskq_d_thread+0xc1
               unix`thread_start+0x8
             47770
@
@ The current workaround that we implemented was to tune
@ zfetch_maxbytes_lb=128K & zfetch_maxbytes_ub=256K. This keeps prefetch enabled but reduces the amount
@ of prefetch data that needs to be thrown away, This is not a very comfortable
@ workaround.

   # echo zfetch_maxbytes_ub/W0t262144 | mdb -kw
   zfetch_maxbytes_ub:             0x2000000       =       0x40000
   # echo zfetch_maxbytes_lb/W0t131072 | mdb -kw
   zfetch_maxbytes_lb:             0x400000        =       0x20000

From the dmu_zfetch.c :
076 * Max amount of data to prefetch at a time */
077 * unsigned int zfetch_maxbytes_ub = 32 << MB_SHIFT;
078 * unsigned int zfetch_maxbytes_lb = 4 << MB_SHIFT;
079 * unsigned int zfetch_target_blks = 256;

650 * Provides an upper bound on the prefetch window size for a single prefetch.
651 * We try to prefetch zfetch_target_blks (256) but stay within the size bounds
652 * [zfetch_maxbytes_lb (4 MB), zfetch_maxbytes_ub (32 MB)]
653 *
654 * BLOCKSIZE MAX PREFETCH WINDOW SIZE
655 * 128K-1MB 32MB
656 * 32K-64K 256 blocks (8MB - 16MB)
657 * < 32K 4MB
658 */
659 static unsigned int
660 dmu_zfetch_max_blocks(uint64_t blkshft)
661 {
662 unsigned int blks = zfetch_target_blks;
663 unsigned int size = blks << blkshft;
664
665 if (size > zfetch_maxbytes_ub)
666 return (zfetch_maxbytes_ub >> blkshft);
667
668 if (size < zfetch_maxbytes_lb)
669 return (zfetch_maxbytes_lb >> blkshft);
670
671 return (blks);
672 }

Bug number	Title	Workaround	Fix	Comments
21071219	Comstar framework should allow a cmd to abort prior to ZFS I/O completion		2013.1.5.7 & 2013.1.6	Relates to 22080255 : ZFSSA backend "resilvering" causes FC front end port inaccesible to remote host

We encountered this issue when we were investigating CR 2829652. Basically on the qlt FC front end of ZFSSA, if a scsi command was detected timedout by remote host, remote host will issue ABTS and retries ABTS if the first one is not processed timely. If the second ABTS was not processed timely the LOGO will be issued from remote host and the I/O path may get lost permanently (see CR 21071136) as a result. For the ZFSSA using FC as the transport interface the capability to be able to handle ABTS timely is critical so that the host can complete its error recovery and retry IOs without having the risk of losing IO path as a result of LOGO. The AK8 kernel and comstack stack does not have the way for fct/stmf to allow a command to abort prior to the ZFS backend IO completion (pmcs may involved in most cases).

traces : 
     *stmf_trace_buf/s
     [2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts, (0)(544h FFFFh)
     (ah)(fffff600e30dd480, fffff600e30dd510)(FFFFFFFFh FFFFFFFFh)(0)(B070
     [2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts, (0)(544h FFFFh)
     (ah)(fffff600e30dd480, fffff600e30dd
     [2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts,
 
     *qlt_state::walk softstate |::qltgettrace : fc/qlttrace
     [2016 May 25 22:01:48:229:519:056] !46890=>QEL qlt(2,0):: qlt_handle_rcvd_abts, (0)(2C0h 724h)
     (4h)(fffff607a0c25700, fffff607a0c25790)(FFFFFFFFh 126B10h)(0)(603
     [2016 May 25 22:01:48:229:519:056] !46890=>QEL qlt(2,0):: qlt_handle_rcvd_abts, )
     [2016 May 25 22:01:48:229:544:907] !46891=>QEL qlt(2,0):: qlt_abort_cmd, cmd_type = FCT_CMD_FCP_XCHG
     [2016 May 25 22:01:48:229:547:011] !46892=>QEL qlt(2,0):: qlt_abort_unsol_scsi_cmd, qi(0) fctcmd-
     fffff604a0877c80(2C0h, FFFFh),9d000477 rex2=126B10h
     [2016 May 25 22:01:48:229:562:586] !46893=>QEL qlt(2,0):: qlt_handle_ctio_completion, abort(oxid=2C0h):
     hndl-9d170477, 1, fffff6007dcc6b40 (126B10h)
     [2016 May 25 22:01:48:229:566:267] !46894=>QEL qlt(2,0):: qlt_handle_ctio_completion, 
     0(fffff604a0877c80,fffff604a0877d10)(2C0h,FFFFh),9d170477 126B10(2016, 0)
     [2016 May 25 22:01:48:784:693:584] !46895=>QEL qlt(2,0):: qlt_send_abts_response, 
     (160h 5E1h)(0)(fffff607998c6e00 fffff607998c6e90) (FFFFFFFFh 122E80h)
     [2016 May 25 22:01:48:784:746:901] !46896=>QEL qlt(2,0):: qlt_handle_abts_completion, qi(0) status =0h,
     (160h 5E1h)
     [2016 May 25 22:01:48:796:029:698] !46897=>QEL qlt(2,0):: qlt_send_abts_response,
     (1BAh 63Eh)(0)(fffff6058ccd3c40 fffff6058ccd3cd0) (FFFFFFFFh 123FF0h) 

The spagettime.d will show high gap as follows :
     # cat spagettime.Pool_02.2016-05-25_14\:14\:10 | nawk '{ if ($16>5000) print}'
     2016 May 25 14:15:21 txg : 9996375; walltime:759  ms ; sync gap:14845 ms ; txg delays:0 ; txg stalled:0
     2016 May 25 16:05:37 txg : 9997695; walltime:1449 ms ; sync gap:7279  ms ; txg delays:0 ; txg stalled:0
     2016 May 25 22:01:49 txg : 10001958;walltime:1175 ms ; sync gap:25238 ms ; txg delays:0 ; txg stalled:0
     2016 May 25 22:03:27 txg : 10001976;walltime:1065 ms ; sync gap:11427 ms ; txg delays:0 ; txg stalled:0
     2016 May 25 22:12:45 txg : 10002087;walltime:1287 ms ; sync gap:5106  ms ; txg delays:0 ; txg stalled:0

Bug number	Title	Workaround	Fix	Comments
23526362	mptsas driver TRAN_BUSY on readzilla degraded shadow-migration performance		2013.1.6.11

Signature : iostat -xnz 1 - wait queue > 0 is a smoking gun + actv >= 1 :

                   extended device statistics
     r/s w/s kr/s kw/s wait actv wsvc_t asvc_t   %w %b device
     124.0 61.0 1423.0 38681.3 5.0 1.3 26.8  7.3 66 68 c0t5001E82002763F90d0 << READZILLA
     263.0 51.0 2233.4 36262.7 4.9 1.4 15.7  4.4 65 71 c0t5001E82002763F90d0
      95.0 61.0  956.0 43606.2 6.8 1.8 43.6 11.8 89 95 c0t5001E82002763F90d0
     288.0 56.0 4157.5 41983.7 5.5 1.5 16.0  4.4 73 76 c0t5001E82002763F90d0
     172.0 64.0 1422.9 43494.1 5.4 1.4 22.8  6.1 69 74 c0t5001E82002763F90d0
      55.0 58.0  517.1 40532.3 5.2 1.4 45.8 12.4 68 70 c0t5001E82002763F90d0

     wait : average number of transactions waiting for service (queue length) or host queue depth.
     actv : active queue is the number of pending I/O requests. This represents the number of threads doing 
     IOs onto the same device (Cf Roch).

L2ARC devices (SANDISK 1.6TB) are 100% busy, but critically have only 1 IO in the active queue and 9 in the wait queue. Blaise suspects it’s because the L2 improvements in recent AK releases means the L2ARC is caching much better, so the bug is triggered more easily.

Bug number	Title	Workaround	Fix	Comments
20466027	zfs_zget is causing high cpu usage		2013.1.6.12

Signature from a dtrace profile :

                unix`mutex_delay_default+0x7
                unix`mutex_vector_enter+0x2ae
                zfs`zfs_zget+0x4a
                zfs`zfs_vget+0x1d2
                genunix`fsop_vget+0x66
                nfssrv`nfs3_fhtovp+0x47
                nfssrv`rfs3_read+0x4d
                nfssrv`common_dispatch+0x53f
                nfssrv`rfs_dispatch+0x2d
                rpcmod`svc_process_request+0x184
                rpcmod`stp_dispatch+0xb9
                genunix`taskq_d_thread+0xc1
                unix`thread_start+0x8
                4652


                nfssrv`rfs3_write+0xab0
                nfssrv`common_dispatch+0x694
                nfssrv`rfs_dispatch+0x2d
                rpcmod`svc_process_request+0x184
                rpcmod`stp_dispatch+0xb9
                genunix`taskq_d_thread+0xc1
                unix`thread_start+0x8
                10390

                unix`mutex_delay_default+0x7
                unix`lock_set_spl_spin+0xc7
                genunix`disp_lock_enter+0x2d
                genunix`turnstile_lookup+0x69
                unix`mutex_vector_enter+0x1b0
                zfs`zfs_zget+0x4a
                zfs`zfs_vget+0x1d2
                genunix`fsop_vget+0x66
                nfssrv`nfs3_fhtovp+0x47
                nfssrv`rfs3_write+0x4a
                nfssrv`common_dispatch+0x694
                nfssrv`rfs_dispatch+0x2d
                rpcmod`svc_process_request+0x184
                rpcmod`stp_dispatch+0xb9
                genunix`taskq_d_thread+0xc1
                unix`thread_start+0x8
                19267

The fix was to remove the mutex and replace it with a numa aware reader/writer lock and have it grabbed as a reader in the normal zget cases. The zget code will then have to deal with the races that can occur when the same znode is being looked up at the same time.

kernel memory related issues

Kernel memory related issues

Bug number	Title	Fix	Comments
15942559	ZFS data can not be easily reclaimed for other uses	ak9
15775761	Converting memory between kmem caches can be limiting factor	no fix	Points to 15942559 – ZFS data can not be easily reclaimed for other uses
15796281	kmem taskq processing can wedge the system	2013.1.6.3 IDR IDR #4.1	The fix design was to change the kmem_taskq into a standalone workq with a dedicated service thread. Each cache could have a bitmap of possible operations that the service thread could perform, along with a "busy" bit. When async work needs to be done on a cache, the thread requesting the work would set the corresponding bit and enqueue the cache on the global list. When the kmem thread wanted to do work, it would set the "busy" bit on a cache, perform the operations specified by the bitmap, then clear the "busy" bit. If a kmem_cache_destroy() needed to come in, it could grab the list lock, wait for the "busy" bit on its cache to be clear (hopefully it would be clear to begin with), remove the cache from the list, and then manually do whatever processing it needed to do before destroying the cache.
19294541	kmem cache reaping should be made more scalable	2013.1.6.3 IDR IDR #4.1

List of caches getting processed by kmem_async are queued :
    > kmem_caches_processing=J
    fffffffffc0b99c0
    > fffffffffc0b99c0::walk list |::kmem_cache
    ADDR             NAME              FLAG    CFLAG  BUFSIZE  BUFTOTL
    ffffc10000047030 kmem_alloc_256    120f 00200000      256   677052 << processing this cache
    ffffc1010e5c8030 dnode_t           120f 04000000      752    16760 << waiting
    ffffc1010e5d4030 zfs_znode_cache   120f 00000000      280    10842 << waiting
    ffffc10113b2d030 zio_buf_512       1200 01020000      512    16816 << this cache may wait a long 
                                                                          time before getting processed

Synchronous processing of caches seems to be not scaling. Things can go bad especially in low memory conditions.
Solution was to split kmem_reap in 2 steps :
1. All “background processing” on a given cache (reap or update) is to be done by a single thread at a time. Multiple threads can run for different buffers at the same time. This is the “scalable part”
2. The “global cleanup” work (re-scheduling timeouts, etc) can be done after all processing related to the base work (reap or update) is complete.

Bug number	Title	Workaround	Fix	Comments
23555751	ZFSSA – FC VM frozen for more than 10 mins	see below	2013.1.6.3 IDR IDR #4.1	Points to 15796281 and 19294541

    echo "arc_shrink_shift/W 6"| mdb -kw (default is 7)
    echo "arc_evict_chunk_minbufs/Z 0t256"|mdb -kw

Adjust arc_evict_chunk_minbufs to 256 from its default value of 128 – See 26264372.

    echo "arc_evict_chunk_minbufs/Z 0t256"|mdb -kw 
    Tuning arc_shrink_shift will have the effect of making zfs reacting to low memory at an early point.
    See arc.c/arc_reclaim_thread() and redzone calculation used by arc_reclaim_needed() :
    redzone = 2 * btop(arc_size >> arc_shrink_shift);

The reaper_trace.d script may be used to see how long it takes for the arc_get_data_block() function to return when throttleing happens.
Here we see that it can take 37 secs to return with some throttles :

    52  43171        arc_get_data_block:return 31658262460856570 2016 Jul 29 17:15:40 
        fffffe0c06f84c20 37658131850 throttle 1877 arn 1878 aen 1878 freemem 9435104 arc_size 1317113940032
    53  43171        arc_get_data_block:return 31658262460053455 2016 Jul 29 17:15:40 
        fffffe0c0582dc20 31811415899 throttle 1591 arn 1592 aen 1592 freemem 9435104 arc_size 1317112963016

The script has to be run as follows :

    cli> confirm shell
    # mkdir /var/tmp/tsc
    # cd /var/tmp/tsc
    # vi reaper_trace.d << copy the content of reaper_trace.d
    # chmod u+x reaper_trace.d
    # vi reaper_collection.sh << copy the content of reaper_collection.sh
    # chmod u+x reaper_collection.sh
    # ./reaper_collection.sh start

Bug number	Title	Workaround	Fix	Comments
22256256	Lengthy kmem_reap() processing can block progress of arc_reap_thread		2013.1.5.7	kmem_cache_reap_async() added to offload advisory reaps to kmem-internal taskq threads.That prevents advisory reaps from blocking behind the async kmem reap activity if they happen to be working on the same kmem cache.
22344599	Final detach at the end of a resilvering leads to high spa_sync time	no workaround	2013.1.6.0	Relates to 22256256 : Lengthy kmem_reap() processing can block progress of arc_reap_thread. Typical signature is kernel memory reorganization at the end of a disk replacement, when the final spare detach happens.

The arc_delays.d script will report some kernel memory buffers being reaped :

  2015 Dec  9 22:04:40 freemem     4054 MB
  2015 Dec  9 22:04:50 freemem     4546 MB; reaping in progress
  2015 Dec  9 22:05:00 freemem     5313 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU; 
  reaping in progress
  2015 Dec  9 22:05:10 freemem     5334 MB; reaping in progress
  2015 Dec  9 22:05:20 freemem     4799 MB; reaping in progress
  2015 Dec  9 22:05:30 freemem     4702 MB; reaping in progress
  2015 Dec  9 22:05:40 freemem     5368 MB; reaping in progress
  2015 Dec  9 22:05:50 freemem     6196 MB; reaping in progress
  2015 Dec  9 22:06:00 freemem     6964 MB; reaping in progress
  2015 Dec  9 22:06:10 freemem     7985 MB; reaping in progress
  2015 Dec  9 22:06:20 freemem     9085 MB; reaping in progress
  2015 Dec  9 22:06:30 freemem    10152 MB; reaping in progress
  2015 Dec  9 22:06:40 freemem    11409 MB; reaping in progress
  2015 Dec  9 22:06:50 freemem    12681 MB; reaping in progress

Looking at the analytics, the resilvering end matches with some kernel memory reorganization.
See how bpmap_cache buffer size decreases :

  lbl-1656:analytics dataset-035> show
  Properties:
                         name = kmem.fragmented[cache]
                     grouping = Memory
                  explanation = kernel memory lost to fragmentation broken down by kmem cache
                       incore = 4.89M


  lbl-1656:analytics dataset-035> read 2015-12-09 22:04:49 1
  DATE/TIME                    MB         MB BREAKDOWN
  2015-12-9 22:04:49        17393      11495 bpmap_cache
                                        1610 space_seg_cache
                                         659 dmu_buf_impl_t
                                         536 l2arc_buf_t
                                         518 arc_buf_t
                                         364 zio_cache
                                         158 arc_ref_t
                                         157 arc_elink_t
                                         144 dnode_t

  lbl-1656:analytics dataset-035> read 2015-12-09 22:06:10 1
  DATE/TIME                    MB         MB BREAKDOWN
  2015-12-9 22:06:10        10753       4979 bpmap_cache
                                        1598 space_seg_cache
                                         596 dmu_buf_impl_t
                                         536 l2arc_buf_t
                                         489 arc_buf_t
                                         361 zio_cache
                                         164 zio_data_buf_8

Bug number	Title	Workaround	Fix	Comments
19949855	optimize the number of cross-calls during hat_unload		2013.1.3.3
20645565	arc_throttle logic can delay zfs operations under low memory conditions		2013.1.6.3 IDR IDR #4.1

a) Excessive wait time in kmem taskq :
arc_reaper_thread dispatches the reap request to kmem_taskq. kmem_taskq is single threaded and it is responsible for multiple kmem maintenance tasks, not just reaping. If there are tasks pending in the taskq, it may take a while for our task to get to the front of the queue and execute.

b) Latency in taskq_wait :
arc_reaper_thread, after it dispatches the reaping task to taskq, it does taskq_wait to wait for the task to be done. Once it returns from taskq_wait, it signals arc_reaper_thread so that it may shrink arc further and evict more buffers if needed. However, it seems like we wait in taksq_wait until all the pending tasks in queue are complete, not just for the completion of the task dispatched by arc_reaper_thread. For example, if some other thread calls kmem_reap, that may dispatch lots of reaping tasks to kmem taskq. In that case, we seem to be keep on waiting until these other reaping tasks are also complete.

It seems like the fix for following bug may help to solve this problem. But, the fix was not backported in AK : 15796281 – kmem taskq processing can wedge the system

c) The red zone size used for reacting to memory shortage :
The memory throttle logic in arc_get_data_block puts the calling thread to wait if allocation of additional memory can result freemem falling below (desfree + lotsfree). On a 512GB system, (desfree + lotsfree) is approximately 12GB. In order that we may not enter throttle logic and put the threads to wait for memory, we may consider reacting to low memory situation at an early point. We shouldn’t wait until freemem goes as low as 12GB. In arc_reclaim_thread logic, that early point is decided based on the size of red zone. Probably, the red zone size also needs some tuning so that we may react much earlier

zfs_signature.sh

The zfs_signature.sh script was commonly used to narrow down performance issues possibly related to ZFS. This was written by Roch Bourbonnais and Blaise Sanouillet, 2 gurus I miss a lot.

Bug number	Title	Workaround	Fix	Comments
18125379	memory leak in l2arc on sync reads that fail		2013.1.2.0

See l2feeding.d script comments .
    * The output below with write BW under the target BW, and all devices able to
    * keep up, is a signature for:
    *      16564195 ARC l_evict_lock contention in arc_evict_bytes()
    *
    * 2014 Nov 12 12:08:08
    * Data eligible for L2, evicted but not fed:   352 MB/s
    * L2 feeding suspended due to low memory:        0
    *
    * L2 device            Waiting for data  Not keeping up    Write BW  Target BW
    * fffff600afd5b940     0                 0                   7 MB/s    40 MB/s
    * fffff600e6e2f280     0                 0                   8 MB/s    40 MB/s
    * fffff600f5be52c0     0                 0                   8 MB/s    40 MB/s
    * fffff600e836ea00     0                 0                   7 MB/s    40 MB/s

lockstat gives this :

Adaptive mutex spin: 4123264 events in 30.136 seconds (136823 events/sec)
-------------------------------------------------------------------------------
ops/s indv cuml rcnt     nsec Lock                   Caller
 1973  19%  19% 0.00  4254743 ARC_mru+0xa0           arc_evict_bytes+0x28

      nsec ------ Time Distribution ------ ops/s     Stack
       128 |                               0         arc_get_data_block+0xe4
       256 |                               8         arc_read+0x64c
       512 |                               20        dsl_read+0xac
      1024 |                               47        dbuf_read_impl+0x22c
      2048 |@                              115       dbuf_read+0x100
      4096 |@@@                            215

Bug number	Title	Workaround	Fix	Comments
16564195	ARC l_evict_lock contention in arc_evict_bytes		2013.1.4.5

   Adaptive mutex spin: 5320606 events in 62.214 seconds (85521 events/sec)
   -------------------------------------------------------------------------------
   Count indv cuml rcnt nsec Lock Hottest Caller
   2611062 80% 80% 0.00 124236 ARC_mru[272] arc_evict_bytes+0x5e

 
      nsec ------ Time Distribution ------ count Stack
      2 56 | 584 arc_get_data_block+0xc7
       512 | @ 91853 arc_read+0x606
      1024 | @ 78003 dsl_read+0x34
      2048 | @ 150450 dbuf_prefetch+0x1e6
      4096 | @@@ 300149 dmu_zfetch_fetch+0x5f
      8192 | @@@@@@@ 640676 dmu_zfetch+0x10d
     16384 | @@@@ 426892 dbuf_read+0x128
     32768 | @@ 246084 dmu_buf_hold_array_by_dnode+0x1cc
     65536 | @ 135358 dmu_buf_hold_array+0x6f
    131072 | @ 117919 dmu_read_uio+0x4b
    262144 | @ 124844 zfs_read+0x260
    524288 | @ 127585 fop_read+0xa9
   1048576 | @ 93911 rfs3_read+0x4e0
   2097152 | @ 53434 common_dispatch+0x53f
   4194304 | @ 20964 rfs_dispatch+0x2d
   8388608 | @ 2233 svc_process_request+0x184
  16777216 | @ 76 stp_dispatch+0xb9
  33554432 | @ 21 taskq_d_thread+0xc1
  67108864 | @ 23 thread_start+0x8
 134217728 | @ 3

19297091 – arc_evictable_memory has needless locks, is part of the fix

Bug number	Title	Fix	Comments
15648057	lack of accounting for DDT in dedup'd frees can oversubscribe txg	2010.Q1.2.0	See spagetti.d script output. walltime » 5 seconds, sync gap close to 0, arc misses from “DDT ZAP algorithm” in the 100s
18281514	spa_sync/space_map_map delay in dbuf_read causes NFS thread to pause	2013.1.2.9	See spagetti.d script output. walltime » 5 seconds, sync gap close to 0, arc misses from “SPA space map” in the 100s
16423386	dnode_next_offset() doesn't do exhaustive search	2013.1.1.4	See spagetti.d script output. Signature for dataset / large file destroy pathologies. walltime » 5 seconds, sync gap close to 0, dsl_dataset_destroy_sync() on-cpu time > 2 seconds
19621408	do_user_quota_updates can consume 10% of spa_sync time in some cases	no fix	See spagetti.d script output. walltime >= 5 seconds, do_user_quota_updates() time > 1 sec (or even bigger).

    -------- spa_sync exclusive time breakdown by tracked function :
                                      Elapse ms  On-cpu ms
    <total>                                5252        487
    spa_sync                                  6          4
    dsl_pool_sync                          2469        277
    metaslab_sync                             3          1
    metaslab_sync_reassess                    2          2          
    metaslab_sync_done                      125        122
    dmu_objset_do_userquota_updates        2193         55   <<<<
    taskq_wait -> vdev_sync                   2          0
    taskq_wait -> vdev_sync_done            109          0
    vdev_config_sync                         95          1
    dsl_dataset_destroy_sync                  0          0
    zio_wait                                  0          0
    lost                                    248         25

Bug number	Title	Workaround	Fix	Comments
15623702	Write throttling starts too late (arc_memory_throttle)		no fix	See spagetti.d script output. pattern of 2 out of 3 consecutive spa_sync with dp_write used ~0 and ARC mem throttle > 0
15744920	read throttle required to survive read write storm		2013.1.5	See spagetti.d script output. walltime >= 5 seconds, sync gap close to 0, no arc misses, dp_write thput limit ~= min + txg stalls and delays – see spagetti.d details.Seting zfs_write_limit_min to 24 * 100MB * #HDD, has been shown to alleviate the case of writes stalled for minutes

For a 20 disks (10 vdevs pool), this gives : 24 * 100MB * 20 = 48000000000 bytes
    zfssa# echo zfs_write_limit_min/Z0t48000000000 | mdb -kw
    zfs_write_limit_min:            0x2000000               =       0xb2d05e000

This has the effect of diminuishing the max throughput for the pool, but this make the txg 
stall and delays disappear.

Before the tuning of zfs_write_limit_min to 48000000000, we may have some very fluctuating 
"used throughput" for TXGs (second col) :
    min/used/thput limit/free limit  :        1/      23/     616/ 1179190
    min/used/thput limit/free limit  :        1/     409/     973/ 1179174
    min/used/thput limit/free limit  :        1/     696/    1347/ 1179146
    min/used/thput limit/free limit  :        1/     708/    1631/ 1179117
    min/used/thput limit/free limit  :        1/      85/    1464/ 1179099
    min/used/thput limit/free limit  :        1/      23/    1176/ 1179099
    min/used/thput limit/free limit  :        1/      24/     983/ 1179114
    min/used/thput limit/free limit  :        1/      24/     810/ 1179114
    min/used/thput limit/free limit  :        1/      23/     656/ 1179114
    min/used/thput limit/free limit  :        1/      11/     537/ 1179115
    min/used/thput limit/free limit  :        1/       1/     409/ 1179116

After the tuning, its more constant :
    min/used/thput limit/free limit  :     1907/     147/     147/ 1179075
    min/used/thput limit/free limit  :     1907/     147/     147/ 1179071
    min/used/thput limit/free limit  :     1907/     147/     147/ 1179066
    min/used/thput limit/free limit  :     1907/     147/     147/ 1179061
    min/used/thput limit/free limit  :     1907/     147/     147/ 1179057

Bug number	Title	Workaround	Fix	Comments
16275386	metaslab_group_alloc shouldn't block on space_map_load	see below	2013.1.6	When destroying a dataset, entries are added to a chain list in disk for a non loaded spacemap. When this spacemap is loaded for further writes, the chain list has to be parsed to determine where free space is. This makes a lot of random IO reads and this takes time while we are in the middle of IO writes. From bug 16275386 : It turns out that this issue is more general than previously thought. During txg sync, if an allocation on vdev 1 is waiting on space_map_load, further allocations will be directed to vdev 2 until it reaches its aliquot, then vdev 3, etc. Eventually it will come back to vdev 1 and also block waiting on space_map_load. This means vdev 1 won't be kept busy and so will reduce the write throughput, even when the ZIL is not in play. It should be possible to have a space map loaded asynchronously in advance so that metaslab_group_alloc() can always proceed without blocking, modulo corner cases.

Aliquo : mg_aliquot = 512K * (# leaf vdevs)
Normally, mg_aliquot is used as a limit to move to the next vdev (or metaslab_group).
In metaslab_ndf_fragmented(), mg_aliquot is used to decide whether we need to move to another metaslab 
if (maxsize < mg_aliquot).

WO :
    # echo "metaslab_unload_delay/W 0x7FFFFFFF" | mdb -kw

Bug number	Title	Workaround	Fix	Comments
17856080	arc_meta_used is not necessarily all MRU	see below	2013.1.7	Symptoms : IO reads can be slow, especially for backups. Disks can be busy due to lack of hits in MRU ( see bug 22818942 ). Disks busy on IO reads due to no evictable space

Signature : arc_delays.d
    2016 Apr 12 09:21:02 freemem     8359 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU;
    2016 Apr 12 09:21:12 freemem     7909 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU;

These messages mean that the MRU buffer is too small to keep the most recently used data.
The issue comes from the arc_adjust() function which adjust the size of the MRU or MFU buffers 
if arc_size > arc_c. arc_ajust() can be used when ZFS needs to reduce its memory consumption.
The current algorithm (Apr 2016) takes arc_meta_used into account but "arc_meta_used is not 
necessarily all MRU". That means we reduce the MRU too much.
The adjust_arc_p.sh tries to artificially increase arc_p (the target size of MRU) to 3 times the value 
of meta_used. It limits the size of arc_p to half of arc_c (target size of cache). If the meta_used and arc_p 
are already too high, the script will not do any action but displays a message.

For example, here arc_p is too small :

   p                         =        11696 MB  /* target size of MRU    */
   c                         =       187039 MB  /* target size of cache  */
   c_min                     =           64 MB  /* min target cache size */
   c_max                     =       261101 MB  /* max target cache size */
   size                      =       187039 MB  /* actual total arc size */
   [..]
   meta_used                 =        21967 MB
   meta_max                  =        41046 MB
   meta_limit                =            0 MB

Thanks to the script, p will go to 60GB (that is still below 187GB/2).
Before, with tuning_zfs_arc_p_min_shift.akwf, arc_p was set to 1/8 or arc_size (reboot needed).
Apparently, this may be not enough.

Bugs under the bonnet

Bug number	Title	Workaround	Fix
15505861	Write performance degrades when ZFS filesystem is near quota	Avoid quota usage, Free some space by deleting some datasets	2013.1.1.9

Signature :

Typical stack when reaching filesystem size limit :
   > ::stacks -c dsl_dir_tempreserve_impl
     zfs`dsl_dataset_check_quota
     zfs`dsl_dir_tempreserve_impl+0xc2
     zfs`dsl_dir_tempreserve_space+0x13e
     zfs`dmu_tx_try_assign+0x208
     zfs`dmu_tx_assign+0x2a
     zfs`zfs_write+0x63c
     genunix`fop_write+0xa6
     genunix`write+0x309
     genunix`write32+0x22
     unix`_sys_sysenter_post_swapgs+0x149

Actually, we get in trouble when the return code is 91 : ERESTART. This can lead to long spa_sync time.

    dsl_dataset.c
    dsl_dataset_check_quota(dsl_dataset_t *ds, uint64_t inflight)
    {
        int error = 0;

        mutex_enter(&ds->ds_lock);

        /*
         * If they are requesting more space, and our current estimate
         * is over quota, they get to try again unless the actual
         * on-disk is over quota and there are no pending changes (which
         * may free up space for us).
         */
        if (ds->ds_quota != 0 &&
            ds->ds_phys->ds_used_bytes + inflight >= ds->ds_quota) {
                if (inflight > 0 || ds->ds_phys->ds_used_bytes < ds->ds_quota)
                        error = ERESTART;
                else
                        error = EDQUOT;
        }
        mutex_exit(&ds->ds_lock);

        return (error);
    }

Dtrace can be of great help to check the return value of dsl_dataset_check_quota() and dsl_dir_tempreserve_space(). If we get something different of 0, we can hit the issue.

# dtrace -n 'dsl_dataset_check_quota:return /arg1!=0/ { trace(arg1) }'
    CPU     ID                    FUNCTION:NAME
    3  45299   dsl_dataset_check_quota:return                91
    3  45299   dsl_dataset_check_quota:return                91
    3  45299   dsl_dataset_check_quota:return                91
    3  45299   dsl_dataset_check_quota:return                91
    3  45299   dsl_dataset_check_quota:return                91
    3  45299   dsl_dataset_check_quota:return                91
    3  45299   dsl_dataset_check_quota:return                91
    3  45299   dsl_dataset_check_quota:return                91

# dtrace -n 'dsl_dir_tempreserve_space:entry { printf("%s a:%lx f:%lx u:%lx",
             stringof(args[0]->dd_myname),arg2,arg3,arg4)}
             dsl_dir_tempreserve_space:return { trace(arg1)}'
    CPU     ID                    FUNCTION:NAME
    0  44317  dsl_dir_tempreserve_space:entry prepcc-t1_oradata2-Clone_DAA1
    0  44318  dsl_dir_tempreserve_space:return               91
    1  44317  dsl_dir_tempreserve_space:entry preptux-d1_home
    1  44318  dsl_dir_tempreserve_space:return               91
    1  44317  dsl_dir_tempreserve_space:entry preptux-d1_home
    1  44318  dsl_dir_tempreserve_space:return               91
    3  44317  dsl_dir_tempreserve_space:entry preptux-d1_home
    3  44318  dsl_dir_tempreserve_space:return               91

Bug number	Title	Workaround	Fix
15710347	system hung, lots of threads in htable_steal	See below	2013.1.1.5

Signature :

    > ::stacks -c htable_steal
    THREAD           STATE    SOBJ                COUNT
    fffff60011597760 RUN      <NONE>                  9
                 swtch+0x150
                 turnstile_block+0x760
                 mutex_vector_enter+0x261
                 htable_steal+0x1a4
                 htable_alloc+0x248
                 htable_create+0x1dc
                 hati_load_common+0xa3
                 hat_memload+0x81
                 hat_memload_region+0x25
                 segvn_faultpage+0x937
                 segvn_fault+0xc13
                 as_fault+0x5ee
                 pagefault+0x99
                 trap+0xe63

Workaround :

    arc_stats::print -ta arc_stats_t arcstat_c_max.value.i64
    fffffffffbd43680 int64_t arcstat_c_max.value.i64 = 0x5bef18000
    > fffffffffbd43680/Z0x4000000
    arc_stats+0x4a0:0x5bef18000             =       0x4000000 <<< 64MB
    > ::arc -m ! grep max
    c_max                     =        64 MB
    meta_max                  =       889 MB

    With :
    0x400000000 << 16GB
    0x600000000 << 24GB

Bug number	Title	Workaround	Fix
15742475	AK-2011.04.24 Extremely sluggish 7420 node due to heap fragmentation	limit arc_size to half of the DRAM, set arc_meta_limit to 2/5 of DRAM	2011.1.3

Signature :

memstat shows that we have plenty of memory free, however a "::stacks -m zfs" shows that we have a number 
of zfs stacks waiting for allocations.

    >::stacks -m zfs
    ffffffaac7a07440 SLEEP    CV                     36
    swtch+0x145
    cv_wait+0x61
    vmem_xalloc+0x635
    vmem_alloc+0x161
    segkmem_xalloc+0x9

Bug number	Title	Workaround	Fix
15751065	appliance unavailable due to zio_arena fragmentation	limit arc_size to half of the DRAM, set arc_meta_limit to 2/5 of DRAM	2011.1.3

Signature :

We haven't run out of available memory.
     > ::memstat
     Page Summary                Pages                MB  %Tot
     ------------     ----------------  ----------------  ----
     Kernel                    8480307             33126    6%
     ZFS File Data           120780433            471798   90%  << ZFS ARC cache for data
     Anon                       231541               904    0%
     Exec and libs                  73                 0    0%
     Page cache                   6816                26    0%
     Free (cachelist)            22290                87    0%
     Free (freelist)           4691874             18327    3  %

     Total                   134213334            524270
     Physical                134213333            524270

We have some threads waiting for some memory allocation

    > ::stacks -m zfs
    THREAD           STATE    SOBJ                COUNT
    ffffff86b38ab440 SLEEP    CV                     23
                swtch+0x150
                cv_wait+0x61
                vmem_xalloc+0x63f
                vmem_alloc+0x161
                segkmem_xalloc+0x90
                segkmem_alloc_vn+0xcd
                segkmem_zio_alloc+0x24
                vmem_xalloc+0x550
                vmem_alloc+0x161
                kmem_slab_create+0x81
                kmem_slab_alloc+0x5b
                kmem_cache_alloc+0x1fa
                zio_data_buf_alloc+0x2c
                arc_get_data_buf+0x18b
                arc_buf_alloc+0xa2
                arc_read_nolock+0x12f
                arc_read+0x79
                dsl_read+0x33
                dbuf_read_impl+0x17e
                dbuf_read+0xfd
                dmu_tx_check_ioerr+0x6b
                dmu_tx_count_write+0x175
                dmu_tx_hold_write+0x5b
                zfs_write+0x655
                fop_write+0xa4
                rfs3_write+0x50e
                common_dispatch+0x48b
                rfs_dispatch+0x2d
                svc_getreq+0x19c
                svc_run+0x171
                svc_do_run+0x81
                nfssys+0x760
                _sys_sysenter_post_swapgs+0x149

from ::kmastat -g, the buffers consuming memory are for data

    >::kmastat -g ! grep -v 0G
    zio_data_buf_65536         65536 4694099 4694101      286G  21342088     0
    zio_data_buf_131072       131072 1424500 1424500      173G   8844051     0
    ...
    zfs_file_data                  460G        512G         0G  10832435     0
    zfs_file_data_buf              460G        460G       460G  10843627     0


Looking in the thread stacks, we were waiting memory allocation just for 0x20000 bytes (128KB)
    >::stacks -c vmem_xalloc
    stack pointer for thread ffffff86b38ab440: ffffff03d56a8700
    [ ffffff03d56a8700 _resume_from_idle+0xf1() ]
      ffffff03d56a8730 swtch+0x150()
      ffffff03d56a8760 cv_wait+0x61(ffffff84949ac01e, ffffff84949ac020)
      ffffff03d56a88a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) <<< 0x20000
      ffffff03d56a8900 vmem_alloc+0x161(ffffff84949ac000, 20000, 4)
      ...
     ]

Most of the waiters on this condition variable are nfsd daemons

    > ffffff84949ac01e::wchaninfo -v
    ADDR             TYPE NWAITERS   THREAD           PROC
    ffffff84949ac01e cond       60:  ffffff8570a5db60 akd
                                     ffffff85a1baabc0 akd
                                     ffffff86bdf66880 akd
                                     ffffff85a1af6420 nfsd
                                     ffffff85e09da420 nfsd
                                     ffffffbd7b172bc0 nfsd
                                     ffffff86bc6450e0 nfsd
                                     ffffff86b38ab440 nfsd
                                     ffffff85a1cdf0c0 nfsd
                                     ffffffbd7b16e100 nfsd
                                     ...

We have many ZFS threads doing some memory allocation

    > ::stacks -m zfs|::findstack -v ! grep vmem_xalloc | grep ffffff84949ac000 | head
    ffffff03d56a88a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d3cb08a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d4e498a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d37dc8a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d06288a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d43708a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d35f58a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d3d768a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d38368a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff03d382a8a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)

Walking in the free vmem_segs of vmem_t ffffff84949ac000, we have many free segments of 0x10000 bytes (64KB) 
but none for 0x20000 (128KB):

    > ffffff84949ac000::walk vmem_free | ::vmem_seg -s ! awk '{print $4}' | sort | uniq -c
    108 0000000000001000
     97 0000000000002000
    166 0000000000003000
    148 0000000000004000
    ...
    09 000000000000f000
 837558 0000000000010000 << 64KB
     30 0000000000011000
     34 0000000000012000
     42 0000000000013000

Bug number	Title	Workaround	Fix
15710534	SUNBT7038138 advance rotor before blocking on space_map_load	See below	2011.1.6

Signature :
Time spent to read metadata during sync is responsible for the sawtooth pattern of IOPS
Workaround :

    # echo "metaslab_unload_delay/W 0x7fffffff"|mdb -kw
    # echo "set zfs:metaslab_unload_delay = 0x7fffffff" >> /etc/system

Bug number	Title	Workaround	Fix
15790948	SUNBT7167903 pre-6998140 soft-ring fanout is single threaded for VLANs	Do not use VLAN under 2011.1 if you need high throughput	2013.1

Signature :

  * or 2 CPU 99% busy
  * High level of interruptions

VLAN usage :

    # dladm show-link
    bge0            type: non-vlan  mtu: 1500       device: bge0
    bge1            type: non-vlan  mtu: 1500       device: bge1
    bge2            type: non-vlan  mtu: 1500       device: bge2
    bge3            type: non-vlan  mtu: 1500       device: bge3
    aggr1           type: non-vlan  mtu: 1500       aggregation: key 1
    aggr111001      type: vlan 111  mtu: 1500       aggregation: key 1
    aggr112001      type: vlan 112  mtu: 1500       aggregation: key 1

    aggr23002   vnic      1500   up       --         aggr2 <<< VLANID - 23 of aggr2
    aggr21002   vnic      1500   up       --         aggr2
    aggr22002   vnic      1500   up       --         aggr2

Bug number	Title	Workaround	Fix
15754028	read-modify-write in space_map_sync() may significantly slow down spa_sync() performance	see below	2013.1.2.9

Signature :
Run the spacemap_15754028.d script to determine if the observed performance issue is due to this bug. If it is, the script shows consistently long spa sync times (>15s) in the “Max sync” column, and high numbers (>200) in the “Sync RMW” column:

         Pool : Max sync  Sync  Load Minsize MB    Count  Load Miss Sync RMW anon/MRU/MFU   : 2013 Sep 13 06:09:30
       system :   503 ms     0     0     133 MB,    6017          0 0        0/0/0
   mstorepool   56329 ms     5     0       0 MB,  355572          0 21259    0/21/21238

Typical stack trace is as follow :

    swtch+0x150()
    cv_wait+0x61(ffffff8a047b2cb8, ffffff8a047b2cb0)
    zio_wait+0x5d(ffffff8a047b29a0)
    dbuf_read+0x1e5(ffffff8b754b8d88, 0, 9)
    dbuf_will_dirty+0x5d(ffffff8b754b8d88, ffffff91ec49c338)
    dmu_write+0xe1(ffffff856606d340, 446, 10fa0, 478,
    space_map_sync+0x295(ffffff86174c8dd0, 1, ffffff86174c8ae0,
    metaslab_sync+0x2de(ffffff86174c8ac0, a8b)
    vdev_sync+0xd5(ffffff8615400640, a8b)
    spa_sync+0x45b(ffffff8638592040, a8b)
    txg_sync_thread+0x247(ffffff85660903c0)
    thread_start+8()

Workaround :

    # echo "metaslab_unload_delay/W 0x7fffffff"|mdb -kw
    # echo "set zfs:metaslab_unload_delay = 0x7fffffff" >> /etc/system

Bug number	Title	Workaround	Fix
16423386	dnode_next_offset() doesn't do exhaustive search	no WO	2013.1.1.4

Signature :

Clone deletion issue :  Note the presence of traverse_dnode() twice on the stack.

               zfs`arc_read_nolock+0x7e7
               zfs`arc_read+0x79
               zfs`dsl_read+0x33
               zfs`traverse_visitbp+0x431
               zfs`traverse_visitbp+0x4dd
               zfs`traverse_dnode+0xa3   <---- B
               zfs`traverse_visitbp+0x387
               zfs`traverse_visitbp+0x4dd
               zfs`traverse_visitbp+0x4dd
               zfs`traverse_visitbp+0x4dd
               zfs`traverse_visitbp+0x4dd
               zfs`traverse_visitbp+0x4dd
               zfs`traverse_visitbp+0x4dd
               zfs`traverse_dnode+0xa3  <----- A
               zfs`traverse_visitbp+0x1fc
               zfs`traverse_impl+0x213
               zfs`traverse_dataset+0x57
               zfs`dsl_dataset_destroy_sync+0x290
               zfs`dsl_sync_task_group_sync+0xf3
               zfs`dsl_pool_sync+0x1ec

Bug number	Title	Workaround	Fix
17650441	NFS drops to about 0 for about 30 seconds and then returns to normal	no WO	IDR 2011.04.24.8.0,1-2.43.2.3

Fixed thanks to bug 6988530 NFS threads partially STOPPED after nfsd restart during ZFSSA fail-back.
Signature :
spa_sync activity is more than 5 sec. Looks like 15754028 but WO may not help. Use the synctime7 dtrace script to confirm the issue.

Bug number	Title	Workaround	Fix
15723911	SUNBT7058001 concurrency opportunity during spa_sync	see below	2013.1.1.1

Parent of 15885830 : RMAN BACKUP PERFORMANCE REDUCED 15-30% WHEN USING UNSTABLE WRITES WITH 11.2.0.3 – Fixed in 2013.
Signature :
When doing an high Asynchronous IO write workload, NFSv3 response time can report significant variations (up to 7 seconds). With Synchronous workload, the max response time is still normal and around 400ms max. For RMAN backups, a performance decrease from 15-30% has been observed.

ZFS periodically writes out dirty cache buffers to disk. This is called a “sync” — we are synchronising what is in core memory with what is on disk and it has many steps. The syncing procedure is managed by a component of ZFS called the Storage Pool Allocator, or SPA. This enhancements addresses two of the many steps that the SPA takes in order to complete the sync.

One of the steps is to process all of the “frees”. A “free” involves releasing a previously allocated region on a disk and making it available for allocation again. As ZFS determines which blocks to free, it adds them to a list. Once the list is complete, the SPA would loop through that list and process the frees one at a time. The enhancement that was made to this step was to reduce the number of items on the list by processing the frees as they come in. This greatly reduces the amount of CPU required to handle each block that is being freed.

Another step that the SPA sync procedure does is to write out the metaslab space maps. These space maps contain the allocations that are done on disks, so ZFS can know which regions are free. Prior to this enhancement, each disk was processed sequentially and its space maps were written out. With the change in place, we can now write out the space maps for each disk in parallel.

Workaround :

    zfs set sync=always <dataset in trouble>

Bug number	Title	Workaround	Fix
15652868	SUNBT6965013 DTrace compiler doesn't align structs properly	Do not use analytics broken down by hostname	2013.1

Signature :
When monitoring IP packets by hostname we can devrease the throughput by 25%


    # dtrace -n '
     profile-997hz /arg0 && curthread->t_pri != -1 && cpu==58 / { @[stack()] = count(); }
     tick-60sec { trunc(@, 10); printa(@); exit(0); }' > /var/ak/dropbox/cpu-busy-$date.out &

The most used stacks were as follows :

     ip`ill_input_short_v4+0xdb
     ip`ip_input+0x23b
     dls`i_dls_link_rx+0x2e7
     mac`mac_rx_deliver+0x5d
     mac`mac_rx_soft_ring_process+0x17a
     mac`mac_rx_srs_fanout+0x823
     mac`mac_rx_srs_drain+0x261
     mac`mac_rx_srs_process+0x180
     mac`mac_rx_classify+0x159
     mac`mac_rx_flow+0x54
     mac`mac_rx_common+0x1f6
     mac`mac_rx+0xac
     mac`mac_rx_ring+0x4c
     ixgbe`ixgbe_intr_msix+0x99
     apix`apix_dispatch_pending_autovect+0x12c
     apix`apix_dispatch_pending_hardint+0x33
     unix`switch_sp_and_call+0x13
     16175

Checking what this part of code does, we can see this.
The 'nop' instructions is a signature of a dtrace probe.

   st1f742003b# echo ill_input_short_v4+0xdb::dis | mdb -k
   ill_input_short_v4+0xdb:        nop
   ill_input_short_v4+0xdc:        nop
   ill_input_short_v4+0xdd:        nop
   ill_input_short_v4+0xde:        addq   $0x10,%rsp
   ill_input_short_v4+0xe2:        movq   -0xb0(%rbp),%rcx
   ill_input_short_v4+0xe9:        movq   %r12,%rdi
   ill_input_short_v4+0xec:        xorq   %rsi,%rsi
   ill_input_short_v4+0xef:        movq   %r13,%rdx
   ill_input_short_v4+0xf2:        nop
   ill_input_short_v4+0xf3:        nop
   ill_input_short_v4+0xf4:        nop

Checking which analytics do have probes in ip, I noticed this :

   dataset-022 active   6.72M   235M ip.bytes[hostname]

After disabling it, we have been able to sustain an higher throughput :

   st1f742003b:analytics dataset-047> read 10
   DATE/TIME                KB/SEC     KB/SEC BREAKDOWN
   2013-9-10 15:06:49       349787     349755 ixgbe0
                                           32 igb1
   2013-9-10 15:06:50       351674     351643 ixgbe0
                                           31 igb1
   2013-9-10 15:06:51       650618     650580 ixgbe0
                                           38 igb1

Bug number	Title	Workaround	Fix
17832061	One dropped packet can permanently throttle TCP transfer	No easy workaround unless using the same VLANS for clients and the ZFS appliance	2013.1.2.9

Signature :
When clients and ZFS appliance are on different VLANS, a 50% throughput decrease has been observed. Note that this does not impact IO writes. This is a pure network issue.
* is the issue seen under 2011.1 : YES
* is the issue seen under 2013.1 and same VLANSs (not routed) : NO
* is the issue seen under 2013.1 and different VLANSs (routed) : YES

In addition to the congestion window, there is a variable that governs the maximum number of packets allowed to be sent in a single burst of packets. Initially this number if essentially infinite, with the number of packets in a burst allowed to keep growing indefinitely as long as there are no dropped packets indicating network congestion. However, when a packet is dropped, the number of packets may be reduced.

The problem we have here that if the destination of the connection is on the same local network, the burst is kept at infinity, but if it is not on the same local network the burst is reduced to 5. Under some circumstances it can even be reduce further to 3, but from 3 it will grow back to 5. This accounts for the difference of the VLANs. One VLAN is deemed local and one is deemed not local. On the non-local VLAN, once a packet is dropped the limits is reduced forever to 5.

Beyond 2013.1.2.9 the following tunable has to be changed :

    echo 'tcp_cwnd_normal/W 0x14' | mdb -kw

For some cases, it seems we can set tcp_cwnd_normal to 0x64 to get better throughput. Adam feedback for a perf issue with IO reads done by NDMP, is as follows :

    tcp_cwnd_normal=0x5  :  3MB/s
    tcp_cwnd_normal=0x14 : 30MB/s
    tcp_cwnd_normal=0x64 : 80MB/s

    Bigger values did not help much

Bug number	Title	Workaround	Fix
18125379	memory leak in l2arc on sync reads that fail	see below	2013.1.2.0

Signature :

A typical kstate.out file will be as follows :
    ::arc -m
    p                         =         4 MB
    c                         =        64 MB  << target ARC size has reached its min
    c_min                     =        64 MB
    c_max                     =    130030 MB
    size                      =     98748 MB  << current ARC size, but no data is evictable anymore
    [...]
    l2_evict_reading          =         0
    l2_abort_lowmem           =     10240     << memory pressure : the kmem_cache_reap_now() thread has run
    l2_cksum_bad              = 504213672     << issue when data read from SSD cache devices is validated before 
                                                 it is given to the ARC
    l2_io_error               =         0
    l2_hdr_size               =     29696 MB

   * l2_cksum_bad will cause memory leak on sync reads (not on async reads) due to 18125379
   * getting 'l2_cksum_bad' will make a read from the original storage device

Workaround :
SSD do not have to be replaced
There is no workaround so far. A reboot will be needed to get around the issue but we have analytics to measure arc.c and so, we can get preemptive actions and schedule a reboot before it gets messy. Do as follows :

   - Add the following analytic, if not present already : **Cache - ARC target size : arc.c**
   - Add an ALERT from **Configuration->Alerts panel**
      * Click on (+) Threshold alerts.
      * Select "Cache ARC target size"
      * Select "fails below"
      * Enter 10000 (MB)
      * Fill the correct email address
   - If you receive an alert, it will be time to schedule a NAS head reboot.

I did chose **10000MB** but it really depends on the memory size. Obviously, when arc.c has decreased to 64MB, it’s already too late.

Beyond that, we still do not know why ‘l2_cksum_bad’ did increase so much and Victor Latushkin told this may not be an HW issue in the SSD.

Bug number	Title	Workaround	Fix
18562374	missing call to arc_free_data_block in l2arc_read()	NA	2013.2
18695640	race between arc_write_done() and arc_evict_buf()	NA	in Oct. 2017, this did not seem to fixed