All of the following information has to be taken “as is” and has not been updated for 3 years now.
Though I am quite sure all of these bugs are fixed now, it was worth listing them and share this with my fellows that are still working onto the ZFSA.
Hope that helps.
Credits to all the engineers that did work with me at Sun and Oracle from 1999 to 2017.
General bugs list
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 15020395 | performance decrease if 1st nameserver is down | 2011.1.3 | ||
| 15040725 | RFE for improved DNS resolver failover performance | 2011.1.3 | ||
| 15770608 | Enable DNS defer-on-fail functionality | 2011.1.3 | ||
| 15505861 | Write performance degrades when ZFS filesystem is near quota | 2013.1.1.9 | Attempts was done for 2011.1.6 IDR but regression tests did fail | |
| 15742475 | Extremely sluggish 7420 node due to heap fragmentation limit arc_size and arc_meta_limit | 2011.1.3 | ||
| 15770103 | Intermittent akd hangs – mutex owned by a thread stuck in zfs land | 2011.1.4 | Impacts rollback, snapshot deletion. Refers to CR 7155750 | |
| 15789968 | Unable to delete zombie snapshot causes CPU spikes | 2011.1.5 | High CPU usage is an indicator | |
| 15814083 | z_fuid_lock becomes too hot when zfs_fuid_find_by_idx() is heavily used | 2013.1.2 | CIFS performance is bad. High CPU usage on ZFSSA is a clear indicator | |
| 15705245 | Elapsed time for zfs delete may grow quadratically | 2011.1.5 | LUN or filesystem deletion may lead to BUI hang, for hours | |
| 15710347 | System hung, lots of threads in htable_steal | limit arc_size | 2013.1.1.5 | |
| 15751065 | Appliance unavailable due to zio_arena fragmentation | limit arc_size and arc_meta_limit | 2011.1.3 | |
| 15710534 | Advance rotor before blocking on space_map_load | metaslab_unload_delay | 2011.1.6 | sawtooth IOPs pattern during sync of metadata |
| 15790948 | soft-ring fanout is single threaded for VLAN | 2013.1 | Throuhput limited when VLAN used | |
| 17650441 | NFS drops to about 0 for about 30 seconds and then returns to normal | IDR 2011.04.24.8.0,1-2.43.2.3 | spa_sync activity is more than 5 sec | |
| 15723911 | concurrency opportunity during spa_sync | sync=always | 2013.1.1.1 | 20% loss of throughput with RMAN backups |
| 15652868 | DTrace compiler doesn't align structs properly disable some analytics | 2013.1 | ip.bytes[hostname] impact perf | |
| 16423386 | dnode_next_offset() doesn't do exhaustive search | 2013.1.1.4 | perf issue when deleting clones | |
| 18753152 | zil train and logbias=throughput, 2X performance opportunity | 2013.1.2.8 | perf issue impacting databases running OLTP workload with 8k recordsize and logbias=throughput | |
| 17832061 | SMB read performance are not consistent | Avoid using different VLANs | 2013.1.2.9 | 50% bandwidth loss over different VLANs |
| 15754028 | read-modify-write in space_map_sync() may significantly slow down spa_sync() performance | metaslab_unload_delay | 2013.1.2.9 | |
| 15784616 | need an upper bound on memory footprint of l2hdr for large l2arc | 2013.1.2.11 | L2ARC_HEADER_PERCENT added : default = 30% of ARC | |
| 18609976 | large deletes cause NFS ops to drop to zero | 2013.1.3.5 | To be monitored with spawtf.d script. Happens when large datasets are destroyed. spa_sync can last more than 10 mins. The fix being pursued here also fixes two problems causing scan/scrub/resilver to overrun its time allocations. Prior to this, if a scrub is running, it consistently goes over 6s. With the fix, it's just over 5s as it's intended to be. |
Typical stack as follows:
genunix`avl_find+0x56
zfs`space_map_add+0xa6
zfs`zio_free+0x1a5
zfs`dsl_free+0x20
zfs`dsl_dataset_block_kill+0x19b
zfs`free_blocks+0x76
zfs`free_children+0x2c7
zfs`free_children+0x23f
zfs`free_children+0x23f
zfs`dnode_sync_free_range+0x113
zfs`dnode_sync+0x272
zfs`dmu_objset_sync_dnodes+0x7f
zfs`dmu_objset_sync+0x1c9
zfs`dsl_dataset_sync+0x63
zfs`dsl_pool_sync+0xdb
zfs`spa_sync+0x39a
zfs`txg_sync_thread+0x244
unix`thread_start+0x8
2407
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 20693077 | snapshot deletes don't check timeout while iterating | 2013.1.4.6 |
Hundreds of stacks similar to this seen onproc:
zfs`dbuf_hold_impl+0x1
zfs`dnode_hold_impl+0xc3
zfs`dnode_hold+0x2b
zfs`dmu_bonus_hold+0x32
zfs`bpobj_open+0x6d
zfs`bpobj_iterate_impl+0x35d
zfs`bpobj_iterate_impl+0x3ff
zfs`bpobj_iterate_impl+0x3ff
zfs`bpobj_iterate_impl+0x3ff
zfs`bpobj_iterate_impl+0x3ff
zfs`bpobj_iterate_impl+0x3ff
zfs`bpobj_iterate_impl+0x3ff
zfs`bpobj_iterate_impl+0x3ff
zfs`bpobj_iterate+0x23
zfs`dsl_scan_sync+0x11b
zfs`spa_sync+0x447
zfs`txg_sync_thread+0x244
unix`thread_start+0x8
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 19772024 | ls_reclaim() blocks kmem_cache_reap() | limit arc_size | 2013.1.3.1 | kmem_cache_reap thread stuck trying to retrieve some space from specific NFS cache |
| 19591405 | akd running in RT class is blocking critical kernel memory management | 2013.1.6.0 | Some RT akd threads needing memory prevent kmem_cache_reap and pageout scanner from running. See MOS Doc ID 2043383.1 : Some systems with LOW Memory configurations might not run reliably/stable with 2013.1.x releases |
An admitted workaround is to set the non cluster akd threads scheduling class to Fare Share (FS), as follows :
zfssa # svccfg -s akd setprop akd/rt_off = "true"
zfssa # svccfg -s akd listprop | grep rt_off
akd/rt_off_cluster astring false
akd/rt_off astring true <<<<
zfssa # svcadm refresh akd
zfssa # svcadm restart akd
Must be done on both cluster nodes.
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 18914201 | Hole buffers are calculating zero checksums leading to performance slowdown | 2013.1.3.1 | ||
| 19137061 | top-level vdev allocations may go over mg_aliquot | 2013.1.3.1 | The fix spreads IOs accross vdevs is a more efficient way, making each vdev less busy | |
| 19949855 | optimize the number of cross-calls during hat_unload | 2013.1.3.3 | Reaper thread not able to free buffers fast enough.This is causing NFS clients to experience IO latency. |
Typical stack is as follows :
unix`do_splx+0x97
unix`xc_common+0x230
unix`xc_call+0x46
unix`hat_tlb_inval+0x289
unix`hat_unload_callback+0x302
unix`hat_unload+0x41
unix`segkmem_free_vn+0x5a
unix`segkmem_free+0x27
genunix`vmem_xfree+0x10a
genunix`vmem_free+0x27
genunix`kmem_slab_destroy+0x87
genunix`kmem_slab_free+0x2c7
genunix`kmem_magazine_destroy+0xdf
genunix`kmem_depot_ws_reap+0x77
genunix`kmem_cache_magazine_purge+0xd6
genunix`kmem_cache_magazine_resize+0x41
genunix`taskq_thread+0x22e
unix`thread_start+0x8
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 19883033 | ZS3 encountering unexplained high CPU utilization | 2013.1.3.3 | SMB issue mostly |
Using dtrace to count the number of calls from zfs_groupmember()->zfs_fuid_find_by_idx() for the exact same workload to the same share as the same user: calls version 12843966 ak-2013.1 3.0 848928 ak-2013.1 3.3
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 15324796 | zfetch needs a whole lotta love | 2013.1.4.4 | For sequential IO reads with small blocksize, zfetch_block_cap can be increased. By default, zfetch_block_cap = 16. With a recordsize of 8k, ZFS will prefetch 128K only. An high number of vdevs will not help as we just do not prefetch deep enough to make them all busy. To get better disk usage, zfetch_block_cap has to be increased to 32 or even 64. See tuning_zfetch_block_cap.akwf attached to bug 19982476 (backport for ZFSSA). | |
| 20361021 | Latency seen on upgrade to SMB2.1 with leases. SMB2.1 server is waiting for lease break ack | 2013.1.4.7 |
SMB2.1 version is available in 2013.1.3.0 which introduces leases. This feature works in a similar way like oplocks where a 35sec delay can be seen. So just disabling oplocks from the BUI might not work.
You have to go back to SMB2.0 (to disable leases) as follows .
zfssa# svccfg -s smb listprop | grep maxprotocol
smb/client_maxprotocol astring 1
smb/server_maxprotocol astring 2
zfssa# echo '::smbsrv -v' | mdb -k | grep maxprotocol
skc_server_maxprotocol = 3 (SMB_D2_1)
zfssa# svccfg -s smb setprop smb/server_maxprotocol = astring: "2.0"
zfssa# svccfg -s smb listprop | grep protocol
smb/client_maxprotocol astring 1
smb/server_maxprotocol astring 2.0
zfssa# echo '::smbsrv -v' | mdb -k | grep maxprotocol
skc_server_maxprotocol = 2 (SMB_D2_0)
zfssa# aksh
zfssa:> configuration services smb
zfssa:configuration services smb> disable
zfssa:configuration services smb> enable
Caveat : Disabling oplocks, will result in more traffic over the wire. The clients will be denied read/write caching of the data and all the requests have to go to the server.
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 19454536 | Improve the performance of SMB2_QUERY_DIRECTORY | 2013.1.6.0 | Browsing CIFS shares with lots of files will make the ZFFSA CPUs busy … latencies can happen |
Signature :
zfssa# dtrace -n 'profile-997hz /arg0 && curthread->t_pri != -1/ { @[stack()] = count(); }
tick-10sec { trunc(@, 10); printa(@); exit(0); }'
unix`mutex_delay_default+0x7
unix`mutex_vector_enter+0x2ae
zfs`zfs_zaccess_aces_check+0x77
zfs`zfs_zaccess_common+0x85
zfs`zfs_zaccess+0x14d
zfs`zfs_fastaccesschk_execute+0x173
zfs`zfs_lookup+0xa8
genunix`vhead_lookup+0xd5
genunix`fop_lookup+0x128
smbsrv`smb_vop_lookup+0x17c
smbsrv`smb_fsop_do_lookup+0x14e
smbsrv`smb_fsop_lookup+0x27d
smbsrv`smb_odir_wildcard_fileinfo+0x6f
smbsrv`smb_odir_read_fileinfo+0x10e
smbsrv`smb2_query_directory_encode_dynamic+0xd5
smbsrv`smb2_query_directory_encode+0x7e
smbsrv`smb2_query_directory+0x84
smbsrv`smb2_dispatch_request_impl+0x237
smbsrv`smb2_dispatch_request+0x15
smbsrv`smb_session_worker+0x88
5672
Lockstat extract (32 cores) :
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Hottest Caller
2959889 95% 95% 0.00 224391 0xfffff6011546f100 zfs_root+0x58
nsec ------ Time Distribution ------ count Stack
256 | 98 fsop_root+0x2d
512 | 56784 smb_node_alloc+0x11e
1024 | 82330 smb_node_lookup+0x1fe
2048 |@ 187212 smb_fsop_do_lookup+0x49b
4096 |@@@ 372976 smb_fsop_lookup+0x27d
8192 |@@@@@@@@ 876803 smb_odir_wildcard_fileinfo+0x6f
16384 |@@@ 354769 smb_odir_read_fileinfo+0x10e
32768 |@ 167806 smb2_query_directory_encode_dynamic+0xd5
65536 |@ 107972 smb2_query_directory_encode+0x7e
131072 | 91142 smb2_query_directory+0x84
262144 |@ 107285 smb2_dispatch_request_impl+0x237
524288 |@ 150887 smb2_dispatch_request+0x15
1048576 |@ 182779 smb_session_worker+0x88
2097152 |@ 152531 taskq_d_thread+0xc1
4194304 | 66082 thread_start+0x8
8388608 | 2424
16777216 | 9
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 23216069 | Missing flag in prefetch backport causes excessive prefetch activity | 2013.1.6 | Related to : 21974849 – RMAN backup over ASM leads to 1M random read foiling zfetch. With the new zfs_fetch implementation brought in 2013.1.4.4, RMAN backup through ASM make the SAS drives saturated because of unexpected prefetching. |
@ With the new zfetch code when
@ subject to 1MB random read, we actually do issue some readaheads and given
@ that the ASM Allocation Units can be interleaved, we're likely to end up not using the prefetched data.
This makes the disks very busy for no profit.
To compare the FC/iSCSI/NFS activity to the SAS2 disks activity, it can be interesting to add some
analytics as follows :
fc.bytes[op]
iscsi.bytes[op]
nfs3.bytes[op]
...
This can even make the CPUs busy with the following stack :
zfs`dbuf_prefetch+0x1
zfs`dmu_zfetch+0x10d
zfs`dbuf_read+0x128
zfs`dmu_buf_hold_array_by_dnode+0x1cc
zfs`dmu_buf_hold_array+0x6f
zfs`dmu_read_uio+0x4b
zfs`zfs_read+0x260
genunix`fop_read+0xa9
nfssrv`rfs3_read+0x3ec
nfssrv`common_dispatch+0x53f
nfssrv`rfs_dispatch+0x2d
rpcmod`svc_process_request+0x184
rpcmod`stp_dispatch+0xb9
genunix`taskq_d_thread+0xc1
unix`thread_start+0x8
47770
@
@ The current workaround that we implemented was to tune
@ zfetch_maxbytes_lb=128K & zfetch_maxbytes_ub=256K. This keeps prefetch enabled but reduces the amount
@ of prefetch data that needs to be thrown away, This is not a very comfortable
@ workaround.
# echo zfetch_maxbytes_ub/W0t262144 | mdb -kw
zfetch_maxbytes_ub: 0x2000000 = 0x40000
# echo zfetch_maxbytes_lb/W0t131072 | mdb -kw
zfetch_maxbytes_lb: 0x400000 = 0x20000
From the dmu_zfetch.c :
076 * Max amount of data to prefetch at a time */
077 * unsigned int zfetch_maxbytes_ub = 32 << MB_SHIFT;
078 * unsigned int zfetch_maxbytes_lb = 4 << MB_SHIFT;
079 * unsigned int zfetch_target_blks = 256;
650 * Provides an upper bound on the prefetch window size for a single prefetch.
651 * We try to prefetch zfetch_target_blks (256) but stay within the size bounds
652 * [zfetch_maxbytes_lb (4 MB), zfetch_maxbytes_ub (32 MB)]
653 *
654 * BLOCKSIZE MAX PREFETCH WINDOW SIZE
655 * 128K-1MB 32MB
656 * 32K-64K 256 blocks (8MB - 16MB)
657 * < 32K 4MB
658 */
659 static unsigned int
660 dmu_zfetch_max_blocks(uint64_t blkshft)
661 {
662 unsigned int blks = zfetch_target_blks;
663 unsigned int size = blks << blkshft;
664
665 if (size > zfetch_maxbytes_ub)
666 return (zfetch_maxbytes_ub >> blkshft);
667
668 if (size < zfetch_maxbytes_lb)
669 return (zfetch_maxbytes_lb >> blkshft);
670
671 return (blks);
672 }
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 21071219 | Comstar framework should allow a cmd to abort prior to ZFS I/O completion | 2013.1.5.7 & 2013.1.6 | Relates to 22080255 : ZFSSA backend "resilvering" causes FC front end port inaccesible to remote host |
We encountered this issue when we were investigating CR 2829652. Basically on the qlt FC front end of ZFSSA, if a scsi command was detected timedout by remote host, remote host will issue ABTS and retries ABTS if the first one is not processed timely. If the second ABTS was not processed timely the LOGO will be issued from remote host and the I/O path may get lost permanently (see CR 21071136) as a result. For the ZFSSA using FC as the transport interface the capability to be able to handle ABTS timely is critical so that the host can complete its error recovery and retry IOs without having the risk of losing IO path as a result of LOGO. The AK8 kernel and comstack stack does not have the way for fct/stmf to allow a command to abort prior to the ZFS backend IO completion (pmcs may involved in most cases).
traces :
*stmf_trace_buf/s
[2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts, (0)(544h FFFFh)
(ah)(fffff600e30dd480, fffff600e30dd510)(FFFFFFFFh FFFFFFFFh)(0)(B070
[2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts, (0)(544h FFFFh)
(ah)(fffff600e30dd480, fffff600e30dd
[2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts,
*qlt_state::walk softstate |::qltgettrace : fc/qlttrace
[2016 May 25 22:01:48:229:519:056] !46890=>QEL qlt(2,0):: qlt_handle_rcvd_abts, (0)(2C0h 724h)
(4h)(fffff607a0c25700, fffff607a0c25790)(FFFFFFFFh 126B10h)(0)(603
[2016 May 25 22:01:48:229:519:056] !46890=>QEL qlt(2,0):: qlt_handle_rcvd_abts, )
[2016 May 25 22:01:48:229:544:907] !46891=>QEL qlt(2,0):: qlt_abort_cmd, cmd_type = FCT_CMD_FCP_XCHG
[2016 May 25 22:01:48:229:547:011] !46892=>QEL qlt(2,0):: qlt_abort_unsol_scsi_cmd, qi(0) fctcmd-
fffff604a0877c80(2C0h, FFFFh),9d000477 rex2=126B10h
[2016 May 25 22:01:48:229:562:586] !46893=>QEL qlt(2,0):: qlt_handle_ctio_completion, abort(oxid=2C0h):
hndl-9d170477, 1, fffff6007dcc6b40 (126B10h)
[2016 May 25 22:01:48:229:566:267] !46894=>QEL qlt(2,0):: qlt_handle_ctio_completion,
0(fffff604a0877c80,fffff604a0877d10)(2C0h,FFFFh),9d170477 126B10(2016, 0)
[2016 May 25 22:01:48:784:693:584] !46895=>QEL qlt(2,0):: qlt_send_abts_response,
(160h 5E1h)(0)(fffff607998c6e00 fffff607998c6e90) (FFFFFFFFh 122E80h)
[2016 May 25 22:01:48:784:746:901] !46896=>QEL qlt(2,0):: qlt_handle_abts_completion, qi(0) status =0h,
(160h 5E1h)
[2016 May 25 22:01:48:796:029:698] !46897=>QEL qlt(2,0):: qlt_send_abts_response,
(1BAh 63Eh)(0)(fffff6058ccd3c40 fffff6058ccd3cd0) (FFFFFFFFh 123FF0h)
The spagettime.d will show high gap as follows :
# cat spagettime.Pool_02.2016-05-25_14\:14\:10 | nawk '{ if ($16>5000) print}'
2016 May 25 14:15:21 txg : 9996375; walltime:759 ms ; sync gap:14845 ms ; txg delays:0 ; txg stalled:0
2016 May 25 16:05:37 txg : 9997695; walltime:1449 ms ; sync gap:7279 ms ; txg delays:0 ; txg stalled:0
2016 May 25 22:01:49 txg : 10001958;walltime:1175 ms ; sync gap:25238 ms ; txg delays:0 ; txg stalled:0
2016 May 25 22:03:27 txg : 10001976;walltime:1065 ms ; sync gap:11427 ms ; txg delays:0 ; txg stalled:0
2016 May 25 22:12:45 txg : 10002087;walltime:1287 ms ; sync gap:5106 ms ; txg delays:0 ; txg stalled:0
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 23526362 | mptsas driver TRAN_BUSY on readzilla degraded shadow-migration performance | 2013.1.6.11 |
Signature : iostat -xnz 1 - wait queue > 0 is a smoking gun + actv >= 1 :
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
124.0 61.0 1423.0 38681.3 5.0 1.3 26.8 7.3 66 68 c0t5001E82002763F90d0 << READZILLA
263.0 51.0 2233.4 36262.7 4.9 1.4 15.7 4.4 65 71 c0t5001E82002763F90d0
95.0 61.0 956.0 43606.2 6.8 1.8 43.6 11.8 89 95 c0t5001E82002763F90d0
288.0 56.0 4157.5 41983.7 5.5 1.5 16.0 4.4 73 76 c0t5001E82002763F90d0
172.0 64.0 1422.9 43494.1 5.4 1.4 22.8 6.1 69 74 c0t5001E82002763F90d0
55.0 58.0 517.1 40532.3 5.2 1.4 45.8 12.4 68 70 c0t5001E82002763F90d0
wait : average number of transactions waiting for service (queue length) or host queue depth.
actv : active queue is the number of pending I/O requests. This represents the number of threads doing
IOs onto the same device (Cf Roch).
L2ARC devices (SANDISK 1.6TB) are 100% busy, but critically have only 1 IO in the active queue and 9 in the wait queue. Blaise suspects it’s because the L2 improvements in recent AK releases means the L2ARC is caching much better, so the bug is triggered more easily.
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 20466027 | zfs_zget is causing high cpu usage | 2013.1.6.12 |
Signature from a dtrace profile :
unix`mutex_delay_default+0x7
unix`mutex_vector_enter+0x2ae
zfs`zfs_zget+0x4a
zfs`zfs_vget+0x1d2
genunix`fsop_vget+0x66
nfssrv`nfs3_fhtovp+0x47
nfssrv`rfs3_read+0x4d
nfssrv`common_dispatch+0x53f
nfssrv`rfs_dispatch+0x2d
rpcmod`svc_process_request+0x184
rpcmod`stp_dispatch+0xb9
genunix`taskq_d_thread+0xc1
unix`thread_start+0x8
4652
nfssrv`rfs3_write+0xab0
nfssrv`common_dispatch+0x694
nfssrv`rfs_dispatch+0x2d
rpcmod`svc_process_request+0x184
rpcmod`stp_dispatch+0xb9
genunix`taskq_d_thread+0xc1
unix`thread_start+0x8
10390
unix`mutex_delay_default+0x7
unix`lock_set_spl_spin+0xc7
genunix`disp_lock_enter+0x2d
genunix`turnstile_lookup+0x69
unix`mutex_vector_enter+0x1b0
zfs`zfs_zget+0x4a
zfs`zfs_vget+0x1d2
genunix`fsop_vget+0x66
nfssrv`nfs3_fhtovp+0x47
nfssrv`rfs3_write+0x4a
nfssrv`common_dispatch+0x694
nfssrv`rfs_dispatch+0x2d
rpcmod`svc_process_request+0x184
rpcmod`stp_dispatch+0xb9
genunix`taskq_d_thread+0xc1
unix`thread_start+0x8
19267
The fix was to remove the mutex and replace it with a numa aware reader/writer lock and have it grabbed as a reader in the normal zget cases. The zget code will then have to deal with the races that can occur when the same znode is being looked up at the same time.
kernel memory related issues
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 15942559 | ZFS data can not be easily reclaimed for other uses | ak9 | ||
| 15775761 | Converting memory between kmem caches can be limiting factor | no fix | Points to 15942559 – ZFS data can not be easily reclaimed for other uses | |
| 15796281 | kmem taskq processing can wedge the system | 2013.1.6.3 IDR IDR #4.1 | The fix design was to change the kmem_taskq into a standalone workq with a dedicated service thread. Each cache could have a bitmap of possible operations that the service thread could perform, along with a "busy" bit. When async work needs to be done on a cache, the thread requesting the work would set the corresponding bit and enqueue the cache on the global list. When the kmem thread wanted to do work, it would set the "busy" bit on a cache, perform the operations specified by the bitmap, then clear the "busy" bit. If a kmem_cache_destroy() needed to come in, it could grab the list lock, wait for the "busy" bit on its cache to be clear (hopefully it would be clear to begin with), remove the cache from the list, and then manually do whatever processing it needed to do before destroying the cache. | |
| 19294541 | kmem cache reaping should be made more scalable | 2013.1.6.3 IDR IDR #4.1 |
List of caches getting processed by kmem_async are queued :
> kmem_caches_processing=J
fffffffffc0b99c0
> fffffffffc0b99c0::walk list |::kmem_cache
ADDR NAME FLAG CFLAG BUFSIZE BUFTOTL
ffffc10000047030 kmem_alloc_256 120f 00200000 256 677052 << processing this cache
ffffc1010e5c8030 dnode_t 120f 04000000 752 16760 << waiting
ffffc1010e5d4030 zfs_znode_cache 120f 00000000 280 10842 << waiting
ffffc10113b2d030 zio_buf_512 1200 01020000 512 16816 << this cache may wait a long
time before getting processed
Synchronous processing of caches seems to be not scaling. Things can go bad especially in low memory conditions.
Solution was to split kmem_reap in 2 steps :
1. All “background processing” on a given cache (reap or update) is to be done by a single thread at a time. Multiple threads can run for different buffers at the same time. This is the “scalable part”
2. The “global cleanup” work (re-scheduling timeouts, etc) can be done after all processing related to the base work (reap or update) is complete.
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 23555751 | ZFSSA – FC VM frozen for more than 10 mins | see below | 2013.1.6.3 IDR IDR #4.1 | Points to 15796281 and 19294541 |
echo "arc_shrink_shift/W 6"| mdb -kw (default is 7)
echo "arc_evict_chunk_minbufs/Z 0t256"|mdb -kw
Adjust arc_evict_chunk_minbufs to 256 from its default value of 128 – See 26264372.
echo "arc_evict_chunk_minbufs/Z 0t256"|mdb -kw
Tuning arc_shrink_shift will have the effect of making zfs reacting to low memory at an early point.
See arc.c/arc_reclaim_thread() and redzone calculation used by arc_reclaim_needed() :
redzone = 2 * btop(arc_size >> arc_shrink_shift);
The reaper_trace.d script may be used to see how long it takes for the arc_get_data_block() function to return when throttleing happens.
Here we see that it can take 37 secs to return with some throttles :
52 43171 arc_get_data_block:return 31658262460856570 2016 Jul 29 17:15:40
fffffe0c06f84c20 37658131850 throttle 1877 arn 1878 aen 1878 freemem 9435104 arc_size 1317113940032
53 43171 arc_get_data_block:return 31658262460053455 2016 Jul 29 17:15:40
fffffe0c0582dc20 31811415899 throttle 1591 arn 1592 aen 1592 freemem 9435104 arc_size 1317112963016
The script has to be run as follows :
cli> confirm shell
# mkdir /var/tmp/tsc
# cd /var/tmp/tsc
# vi reaper_trace.d << copy the content of reaper_trace.d
# chmod u+x reaper_trace.d
# vi reaper_collection.sh << copy the content of reaper_collection.sh
# chmod u+x reaper_collection.sh
# ./reaper_collection.sh start
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 22256256 | Lengthy kmem_reap() processing can block progress of arc_reap_thread | 2013.1.5.7 | kmem_cache_reap_async() added to offload advisory reaps to kmem-internal taskq threads.That prevents advisory reaps from blocking behind the async kmem reap activity if they happen to be working on the same kmem cache. | |
| 22344599 | Final detach at the end of a resilvering leads to high spa_sync time | no workaround | 2013.1.6.0 | Relates to 22256256 : Lengthy kmem_reap() processing can block progress of arc_reap_thread. Typical signature is kernel memory reorganization at the end of a disk replacement, when the final spare detach happens. |
The arc_delays.d script will report some kernel memory buffers being reaped :
2015 Dec 9 22:04:40 freemem 4054 MB
2015 Dec 9 22:04:50 freemem 4546 MB; reaping in progress
2015 Dec 9 22:05:00 freemem 5313 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU;
reaping in progress
2015 Dec 9 22:05:10 freemem 5334 MB; reaping in progress
2015 Dec 9 22:05:20 freemem 4799 MB; reaping in progress
2015 Dec 9 22:05:30 freemem 4702 MB; reaping in progress
2015 Dec 9 22:05:40 freemem 5368 MB; reaping in progress
2015 Dec 9 22:05:50 freemem 6196 MB; reaping in progress
2015 Dec 9 22:06:00 freemem 6964 MB; reaping in progress
2015 Dec 9 22:06:10 freemem 7985 MB; reaping in progress
2015 Dec 9 22:06:20 freemem 9085 MB; reaping in progress
2015 Dec 9 22:06:30 freemem 10152 MB; reaping in progress
2015 Dec 9 22:06:40 freemem 11409 MB; reaping in progress
2015 Dec 9 22:06:50 freemem 12681 MB; reaping in progress
Looking at the analytics, the resilvering end matches with some kernel memory reorganization.
See how bpmap_cache buffer size decreases :
lbl-1656:analytics dataset-035> show
Properties:
name = kmem.fragmented[cache]
grouping = Memory
explanation = kernel memory lost to fragmentation broken down by kmem cache
incore = 4.89M
lbl-1656:analytics dataset-035> read 2015-12-09 22:04:49 1
DATE/TIME MB MB BREAKDOWN
2015-12-9 22:04:49 17393 11495 bpmap_cache
1610 space_seg_cache
659 dmu_buf_impl_t
536 l2arc_buf_t
518 arc_buf_t
364 zio_cache
158 arc_ref_t
157 arc_elink_t
144 dnode_t
lbl-1656:analytics dataset-035> read 2015-12-09 22:06:10 1
DATE/TIME MB MB BREAKDOWN
2015-12-9 22:06:10 10753 4979 bpmap_cache
1598 space_seg_cache
596 dmu_buf_impl_t
536 l2arc_buf_t
489 arc_buf_t
361 zio_cache
164 zio_data_buf_8
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 19949855 | optimize the number of cross-calls during hat_unload | 2013.1.3.3 | ||
| 20645565 | arc_throttle logic can delay zfs operations under low memory conditions | 2013.1.6.3 IDR IDR #4.1 |
a) Excessive wait time in kmem taskq :
arc_reaper_thread dispatches the reap request to kmem_taskq. kmem_taskq is single threaded and it is responsible for multiple kmem maintenance tasks, not just reaping. If there are tasks pending in the taskq, it may take a while for our task to get to the front of the queue and execute.
b) Latency in taskq_wait :
arc_reaper_thread, after it dispatches the reaping task to taskq, it does taskq_wait to wait for the task to be done. Once it returns from taskq_wait, it signals arc_reaper_thread so that it may shrink arc further and evict more buffers if needed. However, it seems like we wait in taksq_wait until all the pending tasks in queue are complete, not just for the completion of the task dispatched by arc_reaper_thread. For example, if some other thread calls kmem_reap, that may dispatch lots of reaping tasks to kmem taskq. In that case, we seem to be keep on waiting until these other reaping tasks are also complete.
It seems like the fix for following bug may help to solve this problem. But, the fix was not backported in AK : 15796281 – kmem taskq processing can wedge the system
c) The red zone size used for reacting to memory shortage :
The memory throttle logic in arc_get_data_block puts the calling thread to wait if allocation of additional memory can result freemem falling below (desfree + lotsfree). On a 512GB system, (desfree + lotsfree) is approximately 12GB. In order that we may not enter throttle logic and put the threads to wait for memory, we may consider reacting to low memory situation at an early point. We shouldn’t wait until freemem goes as low as 12GB. In arc_reclaim_thread logic, that early point is decided based on the size of red zone. Probably, the red zone size also needs some tuning so that we may react much earlier
zfs_signature.sh
The zfs_signature.sh script was commonly used to narrow down performance issues possibly related to ZFS. This was written by Roch Bourbonnais and Blaise Sanouillet, 2 gurus I miss a lot.
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 18125379 | memory leak in l2arc on sync reads that fail | 2013.1.2.0 |
See l2feeding.d script comments .
* The output below with write BW under the target BW, and all devices able to
* keep up, is a signature for:
* 16564195 ARC l_evict_lock contention in arc_evict_bytes()
*
* 2014 Nov 12 12:08:08
* Data eligible for L2, evicted but not fed: 352 MB/s
* L2 feeding suspended due to low memory: 0
*
* L2 device Waiting for data Not keeping up Write BW Target BW
* fffff600afd5b940 0 0 7 MB/s 40 MB/s
* fffff600e6e2f280 0 0 8 MB/s 40 MB/s
* fffff600f5be52c0 0 0 8 MB/s 40 MB/s
* fffff600e836ea00 0 0 7 MB/s 40 MB/s
lockstat gives this :
Adaptive mutex spin: 4123264 events in 30.136 seconds (136823 events/sec)
-------------------------------------------------------------------------------
ops/s indv cuml rcnt nsec Lock Caller
1973 19% 19% 0.00 4254743 ARC_mru+0xa0 arc_evict_bytes+0x28
nsec ------ Time Distribution ------ ops/s Stack
128 | 0 arc_get_data_block+0xe4
256 | 8 arc_read+0x64c
512 | 20 dsl_read+0xac
1024 | 47 dbuf_read_impl+0x22c
2048 |@ 115 dbuf_read+0x100
4096 |@@@ 215
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 16564195 | ARC l_evict_lock contention in arc_evict_bytes | 2013.1.4.5 |
Adaptive mutex spin: 5320606 events in 62.214 seconds (85521 events/sec)
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Hottest Caller
2611062 80% 80% 0.00 124236 ARC_mru[272] arc_evict_bytes+0x5e
nsec ------ Time Distribution ------ count Stack
2 56 | 584 arc_get_data_block+0xc7
512 | @ 91853 arc_read+0x606
1024 | @ 78003 dsl_read+0x34
2048 | @ 150450 dbuf_prefetch+0x1e6
4096 | @@@ 300149 dmu_zfetch_fetch+0x5f
8192 | @@@@@@@ 640676 dmu_zfetch+0x10d
16384 | @@@@ 426892 dbuf_read+0x128
32768 | @@ 246084 dmu_buf_hold_array_by_dnode+0x1cc
65536 | @ 135358 dmu_buf_hold_array+0x6f
131072 | @ 117919 dmu_read_uio+0x4b
262144 | @ 124844 zfs_read+0x260
524288 | @ 127585 fop_read+0xa9
1048576 | @ 93911 rfs3_read+0x4e0
2097152 | @ 53434 common_dispatch+0x53f
4194304 | @ 20964 rfs_dispatch+0x2d
8388608 | @ 2233 svc_process_request+0x184
16777216 | @ 76 stp_dispatch+0xb9
33554432 | @ 21 taskq_d_thread+0xc1
67108864 | @ 23 thread_start+0x8
134217728 | @ 3
19297091 – arc_evictable_memory has needless locks, is part of the fix
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 15648057 | lack of accounting for DDT in dedup'd frees can oversubscribe txg | 2010.Q1.2.0 | See spagetti.d script output. walltime » 5 seconds, sync gap close to 0, arc misses from “DDT ZAP algorithm” in the 100s | |
| 18281514 | spa_sync/space_map_map delay in dbuf_read causes NFS thread to pause | 2013.1.2.9 | See spagetti.d script output. walltime » 5 seconds, sync gap close to 0, arc misses from “SPA space map” in the 100s | |
| 16423386 | dnode_next_offset() doesn't do exhaustive search | 2013.1.1.4 | See spagetti.d script output. Signature for dataset / large file destroy pathologies. walltime » 5 seconds, sync gap close to 0, dsl_dataset_destroy_sync() on-cpu time > 2 seconds | |
| 19621408 | do_user_quota_updates can consume 10% of spa_sync time in some cases | no fix | See spagetti.d script output. walltime >= 5 seconds, do_user_quota_updates() time > 1 sec (or even bigger). |
-------- spa_sync exclusive time breakdown by tracked function :
Elapse ms On-cpu ms
<total> 5252 487
spa_sync 6 4
dsl_pool_sync 2469 277
metaslab_sync 3 1
metaslab_sync_reassess 2 2
metaslab_sync_done 125 122
dmu_objset_do_userquota_updates 2193 55 <<<<
taskq_wait -> vdev_sync 2 0
taskq_wait -> vdev_sync_done 109 0
vdev_config_sync 95 1
dsl_dataset_destroy_sync 0 0
zio_wait 0 0
lost 248 25
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 15623702 | Write throttling starts too late (arc_memory_throttle) | no fix | See spagetti.d script output. pattern of 2 out of 3 consecutive spa_sync with dp_write used ~0 and ARC mem throttle > 0 | |
| 15744920 | read throttle required to survive read write storm | 2013.1.5 | See spagetti.d script output. walltime >= 5 seconds, sync gap close to 0, no arc misses, dp_write thput limit ~= min + txg stalls and delays – see spagetti.d details.Seting zfs_write_limit_min to 24 * 100MB * #HDD, has been shown to alleviate the case of writes stalled for minutes |
For a 20 disks (10 vdevs pool), this gives : 24 * 100MB * 20 = 48000000000 bytes
zfssa# echo zfs_write_limit_min/Z0t48000000000 | mdb -kw
zfs_write_limit_min: 0x2000000 = 0xb2d05e000
This has the effect of diminuishing the max throughput for the pool, but this make the txg
stall and delays disappear.
Before the tuning of zfs_write_limit_min to 48000000000, we may have some very fluctuating
"used throughput" for TXGs (second col) :
min/used/thput limit/free limit : 1/ 23/ 616/ 1179190
min/used/thput limit/free limit : 1/ 409/ 973/ 1179174
min/used/thput limit/free limit : 1/ 696/ 1347/ 1179146
min/used/thput limit/free limit : 1/ 708/ 1631/ 1179117
min/used/thput limit/free limit : 1/ 85/ 1464/ 1179099
min/used/thput limit/free limit : 1/ 23/ 1176/ 1179099
min/used/thput limit/free limit : 1/ 24/ 983/ 1179114
min/used/thput limit/free limit : 1/ 24/ 810/ 1179114
min/used/thput limit/free limit : 1/ 23/ 656/ 1179114
min/used/thput limit/free limit : 1/ 11/ 537/ 1179115
min/used/thput limit/free limit : 1/ 1/ 409/ 1179116
After the tuning, its more constant :
min/used/thput limit/free limit : 1907/ 147/ 147/ 1179075
min/used/thput limit/free limit : 1907/ 147/ 147/ 1179071
min/used/thput limit/free limit : 1907/ 147/ 147/ 1179066
min/used/thput limit/free limit : 1907/ 147/ 147/ 1179061
min/used/thput limit/free limit : 1907/ 147/ 147/ 1179057
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 16275386 | metaslab_group_alloc shouldn't block on space_map_load | see below | 2013.1.6 | When destroying a dataset, entries are added to a chain list in disk for a non loaded spacemap. When this spacemap is loaded for further writes, the chain list has to be parsed to determine where free space is. This makes a lot of random IO reads and this takes time while we are in the middle of IO writes. From bug 16275386 : It turns out that this issue is more general than previously thought. During txg sync, if an allocation on vdev 1 is waiting on space_map_load, further allocations will be directed to vdev 2 until it reaches its aliquot, then vdev 3, etc. Eventually it will come back to vdev 1 and also block waiting on space_map_load. This means vdev 1 won't be kept busy and so will reduce the write throughput, even when the ZIL is not in play. It should be possible to have a space map loaded asynchronously in advance so that metaslab_group_alloc() can always proceed without blocking, modulo corner cases. |
Aliquo : mg_aliquot = 512K * (# leaf vdevs)
Normally, mg_aliquot is used as a limit to move to the next vdev (or metaslab_group).
In metaslab_ndf_fragmented(), mg_aliquot is used to decide whether we need to move to another metaslab
if (maxsize < mg_aliquot).
WO :
# echo "metaslab_unload_delay/W 0x7FFFFFFF" | mdb -kw
| Bug number | Title | Workaround | Fix | Comments |
|---|---|---|---|---|
| 17856080 | arc_meta_used is not necessarily all MRU | see below | 2013.1.7 | Symptoms : IO reads can be slow, especially for backups. Disks can be busy due to lack of hits in MRU ( see bug 22818942 ). Disks busy on IO reads due to no evictable space |
Signature : arc_delays.d
2016 Apr 12 09:21:02 freemem 8359 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU;
2016 Apr 12 09:21:12 freemem 7909 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU;
These messages mean that the MRU buffer is too small to keep the most recently used data.
The issue comes from the arc_adjust() function which adjust the size of the MRU or MFU buffers
if arc_size > arc_c. arc_ajust() can be used when ZFS needs to reduce its memory consumption.
The current algorithm (Apr 2016) takes arc_meta_used into account but "arc_meta_used is not
necessarily all MRU". That means we reduce the MRU too much.
The adjust_arc_p.sh tries to artificially increase arc_p (the target size of MRU) to 3 times the value
of meta_used. It limits the size of arc_p to half of arc_c (target size of cache). If the meta_used and arc_p
are already too high, the script will not do any action but displays a message.
For example, here arc_p is too small :
p = 11696 MB /* target size of MRU */
c = 187039 MB /* target size of cache */
c_min = 64 MB /* min target cache size */
c_max = 261101 MB /* max target cache size */
size = 187039 MB /* actual total arc size */
[..]
meta_used = 21967 MB
meta_max = 41046 MB
meta_limit = 0 MB
Thanks to the script, p will go to 60GB (that is still below 187GB/2).
Before, with tuning_zfs_arc_p_min_shift.akwf, arc_p was set to 1/8 or arc_size (reboot needed).
Apparently, this may be not enough.
Bugs under the bonnet
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15505861 | Write performance degrades when ZFS filesystem is near quota | Avoid quota usage, Free some space by deleting some datasets | 2013.1.1.9 |
Signature :
Typical stack when reaching filesystem size limit :
> ::stacks -c dsl_dir_tempreserve_impl
zfs`dsl_dataset_check_quota
zfs`dsl_dir_tempreserve_impl+0xc2
zfs`dsl_dir_tempreserve_space+0x13e
zfs`dmu_tx_try_assign+0x208
zfs`dmu_tx_assign+0x2a
zfs`zfs_write+0x63c
genunix`fop_write+0xa6
genunix`write+0x309
genunix`write32+0x22
unix`_sys_sysenter_post_swapgs+0x149
Actually, we get in trouble when the return code is 91 : ERESTART. This can lead to long spa_sync time.
dsl_dataset.c
dsl_dataset_check_quota(dsl_dataset_t *ds, uint64_t inflight)
{
int error = 0;
mutex_enter(&ds->ds_lock);
/*
* If they are requesting more space, and our current estimate
* is over quota, they get to try again unless the actual
* on-disk is over quota and there are no pending changes (which
* may free up space for us).
*/
if (ds->ds_quota != 0 &&
ds->ds_phys->ds_used_bytes + inflight >= ds->ds_quota) {
if (inflight > 0 || ds->ds_phys->ds_used_bytes < ds->ds_quota)
error = ERESTART;
else
error = EDQUOT;
}
mutex_exit(&ds->ds_lock);
return (error);
}
Dtrace can be of great help to check the return value of dsl_dataset_check_quota() and dsl_dir_tempreserve_space(). If we get something different of 0, we can hit the issue.
# dtrace -n 'dsl_dataset_check_quota:return /arg1!=0/ { trace(arg1) }'
CPU ID FUNCTION:NAME
3 45299 dsl_dataset_check_quota:return 91
3 45299 dsl_dataset_check_quota:return 91
3 45299 dsl_dataset_check_quota:return 91
3 45299 dsl_dataset_check_quota:return 91
3 45299 dsl_dataset_check_quota:return 91
3 45299 dsl_dataset_check_quota:return 91
3 45299 dsl_dataset_check_quota:return 91
3 45299 dsl_dataset_check_quota:return 91
# dtrace -n 'dsl_dir_tempreserve_space:entry { printf("%s a:%lx f:%lx u:%lx",
stringof(args[0]->dd_myname),arg2,arg3,arg4)}
dsl_dir_tempreserve_space:return { trace(arg1)}'
CPU ID FUNCTION:NAME
0 44317 dsl_dir_tempreserve_space:entry prepcc-t1_oradata2-Clone_DAA1
0 44318 dsl_dir_tempreserve_space:return 91
1 44317 dsl_dir_tempreserve_space:entry preptux-d1_home
1 44318 dsl_dir_tempreserve_space:return 91
1 44317 dsl_dir_tempreserve_space:entry preptux-d1_home
1 44318 dsl_dir_tempreserve_space:return 91
3 44317 dsl_dir_tempreserve_space:entry preptux-d1_home
3 44318 dsl_dir_tempreserve_space:return 91
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15710347 | system hung, lots of threads in htable_steal | See below | 2013.1.1.5 |
Signature :
> ::stacks -c htable_steal
THREAD STATE SOBJ COUNT
fffff60011597760 RUN <NONE> 9
swtch+0x150
turnstile_block+0x760
mutex_vector_enter+0x261
htable_steal+0x1a4
htable_alloc+0x248
htable_create+0x1dc
hati_load_common+0xa3
hat_memload+0x81
hat_memload_region+0x25
segvn_faultpage+0x937
segvn_fault+0xc13
as_fault+0x5ee
pagefault+0x99
trap+0xe63
Workaround :
arc_stats::print -ta arc_stats_t arcstat_c_max.value.i64
fffffffffbd43680 int64_t arcstat_c_max.value.i64 = 0x5bef18000
> fffffffffbd43680/Z0x4000000
arc_stats+0x4a0:0x5bef18000 = 0x4000000 <<< 64MB
> ::arc -m ! grep max
c_max = 64 MB
meta_max = 889 MB
With :
0x400000000 << 16GB
0x600000000 << 24GB
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15742475 | AK-2011.04.24 Extremely sluggish 7420 node due to heap fragmentation | limit arc_size to half of the DRAM, set arc_meta_limit to 2/5 of DRAM | 2011.1.3 |
Signature :
memstat shows that we have plenty of memory free, however a "::stacks -m zfs" shows that we have a number
of zfs stacks waiting for allocations.
>::stacks -m zfs
ffffffaac7a07440 SLEEP CV 36
swtch+0x145
cv_wait+0x61
vmem_xalloc+0x635
vmem_alloc+0x161
segkmem_xalloc+0x9
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15751065 | appliance unavailable due to zio_arena fragmentation | limit arc_size to half of the DRAM, set arc_meta_limit to 2/5 of DRAM | 2011.1.3 |
Signature :
We haven't run out of available memory.
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 8480307 33126 6%
ZFS File Data 120780433 471798 90% << ZFS ARC cache for data
Anon 231541 904 0%
Exec and libs 73 0 0%
Page cache 6816 26 0%
Free (cachelist) 22290 87 0%
Free (freelist) 4691874 18327 3 %
Total 134213334 524270
Physical 134213333 524270
We have some threads waiting for some memory allocation
> ::stacks -m zfs
THREAD STATE SOBJ COUNT
ffffff86b38ab440 SLEEP CV 23
swtch+0x150
cv_wait+0x61
vmem_xalloc+0x63f
vmem_alloc+0x161
segkmem_xalloc+0x90
segkmem_alloc_vn+0xcd
segkmem_zio_alloc+0x24
vmem_xalloc+0x550
vmem_alloc+0x161
kmem_slab_create+0x81
kmem_slab_alloc+0x5b
kmem_cache_alloc+0x1fa
zio_data_buf_alloc+0x2c
arc_get_data_buf+0x18b
arc_buf_alloc+0xa2
arc_read_nolock+0x12f
arc_read+0x79
dsl_read+0x33
dbuf_read_impl+0x17e
dbuf_read+0xfd
dmu_tx_check_ioerr+0x6b
dmu_tx_count_write+0x175
dmu_tx_hold_write+0x5b
zfs_write+0x655
fop_write+0xa4
rfs3_write+0x50e
common_dispatch+0x48b
rfs_dispatch+0x2d
svc_getreq+0x19c
svc_run+0x171
svc_do_run+0x81
nfssys+0x760
_sys_sysenter_post_swapgs+0x149
from ::kmastat -g, the buffers consuming memory are for data
>::kmastat -g ! grep -v 0G
zio_data_buf_65536 65536 4694099 4694101 286G 21342088 0
zio_data_buf_131072 131072 1424500 1424500 173G 8844051 0
...
zfs_file_data 460G 512G 0G 10832435 0
zfs_file_data_buf 460G 460G 460G 10843627 0
Looking in the thread stacks, we were waiting memory allocation just for 0x20000 bytes (128KB)
>::stacks -c vmem_xalloc
stack pointer for thread ffffff86b38ab440: ffffff03d56a8700
[ ffffff03d56a8700 _resume_from_idle+0xf1() ]
ffffff03d56a8730 swtch+0x150()
ffffff03d56a8760 cv_wait+0x61(ffffff84949ac01e, ffffff84949ac020)
ffffff03d56a88a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) <<< 0x20000
ffffff03d56a8900 vmem_alloc+0x161(ffffff84949ac000, 20000, 4)
...
]
Most of the waiters on this condition variable are nfsd daemons
> ffffff84949ac01e::wchaninfo -v
ADDR TYPE NWAITERS THREAD PROC
ffffff84949ac01e cond 60: ffffff8570a5db60 akd
ffffff85a1baabc0 akd
ffffff86bdf66880 akd
ffffff85a1af6420 nfsd
ffffff85e09da420 nfsd
ffffffbd7b172bc0 nfsd
ffffff86bc6450e0 nfsd
ffffff86b38ab440 nfsd
ffffff85a1cdf0c0 nfsd
ffffffbd7b16e100 nfsd
...
We have many ZFS threads doing some memory allocation
> ::stacks -m zfs|::findstack -v ! grep vmem_xalloc | grep ffffff84949ac000 | head
ffffff03d56a88a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d3cb08a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d4e498a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d37dc8a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d06288a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d43708a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d35f58a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d3d768a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d38368a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff03d382a8a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4)
Walking in the free vmem_segs of vmem_t ffffff84949ac000, we have many free segments of 0x10000 bytes (64KB)
but none for 0x20000 (128KB):
> ffffff84949ac000::walk vmem_free | ::vmem_seg -s ! awk '{print $4}' | sort | uniq -c
108 0000000000001000
97 0000000000002000
166 0000000000003000
148 0000000000004000
...
09 000000000000f000
837558 0000000000010000 << 64KB
30 0000000000011000
34 0000000000012000
42 0000000000013000
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15710534 | SUNBT7038138 advance rotor before blocking on space_map_load | See below | 2011.1.6 |
Signature :
Time spent to read metadata during sync is responsible for the sawtooth pattern of IOPS
Workaround :
# echo "metaslab_unload_delay/W 0x7fffffff"|mdb -kw
# echo "set zfs:metaslab_unload_delay = 0x7fffffff" >> /etc/system
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15790948 | SUNBT7167903 pre-6998140 soft-ring fanout is single threaded for VLANs | Do not use VLAN under 2011.1 if you need high throughput | 2013.1 |
Signature :
* or 2 CPU 99% busy
* High level of interruptions
VLAN usage :
# dladm show-link
bge0 type: non-vlan mtu: 1500 device: bge0
bge1 type: non-vlan mtu: 1500 device: bge1
bge2 type: non-vlan mtu: 1500 device: bge2
bge3 type: non-vlan mtu: 1500 device: bge3
aggr1 type: non-vlan mtu: 1500 aggregation: key 1
aggr111001 type: vlan 111 mtu: 1500 aggregation: key 1
aggr112001 type: vlan 112 mtu: 1500 aggregation: key 1
aggr23002 vnic 1500 up -- aggr2 <<< VLANID - 23 of aggr2
aggr21002 vnic 1500 up -- aggr2
aggr22002 vnic 1500 up -- aggr2
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15754028 | read-modify-write in space_map_sync() may significantly slow down spa_sync() performance | see below | 2013.1.2.9 |
Signature :
Run the spacemap_15754028.d script to determine if the observed performance issue is due to this bug. If it is, the script shows consistently long spa sync times (>15s) in the “Max sync” column, and high numbers (>200) in the “Sync RMW” column:
Pool : Max sync Sync Load Minsize MB Count Load Miss Sync RMW anon/MRU/MFU : 2013 Sep 13 06:09:30
system : 503 ms 0 0 133 MB, 6017 0 0 0/0/0
mstorepool 56329 ms 5 0 0 MB, 355572 0 21259 0/21/21238
Typical stack trace is as follow :
swtch+0x150()
cv_wait+0x61(ffffff8a047b2cb8, ffffff8a047b2cb0)
zio_wait+0x5d(ffffff8a047b29a0)
dbuf_read+0x1e5(ffffff8b754b8d88, 0, 9)
dbuf_will_dirty+0x5d(ffffff8b754b8d88, ffffff91ec49c338)
dmu_write+0xe1(ffffff856606d340, 446, 10fa0, 478,
space_map_sync+0x295(ffffff86174c8dd0, 1, ffffff86174c8ae0,
metaslab_sync+0x2de(ffffff86174c8ac0, a8b)
vdev_sync+0xd5(ffffff8615400640, a8b)
spa_sync+0x45b(ffffff8638592040, a8b)
txg_sync_thread+0x247(ffffff85660903c0)
thread_start+8()
Workaround :
# echo "metaslab_unload_delay/W 0x7fffffff"|mdb -kw
# echo "set zfs:metaslab_unload_delay = 0x7fffffff" >> /etc/system
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 16423386 | dnode_next_offset() doesn't do exhaustive search | no WO | 2013.1.1.4 |
Signature :
Clone deletion issue : Note the presence of traverse_dnode() twice on the stack.
zfs`arc_read_nolock+0x7e7
zfs`arc_read+0x79
zfs`dsl_read+0x33
zfs`traverse_visitbp+0x431
zfs`traverse_visitbp+0x4dd
zfs`traverse_dnode+0xa3 <---- B
zfs`traverse_visitbp+0x387
zfs`traverse_visitbp+0x4dd
zfs`traverse_visitbp+0x4dd
zfs`traverse_visitbp+0x4dd
zfs`traverse_visitbp+0x4dd
zfs`traverse_visitbp+0x4dd
zfs`traverse_visitbp+0x4dd
zfs`traverse_dnode+0xa3 <----- A
zfs`traverse_visitbp+0x1fc
zfs`traverse_impl+0x213
zfs`traverse_dataset+0x57
zfs`dsl_dataset_destroy_sync+0x290
zfs`dsl_sync_task_group_sync+0xf3
zfs`dsl_pool_sync+0x1ec
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 17650441 | NFS drops to about 0 for about 30 seconds and then returns to normal | no WO | IDR 2011.04.24.8.0,1-2.43.2.3 |
Fixed thanks to bug 6988530 NFS threads partially STOPPED after nfsd restart during ZFSSA fail-back.
Signature :
spa_sync activity is more than 5 sec. Looks like 15754028 but WO may not help. Use the synctime7 dtrace script to confirm the issue.
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15723911 | SUNBT7058001 concurrency opportunity during spa_sync | see below | 2013.1.1.1 |
Parent of 15885830 : RMAN BACKUP PERFORMANCE REDUCED 15-30% WHEN USING UNSTABLE WRITES WITH 11.2.0.3 – Fixed in 2013.
Signature :
When doing an high Asynchronous IO write workload, NFSv3 response time can report significant variations (up to 7 seconds). With Synchronous workload, the max response time is still normal and around 400ms max. For RMAN backups, a performance decrease from 15-30% has been observed.
ZFS periodically writes out dirty cache buffers to disk. This is called a “sync” — we are synchronising what is in core memory with what is on disk and it has many steps. The syncing procedure is managed by a component of ZFS called the Storage Pool Allocator, or SPA. This enhancements addresses two of the many steps that the SPA takes in order to complete the sync.
One of the steps is to process all of the “frees”. A “free” involves releasing a previously allocated region on a disk and making it available for allocation again. As ZFS determines which blocks to free, it adds them to a list. Once the list is complete, the SPA would loop through that list and process the frees one at a time. The enhancement that was made to this step was to reduce the number of items on the list by processing the frees as they come in. This greatly reduces the amount of CPU required to handle each block that is being freed.
Another step that the SPA sync procedure does is to write out the metaslab space maps. These space maps contain the allocations that are done on disks, so ZFS can know which regions are free. Prior to this enhancement, each disk was processed sequentially and its space maps were written out. With the change in place, we can now write out the space maps for each disk in parallel.
Workaround :
zfs set sync=always <dataset in trouble>
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 15652868 | SUNBT6965013 DTrace compiler doesn't align structs properly | Do not use analytics broken down by hostname | 2013.1 |
Signature :
When monitoring IP packets by hostname we can devrease the throughput by 25%
# dtrace -n '
profile-997hz /arg0 && curthread->t_pri != -1 && cpu==58 / { @[stack()] = count(); }
tick-60sec { trunc(@, 10); printa(@); exit(0); }' > /var/ak/dropbox/cpu-busy-$date.out &
The most used stacks were as follows :
ip`ill_input_short_v4+0xdb
ip`ip_input+0x23b
dls`i_dls_link_rx+0x2e7
mac`mac_rx_deliver+0x5d
mac`mac_rx_soft_ring_process+0x17a
mac`mac_rx_srs_fanout+0x823
mac`mac_rx_srs_drain+0x261
mac`mac_rx_srs_process+0x180
mac`mac_rx_classify+0x159
mac`mac_rx_flow+0x54
mac`mac_rx_common+0x1f6
mac`mac_rx+0xac
mac`mac_rx_ring+0x4c
ixgbe`ixgbe_intr_msix+0x99
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hardint+0x33
unix`switch_sp_and_call+0x13
16175
Checking what this part of code does, we can see this.
The 'nop' instructions is a signature of a dtrace probe.
st1f742003b# echo ill_input_short_v4+0xdb::dis | mdb -k
ill_input_short_v4+0xdb: nop
ill_input_short_v4+0xdc: nop
ill_input_short_v4+0xdd: nop
ill_input_short_v4+0xde: addq $0x10,%rsp
ill_input_short_v4+0xe2: movq -0xb0(%rbp),%rcx
ill_input_short_v4+0xe9: movq %r12,%rdi
ill_input_short_v4+0xec: xorq %rsi,%rsi
ill_input_short_v4+0xef: movq %r13,%rdx
ill_input_short_v4+0xf2: nop
ill_input_short_v4+0xf3: nop
ill_input_short_v4+0xf4: nop
Checking which analytics do have probes in ip, I noticed this :
dataset-022 active 6.72M 235M ip.bytes[hostname]
After disabling it, we have been able to sustain an higher throughput :
st1f742003b:analytics dataset-047> read 10
DATE/TIME KB/SEC KB/SEC BREAKDOWN
2013-9-10 15:06:49 349787 349755 ixgbe0
32 igb1
2013-9-10 15:06:50 351674 351643 ixgbe0
31 igb1
2013-9-10 15:06:51 650618 650580 ixgbe0
38 igb1
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 17832061 | One dropped packet can permanently throttle TCP transfer | No easy workaround unless using the same VLANS for clients and the ZFS appliance | 2013.1.2.9 |
Signature :
When clients and ZFS appliance are on different VLANS, a 50% throughput decrease has been observed. Note that this does not impact IO writes. This is a pure network issue.
* is the issue seen under 2011.1 : YES
* is the issue seen under 2013.1 and same VLANSs (not routed) : NO
* is the issue seen under 2013.1 and different VLANSs (routed) : YES
In addition to the congestion window, there is a variable that governs the maximum number of packets allowed to be sent in a single burst of packets. Initially this number if essentially infinite, with the number of packets in a burst allowed to keep growing indefinitely as long as there are no dropped packets indicating network congestion. However, when a packet is dropped, the number of packets may be reduced.
The problem we have here that if the destination of the connection is on the same local network, the burst is kept at infinity, but if it is not on the same local network the burst is reduced to 5. Under some circumstances it can even be reduce further to 3, but from 3 it will grow back to 5. This accounts for the difference of the VLANs. One VLAN is deemed local and one is deemed not local. On the non-local VLAN, once a packet is dropped the limits is reduced forever to 5.
Beyond 2013.1.2.9 the following tunable has to be changed :
echo 'tcp_cwnd_normal/W 0x14' | mdb -kw
For some cases, it seems we can set tcp_cwnd_normal to 0x64 to get better throughput. Adam feedback for a perf issue with IO reads done by NDMP, is as follows :
tcp_cwnd_normal=0x5 : 3MB/s
tcp_cwnd_normal=0x14 : 30MB/s
tcp_cwnd_normal=0x64 : 80MB/s
Bigger values did not help much
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 18125379 | memory leak in l2arc on sync reads that fail | see below | 2013.1.2.0 |
Signature :
A typical kstate.out file will be as follows :
::arc -m
p = 4 MB
c = 64 MB << target ARC size has reached its min
c_min = 64 MB
c_max = 130030 MB
size = 98748 MB << current ARC size, but no data is evictable anymore
[...]
l2_evict_reading = 0
l2_abort_lowmem = 10240 << memory pressure : the kmem_cache_reap_now() thread has run
l2_cksum_bad = 504213672 << issue when data read from SSD cache devices is validated before
it is given to the ARC
l2_io_error = 0
l2_hdr_size = 29696 MB
* l2_cksum_bad will cause memory leak on sync reads (not on async reads) due to 18125379
* getting 'l2_cksum_bad' will make a read from the original storage device
Workaround :
SSD do not have to be replaced
There is no workaround so far. A reboot will be needed to get around the issue but we have analytics to measure arc.c and so, we can get preemptive actions and schedule a reboot before it gets messy. Do as follows :
- Add the following analytic, if not present already : **Cache - ARC target size : arc.c**
- Add an ALERT from **Configuration->Alerts panel**
* Click on (+) Threshold alerts.
* Select "Cache ARC target size"
* Select "fails below"
* Enter 10000 (MB)
* Fill the correct email address
- If you receive an alert, it will be time to schedule a NAS head reboot.
I did chose **10000MB** but it really depends on the memory size. Obviously, when arc.c has decreased to 64MB, it’s already too late.
Beyond that, we still do not know why ‘l2_cksum_bad’ did increase so much and Victor Latushkin told this may not be an HW issue in the SSD.
| Bug number | Title | Workaround | Fix |
|---|---|---|---|
| 18562374 | missing call to arc_free_data_block in l2arc_read() | NA | 2013.2 |
| 18695640 | race between arc_write_done() and arc_evict_buf() | NA | in Oct. 2017, this did not seem to fixed |
