All of the following information has to be taken “as is” and has not been updated for 3 years now.
Though I am quite sure all of these bugs are fixed now, it was worth listing them and share this with my fellows that are still working onto the ZFSA.
Hope that helps.
Credits to all the engineers that did work with me at Sun and Oracle from 1999 to 2017.
General bugs list
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
15020395 | performance decrease if 1st nameserver is down | 2011.1.3 | ||
15040725 | RFE for improved DNS resolver failover performance | 2011.1.3 | ||
15770608 | Enable DNS defer-on-fail functionality | 2011.1.3 | ||
15505861 | Write performance degrades when ZFS filesystem is near quota | 2013.1.1.9 | Attempts was done for 2011.1.6 IDR but regression tests did fail | |
15742475 | Extremely sluggish 7420 node due to heap fragmentation limit arc_size and arc_meta_limit | 2011.1.3 | ||
15770103 | Intermittent akd hangs – mutex owned by a thread stuck in zfs land | 2011.1.4 | Impacts rollback, snapshot deletion. Refers to CR 7155750 | |
15789968 | Unable to delete zombie snapshot causes CPU spikes | 2011.1.5 | High CPU usage is an indicator | |
15814083 | z_fuid_lock becomes too hot when zfs_fuid_find_by_idx() is heavily used | 2013.1.2 | CIFS performance is bad. High CPU usage on ZFSSA is a clear indicator | |
15705245 | Elapsed time for zfs delete may grow quadratically | 2011.1.5 | LUN or filesystem deletion may lead to BUI hang, for hours | |
15710347 | System hung, lots of threads in htable_steal | limit arc_size | 2013.1.1.5 | |
15751065 | Appliance unavailable due to zio_arena fragmentation | limit arc_size and arc_meta_limit | 2011.1.3 | |
15710534 | Advance rotor before blocking on space_map_load | metaslab_unload_delay | 2011.1.6 | sawtooth IOPs pattern during sync of metadata |
15790948 | soft-ring fanout is single threaded for VLAN | 2013.1 | Throuhput limited when VLAN used | |
17650441 | NFS drops to about 0 for about 30 seconds and then returns to normal | IDR 2011.04.24.8.0,1-2.43.2.3 | spa_sync activity is more than 5 sec | |
15723911 | concurrency opportunity during spa_sync | sync=always | 2013.1.1.1 | 20% loss of throughput with RMAN backups |
15652868 | DTrace compiler doesn't align structs properly disable some analytics | 2013.1 | ip.bytes[hostname] impact perf | |
16423386 | dnode_next_offset() doesn't do exhaustive search | 2013.1.1.4 | perf issue when deleting clones | |
18753152 | zil train and logbias=throughput, 2X performance opportunity | 2013.1.2.8 | perf issue impacting databases running OLTP workload with 8k recordsize and logbias=throughput | |
17832061 | SMB read performance are not consistent | Avoid using different VLANs | 2013.1.2.9 | 50% bandwidth loss over different VLANs |
15754028 | read-modify-write in space_map_sync() may significantly slow down spa_sync() performance | metaslab_unload_delay | 2013.1.2.9 | |
15784616 | need an upper bound on memory footprint of l2hdr for large l2arc | 2013.1.2.11 | L2ARC_HEADER_PERCENT added : default = 30% of ARC | |
18609976 | large deletes cause NFS ops to drop to zero | 2013.1.3.5 | To be monitored with spawtf.d script. Happens when large datasets are destroyed. spa_sync can last more than 10 mins. The fix being pursued here also fixes two problems causing scan/scrub/resilver to overrun its time allocations. Prior to this, if a scrub is running, it consistently goes over 6s. With the fix, it's just over 5s as it's intended to be. |
Typical stack as follows: genunix`avl_find+0x56 zfs`space_map_add+0xa6 zfs`zio_free+0x1a5 zfs`dsl_free+0x20 zfs`dsl_dataset_block_kill+0x19b zfs`free_blocks+0x76 zfs`free_children+0x2c7 zfs`free_children+0x23f zfs`free_children+0x23f zfs`dnode_sync_free_range+0x113 zfs`dnode_sync+0x272 zfs`dmu_objset_sync_dnodes+0x7f zfs`dmu_objset_sync+0x1c9 zfs`dsl_dataset_sync+0x63 zfs`dsl_pool_sync+0xdb zfs`spa_sync+0x39a zfs`txg_sync_thread+0x244 unix`thread_start+0x8 2407
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
20693077 | snapshot deletes don't check timeout while iterating | 2013.1.4.6 |
Hundreds of stacks similar to this seen onproc: zfs`dbuf_hold_impl+0x1 zfs`dnode_hold_impl+0xc3 zfs`dnode_hold+0x2b zfs`dmu_bonus_hold+0x32 zfs`bpobj_open+0x6d zfs`bpobj_iterate_impl+0x35d zfs`bpobj_iterate_impl+0x3ff zfs`bpobj_iterate_impl+0x3ff zfs`bpobj_iterate_impl+0x3ff zfs`bpobj_iterate_impl+0x3ff zfs`bpobj_iterate_impl+0x3ff zfs`bpobj_iterate_impl+0x3ff zfs`bpobj_iterate_impl+0x3ff zfs`bpobj_iterate+0x23 zfs`dsl_scan_sync+0x11b zfs`spa_sync+0x447 zfs`txg_sync_thread+0x244 unix`thread_start+0x8
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
19772024 | ls_reclaim() blocks kmem_cache_reap() | limit arc_size | 2013.1.3.1 | kmem_cache_reap thread stuck trying to retrieve some space from specific NFS cache |
19591405 | akd running in RT class is blocking critical kernel memory management | 2013.1.6.0 | Some RT akd threads needing memory prevent kmem_cache_reap and pageout scanner from running. See MOS Doc ID 2043383.1 : Some systems with LOW Memory configurations might not run reliably/stable with 2013.1.x releases |
An admitted workaround is to set the non cluster akd threads scheduling class to Fare Share (FS), as follows : zfssa # svccfg -s akd setprop akd/rt_off = "true" zfssa # svccfg -s akd listprop | grep rt_off akd/rt_off_cluster astring false akd/rt_off astring true <<<< zfssa # svcadm refresh akd zfssa # svcadm restart akd Must be done on both cluster nodes.
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
18914201 | Hole buffers are calculating zero checksums leading to performance slowdown | 2013.1.3.1 | ||
19137061 | top-level vdev allocations may go over mg_aliquot | 2013.1.3.1 | The fix spreads IOs accross vdevs is a more efficient way, making each vdev less busy | |
19949855 | optimize the number of cross-calls during hat_unload | 2013.1.3.3 | Reaper thread not able to free buffers fast enough.This is causing NFS clients to experience IO latency. |
Typical stack is as follows : unix`do_splx+0x97 unix`xc_common+0x230 unix`xc_call+0x46 unix`hat_tlb_inval+0x289 unix`hat_unload_callback+0x302 unix`hat_unload+0x41 unix`segkmem_free_vn+0x5a unix`segkmem_free+0x27 genunix`vmem_xfree+0x10a genunix`vmem_free+0x27 genunix`kmem_slab_destroy+0x87 genunix`kmem_slab_free+0x2c7 genunix`kmem_magazine_destroy+0xdf genunix`kmem_depot_ws_reap+0x77 genunix`kmem_cache_magazine_purge+0xd6 genunix`kmem_cache_magazine_resize+0x41 genunix`taskq_thread+0x22e unix`thread_start+0x8
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
19883033 | ZS3 encountering unexplained high CPU utilization | 2013.1.3.3 | SMB issue mostly |
Using dtrace to count the number of calls from zfs_groupmember()->zfs_fuid_find_by_idx() for the exact same workload to the same share as the same user: calls version 12843966 ak-2013.1 3.0 848928 ak-2013.1 3.3
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
15324796 | zfetch needs a whole lotta love | 2013.1.4.4 | For sequential IO reads with small blocksize, zfetch_block_cap can be increased. By default, zfetch_block_cap = 16. With a recordsize of 8k, ZFS will prefetch 128K only. An high number of vdevs will not help as we just do not prefetch deep enough to make them all busy. To get better disk usage, zfetch_block_cap has to be increased to 32 or even 64. See tuning_zfetch_block_cap.akwf attached to bug 19982476 (backport for ZFSSA). | |
20361021 | Latency seen on upgrade to SMB2.1 with leases. SMB2.1 server is waiting for lease break ack | 2013.1.4.7 |
SMB2.1 version is available in 2013.1.3.0 which introduces leases. This feature works in a similar way like oplocks where a 35sec delay can be seen. So just disabling oplocks from the BUI might not work.
You have to go back to SMB2.0 (to disable leases) as follows . zfssa# svccfg -s smb listprop | grep maxprotocol smb/client_maxprotocol astring 1 smb/server_maxprotocol astring 2 zfssa# echo '::smbsrv -v' | mdb -k | grep maxprotocol skc_server_maxprotocol = 3 (SMB_D2_1) zfssa# svccfg -s smb setprop smb/server_maxprotocol = astring: "2.0" zfssa# svccfg -s smb listprop | grep protocol smb/client_maxprotocol astring 1 smb/server_maxprotocol astring 2.0 zfssa# echo '::smbsrv -v' | mdb -k | grep maxprotocol skc_server_maxprotocol = 2 (SMB_D2_0) zfssa# aksh zfssa:> configuration services smb zfssa:configuration services smb> disable zfssa:configuration services smb> enable
Caveat : Disabling oplocks, will result in more traffic over the wire. The clients will be denied read/write caching of the data and all the requests have to go to the server.
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
19454536 | Improve the performance of SMB2_QUERY_DIRECTORY | 2013.1.6.0 | Browsing CIFS shares with lots of files will make the ZFFSA CPUs busy … latencies can happen |
Signature :
zfssa# dtrace -n 'profile-997hz /arg0 && curthread->t_pri != -1/ { @[stack()] = count(); } tick-10sec { trunc(@, 10); printa(@); exit(0); }' unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0x2ae zfs`zfs_zaccess_aces_check+0x77 zfs`zfs_zaccess_common+0x85 zfs`zfs_zaccess+0x14d zfs`zfs_fastaccesschk_execute+0x173 zfs`zfs_lookup+0xa8 genunix`vhead_lookup+0xd5 genunix`fop_lookup+0x128 smbsrv`smb_vop_lookup+0x17c smbsrv`smb_fsop_do_lookup+0x14e smbsrv`smb_fsop_lookup+0x27d smbsrv`smb_odir_wildcard_fileinfo+0x6f smbsrv`smb_odir_read_fileinfo+0x10e smbsrv`smb2_query_directory_encode_dynamic+0xd5 smbsrv`smb2_query_directory_encode+0x7e smbsrv`smb2_query_directory+0x84 smbsrv`smb2_dispatch_request_impl+0x237 smbsrv`smb2_dispatch_request+0x15 smbsrv`smb_session_worker+0x88 5672 Lockstat extract (32 cores) : ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2959889 95% 95% 0.00 224391 0xfffff6011546f100 zfs_root+0x58 nsec ------ Time Distribution ------ count Stack 256 | 98 fsop_root+0x2d 512 | 56784 smb_node_alloc+0x11e 1024 | 82330 smb_node_lookup+0x1fe 2048 |@ 187212 smb_fsop_do_lookup+0x49b 4096 |@@@ 372976 smb_fsop_lookup+0x27d 8192 |@@@@@@@@ 876803 smb_odir_wildcard_fileinfo+0x6f 16384 |@@@ 354769 smb_odir_read_fileinfo+0x10e 32768 |@ 167806 smb2_query_directory_encode_dynamic+0xd5 65536 |@ 107972 smb2_query_directory_encode+0x7e 131072 | 91142 smb2_query_directory+0x84 262144 |@ 107285 smb2_dispatch_request_impl+0x237 524288 |@ 150887 smb2_dispatch_request+0x15 1048576 |@ 182779 smb_session_worker+0x88 2097152 |@ 152531 taskq_d_thread+0xc1 4194304 | 66082 thread_start+0x8 8388608 | 2424 16777216 | 9
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
23216069 | Missing flag in prefetch backport causes excessive prefetch activity | 2013.1.6 | Related to : 21974849 – RMAN backup over ASM leads to 1M random read foiling zfetch. With the new zfs_fetch implementation brought in 2013.1.4.4, RMAN backup through ASM make the SAS drives saturated because of unexpected prefetching. |
@ With the new zfetch code when @ subject to 1MB random read, we actually do issue some readaheads and given @ that the ASM Allocation Units can be interleaved, we're likely to end up not using the prefetched data. This makes the disks very busy for no profit. To compare the FC/iSCSI/NFS activity to the SAS2 disks activity, it can be interesting to add some analytics as follows : fc.bytes[op] iscsi.bytes[op] nfs3.bytes[op] ... This can even make the CPUs busy with the following stack : zfs`dbuf_prefetch+0x1 zfs`dmu_zfetch+0x10d zfs`dbuf_read+0x128 zfs`dmu_buf_hold_array_by_dnode+0x1cc zfs`dmu_buf_hold_array+0x6f zfs`dmu_read_uio+0x4b zfs`zfs_read+0x260 genunix`fop_read+0xa9 nfssrv`rfs3_read+0x3ec nfssrv`common_dispatch+0x53f nfssrv`rfs_dispatch+0x2d rpcmod`svc_process_request+0x184 rpcmod`stp_dispatch+0xb9 genunix`taskq_d_thread+0xc1 unix`thread_start+0x8 47770 @ @ The current workaround that we implemented was to tune @ zfetch_maxbytes_lb=128K & zfetch_maxbytes_ub=256K. This keeps prefetch enabled but reduces the amount @ of prefetch data that needs to be thrown away, This is not a very comfortable @ workaround. # echo zfetch_maxbytes_ub/W0t262144 | mdb -kw zfetch_maxbytes_ub: 0x2000000 = 0x40000 # echo zfetch_maxbytes_lb/W0t131072 | mdb -kw zfetch_maxbytes_lb: 0x400000 = 0x20000 From the dmu_zfetch.c : 076 * Max amount of data to prefetch at a time */ 077 * unsigned int zfetch_maxbytes_ub = 32 << MB_SHIFT; 078 * unsigned int zfetch_maxbytes_lb = 4 << MB_SHIFT; 079 * unsigned int zfetch_target_blks = 256; 650 * Provides an upper bound on the prefetch window size for a single prefetch. 651 * We try to prefetch zfetch_target_blks (256) but stay within the size bounds 652 * [zfetch_maxbytes_lb (4 MB), zfetch_maxbytes_ub (32 MB)] 653 * 654 * BLOCKSIZE MAX PREFETCH WINDOW SIZE 655 * 128K-1MB 32MB 656 * 32K-64K 256 blocks (8MB - 16MB) 657 * < 32K 4MB 658 */ 659 static unsigned int 660 dmu_zfetch_max_blocks(uint64_t blkshft) 661 { 662 unsigned int blks = zfetch_target_blks; 663 unsigned int size = blks << blkshft; 664 665 if (size > zfetch_maxbytes_ub) 666 return (zfetch_maxbytes_ub >> blkshft); 667 668 if (size < zfetch_maxbytes_lb) 669 return (zfetch_maxbytes_lb >> blkshft); 670 671 return (blks); 672 }
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
21071219 | Comstar framework should allow a cmd to abort prior to ZFS I/O completion | 2013.1.5.7 & 2013.1.6 | Relates to 22080255 : ZFSSA backend "resilvering" causes FC front end port inaccesible to remote host |
We encountered this issue when we were investigating CR 2829652. Basically on the qlt FC front end of ZFSSA, if a scsi command was detected timedout by remote host, remote host will issue ABTS and retries ABTS if the first one is not processed timely. If the second ABTS was not processed timely the LOGO will be issued from remote host and the I/O path may get lost permanently (see CR 21071136) as a result. For the ZFSSA using FC as the transport interface the capability to be able to handle ABTS timely is critical so that the host can complete its error recovery and retry IOs without having the risk of losing IO path as a result of LOGO. The AK8 kernel and comstack stack does not have the way for fct/stmf to allow a command to abort prior to the ZFS backend IO completion (pmcs may involved in most cases).
traces : *stmf_trace_buf/s [2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts, (0)(544h FFFFh) (ah)(fffff600e30dd480, fffff600e30dd510)(FFFFFFFFh FFFFFFFFh)(0)(B070 [2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts, (0)(544h FFFFh) (ah)(fffff600e30dd480, fffff600e30dd [2014 Aug 22 13:56:23:075:142:122] !584=>QEL qlt(0,0):: qlt_handle_rcvd_abts, *qlt_state::walk softstate |::qltgettrace : fc/qlttrace [2016 May 25 22:01:48:229:519:056] !46890=>QEL qlt(2,0):: qlt_handle_rcvd_abts, (0)(2C0h 724h) (4h)(fffff607a0c25700, fffff607a0c25790)(FFFFFFFFh 126B10h)(0)(603 [2016 May 25 22:01:48:229:519:056] !46890=>QEL qlt(2,0):: qlt_handle_rcvd_abts, ) [2016 May 25 22:01:48:229:544:907] !46891=>QEL qlt(2,0):: qlt_abort_cmd, cmd_type = FCT_CMD_FCP_XCHG [2016 May 25 22:01:48:229:547:011] !46892=>QEL qlt(2,0):: qlt_abort_unsol_scsi_cmd, qi(0) fctcmd- fffff604a0877c80(2C0h, FFFFh),9d000477 rex2=126B10h [2016 May 25 22:01:48:229:562:586] !46893=>QEL qlt(2,0):: qlt_handle_ctio_completion, abort(oxid=2C0h): hndl-9d170477, 1, fffff6007dcc6b40 (126B10h) [2016 May 25 22:01:48:229:566:267] !46894=>QEL qlt(2,0):: qlt_handle_ctio_completion, 0(fffff604a0877c80,fffff604a0877d10)(2C0h,FFFFh),9d170477 126B10(2016, 0) [2016 May 25 22:01:48:784:693:584] !46895=>QEL qlt(2,0):: qlt_send_abts_response, (160h 5E1h)(0)(fffff607998c6e00 fffff607998c6e90) (FFFFFFFFh 122E80h) [2016 May 25 22:01:48:784:746:901] !46896=>QEL qlt(2,0):: qlt_handle_abts_completion, qi(0) status =0h, (160h 5E1h) [2016 May 25 22:01:48:796:029:698] !46897=>QEL qlt(2,0):: qlt_send_abts_response, (1BAh 63Eh)(0)(fffff6058ccd3c40 fffff6058ccd3cd0) (FFFFFFFFh 123FF0h) The spagettime.d will show high gap as follows : # cat spagettime.Pool_02.2016-05-25_14\:14\:10 | nawk '{ if ($16>5000) print}' 2016 May 25 14:15:21 txg : 9996375; walltime:759 ms ; sync gap:14845 ms ; txg delays:0 ; txg stalled:0 2016 May 25 16:05:37 txg : 9997695; walltime:1449 ms ; sync gap:7279 ms ; txg delays:0 ; txg stalled:0 2016 May 25 22:01:49 txg : 10001958;walltime:1175 ms ; sync gap:25238 ms ; txg delays:0 ; txg stalled:0 2016 May 25 22:03:27 txg : 10001976;walltime:1065 ms ; sync gap:11427 ms ; txg delays:0 ; txg stalled:0 2016 May 25 22:12:45 txg : 10002087;walltime:1287 ms ; sync gap:5106 ms ; txg delays:0 ; txg stalled:0
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
23526362 | mptsas driver TRAN_BUSY on readzilla degraded shadow-migration performance | 2013.1.6.11 |
Signature : iostat -xnz 1 - wait queue > 0 is a smoking gun + actv >= 1 : extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 124.0 61.0 1423.0 38681.3 5.0 1.3 26.8 7.3 66 68 c0t5001E82002763F90d0 << READZILLA 263.0 51.0 2233.4 36262.7 4.9 1.4 15.7 4.4 65 71 c0t5001E82002763F90d0 95.0 61.0 956.0 43606.2 6.8 1.8 43.6 11.8 89 95 c0t5001E82002763F90d0 288.0 56.0 4157.5 41983.7 5.5 1.5 16.0 4.4 73 76 c0t5001E82002763F90d0 172.0 64.0 1422.9 43494.1 5.4 1.4 22.8 6.1 69 74 c0t5001E82002763F90d0 55.0 58.0 517.1 40532.3 5.2 1.4 45.8 12.4 68 70 c0t5001E82002763F90d0 wait : average number of transactions waiting for service (queue length) or host queue depth. actv : active queue is the number of pending I/O requests. This represents the number of threads doing IOs onto the same device (Cf Roch).
L2ARC devices (SANDISK 1.6TB) are 100% busy, but critically have only 1 IO in the active queue and 9 in the wait queue. Blaise suspects it’s because the L2 improvements in recent AK releases means the L2ARC is caching much better, so the bug is triggered more easily.
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
20466027 | zfs_zget is causing high cpu usage | 2013.1.6.12 |
Signature from a dtrace profile : unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0x2ae zfs`zfs_zget+0x4a zfs`zfs_vget+0x1d2 genunix`fsop_vget+0x66 nfssrv`nfs3_fhtovp+0x47 nfssrv`rfs3_read+0x4d nfssrv`common_dispatch+0x53f nfssrv`rfs_dispatch+0x2d rpcmod`svc_process_request+0x184 rpcmod`stp_dispatch+0xb9 genunix`taskq_d_thread+0xc1 unix`thread_start+0x8 4652 nfssrv`rfs3_write+0xab0 nfssrv`common_dispatch+0x694 nfssrv`rfs_dispatch+0x2d rpcmod`svc_process_request+0x184 rpcmod`stp_dispatch+0xb9 genunix`taskq_d_thread+0xc1 unix`thread_start+0x8 10390 unix`mutex_delay_default+0x7 unix`lock_set_spl_spin+0xc7 genunix`disp_lock_enter+0x2d genunix`turnstile_lookup+0x69 unix`mutex_vector_enter+0x1b0 zfs`zfs_zget+0x4a zfs`zfs_vget+0x1d2 genunix`fsop_vget+0x66 nfssrv`nfs3_fhtovp+0x47 nfssrv`rfs3_write+0x4a nfssrv`common_dispatch+0x694 nfssrv`rfs_dispatch+0x2d rpcmod`svc_process_request+0x184 rpcmod`stp_dispatch+0xb9 genunix`taskq_d_thread+0xc1 unix`thread_start+0x8 19267
The fix was to remove the mutex and replace it with a numa aware reader/writer lock and have it grabbed as a reader in the normal zget cases. The zget code will then have to deal with the races that can occur when the same znode is being looked up at the same time.
kernel memory related issues
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
15942559 | ZFS data can not be easily reclaimed for other uses | ak9 | ||
15775761 | Converting memory between kmem caches can be limiting factor | no fix | Points to 15942559 – ZFS data can not be easily reclaimed for other uses | |
15796281 | kmem taskq processing can wedge the system | 2013.1.6.3 IDR IDR #4.1 | The fix design was to change the kmem_taskq into a standalone workq with a dedicated service thread. Each cache could have a bitmap of possible operations that the service thread could perform, along with a "busy" bit. When async work needs to be done on a cache, the thread requesting the work would set the corresponding bit and enqueue the cache on the global list. When the kmem thread wanted to do work, it would set the "busy" bit on a cache, perform the operations specified by the bitmap, then clear the "busy" bit. If a kmem_cache_destroy() needed to come in, it could grab the list lock, wait for the "busy" bit on its cache to be clear (hopefully it would be clear to begin with), remove the cache from the list, and then manually do whatever processing it needed to do before destroying the cache. | |
19294541 | kmem cache reaping should be made more scalable | 2013.1.6.3 IDR IDR #4.1 |
List of caches getting processed by kmem_async are queued : > kmem_caches_processing=J fffffffffc0b99c0 > fffffffffc0b99c0::walk list |::kmem_cache ADDR NAME FLAG CFLAG BUFSIZE BUFTOTL ffffc10000047030 kmem_alloc_256 120f 00200000 256 677052 << processing this cache ffffc1010e5c8030 dnode_t 120f 04000000 752 16760 << waiting ffffc1010e5d4030 zfs_znode_cache 120f 00000000 280 10842 << waiting ffffc10113b2d030 zio_buf_512 1200 01020000 512 16816 << this cache may wait a long time before getting processed
Synchronous processing of caches seems to be not scaling. Things can go bad especially in low memory conditions.
Solution was to split kmem_reap in 2 steps :
1. All “background processing” on a given cache (reap or update) is to be done by a single thread at a time. Multiple threads can run for different buffers at the same time. This is the “scalable part”
2. The “global cleanup” work (re-scheduling timeouts, etc) can be done after all processing related to the base work (reap or update) is complete.
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
23555751 | ZFSSA – FC VM frozen for more than 10 mins | see below | 2013.1.6.3 IDR IDR #4.1 | Points to 15796281 and 19294541 |
echo "arc_shrink_shift/W 6"| mdb -kw (default is 7) echo "arc_evict_chunk_minbufs/Z 0t256"|mdb -kw
Adjust arc_evict_chunk_minbufs to 256 from its default value of 128 – See 26264372.
echo "arc_evict_chunk_minbufs/Z 0t256"|mdb -kw Tuning arc_shrink_shift will have the effect of making zfs reacting to low memory at an early point. See arc.c/arc_reclaim_thread() and redzone calculation used by arc_reclaim_needed() : redzone = 2 * btop(arc_size >> arc_shrink_shift);
The reaper_trace.d script may be used to see how long it takes for the arc_get_data_block() function to return when throttleing happens.
Here we see that it can take 37 secs to return with some throttles :
52 43171 arc_get_data_block:return 31658262460856570 2016 Jul 29 17:15:40 fffffe0c06f84c20 37658131850 throttle 1877 arn 1878 aen 1878 freemem 9435104 arc_size 1317113940032 53 43171 arc_get_data_block:return 31658262460053455 2016 Jul 29 17:15:40 fffffe0c0582dc20 31811415899 throttle 1591 arn 1592 aen 1592 freemem 9435104 arc_size 1317112963016
The script has to be run as follows :
cli> confirm shell # mkdir /var/tmp/tsc # cd /var/tmp/tsc # vi reaper_trace.d << copy the content of reaper_trace.d # chmod u+x reaper_trace.d # vi reaper_collection.sh << copy the content of reaper_collection.sh # chmod u+x reaper_collection.sh # ./reaper_collection.sh start
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
22256256 | Lengthy kmem_reap() processing can block progress of arc_reap_thread | 2013.1.5.7 | kmem_cache_reap_async() added to offload advisory reaps to kmem-internal taskq threads.That prevents advisory reaps from blocking behind the async kmem reap activity if they happen to be working on the same kmem cache. | |
22344599 | Final detach at the end of a resilvering leads to high spa_sync time | no workaround | 2013.1.6.0 | Relates to 22256256 : Lengthy kmem_reap() processing can block progress of arc_reap_thread. Typical signature is kernel memory reorganization at the end of a disk replacement, when the final spare detach happens. |
The arc_delays.d script will report some kernel memory buffers being reaped : 2015 Dec 9 22:04:40 freemem 4054 MB 2015 Dec 9 22:04:50 freemem 4546 MB; reaping in progress 2015 Dec 9 22:05:00 freemem 5313 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU; reaping in progress 2015 Dec 9 22:05:10 freemem 5334 MB; reaping in progress 2015 Dec 9 22:05:20 freemem 4799 MB; reaping in progress 2015 Dec 9 22:05:30 freemem 4702 MB; reaping in progress 2015 Dec 9 22:05:40 freemem 5368 MB; reaping in progress 2015 Dec 9 22:05:50 freemem 6196 MB; reaping in progress 2015 Dec 9 22:06:00 freemem 6964 MB; reaping in progress 2015 Dec 9 22:06:10 freemem 7985 MB; reaping in progress 2015 Dec 9 22:06:20 freemem 9085 MB; reaping in progress 2015 Dec 9 22:06:30 freemem 10152 MB; reaping in progress 2015 Dec 9 22:06:40 freemem 11409 MB; reaping in progress 2015 Dec 9 22:06:50 freemem 12681 MB; reaping in progress Looking at the analytics, the resilvering end matches with some kernel memory reorganization. See how bpmap_cache buffer size decreases : lbl-1656:analytics dataset-035> show Properties: name = kmem.fragmented[cache] grouping = Memory explanation = kernel memory lost to fragmentation broken down by kmem cache incore = 4.89M lbl-1656:analytics dataset-035> read 2015-12-09 22:04:49 1 DATE/TIME MB MB BREAKDOWN 2015-12-9 22:04:49 17393 11495 bpmap_cache 1610 space_seg_cache 659 dmu_buf_impl_t 536 l2arc_buf_t 518 arc_buf_t 364 zio_cache 158 arc_ref_t 157 arc_elink_t 144 dnode_t lbl-1656:analytics dataset-035> read 2015-12-09 22:06:10 1 DATE/TIME MB MB BREAKDOWN 2015-12-9 22:06:10 10753 4979 bpmap_cache 1598 space_seg_cache 596 dmu_buf_impl_t 536 l2arc_buf_t 489 arc_buf_t 361 zio_cache 164 zio_data_buf_8
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
19949855 | optimize the number of cross-calls during hat_unload | 2013.1.3.3 | ||
20645565 | arc_throttle logic can delay zfs operations under low memory conditions | 2013.1.6.3 IDR IDR #4.1 |
a) Excessive wait time in kmem taskq :
arc_reaper_thread dispatches the reap request to kmem_taskq. kmem_taskq is single threaded and it is responsible for multiple kmem maintenance tasks, not just reaping. If there are tasks pending in the taskq, it may take a while for our task to get to the front of the queue and execute.
b) Latency in taskq_wait :
arc_reaper_thread, after it dispatches the reaping task to taskq, it does taskq_wait to wait for the task to be done. Once it returns from taskq_wait, it signals arc_reaper_thread so that it may shrink arc further and evict more buffers if needed. However, it seems like we wait in taksq_wait until all the pending tasks in queue are complete, not just for the completion of the task dispatched by arc_reaper_thread. For example, if some other thread calls kmem_reap, that may dispatch lots of reaping tasks to kmem taskq. In that case, we seem to be keep on waiting until these other reaping tasks are also complete.
It seems like the fix for following bug may help to solve this problem. But, the fix was not backported in AK : 15796281 – kmem taskq processing can wedge the system
c) The red zone size used for reacting to memory shortage :
The memory throttle logic in arc_get_data_block puts the calling thread to wait if allocation of additional memory can result freemem falling below (desfree + lotsfree). On a 512GB system, (desfree + lotsfree) is approximately 12GB. In order that we may not enter throttle logic and put the threads to wait for memory, we may consider reacting to low memory situation at an early point. We shouldn’t wait until freemem goes as low as 12GB. In arc_reclaim_thread logic, that early point is decided based on the size of red zone. Probably, the red zone size also needs some tuning so that we may react much earlier
zfs_signature.sh
The zfs_signature.sh script was commonly used to narrow down performance issues possibly related to ZFS. This was written by Roch Bourbonnais and Blaise Sanouillet, 2 gurus I miss a lot.
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
18125379 | memory leak in l2arc on sync reads that fail | 2013.1.2.0 |
See l2feeding.d script comments . * The output below with write BW under the target BW, and all devices able to * keep up, is a signature for: * 16564195 ARC l_evict_lock contention in arc_evict_bytes() * * 2014 Nov 12 12:08:08 * Data eligible for L2, evicted but not fed: 352 MB/s * L2 feeding suspended due to low memory: 0 * * L2 device Waiting for data Not keeping up Write BW Target BW * fffff600afd5b940 0 0 7 MB/s 40 MB/s * fffff600e6e2f280 0 0 8 MB/s 40 MB/s * fffff600f5be52c0 0 0 8 MB/s 40 MB/s * fffff600e836ea00 0 0 7 MB/s 40 MB/s lockstat gives this : Adaptive mutex spin: 4123264 events in 30.136 seconds (136823 events/sec) ------------------------------------------------------------------------------- ops/s indv cuml rcnt nsec Lock Caller 1973 19% 19% 0.00 4254743 ARC_mru+0xa0 arc_evict_bytes+0x28 nsec ------ Time Distribution ------ ops/s Stack 128 | 0 arc_get_data_block+0xe4 256 | 8 arc_read+0x64c 512 | 20 dsl_read+0xac 1024 | 47 dbuf_read_impl+0x22c 2048 |@ 115 dbuf_read+0x100 4096 |@@@ 215
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
16564195 | ARC l_evict_lock contention in arc_evict_bytes | 2013.1.4.5 |
Adaptive mutex spin: 5320606 events in 62.214 seconds (85521 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Hottest Caller 2611062 80% 80% 0.00 124236 ARC_mru[272] arc_evict_bytes+0x5e nsec ------ Time Distribution ------ count Stack 2 56 | 584 arc_get_data_block+0xc7 512 | @ 91853 arc_read+0x606 1024 | @ 78003 dsl_read+0x34 2048 | @ 150450 dbuf_prefetch+0x1e6 4096 | @@@ 300149 dmu_zfetch_fetch+0x5f 8192 | @@@@@@@ 640676 dmu_zfetch+0x10d 16384 | @@@@ 426892 dbuf_read+0x128 32768 | @@ 246084 dmu_buf_hold_array_by_dnode+0x1cc 65536 | @ 135358 dmu_buf_hold_array+0x6f 131072 | @ 117919 dmu_read_uio+0x4b 262144 | @ 124844 zfs_read+0x260 524288 | @ 127585 fop_read+0xa9 1048576 | @ 93911 rfs3_read+0x4e0 2097152 | @ 53434 common_dispatch+0x53f 4194304 | @ 20964 rfs_dispatch+0x2d 8388608 | @ 2233 svc_process_request+0x184 16777216 | @ 76 stp_dispatch+0xb9 33554432 | @ 21 taskq_d_thread+0xc1 67108864 | @ 23 thread_start+0x8 134217728 | @ 3
19297091 – arc_evictable_memory has needless locks, is part of the fix
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
15648057 | lack of accounting for DDT in dedup'd frees can oversubscribe txg | 2010.Q1.2.0 | See spagetti.d script output. walltime » 5 seconds, sync gap close to 0, arc misses from “DDT ZAP algorithm” in the 100s | |
18281514 | spa_sync/space_map_map delay in dbuf_read causes NFS thread to pause | 2013.1.2.9 | See spagetti.d script output. walltime » 5 seconds, sync gap close to 0, arc misses from “SPA space map” in the 100s | |
16423386 | dnode_next_offset() doesn't do exhaustive search | 2013.1.1.4 | See spagetti.d script output. Signature for dataset / large file destroy pathologies. walltime » 5 seconds, sync gap close to 0, dsl_dataset_destroy_sync() on-cpu time > 2 seconds | |
19621408 | do_user_quota_updates can consume 10% of spa_sync time in some cases | no fix | See spagetti.d script output. walltime >= 5 seconds, do_user_quota_updates() time > 1 sec (or even bigger). |
-------- spa_sync exclusive time breakdown by tracked function : Elapse ms On-cpu ms <total> 5252 487 spa_sync 6 4 dsl_pool_sync 2469 277 metaslab_sync 3 1 metaslab_sync_reassess 2 2 metaslab_sync_done 125 122 dmu_objset_do_userquota_updates 2193 55 <<<< taskq_wait -> vdev_sync 2 0 taskq_wait -> vdev_sync_done 109 0 vdev_config_sync 95 1 dsl_dataset_destroy_sync 0 0 zio_wait 0 0 lost 248 25
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
15623702 | Write throttling starts too late (arc_memory_throttle) | no fix | See spagetti.d script output. pattern of 2 out of 3 consecutive spa_sync with dp_write used ~0 and ARC mem throttle > 0 | |
15744920 | read throttle required to survive read write storm | 2013.1.5 | See spagetti.d script output. walltime >= 5 seconds, sync gap close to 0, no arc misses, dp_write thput limit ~= min + txg stalls and delays – see spagetti.d details.Seting zfs_write_limit_min to 24 * 100MB * #HDD, has been shown to alleviate the case of writes stalled for minutes |
For a 20 disks (10 vdevs pool), this gives : 24 * 100MB * 20 = 48000000000 bytes zfssa# echo zfs_write_limit_min/Z0t48000000000 | mdb -kw zfs_write_limit_min: 0x2000000 = 0xb2d05e000 This has the effect of diminuishing the max throughput for the pool, but this make the txg stall and delays disappear. Before the tuning of zfs_write_limit_min to 48000000000, we may have some very fluctuating "used throughput" for TXGs (second col) : min/used/thput limit/free limit : 1/ 23/ 616/ 1179190 min/used/thput limit/free limit : 1/ 409/ 973/ 1179174 min/used/thput limit/free limit : 1/ 696/ 1347/ 1179146 min/used/thput limit/free limit : 1/ 708/ 1631/ 1179117 min/used/thput limit/free limit : 1/ 85/ 1464/ 1179099 min/used/thput limit/free limit : 1/ 23/ 1176/ 1179099 min/used/thput limit/free limit : 1/ 24/ 983/ 1179114 min/used/thput limit/free limit : 1/ 24/ 810/ 1179114 min/used/thput limit/free limit : 1/ 23/ 656/ 1179114 min/used/thput limit/free limit : 1/ 11/ 537/ 1179115 min/used/thput limit/free limit : 1/ 1/ 409/ 1179116 After the tuning, its more constant : min/used/thput limit/free limit : 1907/ 147/ 147/ 1179075 min/used/thput limit/free limit : 1907/ 147/ 147/ 1179071 min/used/thput limit/free limit : 1907/ 147/ 147/ 1179066 min/used/thput limit/free limit : 1907/ 147/ 147/ 1179061 min/used/thput limit/free limit : 1907/ 147/ 147/ 1179057
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
16275386 | metaslab_group_alloc shouldn't block on space_map_load | see below | 2013.1.6 | When destroying a dataset, entries are added to a chain list in disk for a non loaded spacemap. When this spacemap is loaded for further writes, the chain list has to be parsed to determine where free space is. This makes a lot of random IO reads and this takes time while we are in the middle of IO writes. From bug 16275386 : It turns out that this issue is more general than previously thought. During txg sync, if an allocation on vdev 1 is waiting on space_map_load, further allocations will be directed to vdev 2 until it reaches its aliquot, then vdev 3, etc. Eventually it will come back to vdev 1 and also block waiting on space_map_load. This means vdev 1 won't be kept busy and so will reduce the write throughput, even when the ZIL is not in play. It should be possible to have a space map loaded asynchronously in advance so that metaslab_group_alloc() can always proceed without blocking, modulo corner cases. |
Aliquo : mg_aliquot = 512K * (# leaf vdevs) Normally, mg_aliquot is used as a limit to move to the next vdev (or metaslab_group). In metaslab_ndf_fragmented(), mg_aliquot is used to decide whether we need to move to another metaslab if (maxsize < mg_aliquot). WO : # echo "metaslab_unload_delay/W 0x7FFFFFFF" | mdb -kw
Bug number | Title | Workaround | Fix | Comments |
---|---|---|---|---|
17856080 | arc_meta_used is not necessarily all MRU | see below | 2013.1.7 | Symptoms : IO reads can be slow, especially for backups. Disks can be busy due to lack of hits in MRU ( see bug 22818942 ). Disks busy on IO reads due to no evictable space |
Signature : arc_delays.d 2016 Apr 12 09:21:02 freemem 8359 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU; 2016 Apr 12 09:21:12 freemem 7909 MB; no evictable data in MRU; ARC metadata consuming 80% of MRU; These messages mean that the MRU buffer is too small to keep the most recently used data. The issue comes from the arc_adjust() function which adjust the size of the MRU or MFU buffers if arc_size > arc_c. arc_ajust() can be used when ZFS needs to reduce its memory consumption. The current algorithm (Apr 2016) takes arc_meta_used into account but "arc_meta_used is not necessarily all MRU". That means we reduce the MRU too much. The adjust_arc_p.sh tries to artificially increase arc_p (the target size of MRU) to 3 times the value of meta_used. It limits the size of arc_p to half of arc_c (target size of cache). If the meta_used and arc_p are already too high, the script will not do any action but displays a message. For example, here arc_p is too small : p = 11696 MB /* target size of MRU */ c = 187039 MB /* target size of cache */ c_min = 64 MB /* min target cache size */ c_max = 261101 MB /* max target cache size */ size = 187039 MB /* actual total arc size */ [..] meta_used = 21967 MB meta_max = 41046 MB meta_limit = 0 MB Thanks to the script, p will go to 60GB (that is still below 187GB/2). Before, with tuning_zfs_arc_p_min_shift.akwf, arc_p was set to 1/8 or arc_size (reboot needed). Apparently, this may be not enough.
Bugs under the bonnet
Bug number | Title | Workaround | Fix |
---|---|---|---|
15505861 | Write performance degrades when ZFS filesystem is near quota | Avoid quota usage, Free some space by deleting some datasets | 2013.1.1.9 |
Signature :
Typical stack when reaching filesystem size limit : > ::stacks -c dsl_dir_tempreserve_impl zfs`dsl_dataset_check_quota zfs`dsl_dir_tempreserve_impl+0xc2 zfs`dsl_dir_tempreserve_space+0x13e zfs`dmu_tx_try_assign+0x208 zfs`dmu_tx_assign+0x2a zfs`zfs_write+0x63c genunix`fop_write+0xa6 genunix`write+0x309 genunix`write32+0x22 unix`_sys_sysenter_post_swapgs+0x149
Actually, we get in trouble when the return code is 91 : ERESTART. This can lead to long spa_sync time.
dsl_dataset.c dsl_dataset_check_quota(dsl_dataset_t *ds, uint64_t inflight) { int error = 0; mutex_enter(&ds->ds_lock); /* * If they are requesting more space, and our current estimate * is over quota, they get to try again unless the actual * on-disk is over quota and there are no pending changes (which * may free up space for us). */ if (ds->ds_quota != 0 && ds->ds_phys->ds_used_bytes + inflight >= ds->ds_quota) { if (inflight > 0 || ds->ds_phys->ds_used_bytes < ds->ds_quota) error = ERESTART; else error = EDQUOT; } mutex_exit(&ds->ds_lock); return (error); }
Dtrace can be of great help to check the return value of dsl_dataset_check_quota() and dsl_dir_tempreserve_space(). If we get something different of 0, we can hit the issue.
# dtrace -n 'dsl_dataset_check_quota:return /arg1!=0/ { trace(arg1) }' CPU ID FUNCTION:NAME 3 45299 dsl_dataset_check_quota:return 91 3 45299 dsl_dataset_check_quota:return 91 3 45299 dsl_dataset_check_quota:return 91 3 45299 dsl_dataset_check_quota:return 91 3 45299 dsl_dataset_check_quota:return 91 3 45299 dsl_dataset_check_quota:return 91 3 45299 dsl_dataset_check_quota:return 91 3 45299 dsl_dataset_check_quota:return 91 # dtrace -n 'dsl_dir_tempreserve_space:entry { printf("%s a:%lx f:%lx u:%lx", stringof(args[0]->dd_myname),arg2,arg3,arg4)} dsl_dir_tempreserve_space:return { trace(arg1)}' CPU ID FUNCTION:NAME 0 44317 dsl_dir_tempreserve_space:entry prepcc-t1_oradata2-Clone_DAA1 0 44318 dsl_dir_tempreserve_space:return 91 1 44317 dsl_dir_tempreserve_space:entry preptux-d1_home 1 44318 dsl_dir_tempreserve_space:return 91 1 44317 dsl_dir_tempreserve_space:entry preptux-d1_home 1 44318 dsl_dir_tempreserve_space:return 91 3 44317 dsl_dir_tempreserve_space:entry preptux-d1_home 3 44318 dsl_dir_tempreserve_space:return 91
Bug number | Title | Workaround | Fix |
---|---|---|---|
15710347 | system hung, lots of threads in htable_steal | See below | 2013.1.1.5 |
Signature :
> ::stacks -c htable_steal THREAD STATE SOBJ COUNT fffff60011597760 RUN <NONE> 9 swtch+0x150 turnstile_block+0x760 mutex_vector_enter+0x261 htable_steal+0x1a4 htable_alloc+0x248 htable_create+0x1dc hati_load_common+0xa3 hat_memload+0x81 hat_memload_region+0x25 segvn_faultpage+0x937 segvn_fault+0xc13 as_fault+0x5ee pagefault+0x99 trap+0xe63
Workaround :
arc_stats::print -ta arc_stats_t arcstat_c_max.value.i64 fffffffffbd43680 int64_t arcstat_c_max.value.i64 = 0x5bef18000 > fffffffffbd43680/Z0x4000000 arc_stats+0x4a0:0x5bef18000 = 0x4000000 <<< 64MB > ::arc -m ! grep max c_max = 64 MB meta_max = 889 MB With : 0x400000000 << 16GB 0x600000000 << 24GB
Bug number | Title | Workaround | Fix |
---|---|---|---|
15742475 | AK-2011.04.24 Extremely sluggish 7420 node due to heap fragmentation | limit arc_size to half of the DRAM, set arc_meta_limit to 2/5 of DRAM | 2011.1.3 |
Signature :
memstat shows that we have plenty of memory free, however a "::stacks -m zfs" shows that we have a number of zfs stacks waiting for allocations. >::stacks -m zfs ffffffaac7a07440 SLEEP CV 36 swtch+0x145 cv_wait+0x61 vmem_xalloc+0x635 vmem_alloc+0x161 segkmem_xalloc+0x9
Bug number | Title | Workaround | Fix |
---|---|---|---|
15751065 | appliance unavailable due to zio_arena fragmentation | limit arc_size to half of the DRAM, set arc_meta_limit to 2/5 of DRAM | 2011.1.3 |
Signature :
We haven't run out of available memory. > ::memstat Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 8480307 33126 6% ZFS File Data 120780433 471798 90% << ZFS ARC cache for data Anon 231541 904 0% Exec and libs 73 0 0% Page cache 6816 26 0% Free (cachelist) 22290 87 0% Free (freelist) 4691874 18327 3 % Total 134213334 524270 Physical 134213333 524270 We have some threads waiting for some memory allocation > ::stacks -m zfs THREAD STATE SOBJ COUNT ffffff86b38ab440 SLEEP CV 23 swtch+0x150 cv_wait+0x61 vmem_xalloc+0x63f vmem_alloc+0x161 segkmem_xalloc+0x90 segkmem_alloc_vn+0xcd segkmem_zio_alloc+0x24 vmem_xalloc+0x550 vmem_alloc+0x161 kmem_slab_create+0x81 kmem_slab_alloc+0x5b kmem_cache_alloc+0x1fa zio_data_buf_alloc+0x2c arc_get_data_buf+0x18b arc_buf_alloc+0xa2 arc_read_nolock+0x12f arc_read+0x79 dsl_read+0x33 dbuf_read_impl+0x17e dbuf_read+0xfd dmu_tx_check_ioerr+0x6b dmu_tx_count_write+0x175 dmu_tx_hold_write+0x5b zfs_write+0x655 fop_write+0xa4 rfs3_write+0x50e common_dispatch+0x48b rfs_dispatch+0x2d svc_getreq+0x19c svc_run+0x171 svc_do_run+0x81 nfssys+0x760 _sys_sysenter_post_swapgs+0x149 from ::kmastat -g, the buffers consuming memory are for data >::kmastat -g ! grep -v 0G zio_data_buf_65536 65536 4694099 4694101 286G 21342088 0 zio_data_buf_131072 131072 1424500 1424500 173G 8844051 0 ... zfs_file_data 460G 512G 0G 10832435 0 zfs_file_data_buf 460G 460G 460G 10843627 0 Looking in the thread stacks, we were waiting memory allocation just for 0x20000 bytes (128KB) >::stacks -c vmem_xalloc stack pointer for thread ffffff86b38ab440: ffffff03d56a8700 [ ffffff03d56a8700 _resume_from_idle+0xf1() ] ffffff03d56a8730 swtch+0x150() ffffff03d56a8760 cv_wait+0x61(ffffff84949ac01e, ffffff84949ac020) ffffff03d56a88a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) <<< 0x20000 ffffff03d56a8900 vmem_alloc+0x161(ffffff84949ac000, 20000, 4) ... ] Most of the waiters on this condition variable are nfsd daemons > ffffff84949ac01e::wchaninfo -v ADDR TYPE NWAITERS THREAD PROC ffffff84949ac01e cond 60: ffffff8570a5db60 akd ffffff85a1baabc0 akd ffffff86bdf66880 akd ffffff85a1af6420 nfsd ffffff85e09da420 nfsd ffffffbd7b172bc0 nfsd ffffff86bc6450e0 nfsd ffffff86b38ab440 nfsd ffffff85a1cdf0c0 nfsd ffffffbd7b16e100 nfsd ... We have many ZFS threads doing some memory allocation > ::stacks -m zfs|::findstack -v ! grep vmem_xalloc | grep ffffff84949ac000 | head ffffff03d56a88a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d3cb08a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d4e498a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d37dc8a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d06288a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d43708a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d35f58a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d3d768a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d38368a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) ffffff03d382a8a0 vmem_xalloc+0x63f(ffffff84949ac000, 20000, 1000, 0, 0, 0, 0, 4) Walking in the free vmem_segs of vmem_t ffffff84949ac000, we have many free segments of 0x10000 bytes (64KB) but none for 0x20000 (128KB): > ffffff84949ac000::walk vmem_free | ::vmem_seg -s ! awk '{print $4}' | sort | uniq -c 108 0000000000001000 97 0000000000002000 166 0000000000003000 148 0000000000004000 ... 09 000000000000f000 837558 0000000000010000 << 64KB 30 0000000000011000 34 0000000000012000 42 0000000000013000
Bug number | Title | Workaround | Fix |
---|---|---|---|
15710534 | SUNBT7038138 advance rotor before blocking on space_map_load | See below | 2011.1.6 |
Signature :
Time spent to read metadata during sync is responsible for the sawtooth pattern of IOPS
Workaround :
# echo "metaslab_unload_delay/W 0x7fffffff"|mdb -kw # echo "set zfs:metaslab_unload_delay = 0x7fffffff" >> /etc/system
Bug number | Title | Workaround | Fix |
---|---|---|---|
15790948 | SUNBT7167903 pre-6998140 soft-ring fanout is single threaded for VLANs | Do not use VLAN under 2011.1 if you need high throughput | 2013.1 |
Signature :
* or 2 CPU 99% busy * High level of interruptions VLAN usage : # dladm show-link bge0 type: non-vlan mtu: 1500 device: bge0 bge1 type: non-vlan mtu: 1500 device: bge1 bge2 type: non-vlan mtu: 1500 device: bge2 bge3 type: non-vlan mtu: 1500 device: bge3 aggr1 type: non-vlan mtu: 1500 aggregation: key 1 aggr111001 type: vlan 111 mtu: 1500 aggregation: key 1 aggr112001 type: vlan 112 mtu: 1500 aggregation: key 1 aggr23002 vnic 1500 up -- aggr2 <<< VLANID - 23 of aggr2 aggr21002 vnic 1500 up -- aggr2 aggr22002 vnic 1500 up -- aggr2
Bug number | Title | Workaround | Fix |
---|---|---|---|
15754028 | read-modify-write in space_map_sync() may significantly slow down spa_sync() performance | see below | 2013.1.2.9 |
Signature :
Run the spacemap_15754028.d script to determine if the observed performance issue is due to this bug. If it is, the script shows consistently long spa sync times (>15s) in the “Max sync” column, and high numbers (>200) in the “Sync RMW” column:
Pool : Max sync Sync Load Minsize MB Count Load Miss Sync RMW anon/MRU/MFU : 2013 Sep 13 06:09:30 system : 503 ms 0 0 133 MB, 6017 0 0 0/0/0 mstorepool 56329 ms 5 0 0 MB, 355572 0 21259 0/21/21238
Typical stack trace is as follow :
swtch+0x150() cv_wait+0x61(ffffff8a047b2cb8, ffffff8a047b2cb0) zio_wait+0x5d(ffffff8a047b29a0) dbuf_read+0x1e5(ffffff8b754b8d88, 0, 9) dbuf_will_dirty+0x5d(ffffff8b754b8d88, ffffff91ec49c338) dmu_write+0xe1(ffffff856606d340, 446, 10fa0, 478, space_map_sync+0x295(ffffff86174c8dd0, 1, ffffff86174c8ae0, metaslab_sync+0x2de(ffffff86174c8ac0, a8b) vdev_sync+0xd5(ffffff8615400640, a8b) spa_sync+0x45b(ffffff8638592040, a8b) txg_sync_thread+0x247(ffffff85660903c0) thread_start+8()
Workaround :
# echo "metaslab_unload_delay/W 0x7fffffff"|mdb -kw # echo "set zfs:metaslab_unload_delay = 0x7fffffff" >> /etc/system
Bug number | Title | Workaround | Fix |
---|---|---|---|
16423386 | dnode_next_offset() doesn't do exhaustive search | no WO | 2013.1.1.4 |
Signature :
Clone deletion issue : Note the presence of traverse_dnode() twice on the stack. zfs`arc_read_nolock+0x7e7 zfs`arc_read+0x79 zfs`dsl_read+0x33 zfs`traverse_visitbp+0x431 zfs`traverse_visitbp+0x4dd zfs`traverse_dnode+0xa3 <---- B zfs`traverse_visitbp+0x387 zfs`traverse_visitbp+0x4dd zfs`traverse_visitbp+0x4dd zfs`traverse_visitbp+0x4dd zfs`traverse_visitbp+0x4dd zfs`traverse_visitbp+0x4dd zfs`traverse_visitbp+0x4dd zfs`traverse_dnode+0xa3 <----- A zfs`traverse_visitbp+0x1fc zfs`traverse_impl+0x213 zfs`traverse_dataset+0x57 zfs`dsl_dataset_destroy_sync+0x290 zfs`dsl_sync_task_group_sync+0xf3 zfs`dsl_pool_sync+0x1ec
Bug number | Title | Workaround | Fix |
---|---|---|---|
17650441 | NFS drops to about 0 for about 30 seconds and then returns to normal | no WO | IDR 2011.04.24.8.0,1-2.43.2.3 |
Fixed thanks to bug 6988530 NFS threads partially STOPPED after nfsd restart during ZFSSA fail-back.
Signature :
spa_sync activity is more than 5 sec. Looks like 15754028 but WO may not help. Use the synctime7 dtrace script to confirm the issue.
Bug number | Title | Workaround | Fix |
---|---|---|---|
15723911 | SUNBT7058001 concurrency opportunity during spa_sync | see below | 2013.1.1.1 |
Parent of 15885830 : RMAN BACKUP PERFORMANCE REDUCED 15-30% WHEN USING UNSTABLE WRITES WITH 11.2.0.3 – Fixed in 2013.
Signature :
When doing an high Asynchronous IO write workload, NFSv3 response time can report significant variations (up to 7 seconds). With Synchronous workload, the max response time is still normal and around 400ms max. For RMAN backups, a performance decrease from 15-30% has been observed.
ZFS periodically writes out dirty cache buffers to disk. This is called a “sync” — we are synchronising what is in core memory with what is on disk and it has many steps. The syncing procedure is managed by a component of ZFS called the Storage Pool Allocator, or SPA. This enhancements addresses two of the many steps that the SPA takes in order to complete the sync.
One of the steps is to process all of the “frees”. A “free” involves releasing a previously allocated region on a disk and making it available for allocation again. As ZFS determines which blocks to free, it adds them to a list. Once the list is complete, the SPA would loop through that list and process the frees one at a time. The enhancement that was made to this step was to reduce the number of items on the list by processing the frees as they come in. This greatly reduces the amount of CPU required to handle each block that is being freed.
Another step that the SPA sync procedure does is to write out the metaslab space maps. These space maps contain the allocations that are done on disks, so ZFS can know which regions are free. Prior to this enhancement, each disk was processed sequentially and its space maps were written out. With the change in place, we can now write out the space maps for each disk in parallel.
Workaround :
zfs set sync=always <dataset in trouble>
Bug number | Title | Workaround | Fix |
---|---|---|---|
15652868 | SUNBT6965013 DTrace compiler doesn't align structs properly | Do not use analytics broken down by hostname | 2013.1 |
Signature :
When monitoring IP packets by hostname we can devrease the throughput by 25%
# dtrace -n ' profile-997hz /arg0 && curthread->t_pri != -1 && cpu==58 / { @[stack()] = count(); } tick-60sec { trunc(@, 10); printa(@); exit(0); }' > /var/ak/dropbox/cpu-busy-$date.out & The most used stacks were as follows : ip`ill_input_short_v4+0xdb ip`ip_input+0x23b dls`i_dls_link_rx+0x2e7 mac`mac_rx_deliver+0x5d mac`mac_rx_soft_ring_process+0x17a mac`mac_rx_srs_fanout+0x823 mac`mac_rx_srs_drain+0x261 mac`mac_rx_srs_process+0x180 mac`mac_rx_classify+0x159 mac`mac_rx_flow+0x54 mac`mac_rx_common+0x1f6 mac`mac_rx+0xac mac`mac_rx_ring+0x4c ixgbe`ixgbe_intr_msix+0x99 apix`apix_dispatch_pending_autovect+0x12c apix`apix_dispatch_pending_hardint+0x33 unix`switch_sp_and_call+0x13 16175 Checking what this part of code does, we can see this. The 'nop' instructions is a signature of a dtrace probe. st1f742003b# echo ill_input_short_v4+0xdb::dis | mdb -k ill_input_short_v4+0xdb: nop ill_input_short_v4+0xdc: nop ill_input_short_v4+0xdd: nop ill_input_short_v4+0xde: addq $0x10,%rsp ill_input_short_v4+0xe2: movq -0xb0(%rbp),%rcx ill_input_short_v4+0xe9: movq %r12,%rdi ill_input_short_v4+0xec: xorq %rsi,%rsi ill_input_short_v4+0xef: movq %r13,%rdx ill_input_short_v4+0xf2: nop ill_input_short_v4+0xf3: nop ill_input_short_v4+0xf4: nop Checking which analytics do have probes in ip, I noticed this : dataset-022 active 6.72M 235M ip.bytes[hostname] After disabling it, we have been able to sustain an higher throughput : st1f742003b:analytics dataset-047> read 10 DATE/TIME KB/SEC KB/SEC BREAKDOWN 2013-9-10 15:06:49 349787 349755 ixgbe0 32 igb1 2013-9-10 15:06:50 351674 351643 ixgbe0 31 igb1 2013-9-10 15:06:51 650618 650580 ixgbe0 38 igb1
Bug number | Title | Workaround | Fix |
---|---|---|---|
17832061 | One dropped packet can permanently throttle TCP transfer | No easy workaround unless using the same VLANS for clients and the ZFS appliance | 2013.1.2.9 |
Signature :
When clients and ZFS appliance are on different VLANS, a 50% throughput decrease has been observed. Note that this does not impact IO writes. This is a pure network issue.
* is the issue seen under 2011.1 : YES
* is the issue seen under 2013.1 and same VLANSs (not routed) : NO
* is the issue seen under 2013.1 and different VLANSs (routed) : YES
In addition to the congestion window, there is a variable that governs the maximum number of packets allowed to be sent in a single burst of packets. Initially this number if essentially infinite, with the number of packets in a burst allowed to keep growing indefinitely as long as there are no dropped packets indicating network congestion. However, when a packet is dropped, the number of packets may be reduced.
The problem we have here that if the destination of the connection is on the same local network, the burst is kept at infinity, but if it is not on the same local network the burst is reduced to 5. Under some circumstances it can even be reduce further to 3, but from 3 it will grow back to 5. This accounts for the difference of the VLANs. One VLAN is deemed local and one is deemed not local. On the non-local VLAN, once a packet is dropped the limits is reduced forever to 5.
Beyond 2013.1.2.9 the following tunable has to be changed :
echo 'tcp_cwnd_normal/W 0x14' | mdb -kw
For some cases, it seems we can set tcp_cwnd_normal to 0x64 to get better throughput. Adam feedback for a perf issue with IO reads done by NDMP, is as follows :
tcp_cwnd_normal=0x5 : 3MB/s tcp_cwnd_normal=0x14 : 30MB/s tcp_cwnd_normal=0x64 : 80MB/s Bigger values did not help much
Bug number | Title | Workaround | Fix |
---|---|---|---|
18125379 | memory leak in l2arc on sync reads that fail | see below | 2013.1.2.0 |
Signature :
A typical kstate.out file will be as follows : ::arc -m p = 4 MB c = 64 MB << target ARC size has reached its min c_min = 64 MB c_max = 130030 MB size = 98748 MB << current ARC size, but no data is evictable anymore [...] l2_evict_reading = 0 l2_abort_lowmem = 10240 << memory pressure : the kmem_cache_reap_now() thread has run l2_cksum_bad = 504213672 << issue when data read from SSD cache devices is validated before it is given to the ARC l2_io_error = 0 l2_hdr_size = 29696 MB * l2_cksum_bad will cause memory leak on sync reads (not on async reads) due to 18125379 * getting 'l2_cksum_bad' will make a read from the original storage device
Workaround :
SSD do not have to be replaced
There is no workaround so far. A reboot will be needed to get around the issue but we have analytics to measure arc.c and so, we can get preemptive actions and schedule a reboot before it gets messy. Do as follows :
- Add the following analytic, if not present already : **Cache - ARC target size : arc.c** - Add an ALERT from **Configuration->Alerts panel** * Click on (+) Threshold alerts. * Select "Cache ARC target size" * Select "fails below" * Enter 10000 (MB) * Fill the correct email address - If you receive an alert, it will be time to schedule a NAS head reboot.
I did chose **10000MB** but it really depends on the memory size. Obviously, when arc.c has decreased to 64MB, it’s already too late.
Beyond that, we still do not know why ‘l2_cksum_bad’ did increase so much and Victor Latushkin told this may not be an HW issue in the SSD.
Bug number | Title | Workaround | Fix |
---|---|---|---|
18562374 | missing call to arc_free_data_block in l2arc_read() | NA | 2013.2 |
18695640 | race between arc_write_done() and arc_evict_buf() | NA | in Oct. 2017, this did not seem to fixed |