Travis Paul

Thoughts no one asked for and more.

How Many LWPs are Enough LWPs?

Ducks ain't supposed to be downtown, there's nothing for 'em there!
Photo by August Phlieger on Unsplash

Over Labor Day weekend I decided to re-install my local Triton. After getting everything back online, I noticed that any attempt to provision a zone with a delegated dataset would fail. The failed provision jobs had the error:

vmadm exited with code: 1: first of 2 errors: timed out waiting for /var/svc/provisioning to move for 9d591bec-774f-4afe-8dbc-e6c359849886

I found that the Triton Data Center private cloud documentation references this error message and I tried the troubleshooting steps without success. It also suggested that one might need to consult the vmadm logs. Doing just that revealed this hint:

[2022-09-08T18:00:09.873Z]  WARN: vmadmd/5401 on davao: (req_id=c4565909-175f-c4e6-d6f4-cf6648f61d7a, action=waitForProvisioning, vm=8d12c37b-7daf-4bba-b128-13bedd6e6bb0)
    svcs -xv output for 8d12c37b-7daf-4bba-b128-13bedd6e6bb0 at fail time:
    svc:/system/filesystem/local:default (local file system mounts)
      Zone: 8d12c37b-7daf-4bba-b128-13bedd6e6bb0
     Alias: delegate-test-7
     State: offline since September  8, 2022 at 05:55:58 PM UTC
    Reason: Start method is running.
       See: /zones/8d12c37b-7daf-4bba-b128-13bedd6e6bb0/root/var/svc/log/system-filesystem-local:default.log
    Impact: 13 dependent services are not running:

On the CN, the zone's root dataset was installed but, it's status was failed. Invoking zfs list revealed that the expected delegate dataset existed. I wanted to try and boot the zone manually and poke around a bit further:

root@cn# zoneadm -z 649dc916-1a80-42ad-93c1-c2f8f0d9fdd6 boot

root@cn# zlogin 649dc916-1a80-42ad-93c1-c2f8f0d9fdd6
root@649dc916-1a80-42ad-93c1-c2f8f0d9fdd6# svcs
STATE          STIME    FMRI
... [snip unrelated services] ...
offline*       18:57:15 svc:/system/filesystem/local:default

The svcs(1) man page kindly reminded me that:

An asterisk (*) is appended for instances in transition

It was clear that the service was never going to complete it's transition and the log file had nothing of use. Inspecting the service's script showed that the service runs zfs mount -va at one point and ps showed that a zfs command was indeed running:

root@649dc916-1a80-42ad-93c1-c2f8f0d9fdd6# ps aux
root     21281  0.1  0.1 8804 7632 ?        S 18:57:08  0:08 /lib/svc/bin/svc.configd
root     21668  0.0  0.0 2216 1192 pts/2    O 18:58:51  0:00 ps aux
root     20955  0.0  0.0    0    0 ?        S 18:57:07  0:00 zsched
root     21258  0.0  0.0 2324 1440 ?        S 18:57:08  0:00 /sbin/init
root     21317  0.0  0.0 3964 2384 pts/2    S 18:57:09  0:00 /usr/bin/login -z global -f root
root     21318  0.0  0.0 3660 2548 pts/2    S 18:57:09  0:00 -bash
netadm   21336  0.0  0.0 3816 2512 ?        S 18:57:10  0:00 /lib/inet/ipmgmtd
root     21401  0.0  0.1 5556 3384 ?        S 18:57:12  0:00 /usr/sbin/nscd
root     21512  0.0  0.0 1948 1120 ?        S 18:57:14  0:00 /usr/lib/pfexecd
root     21592  0.0  0.1 4928 2980 ?        S 18:57:15  0:00 /sbin/sh /lib/svc/method/fs-local
root     21600  0.0  0.1 10092 6060 ?        S 18:57:15  0:00 /usr/sbin/zfs mount -va
root     21610  0.0  0.1 4928 2980 ?        S 18:57:15  0:00 /sbin/sh /lib/svc/method/fs-local
root     21617  0.0  0.1 8384 4356 ?        S 18:57:16  0:00 /usr/sbin/zfs mount -va
root     21620  0.0  0.0 1764 1180 console  S 18:57:16  0:00 /etc/sulogin -
root     21621  0.0  0.0    0    0          Z           0:00  

Curious what it's doing, I consulted pstack(1):

root@649dc916-1a80-42ad-93c1-c2f8f0d9fdd6# pstack 21617
... [snip lots of output] ...
-------------------- thread# 66 / lwp# 66 --------------------
 fffffc7fef040957 lwp_park (0, 0, 0)
 fffffc7fef039deb cond_wait_queue (4afeb8, 4afe60, 0) + 5b
 fffffc7fef03a23d cond_wait_common (4afeb8, 4afe60, 0) + 1dd
 fffffc7fef03a40c __cond_wait (4afeb8, 4afe60) + 6c
 fffffc7fef03a494 cond_wait (4afeb8, 4afe60) + 24
 fffffc7feeb29d90 zfs_taskq_thread (4afe40) + 160
 fffffc7fef0405fc _thrp_setup (fffffc7fef1b0240) + 6c
 fffffc7fef040910 _lwp_start ()

The process has a lot of LWPs and they are all hanging out at the lwp_park.

While we're here, now would be a good time to define a Light Weight Process. According to the Solaris Internals book an LWP is "the kernel visible execution context for a thread". Every process has one LWP and every additional thread in that process has an LWP too.

In Triton, the number of threads (LWPs) a zone can use are configurable for each package. I had created some new packages during my re-install and set 400 LWPs for this particular package as you can see from the prctl(1) output:

root@cn# prctl -i zone -n zone.max-lwps  649dc916-1a80-42ad-93c1-c2f8f0d9fdd6
zone: 14: 649dc916-1a80-42ad-93c1-c2f8f0d9fdd6
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
        usage             398    
        privileged        400       -   deny                                 -
        system          2.15G     max   deny                                 -

Well, there's the problem! The zone is using 398 of the 400 allowed LWPs. When using delegate datasets 400 LWPs is not enough for the zfs mount invocation that's needed. After a bit of testing it seems neccesary to have at least 625 LWPs in order to create a Joyent branded zone with delegate datasets else svc:/system/filesystem/local:default will hang forever. At least on the hardware that I'm using. It is worth noting that, without a delegate dataset, 250 LWPs will do (the lowest number I could create for a package) and so I never encountered the issue until I went to use the delegate dataset feature.

Unsolved Mysteries

  • At one point while debugging the issue, one of the zfs mount instances had over 500 LWPs. Why are so many LWPs needed for one process?
  • Why are two instances of /usr/sbin/zfs mount -va running at once in the zone? Maybe those can be serialized to reduce the number of LWPs needed?