Dude, Where's My SUID Bit?
A suggestion Dan Langille shared in a EuroBSDcon talk has been resonating with me lately:
"I encourage you to keep a blog or a diary of any type, just of the everyday stuff you are doing."
I've heard this suggestion before but unfortunately not early enough in my career. I'm starting to do that now, though, a little over a decade too late. But I figured I might document some of the interesting things that I recall from previous years while still remembering the details.
First, a little background on this story. In 2017 I added a feature to a web application to schedule jobs. This job scheduling feature allowed the support team to configure data import and export tasks at regular intervals. This removed the need for a developer to create a one-off script to sync data for each customer. We found that nearly every customer wanted some kind of data integration with an existing system they were using. We had a lot of various scripts floating around to enable the integrations. It was time to reign in the sprawl. Since we had already done a dozen or so integrations, we saw a pattern and knew we could make something generic enough for most customers. Since 2017, we've added a few features but have yet to make another custom integration, so it worked out pretty well!
At the time, the application ran on-premises on Windows or in our hosted environment which was Linux. So I needed to support both Windows and Linux. To achieve this, I created an abstract class, which then had an implementation developed for both Windows (which used the Task Scheduler) and Linux, which used cron. This design also allowed us to adopt some other task scheduling mechanism in the future if we didn't want to use cron forever as it binds the scheduled job to a particular host in the cluster somewhat (at least without using some kind of additional coordination.) Any new implementation we might need just has to implement the interface of the abstract class.
This was quite an exciting project in and of itself, and I have a lot more respect for the Windows Task Scheduler at the end of it than when I went into it. I interfaced with it via COM, which was dense at first but worked better than expected.
This new feature was rolled out to our on-premise customers first without many issues. 95% of our on-premise users at the time ran Windows, so the feature didn't run on a production Linux host for a little while. I was excited when I heard that a new customer was onboarded onto our Cloud environment, which meant the feature was finally getting used on Linux.
It wasn't long before I got word from the support team that the feature wasn't working and hit a permission denied error.
I had documented the new requirements for this feature and passed them along to the sysadmin managing the systems to ensure a smooth rollout. This application ran on Apache, and on SELinux enabled systems, Apache can't invoke the crontab(1) program by default. I created a custom SELinux policy to allow the crontab entries to be managed by Apache via the crontab command ahead of time. There also was a small DB migration and maybe one or two other commands required to get everything rolling. So the first thing I checked was that the sysadmin managing these servers had set everything up correctly. The sysadmin claimed that they made all the necessary updates, the migration and SELinux policy were bundled along with the software's RPM so I trusted that everything was complete, added more logging, and pushed out another version of the software.
Still, the issue persisted, and the logs didn't glean any additional details as I'd hoped. This was weird. I double, triple, and quadruple-checked everything on my VMs, rolled back to a clean VM, even checked the QA server where the feature was working. Everything worked as expected. Why wasn't this working in production?
I finally had come to a point where I had to do some troubleshooting on this specific host, as I was out of other options. The sysadmin begrudgingly gave me access, and I got to work.
One nice thing about this web application is that it had a CLI interface that could run any code within the web application. I had already extended that CLI to provide an alternative interface into the task scheduling utility and ran that as the Apache user via sudo
It was returning the same error.
Next, I bypassed the application entirely and ran crontab commands as Apache using sudo.
It was still failing.
This was interesting, and it was failing even outside of any of the code I wrote. Some other configuration had to be causing this issue to occur on only this system.
I double-checked that the sysadmin had applied the SELinux policy and added the apache user to /etc/cron.allow
, and he had done so. I checked the SELinux denial logs, nothing there either..
I could tell this VM was almost entirely manually configured. There appeared to be some Ansible involved, but it didn't look extensive enough to configure very much, and nothing in those scripts looked like they might be causing an issue.
Since I had access to two systems behaving differently, I reached for one of my favorite Linux tools: strace
"strace is a diagnostic, debugging and instructional userspace utility for Linux. It is used to monitor and tamper with interactions between processes and the Linux kernel, which include system calls, signal deliveries, and changes of process state. The operation of strace is made possible by the kernel feature known as ptrace."
I ran the crontab commands with strace on the broken system and captured the output. I then did the same on a working system and diffed the two files. It was like looking for a needle in a haystack...
I can't recall exactly where it jumped out at me (this was ~5 years ago), and attempting to recreate the strace output on a newer CentOS system didn't produce the results that I recall.
But in my attempt to recreate I did find a hint:
access("/etc/suid-debug", 0 /* F_OK */) = -1 ENOENT (No such file or directory)
, which was only found in the working output.
What is /etc/suid-debug
?
There is one problem with MALLOC_CHECK_: in SUID or SGID binaries it could possibly be exploited since diverging from the normal programs behavior it now writes something to the standard error descriptor. Therefore the use of MALLOC_CHECK_ is disabled by default for SUID and SGID binaries. It can be enabled again by the system administrator by adding a file /etc/suid-debug (the content is not important it could be empty).
That file is used to re-enable GCC heap consistency checking, but it is only checked when running SUID or SGID binaries, and when I checked the broken system, the SUID bit was, in fact, missing on the crontab program!
The Unix access rights flags setuid and setgid (short for "set user ID" and "set group ID") allow users to run an executable with the file system permissions of the executable's owner or group respectively and to change behaviour in directories. They are often used to allow users on a computer system to run programs with temporarily elevated privileges in order to perform a specific task.
I removed the SUID bit from the working system via chmod u-s /usr/bin/crontab
, and it broke in the same way as the production server. Finally, I had found the problem!
On RHEL and some other unix systems, the crontab
program needs to have the SUID flag set and be owned by root. This is necessary because the /var/spool/cron
directory where the user crontab files are stored is
owned by root. When a user edits their crontab file, the program will create a temporary file in the spool directory, ensure that it parses correctly, and then rename it to match the user's name.
I logged into the production system and set the SUID flag on crontab and everything worked as expected. The support team was able to configure the customer's jobs, and the data could flow. I relayed my finding to the sysadmin in charge of the system, and they couldn't explain the permission issue. Leaving no stone unturned, I even checked the RPM SPEC file and crontab has SUID set and is owned by root when installed.
To this day I don't know for sure what caused the missing SUID flag, was it an attempt by a mischievous sysadmin to reduce the number of SUID binaries on the system? A fat finger mistake, or an Ansible run gone wrong? I will likely never know.