SRE Ops Lab
---
Author: Donald Morton
Date: May 5, 2021
---
Back in January 2020, I interviewed for a Senior Site Reliability Engineer
position at a company in Austin. They had a pretty cool process for testing
their candidates' troubleshooting skills. I had a ton of fun doing it, so I'm
sharing it here in the hopes that other companies adopt the idea.
I've run into many of these situations in the past as a systems administrator.
It's 3am. People are yelling at me on the phone. I can't think straight. What in
the world did my co-workers do to this box? Why would they do these things?
The basic idea is that they spin up a temporary virtual machine with Confluence
installed on it. You are expected to ssh in, start Confluence, navigate to a
certain page within Confluence and tell them what's on it. The only problem is,
the instance is totally jacked up. It's broken in multiple ways that would
confound any normal person. They want details on how you approached your problem
solving.
I was given an IP address, a login name, and a private SSH key. They said I'd
have 60 minutes to fix everything and figure it out. The timer starts on the
first ssh connection.
### Relevant Facts Provided
- The database username
- The application install directory
- The application's home directory
- The Tomcat log directory
- The Confluence log directory
- The url to the webapp's status page
I saved my command history for my writeup to the company. I'll include some of
those below to give an idea of my process. It's kind of neat seeing someone
poking around a system they're unfamiliar with.
## Initial SSH Connection
The first step, ssh'ing in:
```
chmod 600 Downloads/id_rsa
ssh -i Downloads/id_rsa lab@IP-ADDRESS
```
I took the private SSH key they emailed me, fixed its permissions, and used it
to SSH into the VM.
In a way, this alone is a good test. A senior should be able to ssh into a
machine with a private key. It's basic, but there are candidates out there that
would be unable to this.
Once I was logged in, I checked diskspace and memory. There was nothing out of
the ordinary there.
```
df -h
free -g
htop
top
```
I tried starting Confluence by running the startup script (given in my original
email). Of course, it fails to start.
## Problem #1: Incorrect Directory Reference in Init Script
`/etc/init.d/confluence` script had an incorrect directory reference (`/zopt`
instead of `/opt`).
Upon opening the script in vi, I see that a directory being used is obviously
wrong (`/zopt` instead of `/opt`).
```
/etc/init.d/confluence start
ll /opt/atlassian/confluence/bin
set -o vi
vi /etc/init.d/confluence
/etc/init.d/confluence start
```
I also realized I'm supposed to be using sudo to start the application. And,
just like they said in their email, there were only a limited set of commands
sudo was allowed to run.
```
sudo -l
sudo /etc/init.d/confluence start
```
## Problem #2: JAVA_HOME Referencing Incorrect Directory
`JAVA_HOME` referencing an incorrect directory (`/zopt` instead of `/opt`).
Now, the script is complaining that it can't find Java. The `JAVA_HOME`
environment variable had been set incorrectly. It was looking in a path that
looked reasonable, but in fact was wrong.
It's been over a year since I completed this. I don't remember exactly why I
couldn't fix `JAVA_HOME` by editing `/etc/init.d/confluence`. But I did have to
search around and find a file that I could edit that would actually allow me to
affect the environment of the service before it started. That file ended up
being `setenv.sh`.
Now, I'm logged in as a normal user here, so I don't have access to edit every
file. Many important files are read-only to me. So I actually had to poke around
and search, not only to see where things were, but to see what I could actually
touch.
```
psg tomcat
ps -ef | grep tomcat
vi /opt/atlassian/confluence/logs/catalina.out
vi /opt/atlassian/confluence/bin/catalina.sh
echo $JAVA_HOME
sudo -l
cd /opt/atlassian/confluence/bin
vi setenv.sh
vi setjre.sh
id
sudo /etc/init.d/confluence start
```
## Problem #3: Java Heapsize Set Too Low
I tried starting Confluence again, only to find the error you get when a Java
process reaches its max heapsize.
This is like an obstacle course. It'd be trivial if you were just messing around
on your own, but a 60 minute timer really messes with your head. Any wasted time
will get you in trouble later. If you go down the wrong path, even for 5
minutes, it could cost you.
At any rate, I edited the Java `-Xms` and `-Xmx` arguments in the `setenv.sh`
file. I noticed the VM had 3g of ram, so I set the heapsize to 2g.
```
ps -ef | grep tomcat
top
vi /opt/atlassian/confluence/logs/catalina.out
free -g
free -m
vi setenv.sh
sudo /etc/init.d/confluence start
```
## Problem #4: Netcat Process Tying Up Port 8080
I started Confluence again and checked the logs. Of course, there's another
error. Something is tying up port 8080! It's the standard error you get when
Java can't open a port. I ran netstat to check which process it was and found
that it was a netcat process. Those tricky bastards. This is absolutely
something I would do if I wanted to mess with someone. They are literally just
toying with me.
```
sudo /etc/init.d/confluence start
vi /opt/atlassian/confluence/logs/catalina.out
netstat -anlp | less
ps -ef | grep 3603
kill 3603
sudo /etc/init.d/confluence stop
cd /opt/atlassian/confluence/logs/
sudo mv catalina.out catalina.out.bak
sudo /etc/init.d/confluence start
```
## Problem #5: Hibernate C3P0 Configuration Missing Values
Hibernate C3P0 min and max size were blank in `confluence.cfg.xml`.
I started Confluence again and checked the logs. The app appeared to start
correctly, but when I used curl to check port 8080, it returned a 404.
I tried a few different things, but eventually I searched around and found the
Confluence log:
`/var/atlassian/application-data/confluence/logs/atlassian-confluence.log`
In that log, I saw this error:
> [sf.hibernate.connection.C3P0ConnectionProvider] configure could not
> instantiate C3P0 connection pool java.lang.NumberFormatException: For input
> string ""
I have never administered Confluence before, so I had no idea what this meant.
Obviously, it was some sort of misconfiguration in the app settings. Probably a
blank value where it expects a number?
After searching around for the app's configuration file, I found it at:
`/var/atlassian/application-data/confluence/confluence.cfg.xml`
I did some Googling on C3P0 and found what I thought might be the setting that
it was complaining about: `hibernate.c3p0.min_size` and
`hibernate.c3p0.max_size` were both set to blank.
## Problem #6: Permission Issues with Configuration File
The `lab` user doesn't have access to edit
`/var/atlassian/application-data/confluence/confluence.cfg.xml`.
I tried to set min and max values, only to find that the lab user doesn't have
permission to edit `confluence.cfg.xml`!
So, now I know the answer to my problem, but I don't have access to apply the
solution.
I wasted the most time on this problem. That 60 minute timer will get you every
time. If you go down the wrong path for any length of time, you're burning time
you'll need for something else later. Anyone can think of the wrong thing at
first, especially when under pressure.
I was sure I should be able to edit this file. Maybe if I just entered the right
sudo command? But the sudoers file was set up such that starting Confluence
doesn't require a password, but editing the config file does require a password.
I didn't know the user's password, though, because I had logged in via an ssh
key. I spent a lot of time googling and trying to make it let me edit the file.
I thought maybe there might be some way to reset the lab user's password without
needing its old password if I dug deeply enough. I even checked if the system
was susceptible to ShellShock (it wasn't).
Finally, I realized I have access to scripts that the Confluence user runs when
it starts up. Then, I felt pretty dumb. That basically gives me access to run
whatever I want.
I added these lines to the end of the `setenv.sh` script:
```
sed -i 's/hibernate.c3p0.min_size"></hibernate.c3p0.min_size">5</' /var/atlassian/application-data/confluence/confluence.cfg.xml
sed -i 's/hibernate.c3p0.max_size"></hibernate.c3p0.max_size">20</' /var/atlassian/application-data/confluence/confluence.cfg.xml
```
After restarting Confluence, the file had been successfully updated and the
error didn't pop back up.
## Problem #7: Database Connection Failed
After restarting Confluence and checking the logs, I saw that the database
connection was now failing.
> org.postgresql.util.PSQLException: Connection to
> REDACTED.us-east-1.rds.amazonaws.com:5432 refused. Check that the hostname
> and port are correct and that the postmaster is accepting TCP/IP connections.
I checked `confluence.cfg.xml` for the connection settings, and I didn't
immediately see anything wrong.
I tried telnetting to the database port to see if it was open, but telnet wasn't
installed.
I pinged the DB hostname, and it responded. Perhaps someone forgot to start the
database up? I didn't have access to it to check. I thought about running nmap
against the address to check if maybe it was on a different port, but nmap
wasn't installed. That would probably require root anyway. I started to run the
postgresql CLI tool to see if that was installed, to maybe troubleshoot, when I
was logged out of my shell and was unable to get back in. My 60 minutes was up!
I didn't have more than a few minutes to troubleshoot problem #7 because I had
spent so much time trying to get sudo to let me edit that stupid config file. I
was super engaged in this. I really, really wanted to try it again. I was mad
that I didn't finish, and I couldn't stop thinking about it for the rest of the
week.
The only consolation I have is the assurance from the company that I was *very*
close to the end of the test.
At any rate, this was a really neat exercise. I imagine that they'd have this
automated, so they probably spin it up for a new candidate and tear it down once
it's done. Certainly, the timer was automated, because I was disconnected after
exactly 60 minutes.
Pretty neat!