For most of the times, debugging is easy – a silly logical error or an oversight, quick to correct and move on but some problems are hard – because you don’t know where it is wrong. The bigger the involved stack is, the higher are your chances of getting stuck at them. In such cases, tools like strace can be of great aid as they can help you give a hint about where things are getting screwed up.
Recently, I faced an issue at my company where navigating beyond the dashboard page resulted in navigation getting stuck.
The Problem
First a bit about our development environment – we use Vagrant to have a complete isolated environment which is similar to what we use in production. To provision our VM we use Puppet configuration management (although, we have been poor to keep it up to date with incorporation of newer technologies).
Now, we started facing a problem in which the PHP-FPM process just get hung and won’t process any other request for a long time. It would do load after a while, having the same problem when navigating to another page. “Annoying” would be an understatement to describe how exasperating this problem was.
The only solution was to run sudo service php5-fpm restart
inside the VM everytime it happened (resulting in 503 Bad Gateway
) and refresh the page again. Certainly, not a worthwhile solution but short of solutions, we aliased the command to have a quick fix.
Since the bug was not reproducible in production environment, our first suspect was vagrant or the local configuration of FPM. I tried, without success, to tweak some buffers and other settings of FPM and nginx but the problem was still persistent.
strace to the rescue
strace is a little linux utility whose most common use case is to prints a list of system calls made by the program. strace is an amazing tool for system admins that can aid in debugging in many cases, like diagnosing a problem of a running program which we will do in this case.
PHP-FPM uses master-slave model where the master process uses the workers to process the incoming requests. Since the actual processing of request can take place in any of the worker processes, we are going to use pidof command in combination with strace (source).
strace -f $(pidof php5-fpm | sed 's/\([0-9]*\)/\-p \1/g')
After running the command and reproducing the problem, I noticed some peculiar timeouts in poll system calls. Let’s cut out the noise from output and just trace those calls instead.
strace -f $(pidof php5-fpm | sed 's/\([0-9]*\)/\-p \1/g') -epoll
This gives the following output,
[pid 2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid 2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid 2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid 2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid 2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid 2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
[pid 2443] poll([{fd=10, events=POLLOUT|POLLWRNORM}], 1, 0) = 0 (Timeout)
This can possible be the issue but what does fd 10
refer to in this? Let’s run lsof to know about it.
lsof -p 2443
This will give us list of all files opened by the process (including the sockets):
php5-fpm 4373 www-data 0u IPv4 16196 0t0 TCP localhost:9000 (LISTEN)
php5-fpm 4373 www-data 1u CHR 1,3 0t0 5788 /dev/null
php5-fpm 4373 www-data 2u CHR 1,3 0t0 5788 /dev/null
php5-fpm 4373 www-data 3ur REG 252,0 0 4587530 /tmp/.ZendSem.LfeVSH (deleted)
php5-fpm 4373 www-data 4u IPv4 55738 0t0 TCP localhost:9000->localhost:50475 (ESTABLISHED)
php5-fpm 4373 www-data 5uW REG 252,0 2534 3805440 /var/lib/php5/sessions/sess_tmh0njqp2aid0jcfacsujdls75
php5-fpm 4373 www-data 6u 0000 0,9 0 5763 anon_inode
php5-fpm 4373 www-data 7u unix 0xffff88007a2e1a00 0t0 55715 socket
php5-fpm 4373 www-data 8u unix 0xffff88007a2e3740 0t0 55716 socket
php5-fpm 4373 www-data 9u IPv4 55719 0t0 TCP localhost:49941->localhost:11210 (ESTABLISHED)
php5-fpm 4373 www-data 10u IPv4 55823 0t0 TCP 10.0.2.15:49535->10.0.0.135:9200 (SYN_SENT)
The last entry with fd 10 is what is of our interest. So the problem seems to be related to socket connection to 10.0.0.135
.
Diagnosis
We had started using Elastic Search a while ago in our application and chose to have centralized local server for the development environment. The ES server powered our reporting. After a while, the majority of our work was being done on the other side of our app, so we lost focus over the reporting part. After certain while, we forgot about our local server and stopped using it altogether since we weren’t really working with it.
Whenever the dashboard opened, an Ajax request used to query ES Sever using the ElasticSearch PHP library. The problem as it happens was, this resulted in PHP5-FPM continuously polling the ES Server to get a reply. Empirically, she worker process got hung in this loop, and it failed to respond to any other request unless it was restarted.
Is it a bug? Well, I am not sure but I will try to dig more into it.
Lesson
Our valuable lesson in this ordeal was that the local environment needs to be an absolute replica of one being used in production and setup of your work environment must be completely automated using up-to-date scripts. A common mistake is to just rely on vaguely documented commands which pose of chance of missing something in the process.