Wednesday, 19 April 2017

Random S**t Happens (or sometimes it doesn't)

On Thursday last week, I migrated a wee enterprise application I wrote a number of years ago (2009?) to a its new home on vmware farm. In itself not a big job, but there were a lot of integration points with other systems. Sadly, it went about as well as I expected. After some pain, normal service was restored. Almost. One of the key pages in the application kept pausing.

As you might expect, I am rather fastidious in ensuring the performance of the applications I write. But this seemed strange. Generating the HTML usually took 30-40 ms (measured at the browser). Not earth shattering, but it does do a lot of work and well within the performance budget. But 1 in every 20 or so requests would take much longer - between 6 and 20 seconds!

Since there were no code changes, the obvious candidate for the cause was the infrastructure which had changed:
  • other VMs on the same host competing for resource
  • I/O contention (this was now on a SAN with a lot of other devices)
  • overzealous network security devices filling the network bandiwdth
  • congestion crashes on routers
But I hadn't ruled out a problem in the software stack. Mod_deflate buffers, database contention...

Checking the usual metrics (load, CPU usage, disk IO, APC stats) revealed nothing untoward. So the next setp was to inject some profiling in the code. I would have preferred to use XHProf, but the people who own this system are not keen on third party tools in their production systems.

The profiling soon revealed that the pauses were always occurring in the same region of code. This ruled out any environmental issue.

Looking through the region, there was no disk, network or database I/O. It did write some output (and the HTTP response was chunked) but that was a very long delay for a context switch or a garbage collection cycle. And why didn't occur on every request?

All the code seemed to be doing was reading stuff from PHP's memory and injecting it into the HTML page.

Going through the program in some detail (I did mentioned it was a very long time ago when I wrote it originally?) there was an inversion of control - a dependency injection - where a callback was invoked. Dumping the callback led me to an obscure library routine doing encryption. This created an initialization vector:

mcrypt_create_iv (mcrypt_enc_get_iv_size($this->td), MCRYPT_DEV_RANDOM);

This was the smoking gun.

The problem was that I had told mcrypt to read from /dev/random and /dev/random didn't have any randomness. So it blocked until it got some.

The solutions were obvious:
  • keep /dev/random topped up (using rngd)
  • use a different (weaker?) entropy source - (MCRYPT_DEV_URANDOM, reading from /dev/urandom is available)
Given that I had already peppered the code with profiling, adding a single character seemed the sensible choice. Whether urandom is weaker is debatable. Indeed, vmware (but not RedHat) recommend this as a solution.

Since the encryption in question was using triple-DES (look, it was a really long time ago, OK?) even a bad random number generator wouldn't have helped make it more secure.

In my defence:

1) mcrypt is now deprecated in current versions of PHP, the current Redhat Release (7.3) ships with a version of PHP pre-dating the deprecation - and it certainly was not deprecated at the time I wrote the code. But it wasn't mcrypt doing anything wrong here.

2) The 3DES encryption was an early CSRF protection mechanism for an application which has very restricted access, and subsequently used a more complex system with SHA1 hashes - but the original code was not removed when the new mechanism was added

3) Frankly, base64 encoding the data here would have been overkill given the level of exposure in this application

This was the first time I had come across this problem. I'm going to be involved in moving a lot of other systems into this network - many of which make more extensive (and critical) use of encryption than this one does. Now I know one more thing to look out for.