Tuesday, 18 December 2018

Using performance to manipulate behaviour

A darker side to the performance story seems to be emerging. This is the first in a series of 3 posts (there might be more later) about how web performance is being weaponized.

While I, like many of you, spend a lot of time simply trying to make my sites going faster, it seems that other people at finding ways to exploit performance as a way of manipulating user behaviour. This was particularly evident when I recently visited www.forbes.com to read an article about phone biometrics. Not where I would go to for authoritative information – I was just browsing at the time. As is common, it asked me if I wanted to accept its cookies.

Yes, they want to protect their revenue stream so the big green button with white text is easy to see and read, while the smaller grey button is a lot harder to read – and only professes to providing “more information”. Now due to the specifics of the EU's GDPR act, the site needs my “informed consent” to any cookies it drops – so not surprisingly the “more information” button takes me to a dialogue where I can also specify which cookies I will accept.

If I click on the first, big green button, I get an almost immediate acknowledgement. Accepting all three classes of cookies from the “more information” dialogue seems to take slightly longer, but I didn't measure it too closely. But what is interesting is that if I dial back the cookie setting to only “required cookies” the site tells me it has a lot of work to do in order to dial back “the full power of Forbes.com”.

So I have incurred a huge performance penalty for exercising my rights.

This did provoke a torrent of activity in the browser – over a thousand requests – which included a few 404s and several 302's sending my browser back around the internet. I've not looked at all of them, but the 200 responses all contained “no data”, and none of the sites I saw had appeared when I first loaded the page.

This is appears to be a very elaborate piece of theatre.

It took around 60 seconds to reach the 100% point – while helpfully giving me the option to change my mind at any point.

Another interesting feature of the performance was that the counter slowed down as it progressed! If you've read up on progress bars, you'll know that is exactly the opposite of what you should do if you want to convey an impression of speed.

Finally, changing my browser config to send a “Do Not Track” header had no impact at all on the behaviour. Although at the time of writing, this is still a proposal for HTTP.

Usually I don't wear my tin foil hat when browsing the internet – I'm OK that websites need a way to fund the content they publish but I am very disturbed that sites seem to go to such lengths to try to manipulate their users' behaviour.

Wednesday, 19 April 2017

Random S**t Happens (or sometimes it doesn't)

On Thursday last week, I migrated a wee enterprise application I wrote a number of years ago (2009?) to a its new home on vmware farm. In itself not a big job, but there were a lot of integration points with other systems. Sadly, it went about as well as I expected. After some pain, normal service was restored. Almost. One of the key pages in the application kept pausing.

As you might expect, I am rather fastidious in ensuring the performance of the applications I write. But this seemed strange. Generating the HTML usually took 30-40 ms (measured at the browser). Not earth shattering, but it does do a lot of work and well within the performance budget. But 1 in every 20 or so requests would take much longer - between 6 and 20 seconds!

Since there were no code changes, the obvious candidate for the cause was the infrastructure which had changed:
  • other VMs on the same host competing for resource
  • I/O contention (this was now on a SAN with a lot of other devices)
  • overzealous network security devices filling the network bandiwdth
  • congestion crashes on routers
But I hadn't ruled out a problem in the software stack. Mod_deflate buffers, database contention...

Checking the usual metrics (load, CPU usage, disk IO, APC stats) revealed nothing untoward. So the next setp was to inject some profiling in the code. I would have preferred to use XHProf, but the people who own this system are not keen on third party tools in their production systems.

The profiling soon revealed that the pauses were always occurring in the same region of code. This ruled out any environmental issue.

Looking through the region, there was no disk, network or database I/O. It did write some output (and the HTTP response was chunked) but that was a very long delay for a context switch or a garbage collection cycle. And why didn't occur on every request?

All the code seemed to be doing was reading stuff from PHP's memory and injecting it into the HTML page.

Going through the program in some detail (I did mentioned it was a very long time ago when I wrote it originally?) there was an inversion of control - a dependency injection - where a callback was invoked. Dumping the callback led me to an obscure library routine doing encryption. This created an initialization vector:

mcrypt_create_iv (mcrypt_enc_get_iv_size($this->td), MCRYPT_DEV_RANDOM);

This was the smoking gun.

The problem was that I had told mcrypt to read from /dev/random and /dev/random didn't have any randomness. So it blocked until it got some.

The solutions were obvious:
  • keep /dev/random topped up (using rngd)
  • use a different (weaker?) entropy source - (MCRYPT_DEV_URANDOM, reading from /dev/urandom is available)
Given that I had already peppered the code with profiling, adding a single character seemed the sensible choice. Whether urandom is weaker is debatable. Indeed, vmware (but not RedHat) recommend this as a solution.

Since the encryption in question was using triple-DES (look, it was a really long time ago, OK?) even a bad random number generator wouldn't have helped make it more secure.

In my defence:

1) mcrypt is now deprecated in current versions of PHP, the current Redhat Release (7.3) ships with a version of PHP pre-dating the deprecation - and it certainly was not deprecated at the time I wrote the code. But it wasn't mcrypt doing anything wrong here.

2) The 3DES encryption was an early CSRF protection mechanism for an application which has very restricted access, and subsequently used a more complex system with SHA1 hashes - but the original code was not removed when the new mechanism was added

3) Frankly, base64 encoding the data here would have been overkill given the level of exposure in this application

This was the first time I had come across this problem. I'm going to be involved in moving a lot of other systems into this network - many of which make more extensive (and critical) use of encryption than this one does. Now I know one more thing to look out for.

Friday, 17 March 2017

Image compression (again)

It's not the most exciting thing to have happened to web performance in the past year or so - but it will have an impact on your performance and scalability. A team from Google have released a new JPEG compressor called guetzli

What's interesting about this one, is that the team were focused (oops) on the perceived image quality, rather than the measured quality, although they did also take time out to write their own image quality measurement algorithm.

Saturday, 6 August 2016

Speeding up Dokuwiki

I'm a big fan of Dokuwiki.
  • its simple
  • has a great ecosystem of plugins
  • has great performance
But some time ago I decided there was room for improvement so I wrote a very simple framework (itself implemented as a Dokuwiki plugin). I've just uploaded this at Github.

Specifically this allows for:
  • Much faster page loading using PJAX
  • Pure javascript/CSS extensions - no PHP required
  • Prevents Javascript injection by page editors
The PJAX page loading requires small changes to the template to exclude all but the page specific content (i.e. navigation elements and the rendered markup) when a PJAX request is made. Instead you just return a well-formed HTML fragment when the request is flagged as coming from PJAX. There is an example template here. While the template this is based on is already rather complex, the actual changes to this, or any existing template, are only a few lines of code - see the diff in the README.

This saves me around 450 milliseconds per page in loading time:

The savings come from not having to parse the CSS and Javascript on the browser. The serverside content generation time is not noticeably affected.

But even if you are not using Dokuwiki you can get the same benefits using PJAX on your CMS of choice.

A strict Content Security Policy provides great protection against XSS attacks. But the question then arises how to get run-time generated data routed to the right bit of code. Jokuwiki solves by this embedding JSON in data-* attributes including the entry point for execution.

Wednesday, 22 June 2016

Faster and more Scalable Session handling

Chapter 18 (18.13 specifically) looks at PHP session handling, which can be a major bottleneck. I suggested there were options for reducing the impact, top of which was to use a faster substrate for the session data, but no matter how fast the storage it won't help with the fact that (by default) control over concurrency is implemented by locking the data file.

While I provided an example in the book of propagating authentication and authorization information securely via the URL, removing the need to open the session in the linked page, sometimes you need access to the full session data.

Recently I wrote a drop in replacement for the default handler which is completely compatible with the default handler (you can mix and match the methods in the same application) but which does not lock the session data file. It struck me that there were lots of things the session handler was doing and which a custom handler might do. Rather than create every possible combination of storage / representation / replication / concurrency control, I adapted my handler API to allow multiple handlers to be stacked to create a custom combination.

The code (including an implementation of the non-blocking handler) is available on PHPClasses.

One thing I omitted to mention in the book is that when session_start() is called it sets the Cache-control and Expires headers to prevent caching:

Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0

If you want your page to be cacheable, then there is a simple 2 step process:

  1. Check - are you really, REALLY sure you want the content to be cacheable and use sessions? If you are just implementing access control, then the content *may* be stored on the users disk.
  2. Add an appropriate set of headers after session_start();
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 3600));
header('Cache-Control: max-age=3600'); 
header('Varies: Cookie'); // ...but using HTTPS would be better 

Monday, 16 March 2015

Accurate capacity planning with Apache - protecting your performance

While most operating systems support some sort of virtual memory, if the system starts paging memory out to disk, performance will take a nose dive. But performance will typically be heavily degraded even before it runs out of memory as the applications start stealing memory used for I/O caching. Hence setting an appropriate value for ServerLimit in Apache (or the equivalent for any multi-threaded/multi-process server) is good practice. For the remainder of the document I will be specifically focussing on Linux, but the theory and practice apply to all flavours of Unix and MSWindows too.

Tracking resource usage of the system as a whole is also good practice – but beyond the scope of what I'll be talking about today.

The immediate problem is determining what an appropriate limit is.

For pre-fork Apache 2.x, the number of processes is constrained by the serverLimit setting

For most systems the limit will be driven primarily by the amount of memory available. But trying to workout how much memory a process uses is actually surprisingly difficult. The executable code is memory mapped files – these are typically readonly and shared between processes.

Running 'strace /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf' causes over 4000 files to be “loaded” on my local Linux machine. Actually few of the are read from disk – they are shared object files already in memory which the kernel then presents at an address accessible to the httpd process. Code is typically loaded into such shared, read only pages. Linux has a further way of conserving memory. When it needs to copy memory which might be written to, the copy is deferred until a process attempts to write to the memory.
The net result is that the actual footprint on the physical memory is much, much less than the size of the address space that the process has access to.

Different URLs will have different footprints, and even different clients can affect the memory usage. Here is a typical distribution of memory usage per httpd process:

This is further complicated by the fact that our webserver might be doing other things – running PHP, MySQL and a mailserver being obvious cases – which may or may not be linked to the volume of HTTP traffic being processed.

In short, trying to synthetically work out how much memory you will need to support (say) 200 concurrent requests is not practical.

The most effective solution is to start with an optimistic guess for serverLimit, and set MaxSpareServers to around 5% of this value. Note that after the data capture exercise, you should up MaxSpareServers to around 10% of serverLimit +3. Then measure how much memory is unused. To do that you'll need to set up a simple script running periodically as a daemon or from cron, capturing the output of the 'free' command and the number of httpd processes.

Here I've plotted the total memory used (less buffers and cache) against the number of httpd processes:

This system has 1Gb of memory. Without any apache instances running, the usage would be less than the projected 290Mb – but that is outwith the bounds we expect to be operating in. From 2 httpd processes upwards, the average size and variation in size for each httpd process is very consistent – but since the variation in size is consistent that means the size of the total usage envelope will expand as the number of processes increases. The dashed red line is 2 standard deviations above the average usage, and hence there is a 97.5% probability that memory usage will be below the dashed line.
I want to have around 200kb available for the VFS, so here, my ServerLimit is around 175.

Of course the story doesn't end there. How do you protect the server and manage the traffic effectively as it approaches the serverLimit? How do you reduce the memory usage per httpd process to get more capacity? How do you turn around requests faster and therefore reduce concurrency? And how do you know how much memory to set aside for the VFS?

For help with finding the answers, the code run here and more information on capacity and performance tuning Linux, Apache, MySQL and PHP....buy the book!

If you would like to learn more about how Linux Memory Management then this (731 page) document is a very good guide:

Monday, 2 March 2015

Making stuff faster with curl_multi_exec() and friends

Running stuff in parallel is a great way to solve some performance problems. My post on long running processes in PHP on my other blog continues to receive a lot of traffic - but a limitation of this approach (and any method which involves forking) is that it is hard to collate the results.
In the book I recommended using the curl_multi_ functions as a way of splitting a task across multiple processing units although I did not provide a detailled example.
I recently had cause to write a new bit of functionality which was an ideal for the curl_multi_ approach. Specifically I needed to implement a rolling data quality check checking that a few million email addresses had valid MX domain records. The script implementing this would be spending most of its time waiting for a response from the DNS system. While it did not have to be as fast as humanly possibly, the 50 hours it took to check the addresses one at a time was just a bit too long - I needed to run the checks in parallel.
While the PHP Curl extension does resolve names in order to make HTTP calls - it does not expose this as a result, and the targets were not HTTP servers, so I wrapped the getmxrr() function in a simple PHP script running at http://localhost.
To refresh my memory on the parameters past and the values returned I went and had a look at the PHP documentation.
The example of how to use the function in the curl_multi_exec() page is somewhat Byzantine:

do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);

OMG - 3 seperate calls to curl_multi_  functions in 3 loops!

It doesn't exactly make it obvious what's going on here. It turns out that the guy who wrote it has since posted an explanation.

There are certain advantages to what the developer is trying to do here, but transparency is not one of them.

The example code in curl_multi_add_handle() is clearer, but somewhat flawed:

do {
} while($running > 0);

To understand what's really happening here, you need to bear in mind that the curl_multi_exec() is intended for implementing asynchronous fetching of pages - i.e. it does not block until it completes. In other words the above will run in a tight loop burning up CPU cycles while waiting for the responses to come in. Indeed it may actually delay the processing of the responses!
Now curl_multi_exec has a lot of work to do. For each instance registered it needs to resolve the vhost name, carry out a TCP handshake, possibly an SSL negotiation, send the HTTP request then wait for a response. Interestingly, when testing against localhost, it does nothing visible on the first invocation while it seems to at least as far as sending the HTTP requests on the second iteration of the loop, regardless of the number of requests. That means that the request has been dispatched to the receiving end, and we can now use our PHP thread to do something interesting / useful while we wait for a response, for example pushing the HEAD of your HTML out to the browser so it can start fetching CSS and (deferred) Javascript (see section 18.11.1 in the book).
Of course, even if I were to confirm that the TCP handshake runs in the second loop, and find out where any SSL handshake took place, there's no guarantee that this won't change in future. We don't know exactly how many iterations it takes to dispatch a request, and timing will be important too.
But it might be why the person who wrote the example code above split the functionality across the 2 consecutive loops - to do something useful in between. However on my local PHP install, on the first iteration through the loop it returns a 0 and CURLM_CALL_MULTI_PERFORM is -1. So the first loop will only run once, and won't send the requests (I tested by adding a long sleep after the call).

Hence I suggest that a better pattern for using curl_multi_exec() is:

do {
        curl_multi_exec($mh, $active);
        if ($active) usleep(20000);
} while ($active > 0);

The usleep is important! This stops the process from hogging the CPU and potentially blocking other things (it could even delay processing of the response!).

We can actually use the time spent waiting for the requests to be processed to do something more useful:


for ($x=0; $x<=3 && $active; $x++) {
        curl_multi_exec($mh, $active);
        // we wait for a bit to allow stuff TCP handshakes to complete and so forth...


do {
        curl_multi_exec($mh, $active);
        if ($active) usleep(20000);
} while ($active > 0);

Here the executions of curl_multi_exec() are split into 2 loops. From experimentation it seems it takes up to 4 iterations to properly despatch all the requests - then there is a delay waiting for the request to cross the network and be serviced - this is where we can do some work locally. The second loop then reaps the responses.

The curl_multi_select function can also be called with a timeout - this makes the function block, but allows the script to wake up early if there's any work to do...


for ($x=0; $x<=4 && $active; $x++) {
        curl_multi_exec($mh, $active);
        // we wait for a bit to allow stuff TCP handshakes to complete and so forth...      

        curl_mutli_select($mh, 0.02) 


do {

        // wait for everything to finish...
        curl_multi_exec($mh, $active);
        if ($active) {

            curl_mutli_select($mh, 0.05);
        // until all the results are in or a timeout occurs
} while ($active > 0 && (MAX_RUNTIME<microtime(true)=$started;);

One further caveat is that curl_multi_exec() does not check for the number of connections to a single host - so be careful if you are trying to send a large number of requests to the same host (see also 6.12.1 in the book).

Did it work? Yes, for up to 30 concurrent requests to localhost, the throughput increased linearly.