Monday 2 March 2015

Making stuff faster with curl_multi_exec() and friends

Running stuff in parallel is a great way to solve some performance problems. My post on long running processes in PHP on my other blog continues to receive a lot of traffic - but a limitation of this approach (and any method which involves forking) is that it is hard to collate the results.
In the book I recommended using the curl_multi_ functions as a way of splitting a task across multiple processing units although I did not provide a detailled example.
I recently had cause to write a new bit of functionality which was an ideal for the curl_multi_ approach. Specifically I needed to implement a rolling data quality check checking that a few million email addresses had valid MX domain records. The script implementing this would be spending most of its time waiting for a response from the DNS system. While it did not have to be as fast as humanly possibly, the 50 hours it took to check the addresses one at a time was just a bit too long - I needed to run the checks in parallel.
While the PHP Curl extension does resolve names in order to make HTTP calls - it does not expose this as a result, and the targets were not HTTP servers, so I wrapped the getmxrr() function in a simple PHP script running at http://localhost.
To refresh my memory on the parameters past and the values returned I went and had a look at the PHP documentation.
The example of how to use the function in the curl_multi_exec() page is somewhat Byzantine:



do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
    }
}

OMG - 3 seperate calls to curl_multi_  functions in 3 loops!

It doesn't exactly make it obvious what's going on here. It turns out that the guy who wrote it has since posted an explanation.

There are certain advantages to what the developer is trying to do here, but transparency is not one of them.

The example code in curl_multi_add_handle() is clearer, but somewhat flawed:


do {
    curl_multi_exec($mh,$running);
} while($running > 0);
 

To understand what's really happening here, you need to bear in mind that the curl_multi_exec() is intended for implementing asynchronous fetching of pages - i.e. it does not block until it completes. In other words the above will run in a tight loop burning up CPU cycles while waiting for the responses to come in. Indeed it may actually delay the processing of the responses!
Now curl_multi_exec has a lot of work to do. For each instance registered it needs to resolve the vhost name, carry out a TCP handshake, possibly an SSL negotiation, send the HTTP request then wait for a response. Interestingly, when testing against localhost, it does nothing visible on the first invocation while it seems to at least as far as sending the HTTP requests on the second iteration of the loop, regardless of the number of requests. That means that the request has been dispatched to the receiving end, and we can now use our PHP thread to do something interesting / useful while we wait for a response, for example pushing the HEAD of your HTML out to the browser so it can start fetching CSS and (deferred) Javascript (see section 18.11.1 in the book).
Of course, even if I were to confirm that the TCP handshake runs in the second loop, and find out where any SSL handshake took place, there's no guarantee that this won't change in future. We don't know exactly how many iterations it takes to dispatch a request, and timing will be important too.
But it might be why the person who wrote the example code above split the functionality across the 2 consecutive loops - to do something useful in between. However on my local PHP install, on the first iteration through the loop it returns a 0 and CURLM_CALL_MULTI_PERFORM is -1. So the first loop will only run once, and won't send the requests (I tested by adding a long sleep after the call).

Hence I suggest that a better pattern for using curl_multi_exec() is:

do {
        curl_multi_exec($mh, $active);
        if ($active) usleep(20000);
} while ($active > 0);


The usleep is important! This stops the process from hogging the CPU and potentially blocking other things (it could even delay processing of the response!).

We can actually use the time spent waiting for the requests to be processed to do something more useful:

$active=count($requests);

for ($x=0; $x<=3 && $active; $x++) {
        curl_multi_exec($mh, $active);
        // we wait for a bit to allow stuff TCP handshakes to complete and so forth...
        usleep(10000);
}

do_something_useful();

do {
        curl_multi_exec($mh, $active);
        if ($active) usleep(20000);
} while ($active > 0);


Here the executions of curl_multi_exec() are split into 2 loops. From experimentation it seems it takes up to 4 iterations to properly despatch all the requests - then there is a delay waiting for the request to cross the network and be serviced - this is where we can do some work locally. The second loop then reaps the responses.

The curl_multi_select function can also be called with a timeout - this makes the function block, but allows the script to wake up early if there's any work to do...

$active=count($requests);
$started=microtime(true);


for ($x=0; $x<=4 && $active; $x++) {
        curl_multi_exec($mh, $active);
        // we wait for a bit to allow stuff TCP handshakes to complete and so forth...      

        curl_mutli_select($mh, 0.02) 
}

do_something_useful();

do {

        // wait for everything to finish...
        curl_multi_exec($mh, $active);
        if ($active) {

            curl_mutli_select($mh, 0.05);
            use_some_spare_cpu_cuycles_here();
        }
        // until all the results are in or a timeout occurs
} while ($active > 0 && (MAX_RUNTIME<microtime(true)=$started;);

One further caveat is that curl_multi_exec() does not check for the number of connections to a single host - so be careful if you are trying to send a large number of requests to the same host (see also 6.12.1 in the book).

Did it work? Yes, for up to 30 concurrent requests to localhost, the throughput increased linearly.

1 comment:

  1. Today someone pointed out the Guzzle library to me - which provides a simple API for handling concurrent / asynchronous HTTP requests. See http://docs.guzzlephp.org/

    ReplyDelete