Monday, 7 October 2019

De-duplication and Opcache

As I now have a lot of Wordpress sites to look after (argh!) I wanted to see if I coiuld set them up to be a bit more efficient in memory. However due to the way Wordpress resolves the location of files, it would have required the wp-settings.php file to be almost completely replaced to set the paths in PHP. I wondered if I could use symlinks on the filesystem to achieve the same goal without hacking the code. The answer appears to be yes - here's the output from my test case:

Opcache and symlinks

This script includes the same file via different paths which use symlinks. The objective is determine whether this creates 1 or 2 entries in opcache - and hence whether I can run multiple Wordpress sites from the same files without rewriting the code

Include from linked1 : This is /var/www/html/myvhost/include/testsymlink.php
include from linked2 : This is /var/www/html/myvhost/include/testsymlink.php

/var/www/html/myvhost/include/testsymlink.php
Array
(
    [full_path] => /var/www/html/myvhost/include/testsymlink.php
    [hits] => 9
    [memory_consumption] => 736
    [last_used] => Mon Oct  7 10:31:53 2019
    [last_used_timestamp] => 1570444313
    [timestamp] => 1570443328
)
2 files included resolve to a single entry in opcache - yay!
Note that some caution is required when applying upgrades to the wordpress install!

Source code for this script

<?php
print "<h1>Opcache and symlinks</h1>";
print 
"<p>This script includes the same file via different paths which use symlinks. The objective is determine whether this creates 1 or 2 entries in opcache - and hence whether I can run multiple Wordpress sites from the same files without rewriting the code</p><p>\n";
print 
"Include from linked1 : ";
include 
"linked1/testsymlink.php";
print 
"include from linked2 : ";
include 
"linked2/testsymlink.php";

print 
"<pre>";
$data=opcache_get_status(true);
foreach (
$data['scripts'] as $script=>$sd) {
   if (
"testsymlink.php"==basename($script)) {
       print 
$script "\n";
       
print_r($sd);
   }
}
print 
"</pre>";

print 
"2 files included resolve to a single entry in opcache - yay!<br />\n";
print 
"Note that some caution is required when applying upgrades to the wordpress install!<br />\n";    
print 
"<h2>Source code for this script</h2>";
highlight_file(__FILE__);


Monday, 6 May 2019

Security Fails

Security Fails

Worse than merely being Security Theatre, a lot of bolt-on "security" products actually undermine your data confidentiality, integrity and availibiliy.

Recently, while perusing my webstats, I noticed http://cp.mcafee.com/... appearing in the referers. The path part of the URL contained rather a lot of data. On opening the URL in a browser, I found it contained a lot of detail about an email, presumably sent to the user of the browser. This report contained a clickable link to my site (hence it appeared in my referers). This information also included the full email address of the email sender.

The technology in question is named "Click Protect" - but it exposes the details of a third party without their consent.

ClickProtect
The site below is rated as Unverified and is categorised by McAfee as XXXXXX/XXXXXXX.

The email was sent to you by XXXX.XXXXX@hotmail.co.uk.

Click the URL only if you understand the risk and wish to continue.

https://www.XXXXXXXX.com/...


Email:  info.security@sainsburys.co.uk


(Original content redacted with XXXXX)

A quick look around the internet and these URLs appear in a lot of different places - there are a lot of sites which publish their stats in a form searchable by Google.

I attempted to contact both McAfee and Sainsburys.co.uk (the webmail provider) to advise them they were leaking information like this but have received no response from either.

Tuesday, 18 December 2018

Using performance to manipulate behaviour

A darker side to the performance story seems to be emerging. This is the first in a series of 3 posts (there might be more later) about how web performance is being weaponized.

While I, like many of you, spend a lot of time simply trying to make my sites going faster, it seems that other people at finding ways to exploit performance as a way of manipulating user behaviour. This was particularly evident when I recently visited www.forbes.com to read an article about phone biometrics. Not where I would go to for authoritative information – I was just browsing at the time. As is common, it asked me if I wanted to accept its cookies.



Yes, they want to protect their revenue stream so the big green button with white text is easy to see and read, while the smaller grey button is a lot harder to read – and only professes to providing “more information”. Now due to the specifics of the EU's GDPR act, the site needs my “informed consent” to any cookies it drops – so not surprisingly the “more information” button takes me to a dialogue where I can also specify which cookies I will accept.


If I click on the first, big green button, I get an almost immediate acknowledgement. Accepting all three classes of cookies from the “more information” dialogue seems to take slightly longer, but I didn't measure it too closely. But what is interesting is that if I dial back the cookie setting to only “required cookies” the site tells me it has a lot of work to do in order to dial back “the full power of Forbes.com”.



So I have incurred a huge performance penalty for exercising my rights.

This did provoke a torrent of activity in the browser – over a thousand requests – which included a few 404s and several 302's sending my browser back around the internet. I've not looked at all of them, but the 200 responses all contained “no data”, and none of the sites I saw had appeared when I first loaded the page.

This is appears to be a very elaborate piece of theatre.

It took around 60 seconds to reach the 100% point – while helpfully giving me the option to change my mind at any point.

Another interesting feature of the performance was that the counter slowed down as it progressed! If you've read up on progress bars, you'll know that is exactly the opposite of what you should do if you want to convey an impression of speed.

Finally, changing my browser config to send a “Do Not Track” header had no impact at all on the behaviour. Although at the time of writing, this is still a proposal for HTTP.

Usually I don't wear my tin foil hat when browsing the internet – I'm OK that websites need a way to fund the content they publish but I am very disturbed that sites seem to go to such lengths to try to manipulate their users' behaviour.

Wednesday, 19 April 2017

Random S**t Happens (or sometimes it doesn't)

On Thursday last week, I migrated a wee enterprise application I wrote a number of years ago (2009?) to a its new home on vmware farm. In itself not a big job, but there were a lot of integration points with other systems. Sadly, it went about as well as I expected. After some pain, normal service was restored. Almost. One of the key pages in the application kept pausing.

As you might expect, I am rather fastidious in ensuring the performance of the applications I write. But this seemed strange. Generating the HTML usually took 30-40 ms (measured at the browser). Not earth shattering, but it does do a lot of work and well within the performance budget. But 1 in every 20 or so requests would take much longer - between 6 and 20 seconds!

Since there were no code changes, the obvious candidate for the cause was the infrastructure which had changed:
  • other VMs on the same host competing for resource
  • I/O contention (this was now on a SAN with a lot of other devices)
  • overzealous network security devices filling the network bandiwdth
  • congestion crashes on routers
But I hadn't ruled out a problem in the software stack. Mod_deflate buffers, database contention...

Checking the usual metrics (load, CPU usage, disk IO, APC stats) revealed nothing untoward. So the next setp was to inject some profiling in the code. I would have preferred to use XHProf, but the people who own this system are not keen on third party tools in their production systems.

The profiling soon revealed that the pauses were always occurring in the same region of code. This ruled out any environmental issue.

Looking through the region, there was no disk, network or database I/O. It did write some output (and the HTTP response was chunked) but that was a very long delay for a context switch or a garbage collection cycle. And why didn't occur on every request?

All the code seemed to be doing was reading stuff from PHP's memory and injecting it into the HTML page.

Going through the program in some detail (I did mentioned it was a very long time ago when I wrote it originally?) there was an inversion of control - a dependency injection - where a callback was invoked. Dumping the callback led me to an obscure library routine doing encryption. This created an initialization vector:

mcrypt_create_iv (mcrypt_enc_get_iv_size($this->td), MCRYPT_DEV_RANDOM);

This was the smoking gun.

The problem was that I had told mcrypt to read from /dev/random and /dev/random didn't have any randomness. So it blocked until it got some.

The solutions were obvious:
  • keep /dev/random topped up (using rngd)
  • use a different (weaker?) entropy source - (MCRYPT_DEV_URANDOM, reading from /dev/urandom is available)
Given that I had already peppered the code with profiling, adding a single character seemed the sensible choice. Whether urandom is weaker is debatable. Indeed, vmware (but not RedHat) recommend this as a solution.

Since the encryption in question was using triple-DES (look, it was a really long time ago, OK?) even a bad random number generator wouldn't have helped make it more secure.

In my defence:

1) mcrypt is now deprecated in current versions of PHP, the current Redhat Release (7.3) ships with a version of PHP pre-dating the deprecation - and it certainly was not deprecated at the time I wrote the code. But it wasn't mcrypt doing anything wrong here.

2) The 3DES encryption was an early CSRF protection mechanism for an application which has very restricted access, and subsequently used a more complex system with SHA1 hashes - but the original code was not removed when the new mechanism was added

3) Frankly, base64 encoding the data here would have been overkill given the level of exposure in this application

This was the first time I had come across this problem. I'm going to be involved in moving a lot of other systems into this network - many of which make more extensive (and critical) use of encryption than this one does. Now I know one more thing to look out for.

Friday, 17 March 2017

Image compression (again)

It's not the most exciting thing to have happened to web performance in the past year or so - but it will have an impact on your performance and scalability. A team from Google have released a new JPEG compressor called guetzli

What's interesting about this one, is that the team were focused (oops) on the perceived image quality, rather than the measured quality, although they did also take time out to write their own image quality measurement algorithm.

Saturday, 6 August 2016

Speeding up Dokuwiki

I'm a big fan of Dokuwiki.
  • its simple
  • has a great ecosystem of plugins
  • has great performance
But some time ago I decided there was room for improvement so I wrote a very simple framework (itself implemented as a Dokuwiki plugin). I've just uploaded this at Github.

Specifically this allows for:
  • Much faster page loading using PJAX
  • Pure javascript/CSS extensions - no PHP required
  • Prevents Javascript injection by page editors
The PJAX page loading requires small changes to the template to exclude all but the page specific content (i.e. navigation elements and the rendered markup) when a PJAX request is made. Instead you just return a well-formed HTML fragment when the request is flagged as coming from PJAX. There is an example template here. While the template this is based on is already rather complex, the actual changes to this, or any existing template, are only a few lines of code - see the diff in the README.

This saves me around 450 milliseconds per page in loading time:

The savings come from not having to parse the CSS and Javascript on the browser. The serverside content generation time is not noticeably affected.

But even if you are not using Dokuwiki you can get the same benefits using PJAX on your CMS of choice.

A strict Content Security Policy provides great protection against XSS attacks. But the question then arises how to get run-time generated data routed to the right bit of code. Jokuwiki solves by this embedding JSON in data-* attributes including the entry point for execution.

Wednesday, 22 June 2016

Faster and more Scalable Session handling

Chapter 18 (18.13 specifically) looks at PHP session handling, which can be a major bottleneck. I suggested there were options for reducing the impact, top of which was to use a faster substrate for the session data, but no matter how fast the storage it won't help with the fact that (by default) control over concurrency is implemented by locking the data file.

While I provided an example in the book of propagating authentication and authorization information securely via the URL, removing the need to open the session in the linked page, sometimes you need access to the full session data.

Recently I wrote a drop in replacement for the default handler which is completely compatible with the default handler (you can mix and match the methods in the same application) but which does not lock the session data file. It struck me that there were lots of things the session handler was doing and which a custom handler might do. Rather than create every possible combination of storage / representation / replication / concurrency control, I adapted my handler API to allow multiple handlers to be stacked to create a custom combination.

The code (including an implementation of the non-blocking handler) is available on PHPClasses.

One thing I omitted to mention in the book is that when session_start() is called it sets the Cache-control and Expires headers to prevent caching:

Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0



If you want your page to be cacheable, then there is a simple 2 step process:

  1. Check - are you really, REALLY sure you want the content to be cacheable and use sessions? If you are just implementing access control, then the content *may* be stored on the users disk.
  2. Add an appropriate set of headers after session_start();
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + 3600));
header('Cache-Control: max-age=3600'); 
header('Varies: Cookie'); // ...but using HTTPS would be better