Thinking Asynchronously in PHP

PHP is an excellent language for developing on the web. It’s portable, versatile and has matured a lot over the past few years. Web applications are becoming more complex each day: they are processing massive amounts of data, talking to other services, and are expected to do so quickly. They get to a point where processing is either too server intensive, or it may be making your site less responsive to users if its doing too much. If you take a step back and think about PHP applications asynchronously, you can build faster, more scalable applications.

Web Application Flow

A typical web application’s job is to turn a request into a response. In order for this to appear responsive, applications should generate that response as quickly as possible – I usually find anything under half a second acceptable. If we were to build the application asynchronously, we’d delegate any complex logic, service calls, or other expensive tasks. These tasks can usually be broken down into two categories…

Preparing Data For This Page

If a page requires us to perform complex queries, pull data from web services, etc. then that will likely add considerable delays to page generation. The usual response is caching, but people tend to naturally still use it in a synchronous fashion. Let’s take a twitter feed as an example (showing latest 10 tweets).

You’ll see code like this quite often:

public function getTweets($username) {
    if ($tweets = $this->_cache->get("tweets_$username")) {
        return $tweets;
    }
 
    $tweets = $this->_twitter->get($username);
    $this->_cache->set("tweets_$username", $tweets);
 
    return $tweets;
}

It checks to see if our cache entry exists, and returns it if it does. If not, it calls the routine that fetches the tweets from Twitter, and caches that result (assuming our default cache expiration). Let’s say this is 30 minutes. The problem with this approach is every 30 minutes, it needs to download the information again, making the user wait while the page loads. The other major problem is that when Twitter is down (or for whatever reason, we are unable to connect), the page is broken, because the cache has expired.

A much better approach is to never even attempt to do this work in the main application flow. Instead, let’s have a cron job run every 30 minutes that downloads this information and caches it with no expiration. This way, even if Twitter is down, the cached copy is always expected to work. You could still use the above code as a safeguard, but with this approach it should never have to be used as a fallback. You’d only want to ever update the cache if you successfully retrieve your information.

This model works with almost any data you can cache. You’ll need to decide if you’d rather only perform these tasks on demand (when a user visits a page), or if you want to do a little extra work to ensure the data is always available. Depending on the traffic of the site, or the amount of unique cacheable items you are working with, you’ll have to weigh the benefits.

Processing Needed Later

The other category usually comes from requests that actually need some sort of action performed. One concept we’re going to look at which really helps scale your applications is a work queue or job queue. With a queue in place, we can defer execution of actions so we can quickly complete our request to response cycle. Just because something is delayed, it doesn’t mean there will be a lapse in our accurate data. By simply delaying something for 1, or even 0 seconds, it can be run immediately, but from another process.

The way this works is we need to have a few different processes in place:

Job Queue
Storage of tasks to be executed. Example: print spooler
Job Worker
A process that reads tasks from the queue, and executes them. Workers can be spread across multiple physical machines, if necessary.
Client
A client would be our application, which adds items to the job queue. Similar to workers, we could have multiple applications using the same queue.

Let’s look at some examples…

Sending Mail

Using PHP to send mail is a relatively simple task. If your server can directly send mail, then it is probably very quick, too. A lot of servers will use some sort of mail queue internally to ensure mail gets sent at a steady pace and doesn’t overload the server. However, if you are sending mail via remote web servers (for example, using GMail), then it will take a lot longer, especially if you are sending lots of mail.

Instead of making the user wait a few extra seconds, we can instead toss the mail into a queue (which should be very quick) and not worry about it. Then, one of our mail queue workers can process it as soon as they are available. Depending on your set up, this could be instant, or it could be very delayed. It’s up to you to decide how important the timing of your mail delivery is.

Rebuilding Stats, Caches, etc.

Another common scenario is performing clean-up processing after making changes to your data. Delaying execution until another process can pick it up is usually harmless to the user, and gets them to their next task that much faster.

Different Types of Queues

There are two very common implementations or approaches you can take.

Running as a Daemon

If you need to minimize time between queuing and execution, or more importantly, between scheduled execution and actual execution, you should look at something like Gearman or beanstalkd . These applications run as daemon and can immediately process queued tasks as they are ready to be executed. However, since it requires software to be running, it may not be available or possible in some hosting environments, and takes a little more work to set up. I’d argue, for the accuracy, it will faster than the alternatives since it is not having to continually poll to get new data.

Running Scheduled Tasks

If you are limited, a simple option is to set up a scheduled task for your worker. It should be a simple PHP script that connects to your queue (you could store it in the database), and processes items as they are available. If you want things to run as quickly as possible, you can make your cron job run more frequently, however this may also burden your system depending how frequently you poll, and how intensive your scripts are.

I’m going to follow this post up with some examples of implementing this with both a database-based queue, running as a scheduled task, and also with either gearman or beanstalkd.

Reflection on Separation of Concerns

Working with larger projects a lot, I sometimes forget some of the earlier decisions I made to get here. Biting the bullet with Zend Framework 3 years ago made some of those decisions subconscious (or seem obvious), but after a little reflection, I am thankful. This post really has nothing to do with ZF, but it does relate to the idea of frameworks a lot. I’ve decided I need to start looking at what other developers are doing more – outside of the few dozen that I actively follow on Twitter. I need to go back, re-evaluate some of those decisions, and hopefully share some knowledge along the way.

After the last #LeanCoffeeKL event, I was brought into a discussion with a fellow developer who could not get mail setup locally. Another guy chipped in with some useful advice. I instantly thought in my head, “who cares? don’t send mail locally”. After thinking how obvious that was for a few minutes, I remember setting up my local dev server ages ago, and being pissed off that I couldn’t get mail working properly. All of the blog posts would explain how to set it up through your ISP. Some, smarter, would show you how to send it through SMTP (or through Gmail, etc.).

A big mindset change for me was when I discovered that I could code for different environments. Having distinct development, staging and production environments made configuration a lot easier. It also made dealing with external tools a lot easier. So, how did that help me with that pesky mail problem?

Mail is a service. My application code does not care how mail works, it just cares that it can call the mail service. If I’m testing things, I don’t want to actually send mail. That can have some very scary unintended problems. I do, however, want to make sure that the generated emails are correct. By abstracting what the mail service does, you can create a basic interface for that service that my application can call. My configuration (per environment) can determine what mail implementation I need.

For example, on development, I capture email to the database. On staging and production, I capture email to the email queue (in the database). However, there is no cron job set up on staging to actually send the mail. However, I can when I need to.

A lot of these problems come up all the time, but unless you know the right questions, it’s pretty hard to navigate forward. I’m hoping to start exploring these topics a lot more in the next few weeks on here.