Caching : Our number one suspect Dutytaker

Caching: The Usual Suspect in Website Update Mysteries

It’s an ordinary workday at DutyTaker. We’re supposed to update the customer’s website. Our senior developer repeats every step of the same update procedure that we’ve relied on probably hundreds of times before. Everything looks good until the customer sends a message: “Did you update the website yet? We still see the old content there.” Now, what could’ve gone wrong when everything about the update went seemingly right? Our number one suspect is caching: the reported old content must have come from a cache. We’re about to respond to the customer and ask them to clear their cache when our developer cuts in: “Wait! It’s not that simple! Let me explain.”

Keeping it Close

The idea of caching is simple: whatever you need often, you keep it handy. If we frequently visit the same website, we want to keep the content near our web browser, in the cache, so that the page loads faster. The nearest cache is on the same computer or device where your browser is running, but by no means is it the only one. Website content can be cached anywhere between your local web browser and the remote web server, or even further if the web server is not the origin of the content. The goal is always the same, though. We want to shorten the journey the content makes from its origin to the browser.

Checking the Browser

When we visit a website, our browser sends a request for every resource that the page requires, including images, fonts, scripts, and stylesheets. The first destination is the local cache folder. It’s the folder that empties when we “Clear browsing data” on our browser. If we find the requested resource in the cache folder, and if it hasn’t expired, we’ll have a quick “cache hit” and the request will never leave our browser. If we get a “cache miss” instead, the browser will send the request forward. If the request is successful, the response will include the resource, which we store in the local cache for the next time.

Using a Proxy

If we connect to the Internet via a proxy server, all the requests for web resources coming from our browser are routed to the proxy server. They typically provide security and privacy to users who want to stay anonymous online. Sitting between the user and the Internet, they can also boost website performance by caching the content. A major difference from the private browser cache is that the proxy server cache is shared among other users who connect to the same proxy. If the proxy server is managed by an internet service provider, we can’t clear the content manually.

On the Edge

At this point, our browser has returned a cache miss for the request. So has the proxy server if it was involved. We are on the edge of our local network, sending our request off to a remote web server. Diving a bit deeper, we need to connect to the network socket of the remote server that can transport the resource. First, we need to resolve a mismatch. Network sockets are identified by an IP address and a port number, but the resource we want is identified by a URL with a domain name. Luckily, our browser knows how to call a DNS server to get an IP address in exchange for a domain name, a procedure also known as DNS resolution. Having the IP address, we then simply connect to the socket and start the data transfer.

Calling DNS resolution simple is actually misleading. While we do get an IP address as a result, it may well be the IP address of a nearby edge server performing edge caching for a content delivery network, a CDN. If our friends on the other side of the globe send an identical request, they will get a different IP address that belongs to a server near the edge of their local network. Edge servers usually cache static content for websites, such as images, videos, pdf documents, stylesheets, and script files. Because they shorten the geographical transfer distance, the data transfer is faster. The edge caches are managed by CDNs, many of which have a user-friendly admin console for purging content.

Approaching the Host

Requests for dynamic content and requests that update resources on the server bypass the edge servers. They are routed hop by hop toward the remote server until they reach the local network of the remote host. A well-secured server has the requests pass through a firewall before landing on the porch, on the reverse proxy server. Besides providing additional protection, a reverse proxy does web acceleration by caching responses to valid requests. Returning a cached response is faster than having the web server recompute a fresh response every time the same request arrives. It also reduces the load on the server. The cache on the reverse proxy can be cleared by the server administrator of the remote website. Cloud hosting services usually provide an admin console for both enabling a reverse proxy and clearing the cache on the proxy.

At Source

The remote server is the final destination of the request sent by our browser. This is where the source code of the web application resides. The web server software and the application source code handle the request and send a response back to the browser client. The source code, written by human developers, is human-readable. It’s verbose for clarity, which is not optimal for a web server to interpret. For this reason, we compile the source code into a format that is faster for the computer to process. PHP frameworks like Symfony compile PHP code into a compact set of classes, loaders, and metadata, as part of the startup and cache warm-up routines. That speeds up request handling as we no longer use the source code in its human-readable form. If a developer updates the PHP code on the server but doesn’t clear the cache, the server will use the outdated code to handle requests, as if no update was ever made.

Web servers also host source code interpreted by the browser, which includes CSS style definitions and Javascript program code. Following best practices, both are compiled and packaged into a browser-friendly format before the server sends them to the browser. If we do the packaging on the server, any change made in the Javascript source files or CSS style definitions will require a rebuild. Sometimes we also rename the files to make sure that the browsers don’t think they can find the latest code in their cache.

Finding Data

In a multitier architecture, the application data lives at least one server downstream from the application code. It’s far away, relatively speaking. Every time the application needs the data, it will send a request to a database or some other downstream server, which responds with the data. The wait can be a hundred milliseconds, which is an eternity for a modern computer. The distance gets shorter by having our application cache both database queries and the query results. We get the maximal benefit by caching static data, such as geographical names, coordinates, and translations. If we find the data in the cache, we will avoid the roundtrip to the database and respond to the request faster.

Updating data in the database doesn’t automatically invalidate the cached copies of the data, which usually have a long lifetime. Setting the time-to-live parameter (TTL) to six months is not uncommon. If we don’t want to wait for the cached data to expire, we’ll need to ask the server administrator to go to the server and clear the database query result cache manually.

Policing Options

Instead of only investigating the issues caused by caching, we can proactively set policies that control our number one suspect. Both the requests sent by our browser and the responses sent by the server may have a Cache-Control header where we specify directives for caching content. For example, our application can tell the browser that cached responses are fine as long as they’re not older than 24 hours, or the remote server informs all proxy servers that the response is private and should not be cached on a shared proxy server. Notice that this header does not affect how the request is routed, so that the request may end up on a server that doesn’t respect any caching policy.

Investigation

Now that we understand the scope of the issue, it’s time to investigate and find the evidence. The main scenes are the web browser and the web server. Browsers keep track of the outgoing requests and incoming responses, including all the headers. If the Age header is present in the response, it indicates the time the resource was in the proxy server cache. A response header line Age: 0 means that we can rule out the proxy server causing a caching issue. Edge servers usually add custom headers to the responses indicating that the content comes from a cache, but this evidence is weaker because the headers are non-standard and not part of the HTTP protocol. Edge servers can also override any headers that set caching policies, making them a poor witness and a potential suspect.

Web servers log all incoming requests. If a request wasn’t logged, it never arrived at the server, freeing the server of the suspicion of serving outdated content. If the server did log the request with a success status, we’ll need to check the status of the application-level caching to find more clues. A broken feature on the website that is still broken after a fix was deployed to the server is circumstantial evidence of an application cache being out-of-sync.

What complicates the investigation is that we don’t always have access to the browser, the server, or any proxy in between. Trying to reproduce the issues with our own browser doesn’t guarantee that we find any evidence one way or the other, so we need to look somewhere else. No access to the server typically implies that our updates are deployed to the server more or less automatically. So if a deployment failure can explain the issues that we’re investigating, it’ll be our next suspect.

Caching : Our number one suspect