Instrumenting HTTP requests in node

By Trevor Livingston

HTTP client libraries are a dime a dozen in user-land, but you might need more from your client of choice.

Almost every web application someone writes is going to need to interact with a service or other HTTP based (focusing on HTTP for this post) server at some point. At scale, this could result in hundreds of millions of calls to an HTTP client library.

While there are plenty of user-land HTTP clients to choose from, is the library you picked going to be free of memory leaks? Will it be able to expose metrics for timings and more? Will you be able to customize it for your environment’s service discovery, secrets or other operational concerns?

Too many OSS projects do not address the reality of operationalizing production software at scale.

With over five millions downloads per week, node-fetch (just picking an example) is a go-to choice for many developers. It’s simple and fast and based on the browser fetch API.

But one of the problems node-fetch and many other libraries introduce is hiding the underlying http.ClientRequestobject. This makes it very hard to do much more than simple response and error handling.

fetch('https://api.github.com/users/github')
.then(res => res.json())
.then(json => console.log(json));

fetch returns a Promise, but there is no way from there to access the http.ClientRequest any longer. This takes away access to a very powerful and useful low level API.

When picking an HTTP client, look for clients that allow you to get at the http.ClientRequest object, because this is what we’re going to be using to make our HTTP interactions better. One client I know of that supports promises while still exposing the http.ClientRequest is wreck.

In addition, look for clients that do not assume non-200 HTTP status codes are errors. If you have a response, there was no error. It might not be the response you were looking for but errors should represent issues making the request.

Let’s start digging into what goes into an HTTP request to understand why we might get a little more out of our client libraries.

Conceptual diagram of connection establishment

While every library, including node’s libraries, is going to provide some sort of timeout functionality on the request/response, there is a lot more happening than just sitting around and waiting for a response from a server. There are many underlying issues that may contribute to a slow request that has nothing to do with the server you are interacting with. Do we really want to treat socket connection the same as server response?

Imagine the following scenario:

The pretty much worst case for some service’s response time is 500ms… As a result you’ve set your timeout to 750ms. During a particularly busy peak, a network issue occurs, resulting in connection failures. What happens?

Timeouts encompass not only receiving a response from the service, but getting a socket and establishing a connection. You are waiting 750ms for a connection that is never going to occur anyway.

Diagram of long wait time with no connect

On average, production connection times should take under 10ms. In the situation I’ve just described, you could be failing fast at 10ms (depending on your environment) and falling back to a recovery path rather than eating up system resources and making users wait.

Diagram of shorter wait time on no connect

It’s important to remember than an http.ClientRequest object is an event emitter. Beyond the response event, it also emits socket, which is the key to getting at the connection setup.

const req = http.request(options);
req.on('socket', (socket) => {
//Yay we have a socket
});

socket is also an event emitter. The particular event we are looking out for is connect (although we could also listen to secureConnect to wait for the entire handshake in an TLS connection to complete).

req.on('socket', (socket) => {
socket.on('connect', () => {
//We are connected
});
});

Putting it together we can do a (simplified for this post) connect timeout very easily:

const req = http.request(options);
const connectTimer = setTimeout(() => {
req.abort();
}, connectTimeout);
req.on('socket', (socket) => {
socket.on('connect', () => {
clearTimeout(connectTimer);
});
});

This will set a timer to abort the request unless the connection occurs and clears it.

I usually want observability into socket, connect, and response times to tell me more about the performance and health of my system. I also want to be able to retry connections, and mix in features like circuit breaker state.

Let’s use the socket, connect and response events to gather some additional metrics we may want to log.

For example. a long period between requesting a socket and receiving a socket event may indicated resource issues, GC pressure, and more. Collecting these metrics can tell you a lot about how your application is performing behind the scenes.

const req = http.request(options);
const connectTimer = setTimeout(() => {
logger.error('timeout', 'connect timeout');
req.abort();
}, connectTimeout);
let ts;
req.on('socket', (socket) => {
const now = Date.now();
logger.info('socket', { time: now - ts });
ts = now;
socket.on('connect', () => {
const now = Date.now();
clearTimeout(connectTimer);
logger.info('connect', { time: now - ts });
});
});
ts = Date.now();
req.end();

The output of that might look something like:

INFO: [socket] {"time":3}
INFO: [connect] {"time":10}

Aggregating this data in a tool like Splunk will allow you to look at connection latency across your system.

For example:

Graph of application p99 connect times

In the graph above, an application has continuously long p99 connection times. In addition, this same application experiences a huge spike in the time it takes to get a socket to connect with. This is a good indication that the connection issues are culminating in resource starvation. Not timing out on these long connect times may be exacerbating the situation.

Graph of p99 socket times

If this application had set a shorter connection timeout (some value a little above its average p99), it would have quickly led to an error spike and call attention to it instead of letting users stare at a spinner wondering when the page will load.

Providing additional logging around the connection provides very useful insight connection issues. How can we use these events to help us design a more resilient client?

Take once again the earlier example of a more aggressive connection timeout. In a fail-fast scenario, we may want to retry in case the first failure indicates a one-off issue that we don’t want to throw up our hands and quit for.

The first thing we’d do is provide a better error for connection timeouts:

const connectTimer = setTimeout(() => { 
const error = new Error('connect timeout');
error.code = 'ETIMEDOUT';
req.emit('error', error);
}, connectTimeout);

Previously we called req.abort() but this will generate its own error and we won’t be able to differentiate between a connection timeout and another error. To make working with the request easier lets wrap it in a Promise as well:

const request = function (options) {
return new Promise((resolve, reject) => {
const connectTimeout = options.connectTimeout;
const req = http.request(options);

const connectTimer = setTimeout(() => {
const error = new Error('connect timeout.');
error.code = 'ETIMEDOUT';
req.emit('error', error);
}, connectTimeout);
    req.once('socket', (socket) => {
socket.once('connect', () => {
clearTimeout(connectTimer);
});
});
    req.once('error', (error) => {
reject(error);
});
    req.once('response', (response) => {
resolve(response);
});
    req.end();
});
};

A small disclaimer here is that in the interest of brevity I am not cleaning up listeners properly not to mention handling many other things.

With that, making it retryable is trivial:

const retryableRequest = function (options) {
const maxRetries = options.maxRetries;
options.retryCount = 0;
  const retry = async function () {
try {
return await request(options);
}
catch (error) {
if (error.code === 'ETIMEDOUT') {
if (options.retryCount < maxRetries) {
options.retryCount++;
return await retry(options);
}
}
throw error;
}
};
  return retry();
};

We can now invoke our request once, while internally it will retry on connection failures:

retryableRequest({ method, host, connectTimeout, maxRetries })
.then((response) => {
console.log(response.statusCode);
});

There is a lot of power in using out of the box node core libraries. That isn’t to say that using open source libraries is the wrong choice, but that when building software we should be careful about the tradeoffs we’re making in the name of simplicity. Often, we do not need to make these tradeoffs at all and digging into the some of these libraries will reveal that.

My advice to new developers is to not use frameworks or user-land libraries until they’ve developed a better understanding of node core APIs. Sometimes these APIs may feel complex, but they are powerful and well worth understanding.