Paper Review: µTune

µTune (OSDI 2018) presents a novel, adaptive threading model to minimize RPC latency regardless of current load in high-performance microservices. Traditional monolithic web applications have latency Service-level objectives (SLOs) in the range of 10s to 100s of milliseconds. Because responding to a user’s request in a microservice-based application may depend on many on sub-requests fanning out to dozens (or hundreds) of microservices, the latency budget for each internal request must be much shorter than that of the overall request, often on the order of single-digit milliseconds or less. At these much-tighter latency SLOs, networking and OS-level overheads (such as context switching and network delays) become a key factor in application performance.

µTune addresses this problem by dynamically adapting the threading model based on the current load (measured in queries-per-second, or QPS). There are 3 axes that µTune works across: synchronous vs async services, polling vs. blocking for incoming requests, and in-line vs. dispatch-based handler execution. Synchronous services store the current state of the request handler on the OS stack and present a simple programming model, while async services store all request state off-stack somewhere. Polling network listeners repeatedly loop, asking the OS if data is available on a given set of sockets, while “blocking” listeners will be de-scheduled until data is available on one of the open sockets (such as through the epoll system call, which is traditionally associated with non-blocking IO). Finally, in-line handlers are invoked immediately on the thread that received data off the wire, while dispatch-based handlers send an event with a reference to request data to a thread-pool for execution. At low QPS, services that poll with inline handlers have almost a factor of two reduction in p99 latency over a blocking, dispatch-based approach, but this is reversed under high load (async is always better than sync).

Previous efforts have focused on evaluating various tradeoffs, such as synchronous vs. asynchronous or thread-pools vs. thread-per-request. Knot (2003), for example, showed that an approach similar to how goroutines work in Go (user-space threading w/ dynamic stack growth) could match or exceed explicitly-asynchronous performance. In general, I feel like often the underlying concurrency primitives usually aren’t abstracted enough that existing systems could provide the same sort of adaptive behavior as µTune (although maybe async/await in .NET comes close).

I really like this work, but think there are some limitations when it comes to applying the results to real-world systems. First, while I appreciate evaluating the full design space, I think that in practice “polling” is a non-starter for all but the most performance-critical systems. In particular, there is almost always more interesting work a machine can be doing apart from repeatedly asking the OS/hypervisor “are there any new packets?”, especially in the bin-packing Kubernetes future we live in. So, if you don’t want your vanilla microservices polling, it reduces the design space to “async vs. sync; inline vs. dispatch”. Asynchronous vs. synchronous services are not substitutable, so in practice an application is either one or the other. This leaves a single axis that a runtime could be adaptive over: immediately invoking handlers vs. dispatching them to a thread pool.

Finally, the work seems to implicitly assume that the work a service does is homogeneous; that requests have the same shape and require similar amounts of work. Not all systems have this property, and especially this isn’t true for load balancers, like the Envoy proxy that was mentioned in related work. A system that could dynamically adapt to not just QPS but also to varying response times for different requests seems like a natural extension.