A request to osrm-routed can be assigned to a thread which
is currently busy processing another request, even when there
are other threads/cores available. This unnecessarily delays
the response, and can make requests appear to hang when
awaiting CPU intensive requests to finish.
The issue looks like a bug in Boost.Asio multithreaded
networking stack.
osrm-routed server implementation is
heavily influenced by the HTTP server 3 example in the
Boost.Asio docs. By upgrading to Boost 1.70 and updating the
server connections to match the example provided in the 1.70
release, the problem is resolved.
The diff of the changes to the Boost.Asio stack are
vast, so it's difficult to identify the exact cause. However
the implementation change is to push the strand of execution
into the socket (and timer) objects, which suggests it could
fix the type of threading issue we are observing.
osrm-routed does not immediately clean up a keep-alive connection
when the client closes it. Instead it waits for five seconds
of inactivity before removing.
Given a setup with low file limits and clients opening and
closing a lot of keep-alive connections, it's possible for
osrm-routed to run out of file descriptors whilst it waits for
the clean-up to trigger.
Furthermore, this causes the connection acceptor loop to exit.
Even after the old connections are cleaned up, new ones
will not be created. Any new requests will block until the
server is restarted.
This commit improves the situation by:
- Immediately closing connections on error. This includes EOF errors
indicating that the client has closed the connection. This releases
resources early (including the open file) and doesn't wait for the
timer.
- Log when the acceptor loop exits. Whilst this means the behaviour
can still occur for reasons other than too many open files,
we will at least have visibility of the cause and can investigate further.