This article talks about building resilience and fault tolerance within the application against non reliable (or potentially blocking) external DNS lookups. It talks about moving the impact of slow external DNS lookups from the worker threads in the request path to asynchronous timer threads for enhanced application performance.
Let's say, you have an application that needs to make multiple outbound calls for serving each incoming request. Now, if these multiple external calls are dependent on one or two enabler services, their degradation will affect the whole system. For example, consider DNS as one of the enabler services. Calls to external services can’t be made without resolving their hostnames. Now, if the latency of the system is in the order of a few hundred milliseconds, your DNS has to be really fast.
Typical DNS lookup:
$ dig inmobi.com ; <<>> DiG 9.8.1-P1 <<>> inmobi.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1823 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION:;inmobi.com. IN A ;; ANSWER SECTION:inmobi.com. 112 IN A 126.96.36.199 ;; Query time: 1 msec ;; SERVER: XX.XX.XX.XX#53(XX.xx.xx.xx) ;; WHEN: Mon Jun 1 07:04:01 2015 ;; MSG SIZE rcvd: 44
Normally, DNS resolution takes 1-3ms. When there is any degradation in the DNS service, lookup time can shoot up in the range of 500-800ms or more. For every new name lookup, DNS server will cache the result for TTL amount of seconds.
There is a second level of caching within JVM. It stores a hash map, which maintains a map of hostname to IP address. This hash map is cleaned and refreshed every 30 seconds (this is the default value set in JVM config for java7 and java8). This is generally a reasonable behaviour but depending on the use case, things can become suboptimal - for instance if you have an application that relies heavily on low latency DNS lookups of multiple DNS domains. The figure below illustrates the issue.
Every 30 seconds, JVM clears the cache and all the new values are repopulated again. At the 30th second, JVM contacts DNS server to refresh the values. If it takes 500ms for each lookup, and there are 20 different outbound calls to be made, it will take the next 10 seconds to fetch new values.
From 31st to 40th second, at any point only one thread (this thread takes lock on hash-map) remains active and rest of the threads are put to sleep. On a 24-core box, if we have 2*n i.e. 48 worker threads, one remains active, remaining 47 will be waiting on monitor.
If SLA of your service is less than half a second and your requests take longer to respond, then they will result in timeout at the calling service.
Let's say Q is the incoming qps, then
- For 10 seconds, everything has resulted in timeout.
- At 41st second, you have Q requests (current) + 10 Q requests (backlog)
- It will take another t (around 2-3) seconds for the service to recover.
- Everything will start working properly from 43rd second.
Thus, for 13 whole seconds; out of 40, the incoming requests to our service will begin to timeout.
- Create a new hash map (contains map of hostname to IP address) in application rather than doing look-up on hashmap present in JDK (InetAddress).
- Use a separate timer thread, which will iterate over all the keys of the hashmap to do DNS lookup. This thread will refresh once every 30 seconds.
- If DNS server fails to respond in time, then there will be delay in updating the map. In that case, it is this timer thread that takes the hit rather than the worker thread.
- Also, the values are overwritten, and not cleared at 30th second. So in the worst case, the worker threads will get old value but will never be blocked.
- At the time of creating the HTTP Request Object , set the URL field to IP address instead of hostname.
Since no DNS lookup happens on the worker thread, requests are never stalled in the queue.
- JVisualVM – Great tool for peeking inside the JVM on a running system. We find this tool extremely helpful when trying to discover thread anomalies in production. .
- Java InetAddress Class – Read the java source code to get the entire picture, no documentation can match it. Especially, the method checkLookupTable, where it puts the thread to sleep.