Today, while I was having lunch, I started receiving email notifications about Niume’s servers not responding. In the beginning it was just one server, so I removed it from the load balancer, then restarted the httpd process and put it back again.
But then the other servers started behaving the same way, even though the load was well under capacity. Between those email notifications, there was a Rackspace support email:
The title was “Excessive Issue – Please Read“.
And the content was the following:
The server(s) at IP address – *.*.*.* have been Rate Limit on our caching DNS servers due to a high volume of inbound recursive DNS queries. We cannot lift the Rate Limit until issue has been addressed.
Your current effective rate limit is 3 queries/second.
In this situation, we believe the queries are the result of a misconfiguration on your server or within an application.
You may need to work with your Fanatical Support team to identify the origin of the queries and get the Rate Limit removed.
Alternately, you may have an application that simply makes a large number of DNS queries.
In this case, you will need to investigate running your own caching DNS server or software.
For more information on Rackspace’s DNS rate-limit policies,please visit the following URL:
We are available 24/7 if you have any questions or concerns over the actions taken to protect our infrastructure.
– Rackspace Hosting”
Before even thinking about the problem, I just fired up the live chat and started lashing out at the operator about how surreal this thing was. Then, after I calmed down a bit I realised that the this issue was probably caused by MI6, the data gathering application that logs every single request performed to Niume in MongoDB, then analyses it asynchronously and adds it to the not-so-big-data database.
Without thinking much, I was asking MI6 to get the hostname of an IP on the fly. This is a completely complimentary and unnecessary thing, so I quickly disabled this functionality and synchronised the code on all of the servers. Then, to make sure that the requests were not coming from somewhere else as well, I replaced resolv.conf with Google’s DNS service. Then everything started restoring gradually.
At that point, I apologised to the operator for my manners and explained to him that it was not acceptable to first rate limit and then notify. Especially to a customer that pays a significant amount of money for infrastructure every month.
After he gave me a bland response in the lines of “it is an automatic process blah blah blah” I wished him a good day and disconnected. It was not his fault after all.
I also searched in the entire codebase to check whether there reverse DNS lookups were performed in other parts of the application(s).
I also disabled reverse DNS lookups for SSH by setting:
UseDNS = no
You will be surprised how many automated attacks per second hit the SSH service, even with fail2ban banning every IP for a day after 3 failed attempts.
Anyway, at the end of the day I just closed the ticket, rated it with 0 and left the following message:
“Well, thank you for notifying me in time before disabling almost ALL of my frontend servers and causing me downtime. This is totally unacceptable. I have no problem of being rate limited but at least you could allow some time (even half an hour would do) for me to detect WHY I was over the limit.
These things make me consider whether I will be moving to AWS in the end.”
Now, it happened that we had zero downtime because I was just sitting in front of my computer and I was quickly removing bad servers from the load balancer on time.
If I was in a plane, or just riding to work, or the load was greater, we would have proper downtime. And it would be for that very stupid reason. It is just not acceptable.
I am very happy with Rackspace for certain things, but today’s thing along with a failed live migration of a MongoDB server in London, (resulting 4 hours of me setting up and configuring a new server from scratch since I couldn’t use an image) due to a new data centre really wound me up lately. There are other things that happened at times as well, like performance degradation, but those fall in the “things that occasionally happen to everyone” category.