So I'm trying to implement a load balanced split/split DNS infrastructure to replace the current infrastructure.
I've got the environment more or less in place at this point (not in production) and I'm trying to slowly roll out my caching only resolvers to select user groups, but I'm running in to some nagging problems.
First I was having issues reliably resolving many sites that use Akamai for hosting/DNS. For example if you do an nslookup for www.bing.com you will receive CNames like search.ms.com.edgesuite.net. These name servers are utilizing EDNS
and our firewall was not playing nice with the packets. I made the suggested registry change here which disables EDNS and resolved those problems
http://support.microsoft.com/kb/832223
Now, we're having another problem. Once in awhile a group of sites (all seemingly owned by Aol, in particular, engadget.com) become unresolvable. I can run an nslookup -d2 and it definitively shows a SERV FAIL. If I look at the cache on an affected server I
see NS records, but typically no other cached records. If I delete the cache the sites are immediately resolvable. Also, typically after an undetermined amount of time (15'ish minutes) the problem resolves itself. If it had happened one time I'd have forgotten
about it, but the problem recurs on a roughly weekly basis.
I've also made this change because our caching servers are using root hints even though it doesn't exactly describe our problem:
http://support.microsoft.com/kb/968372/en-us
Additionally I can add a forwarder to the server and it immediately starts resolving properly again. So I'm wondering if this bug is still affecting me and what options I have to alleviate it.