Quantcast
Channel: Network Infrastructure Servers forum
Viewing all articles
Browse latest Browse all 5877

Broadcast ARP requests preventing SNMP Traps being sent reliably (Windows Server 2008 R2)

$
0
0

Scenario

The standard Windows SNMP Service is configured with 4 SNMP targets which are all on the far side of a network gateway - which happens to be a firewall.

There are the standard HP SNMP extension agents installed, plus our own SNMP extension agent, and additionally, at least on one server, the Microsoft FTP and WWW SNMP extension agents.


Problem

SNMP trap messages on UDP port 162 are not always being emitted from the server. We use Wireshark on the server itself to see what is being emitted. When the issue occurs, typically only thelast SNMP target in the list has a trap sent. (This tells me that the UDP packet queue is being dumped.)


Investigation

To exercise HP-sourced SNMP traps, we can repeated pull and re-insert the power to one of the resilient PSUs in the server. Each pull or re-insertion causes 2 events to be created, and as each event emits 4 SNMP trap messages, a total of 8 SNMP trap messages are emitted in a burst.

To exercise our own SNMP traps, we can click a button on our application and this will cause a single event for which 4 SNMP traps are emitted in a burst.

The fault is almost always reproduce-able, but not entirely predictable (as yet). And today, it has decided not to fail, even after rebooting - perhaps this is due to inconsistent testing.

If the PSU's power is pulled and re-inserted within a few seconds (varying, say, from 3 to 30 seconds) between the actions - and this is cycle is repeated 10 times - we tend to see that perhaps in one or two instances, only the last trap in the expected burst of 8 traps is emitted - we'll call this the "orphaned trap".

When it fails, Wireshark shows us that just before the "orphaned trap" is emitted, a Broadcast ARP is emitted.

Reading up on this, it seems that when the ARP cache is flushed, a Broadcast ARP will be emitted. This action will also flush the queued UDP packets leaving only the last UDP packet to be sent after the Broadcast ARP has been satisfied.

In our case, all the SNMP targets are through a gateway, so the only ARP'ed address is the address of the gateway.

When we're not pulling and re-inserting the power from the PSU, it appears that only Unicast ARP requests are sent to the gateway - these can happen regularly about every 30 seconds.

By observation, the Unicast ARP requests don't upset the emission of SNMP traps.

So it seems simple: why should I expect anything different? ARP cache flushes are dumping the UDP queue and, tough, transmission of UDP packets is inherently not guaranteed.

But...

When HP traps are being 'unreliable', I can squirt out as many of our application traps as I like - and so far, none of 50 or more of our traps that have been sent, have been dumped. And it's also like this when our extension agent is removed from the list of SNMP extension agents.

 

Current Theory

As there are typically no Broadcast ARPs, and as it's appears to be a single family of SNMP Traps that are affected (e.g. either HP's or ours), it seems vaguely possible that its the SNMP Service itself is intentionally causing a ARP cache flush and inadvertently causing the dumping of the queued UDP packets. It appears that the SNMP Service perhaps latches onto one of the extension agent's traps and will tend to periodically cause of dump of its queued traps. (Not convinced of this.)

However, I guess there could be another process that flushes the ARP cache and it's the first UDP packet sent by anyone that will cause the Broadcast ARP to be emitted. But then why are the traps from only one extension agent being dumped in preference to another extension agents, unless it were the SNMP Service's arbitary decision? Or perhaps it's down to inconsistent testing that we're performing. (Not convinced of this.)

Perhaps it's just the ARP layer itself that is deciding that sometimes it will issue a Broadcast ARP instead of a Unicast ARP. For example: UDP trap sent at 18:34:01, then Broadcast ARP sent at 18:34:21 - only 20 seconds later.


Why it's a problem

HP SNMP traps were generally pretty reliable even though they were over UDP - but now, possibly as a result of the re-architecting of ARP in Windows 2008 (for RFC4861), it can be significantly unreliable - so potentially critical hardware faults, as reported by SNMP are being missed.


Help!

I really don't think this is anything to do with our applications - I'm not exactly sure who the culprit is, but I do know that it's very unhelpful behaviour indeed. Perhaps it's exacerbated by the firewall and the general paucity of traffic.

Fundamentally I can't see why it should suddenly decide to use a Broadcast ARP when a Unicast ARP has been working okay.

Is this because sometimes SNMP Server decides to use SendARP() immediately before sending Traps - without allowing sufficient time for a response to the ARP. (Response from gateway to Broadcast ARP can be as little as 1.3 milliseconds.)


Workaround

Now we could define a permanent static ARP entry using NETSH, which stops the ARPs being emitted, and I believe this would stop the UDP packet loss, but this is distinctly a "horrible solution".

Pretty please...

Can anyone explain / confirm? Can anyone suggest a 'nice' way around this?

Thanks all.

 

 

 

 

 

 

 

 

 

 


Viewing all articles
Browse latest Browse all 5877

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>