Archive for October, 2012

Hope that this helps someone else with the same problem.

Problem overview:

SCOM 2012 running on 2 management servers with a backend SQL 2008 R2 cluster. Environment is healthy and working fine overall.

We were seeing a lot of heartbeat flaps (server loses heartbeat and then regains it within 1-10 minutes). Some of the problem servers are in the same data center as the management servers and some are in overseas data centers.

On the agent systems, when the problem occurs, error 20070 appears as follows:

The OpsMgr Connector connected to servername.domain.com, but the connection was closed immediately after authentication occurred.  The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.  Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.

This occurred on agents managed by either management servers. At times the agents would fail over to the other server successfully and at other times there is an event ID 21050 immediately following indicating that a connection to the other management server could not be made successfully. There are no corresponding event 20000 entries on the SCOM management servers nor are there any pending agents in the console.

The issues do not come up in batches or in any other discernible pattern. The managements servers are reachable via PING and RPC during the ‘outage’.

I tried installing all updates on the management servers, restarting services, rebooting the servers, flushing the cache on the clients and reinstalling the agent on clients. None of those helped.

Turned out that the organization had a previous management group hosted on a management server with the same name as the current management server. Clients were reaching out to the management server with information for a management group that no longer existed which generated some confusing errors.

The solution is to flush the cache on the management servers and then remove the phantom management group entries from agents that reported the issue.

Flushing the server cache can be done using this process:

  1. Open the Monitoring workspace
  2. Expand Operations Manager and then expand Management Server
  3. Select the Management Servers State view
  4. In Management Server State pane, click a management server
  5. In the Tasks pane, click Flush Health Service State and Cache

IMPORTANT: this task will never succeed since the task also flushes the fact that the server is running the task. The task will timeout and fail which is expected.

Then, cleaning up the agents was done with this process:

  1. Stop System Center Management service
  2. Remove registry keys for OLDMP from HKLM\Software\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups and HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Management Groups
  3. Rename C:\Program Files\System Center Operations Manager\Agent\Health Service State
  4. Start System Center Management service

At the end of this process, the heartbeat alerts for servers that are up and have the System Center Management service running stopped happening.