Is ComputerName really an option?

I was creating a custom event data source module today, using the Microsoft.Windows.BaseEventProvider, and ran into a problem.

After importing the pack into the management group, I saw these events on all targeted agents.

Log Name:      Operations Manager
Source:        HealthService
Date:          5/15/2014 4:38:34 PM
Event ID:      4511
Task Category: Health Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      sql01.scomskills.com
Description:
Initialization of a module of type "CrimsonDataSource" (CLSID "{B98BD20C-3CC8-4AFE-9F68-5702C74D73DB}") failed with error code The parameter is incorrect. causing the rule "Scomskills.CustomModules.Rule.FileSizeScriptException" running for instance "SQL01.scomskills.com" with id:"{6898E94D-1E65-E33B-F8DF-7BF9A124CF6F}" in management group "2012-SP1".

It ended up being that the Microsoft.Windows.BaseEventProvider required ComputerName as configuration, even though ComputerName is marked as optional configuration.

Digging in a little further, event details above call out CrimsonDataSource CLSID {B98BD20C-3CC8-4AFE-9F68-5702C74D73DB}.

Taking a look at the module implementation, we can see this is exactly the class id referenced:

<DataSourceModuleType ID="Microsoft.Windows.BaseEventProvider" Accessibility="Public">
  <Configuration>
    <IncludeSchemaTypes>
      <SchemaType>Microsoft.Windows.ComputerNameSchema</SchemaType>
    </IncludeSchemaTypes>
    <xsd:element name="ComputerName" type="ComputerNameType" minOccurs="0" maxOccurs="1" />
    <xsd:element name="LogName" type="xsd:string" />
    <xsd:element name="AllowProxying" type="xsd:boolean" minOccurs="0" maxOccurs="1" />
  </Configuration>
  <ModuleImplementation>
    <Native>
      <ClassID>B98BD20C-3CC8-4AFE-9F68-5702C74D73DB</ClassID>
    </Native>
  </ModuleImplementation>
  <OutputType>Microsoft.Windows.EventData</OutputType>
</DataSourceModuleType>

 

Another interesting part is that Microsoft.Windows.ComputerNameSchema is referenced as an included schema. The interesting part here is, when a schema type is included in a data source module, this automatically precludes that xsd element from being optional – in this case, ComputerName.

<SchemaType ID="Microsoft.Windows.ComputerNameSchema" Accessibility="Public">
  <xsd:simpleType name="ComputerNameType">
    <xsd:restriction base="xsd:string">
      <xsd:minLength value="0" />
      <xsd:maxLength value="260" />
    </xsd:restriction>
  </xsd:simpleType>
</SchemaType>

 

If the included schema type had a minOccurs flag, ComputerName would in fact be optional in the data source. But this is not the case in this example.

I suppose the point to this entire post is:

Beware of "optional" configuration, because it may actually be required and result in a broken data source if it’s missing.

In these cases, it is minimally required to pass in empty configuration, as demonstrated below.

<DataSource ID="DS1" TypeID="Windows!Microsoft.Windows.BaseEventProvider">
  <ComputerName />
  <LogName>Operations Manager</LogName>
</DataSource>

 

 

🙁

Best Practices–Logging Script Events

Scripts are a part of monitoring, and those scripts sometimes may fail for any number of reasons. When a monitoring script fails, it is essential to capture those failures. In this post, I will briefly go over two types of events (exception and debug)that every management pack developer should consider for script-based modules.

Exception Event

These types of events (or exceptions) are usually generated when a resource is not accessible for some reason; cannot authenticate to resource, cannot connect to resource, resource does not exist, etc. In this case, the script cannot continue as expected.

Capturing script event context and generating a meaningful alert when there is an exception will enable an operator to more quickly understand a problem without having to put on a developer hat. Not capturing exceptions may result in a situation where a monitor may not be working and nobody ever knows about it, at least until a catastrophic failure occurs and everyone is asking why the monitoring tool did not catch it.

Debug Event

Debug events can optionally be handled within a script, and can provide useful information to a monitoring administrator while tracing a script-based workflow. These events can be logged anywhere within the script that makes sense to you. It should read like a short story to an administrator (sequentially), so anyone can follow what’s happening in the script when viewing the sequence of events in the Operations Manager log.

Debug events typically will not generate an alert, and writing debug events should be an optional setting and disabled by default.

Writing Events

Writing events to the Operations Manager log using the LogScriptEvent method is described here.

Example:

Dim oAPI, oBag 
Set oAPI = CreateObject("MOM.ScriptAPI")
Call oAPI.LogScriptEvent("YourScriptName.vbs",101,1,"Something bad happened!")

Here is an example of a Powershell function that writes events and includes some debugging logic:

function WriteToEventLog ($id, $level, $param0, $message) {
if ($debugFlag) {
$momapi.LogScriptEvent($param0,$id,$level,$message)
} elseif ($level -ne 4) {
$momapi.LogScriptEvent($param0,$id,$level,$message)
}
}

In the above example, Param0 is a common placeholder for the script name, but it can be anything that makes sense to you or an operator.

Taking this a step further, consider also implementing a try-catch where you think there is potential for an exception in the script, like a problem connecting to a resource. This is an excellent way to provide additional context in the event log, and optionally (ideally) bubble up into an alert in the console.

This example uses the Write-EventLog cmdlet, which is described here.

Example:

try { 
#...do something...
} catch [system.exception] {
$message = $_.Exception
Write-EventLog –logname Application –source YourSource –eventID 101 –entrytype Error –message $message
}

Grey Agents With Reason (gray agents)

A few years ago I wrote some TSQL to return all grey agents with the reason code. This worked fine in SCOM 2007, but it doesn’t work in 2012 environments for some reason. I basically just modified the WHERE clause, removing a bunch of SELECT statements – I’m not sure why I added those additional SELECT statements, but there must have been a reason.

I am reposting the TSQL here, updated for SCOM 2012. There was also a small bug fixed with the outage days column –  my initial query did not use UTC time in the DATEDIFF calculation, which would cause a negative value for newly grey agents and was off n hours depending on your local time zone.

/*
Gray agents with reason
Jonathan Almquist (jonathan@scomskills.com)
Updated 02-24-2014
*/
USE OperationsManagerDW
SELECT
    ME.Path,
    HSO.StartDateTime AS OutageStartDateTime,
    DATEDIFF (DD, HSO.StartDateTime, GETUTCDATE()) AS OutageDays,
    HSO.ReasonCode,
    DS.Name AS ReasonString
FROM  vManagedEntity AS ME INNER JOIN
    vHealthServiceOutage AS HSO ON HSO.ManagedEntityRowId = ME.ManagedEntityRowId INNER JOIN
    vStringResource AS SR ON HSO.ReasonCode = 
    REPLACE(LEFT(SR.StringResourceSystemName, LEN(SR.StringResourceSystemName)
    - CHARINDEX('.', REVERSE(SR.StringResourceSystemName))), 
    'System.Availability.StateData.Reasons.', '') INNER JOIN
    vDisplayString AS DS ON DS.ElementGuid = SR.StringResourceGuid
WHERE (HSO.EndDateTime IS NULL)
    AND (SR.StringResourceSystemName LIKE 'System.Availability.StateData.Reasons.[0-9]%')
    AND DS.LanguageCode = 'ENU'
ORDER BY OutageStartDateTime
 
 
 
🙂

 

Agent Management–List Primary and Failover Configuration

Something I don’t like about using the SDK (powershell) to manage agents, are the get* member cmdlet’s to return information – large scale queries take too long! The SDK is typically pretty slow in this regard, and that’s a shame because I find myself writing TSQL to accomplish tasks that the SDK should be able to promptly handle.

Recently I wrote some TSQL that will return all agents with their associated primary and failover management servers. This is very informative when the question "where does this agent failover to", and it’s a speedy way to implement some sort of automation process to expedite agent assignment.

 

Here you go!

 

SELECT rgv.SourceObjectPath AS [Agent], rgv.TargetObjectPath AS [ManagementServer], 
       CASE
              WHEN rtv.DisplayName = 'Health Service Communication' THEN 'Primary'
              ELSE 'Failover'
       END AS [Type]
FROM ManagedTypeView mt INNER JOIN
       ManagedEntityGenericView AS meg ON meg.MonitoringClassId = mt.Id INNER JOIN
       RelationshipGenericView rgv ON rgv.SourceObjectId = meg.Id INNER JOIN
       RelationshipTypeView rtv ON rtv.Id = rgv.RelationshipId
WHERE mt.Name = 'Microsoft.SystemCenter.Agent' AND
       rtv.Name like 'Microsoft.SystemCenter.HealthService%Communication' AND
       rgv.IsDeleted = 0
ORDER BY rgv.SourceObjectPath ASC, rtv.DisplayName ASC

 

Something like this would take several minutes in small to medium sized environments, and maybe upwards of 15-30 minutes in larger environments. This little bit of TSQL returns in 1-2 seconds. Eat that!

 

🙂

New MatchCount Configuration in SCOM 2012

I’ve been meaning to write about this for a while, because I was thrilled when I found this new configuration element in the expression filter module when SCOM 2012 hit the press.

For reference, here are the differences in the expression filters:

SCOM 2007: http://msdn.microsoft.com/en-us/library/ee692962.aspx

SCOM 2012: http://msdn.microsoft.com/en-us/library/jj129836.aspx

Previously, the System.ExpressionFilter did not include suppression – today it does!

What this means is, we can now count the number of passes through a condition detection, and it will only pass data to the next module when the MatchCount value exceeds the configuration provided.

It doesn’t sound like a big deal really – but it is. I’ve had cases where I needed to count condition passes, and the only way to do it before was to include a consolidation module. This was not fun and it turned out to be a lot more work than was necessary – and it was confusing to the customer when they looked at the code.

What I do not like so much is the fact that Microsoft doesn’t expose this new configuration in their base monitoring at this time. For example, it’s not possible to override the match count for a service monitor that you created using the service monitoring template – or even interval for that matter. To me, it doesn’t make sense to introduce a new configuration element without providing a way to override it – especially a valuable configuration such as this.

The default monitoring for Windows services (at this time) is to sample every 30 seconds and exceed a match count of 2. This equates to a state change within 60 seconds of service downtime.

What I am providing here is a Windows service monitoring VSAE fragment that will allow you to override both the interval as well as the match count. I’ve also included an additional state value to account for service not found conditions. I added this condition because sometimes a pack needs to take into account upgrade scenarios where a service name changes – you don’t want an alert on a service that had been renamed due to an upgrade!

By the way, MatchCount has nothing to do with service monitoring – it’s a part of the expression filter, and can be used anywhere. This is just a working example of how you can use it in a custom service monitor type.

Here you go!

 

<ManagementPackFragment SchemaVersion="2.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <TypeDefinitions>
    <MonitorTypes>
      <UnitMonitorType ID="Example.CustomeModuleLibrary.MonitorType.CheckServiceState" Accessibility="Public">
        <MonitorTypeStates>
          <MonitorTypeState ID="MTS_Running" />
          <MonitorTypeState ID="MTS_NotRunning" />
        </MonitorTypeStates>
        <Configuration>
          <xsd:element name="ComputerName" type="xsd:string" />
          <xsd:element name="ServiceName" type="xsd:string" />
          <xsd:element name="IntervalSeconds" type="xsd:integer" />
          <xsd:element name="MatchCount" type="xsd:integer" />
        </Configuration>
        <OverrideableParameters>
          <OverrideableParameter ID="IntervalSeconds" Selector="$Config/IntervalSeconds$" ParameterType="int" />
          <OverrideableParameter ID="MatchCount" Selector="$Config/MatchCount$" ParameterType="int" />
        </OverrideableParameters>
        <MonitorImplementation>
          <MemberModules>
            <DataSource ID="DS" TypeID="Windows!Microsoft.Windows.Win32ServiceInformationProvider">
              <ComputerName>$Config/ComputerName$</ComputerName>
              <ServiceName>$Config/ServiceName$</ServiceName>
              <Frequency>$Config/IntervalSeconds$</Frequency>
            </DataSource>
            <ProbeAction ID="Probe" TypeID="Windows!Microsoft.Windows.Win32ServiceInformationProbe">
              <ComputerName>$Config/ComputerName$</ComputerName>
              <ServiceName>$Config/ServiceName$</ServiceName>
            </ProbeAction>
            <ConditionDetection ID="CD_ServiceRunning" TypeID="System!System.ExpressionFilter">
              <Expression>
                <RegExExpression>
                  <ValueExpression>
                    <XPathQuery Type="Integer">Property[@Name='State']</XPathQuery>
                  </ValueExpression>
                  <Operator>MatchesRegularExpression</Operator>
                  <Pattern>^(4|8)$</Pattern> 
                </RegExExpression>
              </Expression>
            </ConditionDetection>
            <ConditionDetection ID="CD_ServiceNotRunning" TypeID="System!System.ExpressionFilter">
              <Expression>
                <RegExExpression>
                  <ValueExpression>
                    <XPathQuery Type="Integer">Property[@Name='State']</XPathQuery>
                  </ValueExpression>
                  <Operator>DoesNotMatchRegularExpression</Operator>
                  <Pattern>^(4|8)$</Pattern>
                </RegExExpression>
              </Expression>
              <SuppressionSettings>
                <MatchCount>$Config/MatchCount$</MatchCount>
              </SuppressionSettings>
            </ConditionDetection>
          </MemberModules>
          <RegularDetections>
            <RegularDetection MonitorTypeStateID="MTS_Running">
              <Node ID="CD_ServiceRunning">
                <Node ID="DS" />
              </Node>
            </RegularDetection>
            <RegularDetection MonitorTypeStateID="MTS_NotRunning">
              <Node ID="CD_ServiceNotRunning">
                <Node ID="DS" />
              </Node>
            </RegularDetection>
          </RegularDetections>
          <OnDemandDetections>
            <OnDemandDetection MonitorTypeStateID="MTS_Running">
              <Node ID="CD_ServiceRunning">
                <Node ID="Probe" />
              </Node>
            </OnDemandDetection>
            <OnDemandDetection MonitorTypeStateID="MTS_NotRunning">
              <Node ID="CD_ServiceNotRunning">
                <Node ID="Probe" />
              </Node>
            </OnDemandDetection>
          </OnDemandDetections>
        </MonitorImplementation>
      </UnitMonitorType>
    </MonitorTypes>
  </TypeDefinitions>
  <LanguagePacks>
    <LanguagePack ID="ENU" IsDefault="true">
      <DisplayStrings>
        <DisplayString ElementID="Example.CustomeModuleLibrary.MonitorType.CheckServiceState" SubElementID="IntervalSeconds">
          <Name>Interval (seconds)</Name>
          <Description>Check service state interval.</Description>
        </DisplayString>
        <DisplayString ElementID="Example.CustomeModuleLibrary.MonitorType.CheckServiceState" SubElementID="MatchCount">
          <Name>Match Count</Name>
          <Description>Number of intervals service is not running before changing monitor state.</Description>
        </DisplayString>
      </DisplayStrings>
    </LanguagePack>
  </LanguagePacks>
</ManagementPackFragment>

Now you can implement new unit monitors that use this monitor type, and extend to your operators the ability to override interval and match count. You might want to replace "Example" with your company name before implementing in your library.

 

🙂