# The following are some example alarms which use mwa process data to alarm # on. Caution should be used when writing process alarms into alarmdef # as improperly written loops could increase overhead or trigger too often. # You can use utility's "analyze" command to see how your custom alarmdef # syntax works when run against historical data. # Remember that scopeux does not usually log all processes on the system # every interval. Thus, the process loop executed by alarmgen will only # see processes that exceed scope's interesting process thresholds, which # are by default based on cpu and disk activity (see the parm file). ####################################################################### # # This first alarm should detect runaway shell processes and send an # email message to alert the sysadmin of them. The alarmpid variable # is used to control repeated email messages if a process continues # to hog the cpu interval after interval. Note that on multiprocessor # systems, it is possible to have more than one process to have a cpu # util of over 95% at the same time, which might cause this alarm to # fire too often. alarmpid = alarmpid PROCESS LOOP { if ((PROC_PROC_NAME == "sh") or (PROC_PROC_NAME == "-sh") or (PROC_PROC_NAME == "ksh") or (PROC_PROC_NAME == "-ksh")) and (PROC_CPU_TOTAL_TIME_CUM > 1000) and (PROC_CPU_TOTAL_UTIL > 95) and (PROC_PROC_ID != alarmpid) then { exec "echo 'Possible runaway shell process detected by mwa, pid ", PROC_PROC_ID, ", cpu util = ", PROC_CPU_TOTAL_UTIL, "\\nfor username ", PROC_USER_NAME|8, " on ", DATE, " at ", TIME, "' | mailx -s 'shell cpu process alert' root" alarmpid = PROC_PROC_ID } } ####################################################################### # # The following alarms catch cpu hog processes in a specific application. # It will avoid unwanted alarm repeats on multiprocessor systems if two # processes happen to be looping at the same time because it compares both # the first-highest and second-highest cpu-consuming processes. # # Initialize variables for cpu hog alarms # cpuhog1_util = 0 cpuhog1_pid = cpuhog1_pid cpuhog2_util = 0 cpuhog2_pid = cpuhog2_pid # This alarm will work well as long as there are fewer than 3 processes # in application number 2 that have over 60% cpu utilization in an # interval. If one of the same 2 processes continue to use over 60% # cpu util, then one of the cpuhog?_util variables will be set that we # then alarm on. If a different process hogs the cpu, then the util # variables will be initialized to zero and the alarm will not go off. # # Note: Application names are not available at the process-data level, but # the application number is shown by the name in utility "scan" command # output. # PROCESS LOOP { if PROC_CPU_TOTAL_UTIL > 60 and PROC_APP_ID == 2 then { if PROC_PROC_ID == cpuhog1_pid then cpuhog1_util = PROC_CPU_TOTAL_UTIL else if PROC_PROC_ID == cpuhog2_pid then cpuhog2_util = PROC_CPU_TOTAL_UTIL else { if cpuhog1_pid == 0 then { cpuhog1_name = PROC_PROC_NAME cpuhog1_pid = PROC_PROC_ID } else if cpuhog2_pid == 0 then { cpuhog2_name = PROC_PROC_NAME cpuhog2_pid = PROC_PROC_ID } } } } # # If the top two cpu hogs did did not consume more than 60% # of the cpu in the last interval, then reset the pid variables. # if cpuhog1_util == 0 then cpuhog1_pid = 0 if cpuhog2_util == 0 then cpuhog2_pid = 0 # # Will alarm if a cpuhog process in application 2 was detected which # continued to use the cpu for over 3 minutes. The alarm will repeat # every 5 minutes. # alarm cpuhog1_util > 60 for 3 minutes start red alert "CPU hog alert: myapp Process ", cpuhog1_pid|5|0, " ", cpuhog1_name|10, " using ", cpuhog1_util, "% CPU" repeat every 5 minutes red alert "CPU hog alert: myapp Process ", cpuhog1_pid|5|0, " ", cpuhog1_name|10, " using ", cpuhog1_util, "% CPU" # # Will alarm if a cpuhog process in application 2 was detected which # continued to use the cpu for over 3 minutes. The alarm will repeat # every 5 minutes. # alarm cpuhog2_util > 60 for 3 minutes start red alert "CPU hog alert: myapp Process ", cpuhog2_pid|5|0, " ", cpuhog2_name|10, " using ", cpuhog2_util, "% CPU" repeat every 5 minutes red alert "CPU hog alert: myapp Process ", cpuhog2_pid|5|0, " ", cpuhog2_name|10, " using ", cpuhog2_util, "% CPU" ####################################################################### # # The following alarms catch possible memory hog processes. Since the # parm file does not have a interesting process threshold for memory, # there is no way to insure that processes using excessive memory will # be logged by scopeux unless you either specify a cpu threshold near # to (or equal to) zero. Usually, however, processes using excessive # virtual memory will also consume a significant amount of cpu (when # allocating memory and paging) so normally this alarm would work well. # Note that virtual set size (VSS), the total working set for a process # which includes regions which may be shared by other processes, is # quite different from the resident set size (RSS), which is the amount # of physical memory that the process currently is accessing. A memory # hog can allocate much more memory space than will actually be in # physical RAM, so we use VSS not RSS for our alarm. # Initialize variables for memory hog alarms: memhog_pid = 0 seconds_since_last_alarm = seconds_since_last_alarm + 60 # Change the following VSS threshold to a value that makes sense for # the workload on your system. This default example is set to 50000 # kilobytes (~50 megabytes) Virtual Set Size, meaning that no alarm # will be generated unless a process has allocated over 50mb of # virtual memory. The actual threshold increases if the alarm gets # triggered and resets back to this original value if the alarm doesn't. starting_vss_threshold = 50000 # Bump the current threshold up as memory hogs are found to avoid # repeated alarms. The following initializes the current threshold: current_vss_threshold = current_vss_threshold if current_vss_threshold == 0 then current_vss_threshold = starting_vss_threshold PROCESS LOOP { # Set memhog_pid to the pid of the process with the highest virtual # set size above the current threshold, as long as it has been # running for at least ten minutes (600 seconds). if (PROC_MEM_VIRT > current_vss_threshold) and (PROC_RUN_TIME > 600) then { memhog_pid = PROC_PROC_ID memhog_name = PROC_PROC_NAME memhog_vss = PROC_MEM_VIRT } } # Only start this alarm at the most once every 10 minutes by checking # how many seconds its been since the last time we alarmed: alarm (memhog_pid != 0) and (seconds_since_last_alarm > 600) for 1 minute start { # Send mail to administrator. Convert this to a red alert like in # the cpu example above if you want to use normal mwa alarming # methods. exec "echo \"Possible memory hog process detected by mwa", "\\nname = ", memhog_name, "\\npid = ", memhog_pid|5|0, "\\nVSS = ", memhog_vss|5|0, "KB", "\\ndetected on ", DATE, " at ", TIME, "\" | mailx -s \"memory hog process alert from `hostname` \" ", "root@adminsystem" # Reset the last alarm timer so we can check the time between alarms: seconds_since_last_alarm = 0 # Reset the current VSS threshold to 1mb more than hog's VSS: current_vss_threshold = memhog_vss + 1000 } # Reset VSSthreshold back to original value if there has been no alarm # for 10 hours (10*60*60 = 36000 seconds): if (seconds_since_last_alarm >= 36000) and (current_vss_threshold > starting_vss_threshold) then current_vss_threshold = starting_vss_threshold ####################################################################### # # The following example was developed with the help of Albert E. Whale # (aewhale@access.hky.com). The contribution is gratefully acknowledged. # # This example will send email *and* a special message to ITO via opcmsg # when a single process is using a significant amount of cpu, and has # accumulated over 10 minutes of cpu time in total. # We avoid processes with pids < 100 assuming they're system processes # and they know what they're doing. # hogpid = hogpid PROCESS LOOP { if (PROC_CPU_TOTAL_UTIL > 60) and (PROC_CPU_TOTAL_TIME_CUM > 600) and (PROC_PROC_ID > 100) and (PROC_PROC_ID != hogpid) then { # First exec sends mail (be sure to change destination address): exec "echo \"Possible runaway process detected by mwa", "\\nname=", PROC_PROC_NAME, "\\npid=", PROC_PROC_ID, "\\noriginal parent pid=", PROC_PARENT_PROC_ID, "\\ncpu util=", PROC_CPU_TOTAL_UTIL, "\\ncumulative cpu seconds=", PROC_CPU_TOTAL_TIME_CUM, "\\nruntime=", PROC_RUN_TIME, "\\nusername=", PROC_USER_NAME, "\\ndetected on ", DATE, " at ", TIME, "\\n `ps -ef | grep ", PROC_PROC_ID, " | grep -v grep`", "\" | mailx -s \"runanway process alert from `hostname`\" ", "root@mgrhost" # Second exec invokes opcmsg with non-default msg group. You can # see in the output of 'agsysdb -l' whether ITO messages are # automatically being generated. cpu_minutes = PROC_CPU_TOTAL_TIME_CUM / 60 runtime_minutes = PROC_RUN_TIME / 60 exec "/opt/OV/bin/OpC/opcmsg app=MWA obj=CPU msg_g=DNS sev=MAJOR ", "msg_t=\"", "Process ", PROC_PROC_NAME|18, "User ", PROC_USER_NAME|18, "PID ", PROC_PROC_ID|5, " ", "CPU Utilization", PROC_CPU_TOTAL_UTIL|6|1, " ", "Minutes of CPU time", cpu_minutes|5|0, " ", "Minutes of Run time", runtime_minutes|5|0, " ", "ps -ef output:", "`ps -ef | grep ", PROC_PROC_ID, " | grep -v grep`", "\"" hogpid=PROC_PROC_ID } }