average resource utilization for a business service

Have you ever needed cpu/memory consumption for a bussiness service..  we have !



We are using nagios for checking server resource usage. Historical data is kept in rrd files. It is easy to get values for single host and draw nice graphics. But getting values for a business service is a tricky one..

Each business service is run by several servers. So wee need to determine which servers are used under a business service. Then we must extract rrd data for specified duration (in our case it was last month). Nagios uses check plugins mostly written by community members. So checks do not follow a standard way. In one check you can get GB memory usage, in other one you can get percent free memory. So you need to align all data to "percent utilization".

I have written a script to partly automate this process. It is not a plug and play script, rather it is a guide for your journey..

Script finds 2 hour averages for the last month and produces tab seperated values file.
You can feed this file to excel and get average values for each time line. Now you can graph this data and see business service resource usage

PS: I am here to help you too..


The following is the script...

[root@nagios ~]# cat rrd2.sh
#this script can be used for cpu/memory utilization percent for a host group. Lets say You have 10 hosts under a bussiness service
#you want to see cpu/memory utilization for this bussiness service. You have to find avreage values for these 10 hosts and then using excel, you can graph utilization.
#This script extracts rrd average values for multiple hosts. metric (rrd values) can be cpu or memory
#It finds necessary rrd files according to metric and OS type. converts all values to utilization percent
#It produces tab seperated average values per host
#Written by M. selcuk karaca @ 29/10/2012 , Ankara/Turkey

#where rrd files are stored in the OS

RRD_PATH=/usr/local/nagios/var/rrd
#first parameter in  the command line is a file which includes which servers will be processed
SERVERS_FILE=$1
#second parameter in command line. This is the metric which  can be memory or cpu.
METRIC=$2
#OS variable holds the server OS type. NA means Not Applicable
OS=NA
#Problematic servers . if we can not find rrd file then server name will be outputted to this variable
PROBLEMS=""
#how many problematic servers are found
PROBLEM_COUNT=0
#analyzed rrd files store
WORK_DIR=/tmp/rrd
#server counter
SCOUNTER=0

if [ ! -d $WORK_DIR ] ; then

  mkdir $WORK_DIR
fi

#######################MAIN LOOP STARTS######################3

#we will do rrd file processing for each server
while read SERVER
do

#if a server name includes dash or dot, this converted to hex numbers in the rrd directory path. so we will first get this form for the server name

#for example archive-1 server name converted to arsiv%2D1 in the directory path
ORIG_NAME=$SERVER
SERVER=`echo $SERVER| sed 's/-/%2D/'`
SERVER=`echo $SERVER| sed 's/\./%2E/'`


let SCOUNTER=SCOUNTER+1


#server name, we are processing

echo $SERVER


#rrd file location changes according to metric and OS check. SO we need to determine correct rrd file..

#this is due to each check script written by different person in the nagios community. SO standardization is necessary in nagios check plugins.
case $METRIC in
memory)

        #windows hosts have 2 checks. physical_memory__ includes utilization percent. but some hosts do not have this, so lacking hosts we will use physical_memory check (nsclient++)

        if [ -e $RRD_PATH/${SERVER}/check*Mem/physical_memory/value.rrd ] ;
        then RRD_FILE=$RRD_PATH/${SERVER}/check*Mem/physical_memory/value.rrd
        OS=WINDOWS
        fi

        if [ -e $RRD_PATH/${SERVER}/check*Mem/physical_memory__/value.rrd ] ;

        then RRD_FILE=$RRD_PATH/${SERVER}/check*Mem/physical_memory__/value.rrd
        OS=WINDOWS
        fi

        #Linux memory check, we will use memory with buffers

        if [ -e $RRD_PATH/${SERVER}/check*Mem/MemUsedWithBuf/value.rrd ] ;
        then RRD_FILE=$RRD_PATH/${SERVER}/check*Mem/MemUsedWithBuf/value.rrd
        OS=LINUX
        fi

       #AIX MEM Utilization Percent

       if [ -e $RRD_PATH/${SERVER}/check%20Mem*/Used/value.rrd ] ; then RRD_FILE=$RRD_PATH/${SERVER}/check*Mem*/Used/value.rrd ; OS=AIX; fi

       #UX MEM Free MB . other param is Total. we will calculate percent utilization.

       if [ -e $RRD_PATH/${SERVER}/check%20Mem/Free/value.rrd ] ; then RRD_FILE=$RRD_PATH/${SERVER}/check*Mem/Free/value.rrd
       OS=UX
       RRDT_FILE=$RRD_PATH/${SERVER}/check*Mem/Total/value.rrd
       fi

       #Solaris MEM free MB. other param is TOTALSPACE. we will calculate percent utilization.

       if [ -e $RRD_PATH/${SERVER}/check%20Mem/TOTALFREE/value.rrd ] ; then RRD_FILE=$RRD_PATH/${SERVER}/check*Mem/TOTALFREE/value.rrd
       OS=SOLARIS
       RRDT_FILE=$RRD_PATH/${SERVER}/check*Mem/TOTALSPACE/value.rrd
       fi
       ;;

cpu)

      #Windows CPU Utilization percent
      if [ -e $RRD_PATH/${SERVER}/check*CPU*/5m/value.rrd ] ; then RRD_FILE=$RRD_PATH/${SERVER}/check*CPU*/5m/value.rrd ; OS=WINDOWS; fi

      #Unix/Linux CPU idle percent, later converted to Utilization Percent

      if [ -e $RRD_PATH/${SERVER}/check*CPU*stats/CpuIdle/value.rrd ] ; then RRD_FILE=$RRD_PATH/${SERVER}/check*CPU*stats/CpuIdle/value.rrd ; OS=UNIX; fi
      ;;
esac

#if we could not find server OS, we will write it to problematic servers variable and increase problem count and continue with the next server..

if [ $OS = NA ]; then
   PROBLEMS=${PROBLEMS}","$SERVER
   let PROBLEM_COUNT=PROBLEM_COUNT+1
   continue
fi


#this is Solaris/UX. for percent calc we need total file

#####SOLARIS/UX part Total file processing starts..
if  [ $OS = UX -o $OS = SOLARIS ]
then
if [ -e /tmp/${SERVER}_total ]; then rm  /tmp/${SERVER}_total  ; fi
#get total values to rrdt.out file.. duration is start from now-2h end in -1month
rrdtool fetch $RRDT_FILE AVERAGE -s end-1m -e now-2h > /tmp/rrdt.out
COUNTER=0

#reproduce rrdt.out in correct date format..

while read i
do
let COUNTER=$COUNTER+1

#first 2 lines are headers so skip them

if [ $COUNTER -le 2 ] ; then
   continue
fi
DATE=`echo $i | tr -d ":" | tr -s " " |cut -f 1 -d " "`
DATA=`echo $i | tr -d ":" | tr -s " " |cut -f 2 -d " "`
DATA=`printf '%.0f' $DATA`
DATE=`date -d @$DATE "+%Y-%m-%d %T"`
echo "$DATE $DATA" >> /tmp/${SERVER}_total

done < /tmp/rrdt.out

fi
#####SOLARIS/UX part Total file processing finished..


#get rrd values out from rrd file..

rrdtool fetch $RRD_FILE AVERAGE -s end-1m  -e now-2h> /tmp/rrd.out
COUNTER=0

##############rrd file processing starts..#################

#reproduce rrd values with correct date format
while read i
do
let COUNTER=$COUNTER+1

#first 2 lines are headers so skip them

if [ $COUNTER -le 2 ] ; then
   continue
fi
DATE=`echo $i | tr -d ":" | tr -s " " |cut -f 1 -d " "`
DATA=`echo $i | tr -d ":" | tr -s " " |cut -f 2 -d " "`
DATA=`printf '%.0f' $DATA`
DATE=`date -d @$DATE "+%Y-%m-%d %T"`

#followings are some special cases for DATA calculation...

#cpu IDLE must be converted to Utilization percent for UNIX type servers
if [ $METRIC = cpu -a $OS = UNIX ] ; then let 'DATA=100-DATA' ; fi
#this is Solaris/UX. calculate percent util..
if  [ $METRIC = memory ] && [ $OS = UX -o $OS = SOLARIS ]
then
TOT=`grep "$DATE"  /tmp/${SERVER}_total | cut -f 3 -d " " `
let 'DATA=100*(TOT-DATA)/TOT'
fi

#we are writing to result file..For the first server we will output DATE DATA. for other servers we will output just DATA..

if [ $SCOUNTER -eq 1 ] ; then
#COUNTER 3 means this is the first data for this server. so we will put header
 if [ $COUNTER -eq 3 ] ; then echo -e "TARIH \t $ORIG_NAME" >> $WORK_DIR/${SCOUNTER}.out ; fi
 echo -e "$DATE \t $DATA" >> $WORK_DIR/${SCOUNTER}.out
#for second and more servers we will just output DATA
else
 #header part..
 if [ $COUNTER -eq 3 ] ; then echo "$ORIG_NAME" >> $WORK_DIR/${SCOUNTER}.out ; fi
 echo "$DATA" >> $WORK_DIR/${SCOUNTER}.out
fi

done < /tmp/rrd.out

##############rrd file processing ends..#################

done < $SERVERS_FILE

#######################MAIN LOOP ENDS######################3


#pasting all out files to 1 file

paste `ls -tr $WORK_DIR/*.out` > ${SERVERS_FILE}_${METRIC}.csv
echo ${SERVERS_FILE}_${METRIC}.csv is ready..


#cleanup

rm  $WORK_DIR/*.out

if [ $PROBLEM_COUNT -gt 0 ] ; then

   echo $PROBLEM_COUNT servers not processed . Problematic servers are $PROBLEMS
fi



Yorumlar

Bu blogdaki popüler yayınlar

create Virtual Machines in VMware with ansible

Yüksek Hizmet Sürekliliği (High Availability)