Posted here with permission from Catherine Carroll.


Hi All:


Below are the scripts I promised before going on vacation a couple weeks ago. The attached UNIX script was created to measure the reliability of a network segment between 2 end points. Here is a brief summary of the script :

The script will do a ping to a remote system every 10 seconds, and it will run for 28 minutes. It will then collect the output at the end of that 28 minutes, and will store that information into a log file. The output gets written to the log file and looks like this:


05/31/02 (09:30am-10:00am) - 0% packet loss
05/31/02 (10:00am-10:30am) - 0% packet loss
05/31/02 (10:30am-11:00am) - 0% packet loss

The script measures both network latency, and network packet loss. The network loss data gets kept in one log file, and the network latency data gets stored in a different log file. Here is an example of the network latency log file output:

05/31/02 (09:30am-10:00am) - 79 ms Average Latency
05/31/02 (10:00am-10:30am) - 54 ms Average Latency
05/31/02 (10:30am-11:00am) - 63 ms Average Latency

The script will need to be configured to start out of cron, and the number of cron entries will depend on what hours during the day that you want to capture the network statistics. Here is an example of some cron entries ( you would replace the XX's with your IP address of your remote system) :


00 8 * * * /export/home/scuser/bin/netmonitor one XX.XX.XX.XX
30 8 * * * /export/home/scuser/bin/netmonitor two XX.XX.XX.XX


The first argument in the command correlates to the time stamps that you want recorded in the log file.

This can be useful if you see many people getting dropped from Peregrine at a certain time in the day, because you can then go look at the network log files to see if there was any network packet loss, or network latency increases recorded during the same time period.

If you choose to monitor from the server to a desktop on the LAN, you should try to make sure that the desktop does not get turned off, so that you do not report incorrect data.

Here is a copy of the script. Good luck!

Code:
#!/bin/ksh -
LATLOG=/export/home/scuser/bin/$2.latency.log
LOSSLOG=/export/home/scuser/bin/$2.loss.log
run=$1
if [[ $run = 'one' ]]
then
        tyme='(08:00am-08:30am)'
fi

if [[ $run = 'two' ]]
then
        tyme='(08:30am-09:00am)'
fi

if [[ $run = 'three' ]]
then
        tyme='(09:00am-09:30am)'
fi

if [[ $run = 'four' ]]
then
        tyme='(09:30am-10:00am)'
fi

if [[ $run = 'five' ]]
then
        tyme='(10:00am-10:30am)'
fi

if [[ $run = 'six' ]]
then
        tyme='(10:30am-11:00am)'
fi

if [[ $run = 'seven' ]]
then
        tyme='(11:00am-11:30am)'
fi

if [[ $run = 'eight' ]]
then
        tyme='(11:30am-12:00pm)'
fi

******************* End of Netmonitor Script **************************
So I could simply run a job to get instant information about who is being dropped and the % packet loss, our Unix gru also wrote the following scripts for me. Showdrops gives me a list of all users who were dropped during the day it is run. It reads the current sc.log file. Each night this file is copied to a backup and deleted, so we have a new sc.log file each day. The results of the script are below:

scuser@srvhap01[/export/home/scuser/bin]showdrops

The following report is a detailed listing of all Peregrine reset errors that were detected on 06/04/02:

TIME (PST) USER IP
========== ==== ==========

07:32:56 gbalaf7 xx.xx.xx.xx

Script called Showdrops:

Code:
#!/bin/ksh -
#
# Program written to scan the Peregrine sc.log
# file looking for reset errors, and then to
# print a report with the findings.
#

SCLOG=/scprod/app/logs/sc.log
NUMS=/var/tmp/$$.nums
DATE=`date +%m/%d/%y`
echo "\n\nThe following report is a detailed listing of all Peregrine "
echo "reset errors that were detected on $DATE:\n"
echo "TIME (PST)        USER                IP"
echo "==========        ====            ==========\n"
grep "Connection reset by peer" $SCLOG | awk ' { print $1 }' > $NUMS
for i in `cat $NUMS`
do
        TIME=`grep " ${i} " $SCLOG | grep "Connection reset by peer" | awk ' { print $3 }'`
        USER=`grep " ${i} " $SCLOG | grep "has logged out" | awk ' { print $5 } '`

        if [ $USER ];
        then
                IP=`grep " ${i} " $SCLOG | grep "connected to" | awk ' { print $NF }' | awk ' { FS=":" } { print $1 }'`

                if [ $IP ];
                then
                        HOSTIP=$IP
                else
                        HOSTIP="N/A"
                fi
                WC=`echo $USER | wc -m`
                if [ "$WC" -lt 9 ];then
                        echo "$TIME        $USER                $HOSTIP"
                else
                        echo "$TIME        $USER        $HOSTIP"
                fi
        fi
done
echo " "
rm $NUMS

************* End of Showdrops Script ***************

Then so I could see the last 15 lines of the netmonitor log; the first script given above, our Unix gui wrote this script and the results look like this:

scuser@srvhap01[/export/home/scuser/bin]shownet

The following report shows the last 15 lines of the /export/home/scuser/bin/xx.xx.xx.xx..loss.log file.

06/03/02 (12:30pm-1:00pm) - 0% packet loss
06/03/02 (1:00pm-1:30pm) - 0% packet loss
06/03/02 (1:30pm-2:00pm) - 0% packet loss
06/03/02 (2:00pm-2:30pm) - 0% packet loss
06/03/02 (2:30pm-3:00pm) - 0% packet loss
06/03/02 (3:00pm-3:30pm) - 0% packet loss
06/03/02 (3:30pm-4:00pm) - 0% packet loss
06/03/02 (4:00pm-4:30pm) - 0% packet loss
06/03/02 (4:30pm-5:00pm) - 5% packet loss
06/03/02 (5:00pm-5:30pm) - 0% packet loss
06/03/02 (5:30pm-6:00pm) - 0% packet loss
06/03/02 (6:00pm-6:30pm) - 0% packet loss
06/03/02 (6:30pm-7:00pm) - 0% packet loss
06/03/02 (7:00pm-7:30pm) - 0% packet loss
06/03/02 (7:30pm-8:00pm) - 0% packet loss


It was this report that helped me prove it was a network problem. When we had 3 or 50 users dropped at the same time and I went back and checked this log file, I could see that we also had some % packet loss during the half hour we had the users dropped.

Script for Shownet:

Code:
#!/bin/ksh -
#
# Program written to tail the network loss report.
#
#

echo "\n\nThe following report shows the last 15 lines of the"
echo "/export/home/scuser/bin/xx.xx.xx.xx.loss.log file.\n"
tail -15 /export/home/scuser/bin/xx.xx.xx.xx.loss.log
echo ""

This does require two servers; one to ping from and the production server. You should be pinging using the same path as your users. This is how we proved we had a network problem. The network folks can not say there are no network problems if your ping report shows some percent packet loss at the same time as your users get dropped. These scripts monitor and report on the reset error. They do not report on the heart beat error. We had users dropped only occasionally from the heart beat error and it never correlated with packet loss.


I hope this helps,

Catherine

Catherine Carroll wrote:

Hi:

I have tried to stay out of this conversation because it is very hard to determine network errors in your own system and probably impossible in someone else's system. We were getting the reset error and 10 to 40 users would be dropped. IBM manages our network. We have users in Washington, California and Illinois. Our servers are in California. After months of work IBM found problems with the network.Routers were changed or upgraded. We started having trouble in November of 2001. AT&T and Lucent Technologies owns the lines. It took IBM, Lucent and AT&T all working together to find all the problems, and there were many problems. We had sniffers set up all over and switched our route to the production server. I'm sure much money was spent, but in the end, it was problems with the hardware and software making up our network. We still have users get dropped from time to time, but I don't think there is ever going to be a perfect network, just one that does not drop users daily.

Since no one would take responsibility for the problem, and all I had to go on was the errors in the sc.log file and Peregrine saying they meant we had a network problem, I had a Unix guy write a script that is still doing continuous pings from the development server here in Chicago to the production server in Irvine, California. Every 30 minutes the script writes a line showing the number of packets lost in that 30 minutes. I was able to prove it was a network issue because when users were dropped, I could go to the file and show IBM there was a 3% or 30% packet loss during the 30 minute period when we had users dropped. Another thing we did was write a script to read the sc.ini file on command and list the user name and time of drop for all reset errors. When the time of drop is the same for several users, this is a network issue.

If anyone would like me to send you the scripts, let me know. If a user logs off of Service Center by clicking on the little X in the upper right corner, the same reset error is created in the log file. If you really start using these log files and scripts, you will have to educate your users to only log off Service Center using the log off buttons.

Hope this helps. If there is anything else I can do, please don't hesitate to ask.


Catherine Carroll

Peregrine Administrator/Developer
Washington Mutual
565 Lakeview Parkway
Suite 250
Vernon Hills, IL 60061