GPL, Linux, Virtualization, HTTP, Perl, Python: July 2010

Here is the shell script:

#! /usr/bin/perl -w
sub concat{
#To ressemble fragmented packets and handle lost fragments.
print "entering concat";
}
$last_time="00:00:00.00000";
$fragment="FFFF";
while (<>)
{
#Sample TCPDUMP with -x -n output would be:
#
#04:09:29.989335 IP 10.0.0.32.55238 > 209.85.227.113.80: Flags [.], ack 2233, win 81, options [nop,nop,TS val 256065 ecr 48692], length 0
# 0x0000: 4500 0034 d413 4000 4006 a7c9 0a00 0020
# 0x0010: d155 e371 d7c6 0050 f5f8 0b98 f76a 480d
# 0x0020: 8010 0051 f7eb 0000 0101 080a 0003 e841
# 0x0030: 0000 be34
#
if(/^([[:digit:]]{2}):([[:digit:]]{2}):([[:digit:]]{2})\.([[:digit:]]+)\s(.*)/){
#new fragment
#*****Here we could check address/lengths in the hash for consistency, but I guess the regex builds the check in.
print "$last_time:$fragment\n";
$fragments{$last_time}="$fragment";
$last_time="$1:$2:$3.$4";
$fragment="";
}
elsif(/^[\s+]0x([[:xdigit:]]{4}):(\s+)([[:xdigit:]]{4})\s+([[:xdigit:]]{4})\s+([[:xdigit:]]{4})\s+([[:xdigit:]]{4})\s+([[:xdigit:]]{4})\s+([[:xdigit:]]{4})\s+([[:xdigit:]]{4})\s+([[:xdigit:]]{4})(.*)/)
{
#new piece of the fragment
$fragment="$fragment$3$4$5$6$7$8$9$10";
}

}

$fragments{$last_time}="$fragment";
print "$last_time:$fragment\n";

After spending 3 hours trying to figure out a way to fine tune timings on a large Veritas cluster to allow for a fail-over during a special and tricky failure use case, I put for myself a reminder on all the parameters and their meaning here.

Configuring the failure behavior

To define the number of times a resource attempts to recover before giving up.
RestartLimit is the number of times VCS attempts to restart the failed resource on the same host. When it is exhausted, the resource faults. If the resource is critical, the service group fails over to the best available node.

To define the number of times a resource attempts to Online before giving up.
OnlineRetryLimit is the number of times VCS attempts to Online the resource initially. When it is exhausted, the resource faults and the service group fails-over.

To define how long Veritas waits between monitoring attempts.
MonitorInterval is in seconds the duration between 2 resource status checks. To be combined with ToleranceLimit to define overall VCS retry policy for a specific resource.

To define after how many failure results from monitoring checks on a specific resource VCS must consider the status as faulted.
MonitorTimeout is the interval in seconds to wait for the monitoring script to return a result and exit.

How long should VCS allow the monitoring script to run before killing it and declaring monitor time-out?
OnlineWaitLimit is the number of times the monitoring agent must try to check whether a resource that was started by VCS during normal startup is indeed ONLINE before considering that the startup attempt is unsuccessful.

To define what happens if the monitoring agent is taking too long to return status(think overloaded service with applicative test).
FaultOnMonitorTimeouts is the number of times the monitoring agent must time-out before VCS considers that the monitored resource is faulted. But it is a bad design to let this in VCS. It is better to make sure you manage monitoring time out via monitor time-out instead. and make sure the agent completes within the MonitorTimeout interval.

How long should VCS allow a startup script to run before declaring online time-out

OnlineTimeout is the interval in seconds to wait for the startup script to return a result and exit.

During resource startup, how many times do we check to see if the startup is successful?
OnlineWaitLimit is the number of times the monitoring agent must try to check whether a resource that was started by VCS during normal startup is indeed ONLINE before considering that the startup attempt is unsuccessful. In between monitor attempts, it waits for MonitorInterval(?)

GPL, Linux, Virtualization, HTTP, Perl, Python

Wednesday, July 21, 2010

[Network Traces] how to print full hexdump of packets on one ligne using perl and tcpdump.

Thursday, July 15, 2010

[Cluster, Veritas] Configuring timings for startup and fail-over

About Me