Martin,
I don't know about your connection problem, but I know of a possiblity for
the original problem where cflowd dies after a few days. Did the problem
start after switching to
2.1-b1? What does the syslog say when it crashes?
I haven't looked closely at the new (2.1-b1) CflowdPacketQueue.cc but the
problem could
be related to the SEM_UNDO flag on a semaphore. If the process that gives
the semaphore
(e.g. cflowdmux) is not the same process that takes the semaphore (e.g.
cflowd), the semaem count will not get decremented properly, since each
process keeps its own count. When the system semaem limit is reached after
several hours, the process goes down.
If it turns out we think this is the problem, you could just take the
SEM_UNDO flags
out of CflowdPacketQueue.cc (replace them with 0) and see if your problem
goes away.
Steve Nunes
CoreExpress Inc.
E-mail: steve.nunes@coreexpress.net
-----Original Message-----
From: Martin_Nieuwelaar@infonet-europe.com
[mailto:Martin_Nieuwelaar@infonet-europe.com]
Sent: Monday, November 20, 2000 5:54 AM
To: cflowd@caida.org
Subject: Problem restarting cflowd after it dies...
Hi People,
I'm having some ongoing problems with cflowd. Everything was running
smoothly apparently until I started collecting netmatrix statistics. Since
then I have had the cflowd process die somewhat mysteriously after a few
days of continuous running. To avoid looking into the real problem I
decided to take the quick and dirty way out by detecting when cflowd dies,
and then restarting it. From cron I check each hour to see how many cflowd
processes are running, and if it's less than three, I do a "cfd restart"
according to the following script:
#!/bin/sh
# Version 1.02
#
# 1.00 - Original release
#
# 1.01 - Made the timing and signalling a little more graceful
#
# 1.02 - Included restart option
#
PATH=/usr/local/arts/sbin:${PATH} ; export PATH
ulimit -c 0
case "$1" in
start)
cflowdmux
sleep 1
cflowd
sleep 1
cfdcollect /usr/local/arts/etc/cfdcollect.conf
;;
stop)
pid=`ps aux | grep cfdcollect | grep -v grep | awk '{ print
$2 }'`
echo "Stopping cfdcollect with PID $pid..."
# This is a little tricky. There is no way to tell
cfdcollect to shut
# down gracefully. The problem is we do not want to
accidentally kill
# it when it is part way through writing to disk, thereby
creating a
# corrupt data file. We can trigger it to reload its config
file, and
# subsequently connect to cflowd. If we wait a little while
after this
# until we are sure it has had time to write to disk, then
we should
# then be safe to kill it.
kill -SIGHUP $pid
echo "Pausing... Please be patient..."
sleep 30
kill -SIGKILL $pid
sleep 1
pid=`ps aux | grep cflowd | grep -v grep | grep -v cflowdmux
| awk '{ print $2 }'`
echo "Stopping cflowd with PID $pid..."
kill -SIGHUP $pid
sleep 1
pid=`ps aux | grep cflowdmux | grep -v grep | awk '{ print
$2 }'`
echo "Stopping cflowdmux with PID $pid..."
kill -SIGINT $pid
;;
restart)
$0 stop
sleep 1
$0 start
;;
*)
echo "Usage: cfd {start|stop}"
exit 1
esac
exit 0
I have been using the cfd script to manually start and stop cflowd for a
while now and it seems to work quite well. However the other day, two of my
collector machines (there are six in total, that I have also recently
upgraded to 2.1-b1 from 2.1-a9) seem to have had problems with the restart
of cflowd (after detecting a missing cflowd process from cron). I tried
"cfd restart" myself, and found the same strange log entries. Essentially
from what I can see, cflowdmux starts OK. cflowd appears to start OK.
cfdcollect starts, but fails when it tries to connect to cflowd. I thought
I'd experiment a little, and on a working collector machine I telnetted to
port 2056 and was greeted with a stream of gibberish. So far so good. I
tried the same thing on one of the faulty machines, and got a "connection
refused" message. Aha, so I'm thinking that cflowd failed to attach to port
2056 for some reason. Hmm, maybe that port is still in use I thought, so I
tried changing the port used by cflowd/cfdcollect from 2056 to 2057.
Surprisingly (to me) this didn't seem to offer any improvement. See log
below.
(I'm running RedHat 6.0 for Intel by the way)
Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] cflowdmux (version
cflowd-2-1-b1) started.
Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] created 2101248 byte packet
queue shmem segment {CflowdPacketQueue.cc:247}
Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] attached to 2101248 byte
packet queue at 0x40179000
Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] created semaphore: id 1
Nov 20 11:15:23 oneofsix cflowd[29746]: [I] cflowd (version cflowd-2-1-b1)
started.
Nov 20 11:15:23 oneofsix cflowd[29746]: [I] got semaphore: id 1
Nov 20 11:15:23 oneofsix cflowd[29746]: [I] attached to 2101248 byte packet
queue at 0x40179000
Nov 20 11:15:24 oneofsix cfdcollect[29749]: [I] cfdcollect (version
cflowd-2-1-b1) started with 1 cflowd instances.
Nov 20 11:15:25 oneofsix cfdcollect[29749]: [E] connect(4,0x80d9e54,16)
(host localhost port 2057) failed: Connection refused {CflowdServer.cc:112}
Nov 20 11:15:25 oneofsix cfdcollect[29749]: [I] sleeping for 299 seconds.
At this point I can also run "netstat -an" and get the following...
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:6013 0.0.0.0:* LISTEN
tcp 0 0 <snip!>:22 <snip!>:1015 ESTABLISHED
tcp 0 0 0.0.0.0:6012 0.0.0.0:* LISTEN
tcp 0 0 <snip!>:22 <snip!>:1016 ESTABLISHED
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
udp 0 0 <snip!>:123 0.0.0.0:*
udp 0 0 127.0.0.1:123 0.0.0.0:*
udp 0 0 0.0.0.0:123 0.0.0.0:*
raw 0 0 0.0.0.0:1 0.0.0.0:* 7
raw 0 0 0.0.0.0:6 0.0.0.0:* 7
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node Path
unix 0 [ ACC ] STREAM LISTENING 1618433 /dev/log
unix 1 [ ] STREAM CONNECTED 1271 @00000096
unix 1 [ ] STREAM CONNECTED 696 @0000007d
unix 1 [ ] STREAM CONNECTED 1745458 @000008d9
unix 0 [ ] STREAM CONNECTED 114 @00000011
unix 1 [ ] STREAM CONNECTED 1740952 @000008ba
unix 1 [ ] STREAM CONNECTED 1740762 @000008b8
unix 1 [ ] STREAM CONNECTED 1745465 @000008db
unix 1 [ ] STREAM CONNECTED 1745461 @000008da
unix 1 [ ] STREAM CONNECTED 1061740 @00000630
unix 1 [ ] STREAM CONNECTED 1745466 /dev/log
unix 1 [ ] STREAM CONNECTED 1745462 /dev/log
unix 1 [ ] STREAM CONNECTED 1745459 /dev/log
unix 1 [ ] STREAM CONNECTED 1740953 /dev/log
unix 1 [ ] STREAM CONNECTED 1740763 /dev/log
unix 1 [ ] STREAM CONNECTED 1061741 /dev/log
unix 1 [ ] STREAM CONNECTED 1272 /dev/log
unix 1 [ ] STREAM CONNECTED 697 /dev/log
I know that when I reboot everything will work fine once again, but I'd like
to get to the bottom of why this is happening. It must surely be some
resource that isn't being released, or that isn't available (for some other
reason) for the new instantiation of cflowd.
Any help much appreciated!
-Martin
-- "Buying a car because it's reliable is like marrying someone because they are punctual" - Jeremy Clarkson-- cflowd mailing list cflowd@caida.org
-- cflowd mailing list cflowd@caida.org
This archive was generated by hypermail 2b29 : Mon Nov 20 2000 - 07:55:41 PST