Problem restarting cflowd after it dies...

Date: Mon Nov 20 2000 - 03:54:25 PST

  • Next message: Nunes, Steve: "RE: Problem restarting cflowd after it dies..."

    Hi People,

    I'm having some ongoing problems with cflowd. Everything was running
    smoothly apparently until I started collecting netmatrix statistics. Since
    then I have had the cflowd process die somewhat mysteriously after a few
    days of continuous running. To avoid looking into the real problem I
    decided to take the quick and dirty way out by detecting when cflowd dies,
    and then restarting it. From cron I check each hour to see how many cflowd
    processes are running, and if it's less than three, I do a "cfd restart"
    according to the following script:


    # Version 1.02
    # 1.00 - Original release
    # 1.01 - Made the timing and signalling a little more graceful
    # 1.02 - Included restart option

    PATH=/usr/local/arts/sbin:${PATH} ; export PATH

    ulimit -c 0

    case "$1" in
                    sleep 1
                    sleep 1
                    cfdcollect /usr/local/arts/etc/cfdcollect.conf
                    pid=`ps aux | grep cfdcollect | grep -v grep | awk '{ print
    $2 }'`
                    echo "Stopping cfdcollect with PID $pid..."

                    # This is a little tricky. There is no way to tell
    cfdcollect to shut
                    # down gracefully. The problem is we do not want to
    accidentally kill
                    # it when it is part way through writing to disk, thereby
    creating a
                    # corrupt data file. We can trigger it to reload its config
    file, and
                    # subsequently connect to cflowd. If we wait a little while
    after this
                    # until we are sure it has had time to write to disk, then
    we should
                    # then be safe to kill it.

                    kill -SIGHUP $pid
                    echo "Pausing... Please be patient..."
                    sleep 30
                    kill -SIGKILL $pid
                    sleep 1

                    pid=`ps aux | grep cflowd | grep -v grep | grep -v cflowdmux
    | awk '{ print $2 }'`
                    echo "Stopping cflowd with PID $pid..."
                    kill -SIGHUP $pid
                    sleep 1

                    pid=`ps aux | grep cflowdmux | grep -v grep | awk '{ print
    $2 }'`
                    echo "Stopping cflowdmux with PID $pid..."
                    kill -SIGINT $pid
                    $0 stop
                    sleep 1
                    $0 start
                    echo "Usage: cfd {start|stop}"
                    exit 1

    exit 0

    I have been using the cfd script to manually start and stop cflowd for a
    while now and it seems to work quite well. However the other day, two of my
    collector machines (there are six in total, that I have also recently
    upgraded to 2.1-b1 from 2.1-a9) seem to have had problems with the restart
    of cflowd (after detecting a missing cflowd process from cron). I tried
    "cfd restart" myself, and found the same strange log entries. Essentially
    from what I can see, cflowdmux starts OK. cflowd appears to start OK.
    cfdcollect starts, but fails when it tries to connect to cflowd. I thought
    I'd experiment a little, and on a working collector machine I telnetted to
    port 2056 and was greeted with a stream of gibberish. So far so good. I
    tried the same thing on one of the faulty machines, and got a "connection
    refused" message. Aha, so I'm thinking that cflowd failed to attach to port
    2056 for some reason. Hmm, maybe that port is still in use I thought, so I
    tried changing the port used by cflowd/cfdcollect from 2056 to 2057.
    Surprisingly (to me) this didn't seem to offer any improvement. See log

    (I'm running RedHat 6.0 for Intel by the way)

    Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] cflowdmux (version
    cflowd-2-1-b1) started.
    Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] created 2101248 byte packet
    queue shmem segment {}
    Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] attached to 2101248 byte
    packet queue at 0x40179000
    Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] created semaphore: id 1
    Nov 20 11:15:23 oneofsix cflowd[29746]: [I] cflowd (version cflowd-2-1-b1)
    Nov 20 11:15:23 oneofsix cflowd[29746]: [I] got semaphore: id 1
    Nov 20 11:15:23 oneofsix cflowd[29746]: [I] attached to 2101248 byte packet
    queue at 0x40179000
    Nov 20 11:15:24 oneofsix cfdcollect[29749]: [I] cfdcollect (version
    cflowd-2-1-b1) started with 1 cflowd instances.
    Nov 20 11:15:25 oneofsix cfdcollect[29749]: [E] connect(4,0x80d9e54,16)
    (host localhost port 2057) failed: Connection refused {}
    Nov 20 11:15:25 oneofsix cfdcollect[29749]: [I] sleeping for 299 seconds.

    At this point I can also run "netstat -an" and get the following...

    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address Foreign Address State

    tcp 0 0* LISTEN

    tcp 0 0 <snip!>:22 <snip!>:1015 ESTABLISHED
    tcp 0 0* LISTEN

    tcp 0 0 <snip!>:22 <snip!>:1016 ESTABLISHED
    tcp 0 0* LISTEN

    udp 0 0 <snip!>:123*
    udp 0 0*

    udp 0 0*

    raw 0 0* 7

    raw 0 0* 7

    Active UNIX domain sockets (servers and established)
    Proto RefCnt Flags Type State I-Node Path
    unix 0 [ ACC ] STREAM LISTENING 1618433 /dev/log
    unix 1 [ ] STREAM CONNECTED 1271 @00000096
    unix 1 [ ] STREAM CONNECTED 696 @0000007d
    unix 1 [ ] STREAM CONNECTED 1745458 @000008d9
    unix 0 [ ] STREAM CONNECTED 114 @00000011
    unix 1 [ ] STREAM CONNECTED 1740952 @000008ba
    unix 1 [ ] STREAM CONNECTED 1740762 @000008b8
    unix 1 [ ] STREAM CONNECTED 1745465 @000008db
    unix 1 [ ] STREAM CONNECTED 1745461 @000008da
    unix 1 [ ] STREAM CONNECTED 1061740 @00000630
    unix 1 [ ] STREAM CONNECTED 1745466 /dev/log
    unix 1 [ ] STREAM CONNECTED 1745462 /dev/log
    unix 1 [ ] STREAM CONNECTED 1745459 /dev/log
    unix 1 [ ] STREAM CONNECTED 1740953 /dev/log
    unix 1 [ ] STREAM CONNECTED 1740763 /dev/log
    unix 1 [ ] STREAM CONNECTED 1061741 /dev/log
    unix 1 [ ] STREAM CONNECTED 1272 /dev/log
    unix 1 [ ] STREAM CONNECTED 697 /dev/log

    I know that when I reboot everything will work fine once again, but I'd like
    to get to the bottom of why this is happening. It must surely be some
    resource that isn't being released, or that isn't available (for some other
    reason) for the new instantiation of cflowd.

    Any help much appreciated!


    "Buying a car because it's reliable is like marrying
    someone because they are punctual" - Jeremy Clarkson

    -- cflowd mailing list

    This archive was generated by hypermail 2b29 : Mon Nov 20 2000 - 04:05:33 PST