RE: Problem restarting cflowd after it dies...

From: Martin_Nieuwelaar@infonet-europe.com
Date: Tue Nov 21 2000 - 03:29:13 PST

  • Next message: Luong Tang: "RE: cflowdmux not collecting data, with no error messages"

    Hi Steve,
     
    The problem I am experiencing is not new to 2.1-b1. It has been happening
    for a while now. I think I started enabling netmatrix stats back when I was
    runing 2.1a3 (from memory). Sadly, when my cflowd dies it doesn't write
    anything in the log. Indeed, I was hoping for some sort of [E] (error)
    entry in the syslog, but the last thing logged always seems to be a regular
    [I] (information) entry.
     
    One thing I have noticed since upgrading to 2.1-b1, are lots of semop errors
    along the lines of:
     
    Nov 15 15:05:32 oneofsix cflowd[31680]: [E] semop(1,0xbfe67aa4,1) failed to
    release buffer lock: Resource temporarily unavailable
    {CflowdPacketQueue.cc:438}
    Nov 15 15:06:08 oneofsix last message repeated 3 times
    Nov 15 15:06:30 oneofsix last message repeated 2 times
    Nov 15 15:07:56 oneofsix cfdcollect[31683]: [I] awakened by alarm.
    Nov 15 15:07:57 oneofsix cfdcollect[31683]: [I] connected to localhost:2056
    Nov 15 15:08:00 oneofsix cfdcollect[31683]: [I] localhost has data for 7
    routers.
    Nov 15 15:08:00 oneofsix cflowd[31680]: [I] missed 40 of 21245 flows from
    (null) engine 1075279856 agg_method 0 (-3.13308e-39% loss)
    Nov 15 15:08:00 oneofsix cflowd[31680]: [I] missed 30 of 241039 flows from
    (null) engine 1075279856 agg_method 0 (-4.67391e+26% loss)
    Nov 15 15:08:00 oneofsix cflowd[31680]: [I] missed 260 of 35763 flows from
    (null) engine 1075279856 agg_method 4 (1.72512e-262% loss)
    Nov 15 15:08:00 oneofsix cflowd[31680]: [I] missed 8 of 1803 flows from
    (null) engine 1075279856 agg_method 2 (2.91823e+171% loss)
    Nov 15 15:08:00 oneofsix cflowd[31680]: [I] missed 2460 of 225403 flows from
    (null) engine 1075279856 agg_method 3 (1.17711e+133% loss)
    Nov 15 15:08:00 oneofsix cflowd[31680]: [I] missed 1140 of 237052 flows from
    (null) engine 1075279856 agg_method 9 (-1911.05% loss)
    Nov 15 15:08:00 oneofsix cflowd[31680]: [I] missed 30 of 11435 flows from
    (null) engine 1075279856 agg_method 10 (-4.12301e+102% loss)

    (as an added interest, my missed flows calculation appears faulty)
     

    -- 
    "Buying a car because it's reliable is like marrying 
    someone because they are punctual" - Jeremy Clarkson 
    

    -----Original Message----- From: Nunes, Steve [mailto:Steve.Nunes@coreexpress.net] Sent: Monday, 20 November 2000 16:26 To: 'cflowd@caida.org' Subject: RE: Problem restarting cflowd after it dies...

    Martin,

    I don't know about your connection problem, but I know of a possiblity for the original problem where cflowd dies after a few days. Did the problem start after switching to

    2.1-b1? What does the syslog say when it crashes?

    I haven't looked closely at the new (2.1-b1) CflowdPacketQueue.cc but the problem could be related to the SEM_UNDO flag on a semaphore. If the process that gives the semaphore (e.g. cflowdmux) is not the same process that takes the semaphore (e.g. cflowd), the semaem count will not get decremented properly, since each process keeps its own count. When the system semaem limit is reached after several hours, the process goes down.

    If it turns out we think this is the problem, you could just take the SEM_UNDO flags out of CflowdPacketQueue.cc (replace them with 0) and see if your problem goes away.

    Steve Nunes CoreExpress Inc. E-mail: steve.nunes@coreexpress.net

    -----Original Message----- From: Martin_Nieuwelaar@infonet-europe.com [ mailto:Martin_Nieuwelaar@infonet-europe.com <mailto:Martin_Nieuwelaar@infonet-europe.com> ] Sent: Monday, November 20, 2000 5:54 AM To: cflowd@caida.org Subject: Problem restarting cflowd after it dies...

    Hi People,

    I'm having some ongoing problems with cflowd. Everything was running smoothly apparently until I started collecting netmatrix statistics. Since then I have had the cflowd process die somewhat mysteriously after a few days of continuous running. To avoid looking into the real problem I decided to take the quick and dirty way out by detecting when cflowd dies, and then restarting it. From cron I check each hour to see how many cflowd processes are running, and if it's less than three, I do a "cfd restart" according to the following script:

    #!/bin/sh

    # Version 1.02 # # 1.00 - Original release # # 1.01 - Made the timing and signalling a little more graceful # # 1.02 - Included restart option #

    PATH=/usr/local/arts/sbin:${PATH} ; export PATH

    ulimit -c 0

    case "$1" in start) cflowdmux sleep 1 cflowd sleep 1 cfdcollect /usr/local/arts/etc/cfdcollect.conf ;; stop) pid=`ps aux | grep cfdcollect | grep -v grep | awk '{ print $2 }'` echo "Stopping cfdcollect with PID $pid..."

    # This is a little tricky. There is no way to tell cfdcollect to shut # down gracefully. The problem is we do not want to accidentally kill # it when it is part way through writing to disk, thereby creating a # corrupt data file. We can trigger it to reload its config

    file, and # subsequently connect to cflowd. If we wait a little while

    after this # until we are sure it has had time to write to disk, then we should # then be safe to kill it.

    kill -SIGHUP $pid echo "Pausing... Please be patient..." sleep 30 kill -SIGKILL $pid sleep 1

    pid=`ps aux | grep cflowd | grep -v grep | grep -v cflowdmux

    | awk '{ print $2 }'` echo "Stopping cflowd with PID $pid..." kill -SIGHUP $pid sleep 1

    pid=`ps aux | grep cflowdmux | grep -v grep | awk '{ print $2 }'` echo "Stopping cflowdmux with PID $pid..." kill -SIGINT $pid ;; restart) $0 stop sleep 1 $0 start ;; *) echo "Usage: cfd {start|stop}" exit 1 esac

    exit 0

    I have been using the cfd script to manually start and stop cflowd for a while now and it seems to work quite well. However the other day, two of my

    collector machines (there are six in total, that I have also recently upgraded to 2.1-b1 from 2.1-a9) seem to have had problems with the restart of cflowd (after detecting a missing cflowd process from cron). I tried "cfd restart" myself, and found the same strange log entries. Essentially from what I can see, cflowdmux starts OK. cflowd appears to start OK. cfdcollect starts, but fails when it tries to connect to cflowd. I thought I'd experiment a little, and on a working collector machine I telnetted to port 2056 and was greeted with a stream of gibberish. So far so good. I tried the same thing on one of the faulty machines, and got a "connection refused" message. Aha, so I'm thinking that cflowd failed to attach to port

    2056 for some reason. Hmm, maybe that port is still in use I thought, so I tried changing the port used by cflowd/cfdcollect from 2056 to 2057. Surprisingly (to me) this didn't seem to offer any improvement. See log below.

    (I'm running RedHat 6.0 for Intel by the way)

    Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] cflowdmux (version cflowd-2-1-b1) started. Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] created 2101248 byte packet queue shmem segment {CflowdPacketQueue.cc:247} Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] attached to 2101248 byte packet queue at 0x40179000 Nov 20 11:15:22 oneofsix cflowdmux[29743]: [I] created semaphore: id 1 Nov 20 11:15:23 oneofsix cflowd[29746]: [I] cflowd (version cflowd-2-1-b1) started. Nov 20 11:15:23 oneofsix cflowd[29746]: [I] got semaphore: id 1 Nov 20 11:15:23 oneofsix cflowd[29746]: [I] attached to 2101248 byte packet queue at 0x40179000 Nov 20 11:15:24 oneofsix cfdcollect[29749]: [I] cfdcollect (version cflowd-2-1-b1) started with 1 cflowd instances. Nov 20 11:15:25 oneofsix cfdcollect[29749]: [E] connect(4,0x80d9e54,16) (host localhost port 2057) failed: Connection refused {CflowdServer.cc:112} Nov 20 11:15:25 oneofsix cfdcollect[29749]: [I] sleeping for 299 seconds.

    At this point I can also run "netstat -an" and get the following...

    Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State

    tcp 0 0 0.0.0.0:6013 0.0.0.0:* LISTEN

    tcp 0 0 <snip!>:22 <snip!>:1015 ESTABLISHED tcp 0 0 0.0.0.0:6012 0.0.0.0:* LISTEN

    tcp 0 0 <snip!>:22 <snip!>:1016 ESTABLISHED tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN

    udp 0 0 <snip!>:123 0.0.0.0:* udp 0 0 127.0.0.1:123 0.0.0.0:*

    udp 0 0 0.0.0.0:123 0.0.0.0:*

    raw 0 0 0.0.0.0:1 0.0.0.0:* 7

    raw 0 0 0.0.0.0:6 0.0.0.0:* 7

    Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node Path unix 0 [ ACC ] STREAM LISTENING 1618433 /dev/log unix 1 [ ] STREAM CONNECTED 1271 @00000096 unix 1 [ ] STREAM CONNECTED 696 @0000007d unix 1 [ ] STREAM CONNECTED 1745458 @000008d9 unix 0 [ ] STREAM CONNECTED 114 @00000011 unix 1 [ ] STREAM CONNECTED 1740952 @000008ba unix 1 [ ] STREAM CONNECTED 1740762 @000008b8 unix 1 [ ] STREAM CONNECTED 1745465 @000008db unix 1 [ ] STREAM CONNECTED 1745461 @000008da unix 1 [ ] STREAM CONNECTED 1061740 @00000630 unix 1 [ ] STREAM CONNECTED 1745466 /dev/log unix 1 [ ] STREAM CONNECTED 1745462 /dev/log unix 1 [ ] STREAM CONNECTED 1745459 /dev/log unix 1 [ ] STREAM CONNECTED 1740953 /dev/log unix 1 [ ] STREAM CONNECTED 1740763 /dev/log unix 1 [ ] STREAM CONNECTED 1061741 /dev/log unix 1 [ ] STREAM CONNECTED 1272 /dev/log unix 1 [ ] STREAM CONNECTED 697 /dev/log

    I know that when I reboot everything will work fine once again, but I'd like

    to get to the bottom of why this is happening. It must surely be some resource that isn't being released, or that isn't available (for some other reason) for the new instantiation of cflowd.

    Any help much appreciated!

    -Martin

    -- "Buying a car because it's reliable is like marrying someone because they are punctual" - Jeremy Clarkson

    -- cflowd mailing list cflowd@caida.org

    -- cflowd mailing list cflowd@caida.org



    This archive was generated by hypermail 2b29 : Tue Nov 21 2000 - 03:48:04 PST