Thanks to the many people who gave detailed
answers as to the probable state of these
processes.  I hope I have listed all of you 
at the end of this summary.
The answers came in 2 flavors:
Addressing the hung processes, and addressing the
TCP sockets that they are keeping open.  The
consensus was the child should close all the
sockets it is not using.  I will look into that.
(The children are perl scripts, but I fully expect
to be able to close the un-needed fd's.)
The hung process issue is less clear (to me).  
Several people clarified that I understood the
situation backwards.  Sometimes, a child can't
exit because the parent isn't calling wait().
If the parent exitted, the process would be 
inherited by init (PID=1), which does wait.
Since these processes were children of initd
already, I assume it is the child itself that
is hung. 
I will try to add some of the suggestions to 
my wish-list for this application, though some
may not be possible, as we don't have the source
for all the modules.
Thanks again
Seth
---- Jon H. LaBadie sent a suggestion on how the application can prevent this problem: >>>Don't know if you have the source (I don't) >>>but if so, open these files with a "close on >>>exec" flag as part of the opening. Then they >>>will be closed when the child "exec's". >>>jl ---- Bruce Zimmer wrote: >>Most often the reason for this is that the >>orphaned child process is suspended waiting on >>I/O and will not check it's signal stack until >>the I/O completes. One of the problems that you >>might be having, could be that the I/O a child is >>waiting on might be a semaphore that the (former) >>parent was holding. or the children are waiting >>on a message queue for a message of some sort >>from the parent. In that case the child will >>wait forever, and the only solution would be to >>re-boot. ----- Kevin Sheehan agreed that the processes I am troubled by are themselves hung, and described how to learn which thread they are hung in:I would suggest finding out where the threads of the child processes are in the kernel. If you do:
echo "$<threadlist" | adb -k > /tmp/list
(remember that adb is not very bright when lining up headers and fields...)
you will get a list of all the threads in the system and where they are in the kernel. You can use the process args and address of the proc structure (ADDR field) given by "ps -el" to figure out which belong to the children.
I suspect you will find that the children have *not* exitted, but are stuck someplace where they aren't receiving signals and can then discuss that further with Sun Support...
Example:
Here is a part of the output in /tmp/list:
============== thread_id f5adf440 0xf5af1448: process args= /usr/sbin/aspppd -d 1 iwscn_dip+0x17b0: lwp proc wchan f5af3c38 f5af1448 f5d5ff0e iwscn_dip+0x17e4: sp pc fbf8ba68 cv_wait_sig_swap+0x180 ?(?) + 0 cv_wait_sig_swap(0xfffd,0xf5d5fef8,0xf5adf440,0xf5adf4d0,0x0,0x1) poll(0x2) + 8ac
============== thread_id f5adf2e0 0xf5d31b10: process args= /usr/sbin/rpcbind iwscn_dip+0x1650: lwp proc wchan f5af3e40 f5d31b10 f5d5fe1e iwscn_dip+0x1684: sp pc fbf98a68 cv_wait_sig_swap+0x180 ?(?) + 0 cv_wait_sig_swap(0xfffd,0xf5d5fe08,0xf5adf2e0,0xf5adf370,0x0,0x1)
here is the output of "ps -el" looking for asppp. You will note that the ADDR/proc addresses match (f5af1448) in addition to giving the args to make it pretty easy to identify which process is which thread.
You will also note that asppp is doing a flavor of cv_wait() that allows the reception of signals and is happy to die with kill -9.
bash# ps -el | grep asppp 8 S 0 82 1 0 41 20 f5af1448 359 f5d5ff0e ? 0:00 aspppd
l & h, kev ------ Michael Maciolek elaborated on the man entries for wait() and exit() regarding how a parent can be set-up to respond to SIGCHILD signal, even when it is not waiting:
These passages describe how a terminating child process is handled, and it's the basis of Sun Tech Support's answer to you. If you can modify the application - if its your own, or if you have the source, you should add a signal handler which listens for a SIGCHLD signal and executes a wait() on receiving one - this will allow a zombie child to finally convey its exit status to its parent, so it can then terminate and be removed from the process table. ================================= Michael Kriss offered Two suggestions:
Have the child process close any fd's it does not need/use
Have the server fork a reaper process. The reaper process forks the child process (rather than the server). The reaper just traps/reaps all child processes that die:
Current New
server server | | fork fork / \ / \ P C P R | | | / \ exit daemon exit P C | | reap daemon
michael
------
As far as killing associated processes, vpopa suggested: I would consider using ptree combined with awk and kill to find and kill the child processes in question. ------ Mark sent a script:
This should do what you want. Be careful, there are grave accents AND single quotes. The grave accents are around the expression we evaluate for the for loop and the single quotes are used in the awk portion. Replace PID with the pid of the parent process you want to kill.
for i in `ps -ef | grep PID | egrep -v grep | awk '{print $2}' | sort -nr` do kill $i #you can use kill -9 here if you think you have to done
Good Luck Mark
-=-=-=-=-=-=--=-= Harvey M Wamboldt gave several Suggestions:
(1) Fork child processes before opening file descriptors. (2) Set up a signal handler so the parent process will "wait" on terminating child processes, (3) or do a double fork with the first fork terminating immediately. This will force child processes to be inherited by init(1m), which will "wait" on terminating processes. (4) Have child close unneeded file descriptors.
Some of this is discussed in the UNIX FAQ, I'd also recommend Stevens' "Advanced Programming in the Unix Environment", and Stevens' "UNIX Network Programming".
Rgds,
-H- -=-=-=-=-=
Hello Seth, You probably already have an answer to this but my guess as to how to prevent the orphaned processes is to have the parent process register to ignore the child terminination signal via:
signal(SIGCLD,SIG_IGN);
I fought this problem for a few days last week (many "defunct" processes that wouldn't go away until parent exited) until I read pg.s 81-82 of Steven's "unix Network Programming".
regards, Bill Hunter The U. of Alabama -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Seth's Original Message: >>Last week, we had to shut down and restart our >>application, but some of the children were slow >>to die, so we used fuser -k on the log dir. >>The processes did not die, and cannot be killed. >>Because these orphans have an open file >>descriptor, my application could not be restarted >>(more on that below).
>>Sun Tech support explained that a parent had to >>receive the exit status from the child. If the >>parent died, the child can't exit. My first >>question is whether there will ever be a change >>to this? My second question focusses on what to >>do about it...prevent it? Has anyone written >>a script to recursively kill all the children >>and grandchildren of a proc, from the bottom-up? >>And does it work?
>>It is a serious problem for us, because the >>child inherits its parent's open File >>Descriptors, >>even though the child does not use them. The >>child keeps these fd's open after it is hung. >>If the file descriptor is a TCP socket, and >>it is a well-known local service, it creates a >>problem. The application cannot be restarted >>on its normal port. In test, it may be possible >>to switch to a different port, but in production, >>it is really not a good idea.
The whole list of responders follows:
Michael Kriss <kriss@fnal.gov> "Ram Kumar" <ramk1@excite.com> Harvey Wamboldt <harvey@iotek.ns.ca> Danny Johnson <djohnson@nbserv1.rsc.raytheon.com> Val <vpopa@sms.mc.xerox.com> "Studebaker, Mark R" <mark_studebaker@reyrey.com> Jochen Bern <bern@TI.Uni-Trier.DE> "Bruce R. Zimmer" <bzimmer@all-phase.com> Michael Maciolek <mikem@leftbank.com> Kevin Sheehan <kevin@joltin.com> Bill Hunter <bill@bama.ua.edu> "Vilain, Sam" <sam.vilain@nz.unisys.com> Jon LaBadie <jon@jgcomp.com>
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:19 CDT