How to run a distributed test case which reboots one of the systems
The information in this article is not presented as a complete solution
but might be helpful
to someone who is attempting to solve a similar type of problem.
We are using TETware 3.2 for running distributed tests
on UNIX systems using the
where 000 is the master and 001, 002
are two other systems participating in the distributed test.
We have a requirement where we need to shutdown one of the systems
that is running the test.
on the master system hang or report
or any such messages because one of the systems is shutdown?
Can the other systems and master continue to run the test?
Is it possible for the system to re-join the test if
it is rebooted again?
Assuming I don't include the sysid in a call to
system is shutdown,
will there be problems with the automatic sync calls that are performed
by the API?
First some background . . .
maintains a connection with
on each system for the lifetime of
The test case on each system has a connection to
on the master system.
The precise behaviour that you will observe depends on what TCP/IP does
when the machine at the other end shuts down.
If the machine that is shutting down closes the connections in an
orderly way (as would happen in a normal shutdown), then the connected
peers will get notification of the close in the normal way (EOF on read,
that sees a connection close will
regard this as an error condition and will take appropriate action.
In the case of
subsequent attempts by the other test case
parts to perform sync operations (automatic or user-defined) will fail
because when the connection closes,
marks the system's sync state as DEAD.
By contrast, if the connections are not closed in an orderly way (as can
sometimes happen when a machine crashes), the connection will simply
hang for some period of time.
Synchronisation requests will time out, but other connections will wait
indefinitely for something to happen to the connection.
Now, to answer your questions . . .
The other systems will not be able to continue to run the test.
Test cases on the other systems will fail with an error condition at the
next automatic sync point.
It is not possible for the system to re-join the test after it has
Since TCP is used for the inter-process connections (which is stateful),
there is no way to restore the connection after a reboot.
The automatic sync calls will fail after one of the systems is
There is no mechanism for deleting a participating system from an
autosync event part-way through a test case's execution.
So, if you want to reboot (say) system 2, you should not include
system 2 in the system list that you pass to the
Perhaps you could try the following:
Instead, you can call
from a child process on system 1.
When you do this, the API in the child process will set up its
own connection to system 2.
This will prevent the API in your test case from retaining state
information about system 2.
Be sure to do nothing in the parent process which would cause
the API to connect to system 2 before you call
from the child.
(Basically this means not calling
with a sysid argument of 2 from the parent.
To create a child process, simply call
with a NULL
argument and then call
tet_remexec(2, . . .
(You should specify a zero
argument and a suitably short timeout - say 30 seconds.)
By the time that
returns, the remote process will have started.
So you can then immediately call
from the child process on system 1.
(The child process should exit with zero status if
succeeded and non-zero if
This will log off all the connected servers (in particular: the
on system 2) and exit.
At this point the call to
will return in the parent on system 1.
The return value of
will indicate whether or not the call to
was successful in the child.
Now you must make sure that the remote process on system 2 waits
for the child process on system 1 to exit.
Note that when the child process on system 1 exits, it will log
on system 2 first.
sees the logoff it will send a
signal to the un-waited-for process that was started by
So you should be sure to ignore
in this process.
You will need to wait until the child process on system 1
exits (thus closing the connection to
to close the connections back to the
servers on the master system.
Finally you can call
to reboot system 2.
You will need call
at various times so as to
ensure that all this happens in the correct order.
The order of events will look something like this.
Events that are synchronised are connected by
System 1 System 2
Create a child process using
tet_fork() with a NULL parentproc
and zero validresults
(parent blocks in tet_fork() call,
waiting for child to exit)
In child process
Call tet_remexec() <-----------------> tccd forks and execs the
to launch a remote process on remote process
system 2 that will reboot the system
tet_remexec() returns <------------> Remote process controller calls
(if tet_remexec returns -1, don't sync tet_main()
but print diagnostic and call
In remote process
Call signal(SIGHUP, SIG_IGN)
Sync with system 2 to syncpoint N <---> Sync with system 1 to syncpoint N+1
(sync call returns) (sync call blocks)
Call tet_exit(0) <-------------------> (tccd sends SIGHUP to remote
(child logs off tccd on system 2 process which is ignored - process
and exits) stays blocked in sync call)
(call to tet_fork() returns in parent -
if child exit status is non-zero this means
that tet_remexec() has failed so give up;
the API has already reported UNRESOLVED
in this case)
Parent process continues
Sync with system 2 to syncpoint N+1 <-> (sync call returns)
Sleep a bit - wait for system 2 call tet_logoff()
to call reboot() (no more API calls are allowed after
(remote process and tccd get killed
as system 2 goes down)
Enter the ping/sleep loop -
wait for system 2 to come back
ping loop ends <--------------------> System restarts -
a new instance of tccd becomes
Sleep a bit - wait for system 2 available once the system
to come up multi-user comes up multi-user
Call tet_remexec() to launch a etc ...
different remote process on system 2;
this time, call tet_remwait() to wait
for the remote process to terminate
This suggestion was offered speculatively and had not been tried out at
the time of writing.
But a subsequent message from the recipient indicated that a strategy
based on this suggestion had in fact been successful.
The descriptions of
in Chapter 8 of the TETware Programmers Guide.
"Distributed TETware architecture'' and
in the TETware User Guide.