Information
|
16.
How to run a distributed test case which reboots one of the systems
The information in this article is not presented as a complete solution
but might be helpful
to someone who is attempting to solve a similar type of problem.
Question
We are using TETware 3.2 for running distributed tests
on UNIX systems using the
:remote:
directive.
For example:
:remote,000,001,002:
where 000 is the master and 001, 002
are two other systems participating in the distributed test.
We have a requirement where we need to shutdown one of the systems
that is running the test.
-
Will
tcc
on the master system hang or report
ER_TIMEDOUT
or any such messages because one of the systems is shutdown?
Can the other systems and master continue to run the test?
-
Is it possible for the system to re-join the test if
it is rebooted again?
-
Assuming I don't include the sysid in a call to
tet_remsync()
after the
system is shutdown,
will there be problems with the automatic sync calls that are performed
by the API?
Answer
First some background . . .
tcc
maintains a connection with
tccd
on each system for the lifetime of
the scenario.
The test case on each system has a connection to
tetsyncd
and
tetxresd
on the master system.
The precise behaviour that you will observe depends on what TCP/IP does
when the machine at the other end shuts down.
If the machine that is shutting down closes the connections in an
orderly way (as would happen in a normal shutdown), then the connected
peers will get notification of the close in the normal way (EOF on read,
SIGPIPE
on write).
Each process
(tcc,
tetsyncd,
tetxresd)
that sees a connection close will
regard this as an error condition and will take appropriate action.
In the case of
tetsyncd,
subsequent attempts by the other test case
parts to perform sync operations (automatic or user-defined) will fail
because when the connection closes,
tetsyncd
marks the system's sync state as DEAD.
By contrast, if the connections are not closed in an orderly way (as can
sometimes happen when a machine crashes), the connection will simply
hang for some period of time.
Synchronisation requests will time out, but other connections will wait
indefinitely for something to happen to the connection.
Now, to answer your questions . . .
-
The other systems will not be able to continue to run the test.
Test cases on the other systems will fail with an error condition at the
next automatic sync point.
-
It is not possible for the system to re-join the test after it has
rebooted.
Since TCP is used for the inter-process connections (which is stateful),
there is no way to restore the connection after a reboot.
-
The automatic sync calls will fail after one of the systems is
rebooted.
There is no mechanism for deleting a participating system from an
autosync event part-way through a test case's execution.
So, if you want to reboot (say) system 2, you should not include
system 2 in the system list that you pass to the
:remote:
directive.
Perhaps you could try the following:
-
Instead, you can call
tet_remexec()
from a child process on system 1.
When you do this, the API in the child process will set up its
own connection to system 2.
This will prevent the API in your test case from retaining state
information about system 2.
Be sure to do nothing in the parent process which would cause
the API to connect to system 2 before you call
tet_remexec()
from the child.
(Basically this means not calling
tet_remexec()
or
tet_remtime()
with a sysid argument of 2 from the parent.
-
To create a child process, simply call
tet_fork()
with a NULL
parentproc
argument and then call
tet_remexec(2, . . .)
from the
childproc
function.
(You should specify a zero
validresults
argument and a suitably short timeout - say 30 seconds.)
-
By the time that
tet_remexec()
returns, the remote process will have started.
So you can then immediately call
tet_exit()
from the child process on system 1.
(The child process should exit with zero status if
tet_remexec()
succeeded and non-zero if
tet_remexec()
failed.)
This will log off all the connected servers (in particular: the
tccd
on system 2) and exit.
At this point the call to
tet_fork()
will return in the parent on system 1.
The return value of
tet_fork()
will indicate whether or not the call to
tet_remexec()
was successful in the child.
-
Now you must make sure that the remote process on system 2 waits
for the child process on system 1 to exit.
Note that when the child process on system 1 exits, it will log
off
tccd
on system 2 first.
When
tccd
sees the logoff it will send a
SIGHUP
signal to the un-waited-for process that was started by
tet_remexec().
So you should be sure to ignore
SIGHUP
in this process.
You will need to wait until the child process on system 1
exits (thus closing the connection to
tccd).
Then call
tet_logoff()
to close the connections back to the
tetsyncd
and
tetxresd
servers on the master system.
Finally you can call
reboot()
to reboot system 2.
You will need call
tet_remsync()
at various times so as to
ensure that all this happens in the correct order.
The order of events will look something like this.
Events that are synchronised are connected by
<----->.
System 1 System 2
--------------------------------- ---------------------------------
Create a child process using
tet_fork() with a NULL parentproc
and zero validresults
(parent blocks in tet_fork() call,
waiting for child to exit)
In child process
----------------
Call tet_remexec() <-----------------> tccd forks and execs the
to launch a remote process on remote process
system 2 that will reboot the system
tet_remexec() returns <------------> Remote process controller calls
(if tet_remexec returns -1, don't sync tet_main()
but print diagnostic and call
tet_exit(1))
In remote process
-----------------
Call signal(SIGHUP, SIG_IGN)
Sync with system 2 to syncpoint N <---> Sync with system 1 to syncpoint N+1
(sync call returns) (sync call blocks)
Call tet_exit(0) <-------------------> (tccd sends SIGHUP to remote
(child logs off tccd on system 2 process which is ignored - process
and exits) stays blocked in sync call)
(call to tet_fork() returns in parent -
if child exit status is non-zero this means
that tet_remexec() has failed so give up;
the API has already reported UNRESOLVED
in this case)
Parent process continues
------------------------
Sync with system 2 to syncpoint N+1 <-> (sync call returns)
Sleep a bit - wait for system 2 call tet_logoff()
to call reboot() (no more API calls are allowed after
this point!)
call reboot()
(remote process and tccd get killed
as system 2 goes down)
====================================
Enter the ping/sleep loop -
wait for system 2 to come back
up again
ping loop ends <--------------------> System restarts -
a new instance of tccd becomes
Sleep a bit - wait for system 2 available once the system
to come up multi-user comes up multi-user
Call tet_remexec() to launch a etc ...
different remote process on system 2;
this time, call tet_remwait() to wait
for the remote process to terminate
Footnote
This suggestion was offered speculatively and had not been tried out at
the time of writing.
But a subsequent message from the recipient indicated that a strategy
based on this suggestion had in fact been successful.
See also
-
The descriptions of
tet_fork(),
tet_remsync(),
tet_remexec(),
tet_exit()
and
tet_logoff()
in Chapter 8 of the TETware Programmers Guide.
-
"Distributed TETware architecture'' and
"TETware programs''
in the TETware User Guide.
|