Home Corporate Contacts

TETware Knowledgebase


Products
Solutions

Information

Datasheet
Documentation
FAQ
Knowledgebase

Return to Knowledgebase Index

16. How to run a distributed test case which reboots one of the systems

The information in this article is not presented as a complete solution but might be helpful to someone who is attempting to solve a similar type of problem.

 

Question

We are using TETware 3.2 for running distributed tests on UNIX systems using the :remote: directive. For example: :remote,000,001,002: where 000 is the master and 001, 002 are two other systems participating in the distributed test.

We have a requirement where we need to shutdown one of the systems that is running the test.

  1. Will tcc on the master system hang or report ER_TIMEDOUT or any such messages because one of the systems is shutdown?

    Can the other systems and master continue to run the test?



  2. Is it possible for the system to re-join the test if it is rebooted again?


  3. Assuming I don't include the sysid in a call to tet_remsync() after the system is shutdown, will there be problems with the automatic sync calls that are performed by the API?


Answer

First some background . . .
tcc maintains a connection with tccd on each system for the lifetime of the scenario. The test case on each system has a connection to tetsyncd and tetxresd on the master system.

The precise behaviour that you will observe depends on what TCP/IP does when the machine at the other end shuts down. If the machine that is shutting down closes the connections in an orderly way (as would happen in a normal shutdown), then the connected peers will get notification of the close in the normal way (EOF on read, SIGPIPE on write). Each process (tcc, tetsyncd, tetxresd) that sees a connection close will regard this as an error condition and will take appropriate action. In the case of tetsyncd, subsequent attempts by the other test case parts to perform sync operations (automatic or user-defined) will fail because when the connection closes, tetsyncd marks the system's sync state as DEAD.

By contrast, if the connections are not closed in an orderly way (as can sometimes happen when a machine crashes), the connection will simply hang for some period of time. Synchronisation requests will time out, but other connections will wait indefinitely for something to happen to the connection.

Now, to answer your questions . . .

  1. The other systems will not be able to continue to run the test. Test cases on the other systems will fail with an error condition at the next automatic sync point.


  2. It is not possible for the system to re-join the test after it has rebooted. Since TCP is used for the inter-process connections (which is stateful), there is no way to restore the connection after a reboot.


  3. The automatic sync calls will fail after one of the systems is rebooted. There is no mechanism for deleting a participating system from an autosync event part-way through a test case's execution.


So, if you want to reboot (say) system 2, you should not include system 2 in the system list that you pass to the :remote: directive.


Perhaps you could try the following:

  1. Instead, you can call tet_remexec() from a child process on system 1. When you do this, the API in the child process will set up its own connection to system 2. This will prevent the API in your test case from retaining state information about system 2. Be sure to do nothing in the parent process which would cause the API to connect to system 2 before you call tet_remexec() from the child. (Basically this means not calling tet_remexec() or tet_remtime() with a sysid argument of 2 from the parent.


  2. To create a child process, simply call tet_fork() with a NULL parentproc argument and then call tet_remexec(2, . . .) from the childproc function. (You should specify a zero validresults argument and a suitably short timeout - say 30 seconds.)


  3. By the time that tet_remexec() returns, the remote process will have started. So you can then immediately call tet_exit() from the child process on system 1. (The child process should exit with zero status if tet_remexec() succeeded and non-zero if tet_remexec() failed.) This will log off all the connected servers (in particular: the tccd on system 2) and exit. At this point the call to tet_fork() will return in the parent on system 1. The return value of tet_fork() will indicate whether or not the call to tet_remexec() was successful in the child.


  4. Now you must make sure that the remote process on system 2 waits for the child process on system 1 to exit. Note that when the child process on system 1 exits, it will log off tccd on system 2 first. When tccd sees the logoff it will send a SIGHUP signal to the un-waited-for process that was started by tet_remexec(). So you should be sure to ignore SIGHUP in this process. You will need to wait until the child process on system 1 exits (thus closing the connection to tccd). Then call tet_logoff() to close the connections back to the tetsyncd and tetxresd servers on the master system. Finally you can call reboot() to reboot system 2.




You will need call tet_remsync() at various times so as to ensure that all this happens in the correct order.


The order of events will look something like this. Events that are synchronised are connected by <----->.


System 1				System 2
---------------------------------	---------------------------------
Create a child process using
tet_fork() with a NULL parentproc
and zero validresults

(parent blocks in tet_fork() call,
waiting for child to exit)

In child process
----------------

Call tet_remexec() <----------------->	tccd forks and execs the
to launch a remote process on		remote process
system 2 that will reboot the system

tet_remexec() returns	<------------>	Remote process controller calls
(if tet_remexec returns -1, don't sync	tet_main()
but print diagnostic and call
tet_exit(1))
					In remote process
					-----------------
					Call signal(SIGHUP, SIG_IGN)

Sync with system 2 to syncpoint N <--->	Sync with system 1 to syncpoint N+1
(sync call returns)			(sync call blocks)

Call tet_exit(0) <------------------->	(tccd sends SIGHUP to remote
(child logs off tccd on system 2	process which is ignored - process
and exits)				stays blocked in sync call)

(call to tet_fork() returns in parent -
if child exit status is non-zero this means
that tet_remexec() has failed so give up;
the API has already reported UNRESOLVED
in this case)


Parent process continues
------------------------

Sync with system 2 to syncpoint N+1 <->	(sync call returns)

Sleep a bit - wait for system 2		call tet_logoff()
to call reboot()			(no more API calls are allowed after
					this point!)

					call reboot()
					(remote process and tccd get killed
					as system 2 goes down)
					====================================

Enter the ping/sleep loop -
wait for system 2 to come back
up again

ping loop ends	<-------------------->	System restarts -
					a new instance of tccd becomes
Sleep a bit - wait for system 2		available once the system
to come up multi-user			comes up multi-user

Call tet_remexec() to launch a		etc ...
different remote process on system 2;
this time, call tet_remwait() to wait
for the remote process to terminate

Footnote

This suggestion was offered speculatively and had not been tried out at the time of writing. But a subsequent message from the recipient indicated that a strategy based on this suggestion had in fact been successful.

 

See also

  • The descriptions of tet_fork(), tet_remsync(), tet_remexec(), tet_exit() and tet_logoff() in Chapter 8 of the TETware Programmers Guide.
  •  
  • "Distributed TETware architecture'' and "TETware programs'' in the TETware User Guide.

 


Home Contacts Legal Copyright Corporate News

Copyright © The Open Group 1995-2012, All Rights Reserved