This is the next post to TDD sample iteration where I was trying to manage system daemons through a python web service. For some reason I thought I did the proof of concept so that restarting a service was working properly, but I definitively didn't. I didn't realize that restarting a daemon from the web service causes the daemon to run in the tcp port used by the web service, which hangs it on. It is the same problem that seems to happen to this guy.
So, despite of having several unit tests for this case, some of them even mocking objects, the lack of a real acceptance tests misleaded to believe that code was working fine.
From this I've learned that apart from the unit tests, it is important to write some tests that call the function on an actual scenario where execution changes the state of the system or database. Running such tests from a controlled environment where changes can be reverted completes the confidence you have in your software. The last step would be to write acceptance tests that interact with the GUI so that once the QA guys know how to use the application, they can write tests that click on buttons, fill forms, and so on. This increases the speed of the regresion tests and make life of QA engineers much more exciting.
Now, regarding the problem I was having, it turns out that susprocess.Popen creates a fork but the child process shares the resources with its parent. So restarting cron daemon (which does not need any tcp port), from the web service running on a tcp port, leads the cron daemon to run in the same socket. Thanks to my friend Rene, who gave me this script (and thanks to the author, of course), I've managed to fix the problem.
class Service(models.Model):
#..... some code here
def _changeServiceStatus(self, command):
return Utils.runExternalCmdOnSeparateProcess(self.command, command)
Utils module:
def runExternalCmdOnSeparateProcess(command, args, timeOut=20):
"""
Use this to manage daemons only. As a side efect, this method
generates a zombie process that keeps in the process table till
the time the parent process dies. The zombie process does not
waste any resources though. Tipical command:
/etc/init.d/apache2 restart
"""
tmpOutputFileName = '/tmp/' + str(random.random()) # output file
if not timeOut:
timeOut = 30
pid1 = os.fork()
if not pid1: # if this is the new process (process 1)
os.setsid() # detach from parent
pid2 = os.fork() # second fork
if pid2: # if it is the process 1, exit
os._exit(0) # this leads to a zombie process that remains till process 0 dies
else: # if it is the process 2
closeAllResources()
execInfo = runCommand(command + " " + args) # do the job
setOutputInfo(tmpOutputFileName, execInfo) # save the output of the job
os._exit(0) # just finish, do not return anything
else: # main thread, the one that starts the call (process 0)
i = 0
while not os.path.exists(tmpOutputFileName):
time.sleep(1) # while the process 2 does not create the output file, wait
i += 1
if i > timeOut:
return ['5', 'Operation time out']
isFileClosed = "0"
i = 0
time.sleep(1)
while isFileClosed == "0": # while the grandchild does not close the output file wait
execInfLsof = runCommand("lsof" + " " + tmpOutputFileName)
isFileClosed = execInfLsof[0]
if isFileClosed <> "0":
return getOutputInfo(tmpOutputFileName)
else:
time.sleep(1)
i += 1
if i > timeOut:
return ['5', 'Operation time out']
def closeAllResources():
"""
Close all the resources hold by the current process
"""
maxfd = resource.getrlimit(resource.RLIMIT_NOFILE)[1]
if (maxfd == resource.RLIM_INFINITY):
maxfd = MAXFD
for fd in range(0, maxfd):
try:
os.close(fd)
except OSError: # ERROR, fd wasn't open to begin with (ignored)
pass
os.open(os.devnull, os.O_RDWR) # standard input (0)
os.dup2(0, 1) # standard output (1)
os.dup2(0, 2) # standard error (2)
def runCommand(cmd):
retCode = 0
errMessage = ''
try:
retInfo = commands.getstatusoutput(cmd)
retCode = retInfo[0]
errMessage = retInfo[1]
# some commands fail but they still return 0, check that
if retCode == 0 and isActualOutcomeFailure(errMessage):
retCode = 3
except Exception, e:
Logger.getInstance().logException(e)
retCode = 4
errMessage = e.message
return [str(retCode), preprocessReturnMessage(str(errMessage))]
.
def isActualOutcomeFailure(msg):
"""
Some commands seem to return 0 although the operation wasn't success. Try to fix that.
To-do: Add as many errors as possible to the errors array below.
"""
errors = ['Permission denied', 'Permiso denegado']
for err in errors:
if re.search(err, msg):
return True
return False
def preprocessErrorMessage(msg):
"""
The soaplib does not accept non alphanumeric chars nor long messages
"""
regex = re.compile(r'\W')
cleaned = regex.sub("_", msg)
maxLength = 100
cleaned = cleaned[:maxLength]
return cleaned
def preprocessReturnMessage(msg):
return preprocessErrorMessage(msg)
def setOutputInfo(tmpFileName, execInfo):
try:
outputFile = open(tmpFileName, 'w')
outputFile.write(execInfo[0] + " " + execInfo[1])
outputFile.close()
except Exception, e:
Logger.getInstance().logException(e)
def getOutputInfo(tmpFileName):
try:
outFile = open(tmpFileName, 'r')
returnArray = outFile.read()
retCode = returnArray[0] # parse the output file
if retCode <> '0':
errMessage = returnArray[1:]
else:
errMessage = 'Success'
errMessage = preprocessErrorMessage(errMessage)
retInfo = [str(retCode), preprocessReturnMessage(str(errMessage))]
except IOError, e:
retInfo = ['6', preprocessReturnMessage('IOError, outputfile not created' + e.message)]
Logger.getInstance().logException(e)
except Exception, e:
retInfo = ['7', preprocessReturnMessage(e.message)]
Logger.getInstance().logException(e)
finally:
if os.path.exists(tmpFileName):
os.remove(tmpFileName) # delete the output file
return retInfo
Unit tests: Try the function but don't change the state of the system.
def testRunCommandTimeOutOnSeparateProcess(self):
retInfo = Utils.runExternalCmdOnSeparateProcess("sleep", 30, timeOut=3)
self.assertEqual(retInfo[0], '5')
self.assertEqual(retInfo[1], 'Operation time out')
def testRunSuccesfullCommandOnSeparateProcess(self):
retInfo = Utils.runExternalCmdOnSeparateProcess("ls", "~")
self.assertEqual(retInfo[0], '0')
def testRunFailureCommandOnSeparateProcess(self):
"""
Always run these tests as a regular user, not root
"""
retInfo = Utils.runExternalCmdOnSeparateProcess("mkdir", "/root/tmp", timeOut=3)
self.assertEqual(retInfo[0], '2')
Kind of acceptance test: Does change the system, restarts the cron daemon
def tRestartCron():
clientManagementService = make_service_client('localhost:17777/management', ServersManagementService(), https=True)
retInfo = clientManagementService.srv_restartService('cron')
print retInfo
# web service wouldn't return anything if cron blocks the port, so return means success
# also check that cron process has been restarted and is not running on any tcp port
# run this on a controlled environment, with test machines.
The were also small modifications on the unit tests I had before, in the post 1 .