QA-273: Feature/python test runner #401

dothebart · 2022-07-05T14:52:36Z

replace fish and ps test launching facilities plus report generators for one python implementation

the environment variable TESTSUITE_TIMEOUT defines a deadline to the tests, how much seconds should be allowed.
tests are running in worker threads.
main thread keeps control, launches more worker threads, once machine bandwith permits, but only every 5s as closest.
tests themselves have their timeouts; testing.js will abort if they are reached.
workers have a progressive timeout, if it doesn't hear back from testing.js for 999999999s it will hard kill and abort.
if workers have no output from testing.js they check whether the deadline is reached.
if the deadline is reached, SIG_INT[*nix] / SIG_BREAK[windows] is sent to testing.js to trigger its deadline feature.
the reached deadline will be indicated to testfailures.txt and the logfile of this test.
with deadline engageged, testing.js can send no more subsequent requests, nor spawn processes => eventually testing will abort.
force shutdown of instances will reset the deadline, SIG_ABRT arangods, and try to do core dump analysis.
workers continue reading pipes from testing.js, but once no chars are comming, waitpid() checks with a 1s timout whether testing.js is done.
if the worker reaches 180 counters of waitpid() it will give up. It will hard kill testing.js and all other child processes.
this should unblock the workers STDOUT/STDERR threads, and they should exit.
the waitpid() on testing.js should exit, I/O threads should be joined, results should be passed up to the main thread.
so the workers still have a slugish interpretation of the deadline, giving them the chance to collect as much knowledge as posible.
meanwhile the main thread has a fixed deadline: 5 minutes after the TESTSUITE_TIMEOUT is reached.
if not all workers have indicated their exit before this final deadline:
the main thread will start killing any subprocesses of itself which it finds.
after this wait another 20s, to see whether the workers may have been unblocked by the killing
if not, it shouts "Geronimoooo" and takes the big shotgun, and force-terminates the python process which is running it. This will kill all threads as well and terminate the process.
if all workers have indicated their exited in time, their threads will be joined.
reports will be generated.

dothebart · 2022-07-13T12:06:55Z

helper.linux.fish

    set s $status
    set s (math $s + (getSanStatus))
  else
-    runInContainer --cap-add SYS_NICE (findBuildImage) $SCRIPTSDIR/runTests.fish $argv


for the moment this is required so we may spawn threads for containers in later ubuntus.

…longer.

README.md

jsteemann

Non-Python files LGTM.
I am not in a good position to review the Python code in this PR though.

Co-authored-by: Jan <[email protected]>

…r into feature/python_test_runner

@mpoeter

* QA-273: Feature/python test runner (#401) Co-authored-by: Jan <[email protected]> Co-authored-by: Vadim <[email protected]> * Feature/python test runner (#417) * start implementing a python launch controller * make it work for the first time * try launching outside of oskar. * no more pipes needed * adjust report directory * fix paths, thread naming. * fallback if no env is configured * lint * more work on cluster etc * silence, proper error message for missing variable * convert params * lint * fix slot * fix arangosh.conf, launching of subsequent testruns * try to launch it from fish * implement 7zip * add modules to the docker container * more printing * fix handling * Add pip3 * Fix typo * Typo 2 * handle INNERWORKDIR * fix missing line break * export settings * fix typo * on windows skip !windows tests * lint, refactor, simplify * install 7z * export core directory * work on fish integration * similarize for new python job scheduler * work on reprot generating * try to implement timeout * also upload 7z and txt * also upload 7z and txt * fix deadline * fix workspace handling * fix temporary directory handling * make sure out temp directory exists * RTFM fail * don't put it to the workspace * implement gtest invoking * cleanup * sort, lint * prefer INNERWORKDIR * implement writing test.log * implement html report * bring back function deletet to early * install the windows boomerang handler on top level * fix include * fix reference * print before killing shit * work on timeout * finish deadline handling, rename script * fix exit code handling * lint * thanks @mpoeter for ps aid * make the thread identifier the test plus a growing number * implement central final deadline, which will kick in after 2 minutes * remove debug output * use /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/snap/bin to locate python * wintendo next try * wintendo next try * wintendo go home * fix calculation of hard time limit * make sure nobody changes the exit code to good * add monkey patches * cleanup deadline * ignore exceptions if no process is there * deadline handling: prioritize incomming lines over timeout counting * fix directory handling * work on result presentation * cleanup * let the file remain open for further info * fix environment variable handling * documentation * fix port handling * work on deadline * fix hard deadline handling * make it 20s * need more time * list processes so we may guess whats actually going on * kill all, then waitpid all * make threads provide half a slot. * be sure to catch * resume just in case, then kill * resume just in case, then kill * ignore resume errors * increase volume * lint * lint * catch more * add multipliers * more load, print load avg * fix sorting by prio - biggest values first * cleanup crash report for size * if test indicates its been crashing create report as well. * more threat to the machine. * timeout * fix typo * delete tzdata subdir first * use load and sockets for throttle control * install required python libs * only see for load [0, 1] * increase container version * anounce deadline at start * don't print to logfile * give better feedback if arangosh fails to launch in first place, thangs @maierlars for bringing up the topic * Update helper.linux.fish * tschuess ruby * re-sync to be stock RTA * fix container numbers, adjust #3 * sync to rta * resync * this is not needed anymore * add --fix-missing * fresh python? * revert to tar.gz * chaos tests in nightlies demand for longer timeouts, since tests run longer. * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * remove more old stuff * ignore encoding errors * increase timeout to hard self kill * switch to one environment variable name * env * limit the amount of coredumps * ignore access denied to open sockets * if we need to wait for the system to cool down on start... * make sure we don't come back good if nothing launched at all * them tiny boxes need more time * need more time * add deadline status to testfailurs.txt * need more time * beautify testfailures.txt * give machine estimate reasons at the start of the run * case may matter * one more environment variable Co-authored-by: Vadim <[email protected]> Co-authored-by: Jan <[email protected]> * Feature/python test runner (#418) * start implementing a python launch controller * make it work for the first time * try launching outside of oskar. * no more pipes needed * adjust report directory * fix paths, thread naming. * fallback if no env is configured * lint * more work on cluster etc * silence, proper error message for missing variable * convert params * lint * fix slot * fix arangosh.conf, launching of subsequent testruns * try to launch it from fish * implement 7zip * add modules to the docker container * more printing * fix handling * Add pip3 * Fix typo * Typo 2 * handle INNERWORKDIR * fix missing line break * export settings * fix typo * on windows skip !windows tests * lint, refactor, simplify * install 7z * export core directory * work on fish integration * similarize for new python job scheduler * work on reprot generating * try to implement timeout * also upload 7z and txt * also upload 7z and txt * fix deadline * fix workspace handling * fix temporary directory handling * make sure out temp directory exists * RTFM fail * don't put it to the workspace * implement gtest invoking * cleanup * sort, lint * prefer INNERWORKDIR * implement writing test.log * implement html report * bring back function deletet to early * install the windows boomerang handler on top level * fix include * fix reference * print before killing shit * work on timeout * finish deadline handling, rename script * fix exit code handling * lint * thanks @mpoeter for ps aid * make the thread identifier the test plus a growing number * implement central final deadline, which will kick in after 2 minutes * remove debug output * use /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/snap/bin to locate python * wintendo next try * wintendo next try * wintendo go home * fix calculation of hard time limit * make sure nobody changes the exit code to good * add monkey patches * cleanup deadline * ignore exceptions if no process is there * deadline handling: prioritize incomming lines over timeout counting * fix directory handling * work on result presentation * cleanup * let the file remain open for further info * fix environment variable handling * documentation * fix port handling * work on deadline * fix hard deadline handling * make it 20s * need more time * list processes so we may guess whats actually going on * kill all, then waitpid all * make threads provide half a slot. * be sure to catch * resume just in case, then kill * resume just in case, then kill * ignore resume errors * increase volume * lint * lint * catch more * add multipliers * more load, print load avg * fix sorting by prio - biggest values first * cleanup crash report for size * if test indicates its been crashing create report as well. * more threat to the machine. * timeout * fix typo * delete tzdata subdir first * use load and sockets for throttle control * install required python libs * only see for load [0, 1] * increase container version * anounce deadline at start * don't print to logfile * give better feedback if arangosh fails to launch in first place, thangs @maierlars for bringing up the topic * Update helper.linux.fish * tschuess ruby * re-sync to be stock RTA * fix container numbers, adjust #3 * sync to rta * resync * this is not needed anymore * add --fix-missing * fresh python? * revert to tar.gz * chaos tests in nightlies demand for longer timeouts, since tests run longer. * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * remove more old stuff * ignore encoding errors * increase timeout to hard self kill * switch to one environment variable name * env * limit the amount of coredumps * ignore access denied to open sockets * if we need to wait for the system to cool down on start... * make sure we don't come back good if nothing launched at all * them tiny boxes need more time * need more time * add deadline status to testfailurs.txt * need more time * beautify testfailures.txt * give machine estimate reasons at the start of the run * case may matter * one more environment variable * anounce test directory * switch sequence, print first * one more var exported Co-authored-by: Vadim <[email protected]> Co-authored-by: Jan <[email protected]> * Fixed 7z and Signing Added fixes for signing and 7z * Feature/python test runner (#419) * start implementing a python launch controller * make it work for the first time * try launching outside of oskar. * no more pipes needed * adjust report directory * fix paths, thread naming. * fallback if no env is configured * lint * more work on cluster etc * silence, proper error message for missing variable * convert params * lint * fix slot * fix arangosh.conf, launching of subsequent testruns * try to launch it from fish * implement 7zip * add modules to the docker container * more printing * fix handling * Add pip3 * Fix typo * Typo 2 * handle INNERWORKDIR * fix missing line break * export settings * fix typo * on windows skip !windows tests * lint, refactor, simplify * install 7z * export core directory * work on fish integration * similarize for new python job scheduler * work on reprot generating * try to implement timeout * also upload 7z and txt * also upload 7z and txt * fix deadline * fix workspace handling * fix temporary directory handling * make sure out temp directory exists * RTFM fail * don't put it to the workspace * implement gtest invoking * cleanup * sort, lint * prefer INNERWORKDIR * implement writing test.log * implement html report * bring back function deletet to early * install the windows boomerang handler on top level * fix include * fix reference * print before killing shit * work on timeout * finish deadline handling, rename script * fix exit code handling * lint * thanks @mpoeter for ps aid * make the thread identifier the test plus a growing number * implement central final deadline, which will kick in after 2 minutes * remove debug output * use /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/snap/bin to locate python * wintendo next try * wintendo next try * wintendo go home * fix calculation of hard time limit * make sure nobody changes the exit code to good * add monkey patches * cleanup deadline * ignore exceptions if no process is there * deadline handling: prioritize incomming lines over timeout counting * fix directory handling * work on result presentation * cleanup * let the file remain open for further info * fix environment variable handling * documentation * fix port handling * work on deadline * fix hard deadline handling * make it 20s * need more time * list processes so we may guess whats actually going on * kill all, then waitpid all * make threads provide half a slot. * be sure to catch * resume just in case, then kill * resume just in case, then kill * ignore resume errors * increase volume * lint * lint * catch more * add multipliers * more load, print load avg * fix sorting by prio - biggest values first * cleanup crash report for size * if test indicates its been crashing create report as well. * more threat to the machine. * timeout * fix typo * delete tzdata subdir first * use load and sockets for throttle control * install required python libs * only see for load [0, 1] * increase container version * anounce deadline at start * don't print to logfile * give better feedback if arangosh fails to launch in first place, thangs @maierlars for bringing up the topic * Update helper.linux.fish * tschuess ruby * re-sync to be stock RTA * fix container numbers, adjust #3 * sync to rta * resync * this is not needed anymore * add --fix-missing * fresh python? * revert to tar.gz * chaos tests in nightlies demand for longer timeouts, since tests run longer. * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * Update README.md Co-authored-by: Jan <[email protected]> * remove more old stuff * ignore encoding errors * increase timeout to hard self kill * switch to one environment variable name * env * limit the amount of coredumps * ignore access denied to open sockets * if we need to wait for the system to cool down on start... * make sure we don't come back good if nothing launched at all * them tiny boxes need more time * need more time * add deadline status to testfailurs.txt * need more time * beautify testfailures.txt * give machine estimate reasons at the start of the run * case may matter * one more environment variable * anounce test directory * switch sequence, print first * one more var exported * add disk i/o to the output * better work with M1 performance cores * print other sequence; enable more load[1] * more threads doesn't cut it * print platform * precise M1 detection * two places on mac to collect cores * properly append * fix default directory * use iso-ish datetime format for filenames Co-authored-by: Vadim <[email protected]> Co-authored-by: Jan <[email protected]> Co-authored-by: Markus Pfeiffer <[email protected]> Co-authored-by: Jan <[email protected]> Co-authored-by: Vadim <[email protected]> Co-authored-by: Sven Luschgy <[email protected]> Co-authored-by: Markus Pfeiffer <[email protected]>

dothebart marked this pull request as draft July 5, 2022 14:53

dothebart marked this pull request as ready for review July 13, 2022 11:58

dothebart requested review from fceller and KVS85 July 13, 2022 11:58

dothebart commented Jul 13, 2022

View reviewed changes

dothebart changed the title ~~Feature/python test runner~~ QA-273: Feature/python test runner Jul 18, 2022

dothebart and others added 24 commits August 2, 2022 17:43

start implementing a python launch controller

a971830

make it work for the first time

0226c4e

try launching outside of oskar.

1c311e5

no more pipes needed

c9b3760

adjust report directory

f7abec9

fix paths, thread naming.

f797024

fallback if no env is configured

9ffa927

lint

cb3d867

more work on cluster etc

02f086e

silence, proper error message for missing variable

fdccda0

convert params

6ac8ab6

lint

30be55e

fix slot

14c5b9a

fix arangosh.conf, launching of subsequent testruns

e7f4cd0

try to launch it from fish

cb69dd3

implement 7zip

eae8168

add modules to the docker container

b5d651e

more printing

3c7825a

fix handling

8383c4c

Add pip3

2eb57a8

Fix typo

295c7e0

Typo 2

70fa195

handle INNERWORKDIR

7d22e6f

fix missing line break

e4fa44f

chaos tests in nightlies demand for longer timeouts, since tests run …

2f70af6

…longer.