Tango silently dies under some workloads
When used in CS 360, Tango is periodically silently dying. This appears to happen (only?) under heavy load, as when a mass regrade is requested. It does not happen for mass regrades of CS122 assignments.
Observations:
- in a mass regrade (say of assignment 3 in CS360) some jobs will remain pending forever (until the tango image is killed and restarted)
- the tango machine appears not to be running autograder containers when it hangs (so presumably there is a problem with tango, not necessarily the autograders)
- although autolab reports say 15 jobs as complete (via jobs tab), looking at their output in autolab the last few have stale output (from a prior submission, not from the recently 'graded' one)
- looking inside the tango container (e.g., in
/opt/TangoService/Tango/courselabs/test-CS360-S20-assignment3/output) I see output from all of the jobs that autolab lists as completed. Including jobs that autolab shows stale output from. Below is an example of such output that never reached autolab. - the tango job manager log shows errors
Example output
Autograder [Wed Mar 11 10:24:40 2020]: Received job CS360-S20_assignment3_30_yuhao.liu@wsu.edu:12
Autograder [Wed Mar 11 10:24:55 2020]: Success: Autodriver returned normally
Autograder [Wed Mar 11 10:24:55 2020]: Here is the output from the autograder:
---
Autodriver: Job exited with status 0
tar xvf autograde.tar
assignment3/
assignment3/assignment3.html
assignment3/mysolution.c
assignment3/Makefile
assignment3/wsuvtest.json
assignment3/driver.sh
assignment3/assignment3.pdf
assignment3/README
assignment3/tests.c
assignment3/testDir.tar.gz
assignment3/assignment3.c
cp assignment3.c assignment3/
(cd assignment3; ./driver.sh)
...
rm -rf *~
Running...
tar -xzf testDir.tar.gz
cc -o tests tests.c assignment3.c mysolution.c
make[1]: cc: Command not found
Makefile:4: recipe for target 'run' failed
make[1]: *** [run] Error 127
Failure: testsuite failed with nonzero exit status of 2
{"scores": {"Correctness": 0}}
Autograder [Wed Mar 11 10:24:56 2020]: Internal Error: TangoMachine instance has no attribute 'use_ssh_master'