I have a Python application which manages the execution, checkpointing and restoring of a bunch of binaries, trough a GRPC interface.
I am experiencing a strange problem - restoring a binary fails when invoked trough the pycriu API.
When using the CRIU service, restore fails:
Consider the following example:
import psutil
import pycriu
p = psutil.Popen(["/path/to/execuable.bin"], env={...}, cwd="/path/to/cwd", start_new_session=True)
criu = pycriu.criu()
criu.use_sk("/var/run/criu-service.socket")
criu.opts.pid = p.pid
criu.opts.tcp_established = True
criu.opts.shell_job = True
criu.opts.images_dir_fd = os.open("/path/to/checkpoint/dir", os.O_DIRECTORY)
criu.dump() # success
p.wait() # reap the zombie process, python won't do that automatically
criu.restore() # fail
If I check the /path/to/checkpoint/dir/criu.log I see:
Warn (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
4959: Error (criu/tty.c:992): tty: Don't have tty to inherit session from, aborting
4959: Error (criu/files.c:1213): Unable to open fd=0 id=0x6
Error (criu/cr-restore.c:2536): Restoring FAILED.
Error (criu/cr-restore.c:1498): 4959 killed by signal 9: Killed
This is how I start the service:
setsid criu service -o /path/to/criu-service.log --address /var/run/criu-service.socket &
But if I checkpoint and restore via the CLI it works:
Terminal 1:
# python
>>> import psutil
>>> p = psutil.Popen(["/path/to/execuable.bin"], env={...}, cwd="/path/to/cwd", start_new_session=True)
>>> p.wait() # wait until criu kills the process and reap it
Terminal 2:
# criu dump -t <pid> --tcp-established --shell-job -D /path/to/checkpoint/dir
Error (criu/util.c:641): exited, status=3
Warn (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
Warn (compel/arch/x86/src/lib/infect.c:352): Will restore 5518 with interrupted system call
Warn (compel/arch/x86/src/lib/infect.c:352): Will restore 5564 with interrupted system call
Warn (compel/arch/x86/src/lib/infect.c:352): Will restore 5565 with interrupted system call
#
# criu restore -d --tcp-established --shell-job -D /path/to/checkpoint/dir
Error (criu/util.c:641): exited, status=3
Warn (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
<STDOUT OUTPUT FROM THE RESTORED PROCESS>
Criu info:
# criu check --all
Error (criu/util.c:641): exited, status=3
Warn (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
Warn (criu/cr-check.c:1231): clone3() with set_tid not supported
Error (criu/cr-check.c:1273): Time namespaces are not supported
Error (criu/cr-check.c:1283): IFLA_NEW_IFINDEX isn't supported
Warn (criu/cr-check.c:1305): Pidfd store requires pidfd_getfd syscall which is not supported
Warn (criu/cr-check.c:804): ptrace(PTRACE_GET_RSEQ_CONFIGURATION) isn't supported. C/R of processes which are using rseq() won't work.
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.
#
# criu --version
Version: 3.17.1
GitID: v3.17.1