0

I have a Python application which manages the execution, checkpointing and restoring of a bunch of binaries, trough a GRPC interface.

I am experiencing a strange problem - restoring a binary fails when invoked trough the pycriu API.

When using the CRIU service, restore fails:

Consider the following example:

import psutil
import pycriu

p = psutil.Popen(["/path/to/execuable.bin"], env={...}, cwd="/path/to/cwd", start_new_session=True)

criu = pycriu.criu()
criu.use_sk("/var/run/criu-service.socket")
criu.opts.pid = p.pid
criu.opts.tcp_established = True
criu.opts.shell_job = True
criu.opts.images_dir_fd = os.open("/path/to/checkpoint/dir", os.O_DIRECTORY)

criu.dump() # success
p.wait() # reap the zombie process, python won't do that automatically
criu.restore() # fail

If I check the /path/to/checkpoint/dir/criu.log I see:

Warn  (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
  4959: Error (criu/tty.c:992): tty: Don't have tty to inherit session from, aborting
  4959: Error (criu/files.c:1213): Unable to open fd=0 id=0x6
Error (criu/cr-restore.c:2536): Restoring FAILED.
Error (criu/cr-restore.c:1498): 4959 killed by signal 9: Killed

This is how I start the service:

 setsid criu service -o /path/to/criu-service.log --address /var/run/criu-service.socket &

But if I checkpoint and restore via the CLI it works:

Terminal 1:

# python
>>> import psutil
>>> p = psutil.Popen(["/path/to/execuable.bin"], env={...}, cwd="/path/to/cwd", start_new_session=True)
>>> p.wait() # wait until criu kills the process and reap it

Terminal 2:

# criu dump -t <pid> --tcp-established --shell-job -D /path/to/checkpoint/dir
Error (criu/util.c:641): exited, status=3
Warn  (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
Warn  (compel/arch/x86/src/lib/infect.c:352): Will restore 5518 with interrupted system call
Warn  (compel/arch/x86/src/lib/infect.c:352): Will restore 5564 with interrupted system call
Warn  (compel/arch/x86/src/lib/infect.c:352): Will restore 5565 with interrupted system call
#
# criu restore -d --tcp-established --shell-job -D /path/to/checkpoint/dir
Error (criu/util.c:641): exited, status=3
Warn  (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
<STDOUT OUTPUT FROM THE RESTORED PROCESS>

Criu info:

# criu check --all
Error (criu/util.c:641): exited, status=3
Warn  (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
Warn  (criu/cr-check.c:1231): clone3() with set_tid not supported
Error (criu/cr-check.c:1273): Time namespaces are not supported
Error (criu/cr-check.c:1283): IFLA_NEW_IFINDEX isn't supported
Warn  (criu/cr-check.c:1305): Pidfd store requires pidfd_getfd syscall which is not supported
Warn  (criu/cr-check.c:804): ptrace(PTRACE_GET_RSEQ_CONFIGURATION) isn't supported. C/R of processes which are using rseq() won't work.
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.
#
# criu --version
Version: 3.17.1
GitID: v3.17.1
AdminBee
  • 21,637
  • 21
  • 47
  • 71
Slav
  • 11
  • 2

0 Answers0