17

https://dvdhrm.wordpress.com/2014/06/10/memfd_create2/

Theoretically, you could achieve [memfd_create()] behavior without introducing new syscalls, like this:

int fd = open("/tmp", O_RDWR | O_TMPFILE | O_EXCL, S_IRWXU);

(Note, to more portably guarantee a tmpfs here, we can use "/dev/shm" instead of "/tmp").

Therefore, the most important question is why the hell do we need a third way?

[...]

  • The backing-memory is accounted to the process that owns the file and is not subject to mount-quotas.

^ Am I right in thinking the first part of this sentence cannot be relied on?

The memfd_create() code is literally implemented as an "unlinked file living in [a] tmpfs which must be kernel internal". Tracing the code, I understand it differs in not implementing LSM checks, also memfds are created to support "seals", as the blog post goes on to explain. However, I'm extremely sceptical that memfds are accounted differently to a tmpfile in principle.

Specifically, when the OOM-killer comes knocking, I don't think it will account for memory held by memfds. This could total up to 50% of RAM - the value of the size= option for tmpfs. The kernel doesn't set a different value for the internal tmpfs, so it would use the default size of 50%.

So I think we can generally expect processes which hold a large memfd, but no other significant memory allocations, will not be OOM-killed. Is that correct?

Bart
  • 2,151
  • 1
  • 10
  • 26
sourcejedi
  • 48,311
  • 17
  • 143
  • 296
  • 2
    As far as OOM scores goes it seems to come down to [the kernel oom_badness function](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/oom_kill.c?h=v4.19-rc3#n233). So I suspect if the memfd_create doesn't show up in a /proc/{pid}/map then its not counted. So the general answer is they could be killed, but they won't have a large score because of the memfd_create use. The memory for the fd can be shared across processes as multiple process can inherit/be sent, the same fd. – danblack Sep 16 '18 at 01:49

1 Answers1

0

Building on @danblack's answer:

The decision is based on oom_kill_process() (cleaned up a bit):

for_each_thread(p, t) {
        list_for_each_entry(child, &t->children, sibling) {
                unsigned int child_points;

                child_points = oom_badness(child,
                        oc->memcg, oc->nodemask, oc->totalpages);
                if (child_points > victim_points) {
                        put_task_struct(victim);
                        victim = child;
                        victim_points = child_points;
                        get_task_struct(victim);
                }
        }
}

(https://github.com/torvalds/linux/blob/master/mm/oom_kill.c#L974)

Which depends on oom_badness() to find the best candidate:

child_points = oom_badness(child,
        oc->memcg, oc->nodemask, oc->totalpages);

oom_badness() does:

points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
        mm_pgtables_bytes(p->mm) / PAGE_SIZE;

(https://github.com/torvalds/linux/blob/master/mm/oom_kill.c#L233)

Where:

static inline unsigned long get_mm_rss(struct mm_struct *mm)
{
        return get_mm_counter(mm, MM_FILEPAGES) +
                get_mm_counter(mm, MM_ANONPAGES) +
                get_mm_counter(mm, MM_SHMEMPAGES);
}

(https://github.com/torvalds/linux/blob/master/mm/oom_kill.c#L966)

So it looks that it counts anonymous pages, which is what memfd_create() uses.

V13
  • 4,551
  • 1
  • 15
  • 20
  • does this also count pages that are not mapped, but referenced by fds opened by the process? – Ferdi265 May 04 '20 at 17:33
  • I'm not sure what those would be. Is that cache? – V13 May 08 '20 at 15:03
  • For example, you could call memfd_create and write 4GB of data into there. I don't think that will be counted by the above metric, but it does represent memory that will be freed when the process dies – Ferdi265 May 09 '20 at 17:08
  • According to memfd_create(2): "Anonymous memory is used for all backing pages of the file. Therefore, files created by memfd_create() have the same semantics as other anonymous memory allocations such as those allocated using mmap(2) with the MAP_ANONYMOUS flag". So that should count against MM_ANONPAGES. – V13 May 10 '20 at 18:08