14

I am currently running a statistical modelling script that performs a phylogenetic ANOVA. The script runs fine when I analyse the full dataset. But when I take a subset it starts analysing but quickly terminates with segmentation fault. I cannot really figure out by googling if this could be due to a problem from my side (e.g. sample dataset to small for the analysis) and/or bug in the script or if this has something to do with my linux system. I read it has to do with writing data to the memory, but than why is everything fine with a larger dataset? I tried to find more information using google, but this made it more complicated.

Thanks for clarifying in advance!

TUnix
  • 355
  • 2
  • 8
  • 10
    Having worked in genetics, and with statistical software written by statisticians and scripts written by myself, I suspect the error occurs due the software expecting all rows of your dataset, but you are only providing a subset of rows. I.e. at some point, it has loaded the `m` rows of subsetted data, but you somehow mistakenly told it to expect all `n >> m` rows, and now it is trying to read row `m+1`. – MrGumble Apr 23 '20 at 06:51
  • 2
    What language is that "script" written in? If it's a perl/python/not-C-or-C++ then the fault is neither on your side (final user's), nor on the script author's side, but on the perl/python implementation's side (or in whatever shared library is loaded by it). There may be some bugs which are unfixable because of historical misdesign -- and for which, the cure (which was sometimes mistakenly applied) is worse than the disease; while a script author could strive to work around them, she still couldn't be faulted for ignoring them and coding to the spec ;-) –  Apr 23 '20 at 09:27

3 Answers3

36

(tl;dr: It's almost certainly a bug in your program or a library it uses.)

A segmentation fault indicates that a memory access was not legal. That is, based on the issued request, the CPU issues a page fault because the page requested either isn't resident or has permissions that are incongruous with the request.

After that, the kernel checks to see whether it simply doesn't know anything about this page, whether it's just not in memory yet and it should put it there, or whether it needs to perform some special handling (for example, copy-on-write pages are read-only, and this valid page fault may indicate we should copy it and update the permissions). See Wikipedia for minor vs. major (e.g. demand paging) vs. invalid page faults.

Getting a segmentation fault indicates the invalid case: the page is not only not in memory, but the kernel also doesn't have any remediative actions to perform because the process doesn't logically have that page of its virtual address space mapped. As such, this almost certainly indicates a bug in either the program or one of its underlying libraries -- for example, attempting to read or write into memory which is not valid for the process. If the address had happened to be valid, it could have caused stack corruption or scribbled over other data, but reading or writing an unmapped page is caught by hardware.

The reason why it works with your larger dataset and not your smaller dataset is entirely specific to that program: it's probably a bug in that program's logic, which is only tripped for the smaller dataset for some reason (for example, your dataset may have a field representing the total number of entries, and if it's not updated, your program may blindly read into unallocated memory if it doesn't do other sanity checks).

It's several orders of magnitude less likely than simply being a software bug, but a segmentation fault may also be an indicator of hardware issues, like faulty memory, a faulty CPU, or your hardware tripping over errata (as an example, see here).

Getting segfaults due to failing hardware often results in sometimes-works behaviour, although a bad bit in physical RAM might get mapped the same way in repeated runs of a program if you don't run anything else in between. You can mostly rule out this possibility by booting memtest86+ to check for failing RAM, and using software like Prime95 to stress-test your CPU (including the FP math FMA execution units).


You can run the program in a debugger like gdb and get the backtrace at the time of the segmentation fault, which will likely indicate the culprit:

% gdb --args ./foo --bar --baz
(gdb) r   # run the program
[...wait for segfault...]
(gdb) bt  # get the backtrace for the current thread
Chris Down
  • 122,090
  • 24
  • 265
  • 262
  • 1
    Thanks Chris for your quick and clear reply! I do understand the issue now.. I am going to run the program with gdb, lets see what is does. – TUnix Apr 22 '20 at 15:18
  • 1
    It would also help if you said a bit about the program: which language, using standard or home-grown libraries, etc. Gdb is normally suitable to debug programs compiled with debugging options; if the bug is in a library you're using it may be harder to find the actual culprit. – Hans-Martin Mosner Apr 22 '20 at 16:30
  • 2
    _Almost_ certainly, yes, but it can also indicate a physical memory error. This was much more common in the heyday of aggressive overclocking. – chrylis -cautiouslyoptimistic- Apr 23 '20 at 02:39
  • Double checks are for humans, computers just check... – Tero Lahtinen Apr 23 '20 at 06:39
  • @Chris: You accidentally called both kinds of page faults *invalid*. I think just a typo not a misconception, so I'll edit. A *valid* page fault is page-fault on a page that the process has logically mapped even though the HW page tables don't have it "wired". (e.g. OS is doing virtual-memory tricks like CoW, lazy allocation (minor), or swapping (major)). An *invalid* page fault is one where the process doesn't have that virtual address mapped. https://en.wikipedia.org/wiki/Page_fault#Types – Peter Cordes Apr 23 '20 at 18:47
  • 2
    On Linux, you might also want to try running under [valgrind](https://valgrind.org/), which will often give you some information about *why* a memory error occurred. Although in this case it sounds like it's probably just a bad access, there might be other, uncaught problems leading up to the issue. The advantage of gdb is ability to poke around in the internals; the advantage of valgrind is watching what your program is doing and reporting certain types of misbehavior. (For example, if the program accessed freed memory first, gdb may not notice, but valgrind will.) – Matthew Apr 23 '20 at 18:54
  • @PeterCordes Thanks! In retrospect, yes, that was not well phrased. – Chris Down Apr 23 '20 at 21:46
5

A segmentation fault occurs when memory locations are accessed that aren't allowed to be accessed. Often, this is due to dereferencing a null pointer or accessing memory out of bounds of allocated memory.

If the full dataset works but a subset does not:

  • check if the program handles gracefully that a dataset does not contain a feature (maybe you allocate an array based on features existing in the dataset, but then assume a length based on a known list of features from the full dataset?)
  • is any group empty and that causes an issue? Generally any kind of off-by-one errors which would manifest if an array was empty?
kutschkem
  • 173
  • 1
  • 5
3

It can be caused by either. Most often, it's a software bug, as described by Chris, but some hardware issues (especially bad memory and bad power supply) can lead to segfaults as well. A bad value is read from memory, which leads to executing a corrupted instruction, reading through a corrupted pointer, using a corrupted page table, etc., all of which lead to a segfault.

The difference, though, is that hardware-based segfaults are very much random, caused by one-in-several-million bit flipping events (if the system is more unstable than that, it doesn't even get to the point of booting up). Segfaults caused by software bugs, on the other hand, can be completely repeatable.

hobbs
  • 888
  • 6
  • 10
  • 2
    Minor nitpick: hardware induced segfaults can be entirely perfectly reproducible, provided they're induced b media errors in persistent storage. Such cases are relatively rare, but they can happen (I've dealt with them on a couple of occasions before). – Austin Hemmelgarn Apr 23 '20 at 17:13
  • @AustinHemmelgarn Good point. I've had one or two of those over the years that were solved by reinstalling a package. :) – hobbs Apr 23 '20 at 17:22
  • And of course segmentation faults caused by software bugs can be random as well, e.g. if they're caused by memory getting corrupted as a result of incorrect multithreading code. – user32929 Apr 23 '20 at 17:29
  • seg fault = *accessing memory that doesn't belong to you* and by definition is not hardware. And segfaults caused by software bugs *being completely repeatable* is also not true for example rolling the dice not initializing memory to zero and using whatever random value happens to be there sometimes it will work other times segfault. – ron Apr 23 '20 at 18:21
  • 2
    @ron the first half of your comment is 100% wrong; the second half is a result of you not reading carefully what I actually said. – hobbs Apr 23 '20 at 21:58
  • 1
    @ron: the mechanism for bit flips -> segfault is when a *pointer* gets corrupted and then you dereference it. – Peter Cordes Apr 23 '20 at 22:01
  • *thus a segmentation fault is not my code says the arrogant programmer, it's a hardware fault not my problem* If people want to *believe* in the 1e-9 chance of random bit flips and cosmic rays causing segfault... i'm out. – ron Apr 24 '20 at 11:13