What kind of database do `updatedb` and `locate` use?

Question

The locate program of findutils scans one or more databases of filenames and displays any matches. This can be used as a very fast find command if the file was present during the last file name database update.

There are many kinds of databases nowadays,

relational databases (with query language e.g. SQL),
NoSQL databases
- document-oriented databases (e.g. MongoDB)
- Key-value database (e.g. Redis)
- Column-oriented databases (e.g. Cassandra)
- Graph database

So what kind of database does updatedb update and locate use?

Thanks.

Regardless of whether locate actually uses BerkelyDB, it's worth you investigating - it's a very old, simple, effective disk-based key-value store. — pjc50, Jul 21 '17 at 13:09
@pjc50 I'd love to. Where are the files for the database? How shall I view their contents? — Tim, Jul 21 '17 at 13:10
For locate? https://serverfault.com/questions/454127/where-is-the-updatedb-database-locatedhttps://serverfault.com/questions/454127/where-is-the-updatedb-database-located — pjc50, Jul 21 '17 at 13:11
"Page Not Found" , the link should be https://serverfault.com/questions/454127/where-is-the-updatedb-database-located — Tim, Jul 21 '17 at 13:12
So what do the "keys" and "values" represent in the database? If I understand Stephen Kitt's comment https://unix.stackexchange.com/questions/379725/what-kind-of-database-do-updatedb-and-locate-use?noredirect=1#comment675528_379729 correctly, the database isn't key-value. — Tim, Jul 21 '17 at 13:15

score 30 · Accepted Answer · answered Jul 20 '17 at 12:15

30

Implementations of locate/updatedb typically use specific databases tailored to their requirements, rather than a generic database engine. You’ll find those specific databases documented by each implementation; for example:

GNU findutils’ is documented in locatedb(5), and is pretty much just a list of files (with a specific compression algorithm);
mlocate’s is documented in mlocate.db(5), and can also be considered a list of directories and files (with metadata).

answered Jul 20 '17 at 12:15

Stephen Kitt

411,918
54
1,065
1,164

Thanks. Where and how can i learn the principles of designing and implementing specific databases tailored to specific requirements? I'd appreciate any references for reading. – Tim Jul 20 '17 at 12:27
11

Designing databases boils down to designing data structures, so learn about those, and then about size-versus-speed design trade-offs... I don’t know of a specific resource that would be good, perhaps something like _Programming Pearls_ would be a nice introduction to the way of thinking about these topics (and not over-thinking them too). – Stephen Kitt Jul 20 '17 at 12:33
Thanks. I have learned something about data structures, and the next question would be finding references and ways to go from data structures to databases. – Tim Jul 20 '17 at 12:38
2

Databases as used by `locate` are just data structures stored on disk, so going from the data structures to the corresponding databases is relatively straightforward. Moving to databases as your question presents them is another thing entirely; there are books and courses dedicated to those topics. Designing and developing a database management system such as MongoDB or PostgreSQL is one of the harder problems in computer science and software engineering today, especially when you throw in the distributed side of things. – Stephen Kitt Jul 20 '17 at 12:43
3

i've done a fair bit with locatedb & mlocate.db over the years. I originally had perl code to generate a locatedb for my `dlocate` program in debian. I ended up discovering that just grepping a text file was many times faster than searching a locatedb, and given the size of disks these days the file size savings were insignificant. So i switched to just grep. I also have a local cron job that dumps mlocate.db to plain text after the mlocate cron job runs, which i search with a local `qlocate` shell script....much faster than running `mlocate` and also has some useful extra options. – cas Jul 20 '17 at 12:48
@cas: (1) I am surprised to know. Why "'grepping a text file was many times faster than searching a locatedb" and your "local qlocate shell script....much faster than running mlocate"? (2) Could you share your "perl code to generate a locatedb for my dlocate program in debian", script for "a local cron job that dumps mlocate.db to plain text after the mlocate cron job runs", and "local qlocate shell script"? – Tim Jul 20 '17 at 13:19
1. because locatedb's compression algorithm isn't very fast. Even with late 90s era (i.e. relatively slow by today's standards, extremely slow compared to SSDs) disks, grepping a large file was MUCH faster than running locate. still is. Test it yourself by dumping your locatedb or mlocate.db to text and timing grep vs locate/mlocate. 2. the perl script for generating a locatedb did little more than pipe /var/lib/dpkg/info/*.list files into locate's `frcode` utility. There are several old versions of dlocate & `update-dlocatedb` in debian's archives from before i switched to grep. – cas Jul 20 '17 at 13:44
The post-mlocate cron job just does `cd /var/lib/mlocate && /usr/bin/locate / > mlocate.txt.new && mv mlocate.txt.new mlocate.txt`. my qlocate script is just a wrapper around `grep .... /var/lib/mlocate/mlocate.txt`, with a few extra options i find useful (like `-d` for searching package filenames in my local debian mirror). i probably should add it to my misc scripts repo on github one of these days. – cas Jul 20 '17 at 13:47
@StephenKitt: Thanks. Which kinds of databases (listed in my post and the links) do the databases in the two implementations belong to? – Tim Jul 21 '17 at 04:48
None, they’re just flat-file databases, arguably with a tree structure. They don’t correspond to any of the types you listed. – Stephen Kitt Jul 21 '17 at 05:24
@Tim basic database text: https://www.goodreads.com/book/show/161300.Fundamentals_of_Database_Systems – jmullee Jul 21 '17 at 15:29
I think the main point of a lot of this is that databases aren't a unique type of entity. They're just data organized in a searchable/modifiable fashion with some rules about what manipulations are allowed and concepts about how that data ought to be stored (normalization, etc.) The part that gets hairy is the implementations of systems to access them which need to insure high performance and data integrity. – Joe Jul 22 '17 at 02:28

jmullee · Answer 2 · 2017-07-20T20:54:03.240

15

Seems to be a flat file of C structs, written/read using the Gnu LibC OBSTACKS Macros

See sources

https://github.com/msekletar/mlocate/blob/master/src/updatedb.c#L720

https://github.com/msekletar/mlocate/blob/master/src/locate.c#L413

You could get something similar with

find / -xdev -type f -not -path \*\.git\/\* | gzip -9 > /tmp/files.gz
zgrep file_i_want /tmp/files.gz

edited Jul 20 '17 at 20:54

answered Jul 20 '17 at 20:39

jmullee

609
3
10

3

Thanks. What are the two commands at the end doing? – Tim Jul 20 '17 at 20:58
3

@Tim First command is searching filesystem (`find`) from root (`/`) directory, without descending into directories on other filesystems (`-xdev`), regular files (`-type f`), not in `*.git` directories (`-not -path \*\.git\/\*`). It compress output (`| gzip -9`) and save it to file `/tmp/files.gz` (`> /tmp/files.gz`). Next line is searching with `zgrep` for file `file_i_want` inside compressed file `/tmp/files.gz` – piotrekkr Jul 21 '17 at 06:53
The `find` example is very inspiring. I run it but not `gzip` the output and use vscode to check the contents. Also didn't expect `find` can traverse the `/` filesystem so quickly! – Rick Jun 04 '22 at 16:00

Romeo Ninov · Answer 3 · 2017-07-20T12:25:14.440

As far as I know behind is Berkeley DB which is key/value daemonless database. Follow the link for more info. Extract from Wikipedia:

Berkeley DB (BDB) is a software library intended to provide a high-performance embedded database for key/value data. Berkeley DB is written in C with API bindings for C++, C#, Java, Perl, PHP, Python, Ruby, Smalltalk, Tcl, and many other programming languages. BDB stores arbitrary key/data pairs as byte arrays, and supports multiple data items for a single key. Berkeley DB is not a relational database.

The location of database in RHEL/CentOS is /var/lib/mlocate/mlocate.db (not sure about the other distributions). The command locate --statistics will give you info about the location and some statistics of database (example):

Database /var/lib/mlocate/mlocate.db:
        16,375 directories
        242,457 files
        11,280,301 bytes in file names
        4,526,116 bytes used to store database

For mlocate format here is head of man page:

A mlocate database starts with a file header: 8 bytes for a magic number ("\0mlo- cate" like a C literal), 4 bytes for the configuration block size in big endian, 1 byte for file format version (0), 1 byte for the “require visibility” flag (0 or 1), 2 bytes padding, and a NUL-terminated path name of the root of the database.

The header is followed by a configuration block, included to ensure databases are not reused if some configuration changes could affect their contents. The size of the configuration block in bytes is stored in the file header. The configuration block is a sequence of variable assignments, ordered by variable name. Each vari- able assignment consists of a NUL-terminated variable name and an ordered list of NUL-terminated values. The value list is terminated by one more NUL character. The ordering used is defined by the strcmp () function.

Do you have any source backing your BerkeleyDB claim? The second part of your answer contradicts it. — Mat, Jul 21 '17 at 07:23

What kind of database do `updatedb` and `locate` use?

3 Answers3

Linked