10

I have a server which exports a directory containing ~7 million files (mostly images) from its local disk to network clients via NFS.

I need to add a second one for the sake of HA, and to keep it in sync with the first one with as little delta between the two as possible.

Research suggests to use lsyncd or other inotify-based solutions for this, but given the number of files creating the inotify watches takes an eternity. Same thing for rsync.

Other possible solutions seems to be drdb, or cluster file systems such as ceph or glusterfs, but I have no experience with those and do not know which one would be more appropriate and cope well with that many files and still provide decent performance.

Note that the activity is mostly read with little write occurring.

agc
  • 7,045
  • 3
  • 23
  • 53
user60039
  • 696
  • 6
  • 10
  • 2
    DRDB works fine and is simple to setup and understand in a 2-machine cluster setup; however it wont scale in a near future. There could be also other approaches to the subject. http://highscalability.com/blog/2012/6/20/ask-highscalability-how-do-i-organize-millions-of-images.html – Rui F Ribeiro Jun 06 '16 at 15:46
  • Did you try to run `rsync` in daemon mode? This will speed up the initial generation of the file list when running the `rsync` command, but will be RAM intensive depending on the amount of files. – Thomas Jun 06 '16 at 16:13
  • how much delay can you tolerate? if you can tolerate a few minutes (or more), using `btrfs` or `zfs` may be an option, combined with a cron job to create snapsnots and `zfs send` or `btrfs send` them to the backup server. much faster and a much lighter workload (for both the sender and receiver machines) than rsync because the snapshot+send doesn't need to compare file timestamps or checksums. – cas Jun 07 '16 at 03:26
  • BTW, with [ceph](https://en.wikipedia.org/wiki/Ceph_(software)) you also get the option of using it as an object store (e.g. like amazon's s3 or openstacks's swift) instead of a filesystem. In fact, ceph's fs is actually layered on top of its object-store. – cas Jun 07 '16 at 03:29
  • @Thomas: `rsync -a` using daemon (on source) takes 200 minutes to complete, which is more than what is acceptable. @cas: `btrfs send` might be worth a shot I'll look into it. I can't move to an object store as I'm not the developer of the app that uses the files. – user60039 Jun 07 '16 at 08:20
  • I believe you don't keep all 7 millions in the same dir but rather disrtibuted among many (sub)dirs. Whatever way you'll choose - you may want to create a separate sync settings for each of those (sub)dirs to reduce time to scan. – Putnik Jun 08 '16 at 08:14
  • The pain point is going to be having a process walking through every directory syncing content. Anything that walks the directory tree is going to take time. I'd go to a lower level and `blindly` sync the few writes that occur. Options here would be 1. DRBD. 2. Ceph (though we got hurt in a recent project where the count(nodes) < 5). 3. Iscsi export disks from 2 different nodes and mdraid 1 them. Personally, I'd go with option 3. Probably the easiest to set up :) – Lmwangi Jul 24 '16 at 18:24
  • I don't know the particulars of your requirements, but `syncthing` could possibly be one solution. Their [public stats](https://data.syncthing.net/) shows that the largest installations manage 7M files per node. There might be issues with this in terms of memory. My own largest `syncthing` node manages 130000 files (65 Gb data) and uses 65 Mb memory. – Kusalananda Aug 03 '16 at 16:42
  • Thanks for all the answers, which were all very interestings. Unfortunately I'm unable to validate one of them as the client who had this requirement decided that he didn't care about redundancy anymore, after I spent time setting up a glusterfs cluster for him (grrr). – user60039 Nov 01 '16 at 00:19

2 Answers2

1

I'm inclined to suggest replication which is data agnostic, like drbd. The large number of files is going to cause anything running at a higher level than "block storage" to spend an inordinate amount of time walking the tree - as you've found using rsync or creating inotify watches.

The short version of my personal story backing that up: I've not used Ceph, but I'm pretty sure this isn't in their prime market target based on its similarity to Gluster. I have, however, been trying to implement this kind of solution with Gluster for the past several years. It's been up and running most of that time, though several major version updates, but I've had no end of problems. If your goal is more redundancy than performance, Gluster may not be a good solution. Particularly if your usage pattern has a lot of stat() calls, Gluster doesn't do real well with replication. This is because stat calls to replicated volumes go to all of the replicated nodes (actually "bricks", but you're probably just going to have one brick per host). If you have a 2-way replica, for example, each stat() from a client waits for a response from both bricks to make sure it's using current data. Then you also have the FUSE overhead and lack of caching if you're using the native gluster filesystem for redundancy (rather than using Gluster as the backend with NFS as the protocol and automounter for redundancy, which still sucks for the stat() reason). Gluster does really well with large files where you can spread data across multiple servers, though; the data striping and distribution works well, as that's really what it's for. And the newer RAID10-type replication performs better than the older straight replicated volumes. But, based on what I'm guessing is your usage model, I'd advise against it.

Bear in mind that you'll probably have to find a way to have master elections between the machines, or implement distributed locking. The shared block device solutions require a filesystem which is multi-master aware (like GFS), or requires that only one node mount the filesystem read-write. Filesystems in general dislike when data is changed at the block device level underneath them. That means your clients will need to be able to tell which is the master, and direct write requests there. That may turn out to be a big nuisance. If GFS and all of its supporting infrastructure is an option, drbd in multi-master mode (they call it "dual primary") could work well. https://www.drbd.org/en/doc/users-guide-83/s-dual-primary-mode for more information on that.

Regardless of direction you go with, you're apt to find that this is still a fairly big pain to do realtime without just giving a SAN company a truckload of money.

dannysauer
  • 1,229
  • 7
  • 14
  • I'm in the initial stages of migrating from rsync commands in cron to using a distributed filesystem. If Gluster runs stat() on all bricks, I may need to reconsider it as a solution. – Jesusaur Aug 25 '16 at 21:40
  • 1
    That is only the case in a replicated filesystem; it runs `stat()` on all bricks which have replicas of the specific block you're looking at. For example, if you have a 2x2 striped replica, the `stat` would run on the two bricks with the replicated block, but not on the other two. In my application with lots of small files (on the order of a million files under 4K of data each), neither NFS nor FUSE provided good performance on replicated volumes. And georeplication to ~20 machines was very unreliable in several configs. – dannysauer Aug 25 '16 at 21:47
  • 1
    Your mileage may vary, but I moved from Gluster everywhere (which I was using exclusively for replication, not for all of the other really cool things Gluster actually does well) to rsync on native filesystems. :) I'm looking at moving to lsyncd (https://github.com/axkibe/lsyncd) now instead of cron, so I can get near real-time without the Gluster overhead. – dannysauer Aug 25 '16 at 21:52
0

I have shifted from rsync to ceph with help of Proxmox VE setup.

Now I am managing 14TB in one cluster with live replication. For nearly 4 years.