Skip to content

Conversation

@Foxboron
Copy link
Contributor

@Foxboron Foxboron commented Apr 28, 2025

WIP draft PR for the feature. So far it works, a bit slow on my machine due to the id remapping which should probably be investigated.

  • Investigate the ID remapping
  • I'm unsure about the different migration options
  • Only support vers=4.2, should be fine?
  • Error out on missing source
  • Any specific options we should include?
  • Info array needs some QA. Not sure I understand all of it.
  • nfs can't support xattr, do we need to handle this in other places than migrate?

Fixes: #1311

@Foxboron
Copy link
Contributor Author

Foxboron commented May 8, 2025

@stgraber if you have any opinions or pointers to the checkboxes feel free to look over and I can investigate a bit.

I suspect some research needs to be done to figure out how we should interact with the uid/gid and squashing behavior of NFS mounts.

@bensmrs
Copy link
Contributor

bensmrs commented Jul 21, 2025

Hi! Is this PR stalled? Do you need help?

@Foxboron
Copy link
Contributor Author

@bensmrs I think I need a bit of help to make sure that the Info struct is correct and that I'm not missing any details from the migration steps. I was hoping @stgraber had time to point in the correct direction on this part.

@stgraber
Copy link
Member

Ah, I just left a few comments in the Info function now.

@bensmrs
Copy link
Contributor

bensmrs commented Jul 21, 2025

I was hoping @stgraber had time to point in the correct direction on this part.

Well now you’re served :þ

I can help review, fix stuff, and even write tests, don’t hesitate to ping me. I’d actually be happy to see it working pretty soon.

@Foxboron
Copy link
Contributor Author

@bensmrs thanks!

I'll probably not touch this in July, and there is a hackercamp and work stuff happening in August on my end that might not allow me a lot of energy to pick this up after hours.

I'd appreciate some pointers on the id remapping incus does and how that should play with the ID squashing NFS does. I think there should be some guidance and testing there to make sure it behaves as expected. It also takes a bunch of time which might not be needed if nfs reassings UID/GID anyway.

If you have time to write tests and stuff I'd be happy to give you access to my fork.

@stgraber
Copy link
Member

I don't believe that VFS idmap works on top of NFS at this point, so we'd be dealing with traditional shifting where Incus rewrites all the uid/gid as needed.

What you want to ensure is that NFS is mounted the same way on all machines and that no uid/gid squashing is performed, then that should work fine.

@Foxboron
Copy link
Contributor Author

Should we call the driver nfs4 to make sure we don't end up in a weird situation down the line where we don't want to support a v5 along with the v4 code? Is that realistic?

@stgraber
Copy link
Member

I think we should stick to nfs as nfs4 may give the impression that we don't support nfs3 whereas the featureset we really need is perfectly fine for NFS3, if anything some of the bits in NFS4 may be problematic (built-in uid/gid mapping and such).

Exactly what kind of server version and configuration we can handle is probably something that's best addressed through the documentation for the driver.

@symgryph
Copy link

NFS would be SO nice. Just an interested party!

@Foxboron Foxboron force-pushed the morten/nfs branch 2 times, most recently from b617326 to 49f1676 Compare October 19, 2025 15:00
@Foxboron Foxboron marked this pull request as ready for review October 19, 2025 15:00
@Foxboron
Copy link
Contributor Author

@bensmrs

I did some changes so we can pass ipv6 source= paths. It's not great but couldn't come up with a more reliable way to split up on ::1:/somepath or similar paths.

What sort of testing do we want on this? Locally creating pools and creating contianers and VMs work. Also attaching additional volumes.

@bensmrs
Copy link
Contributor

bensmrs commented Oct 19, 2025

What sort of testing do we want on this? Locally creating pools and creating contianers and VMs work. Also attaching additional volumes.

We’d basically extend the available filesystem drivers in the test matrix, to test everything we’re already testing:

backend:
- dir
- btrfs
- lvm
- zfs
- ceph
- linstor
- random

The tests initialization routines allow you to define how your driver should be initialized in test/backends, although it’s not enough. I can have a quick look at it as soon as I’m done with my other PR.

@Foxboron Foxboron force-pushed the morten/nfs branch 2 times, most recently from 0080a77 to 7070ad5 Compare October 19, 2025 20:32
@github-actions github-actions bot added the Documentation Documentation needs updating label Oct 19, 2025
@bensmrs
Copy link
Contributor

bensmrs commented Oct 20, 2025

I started looking at the tests and am a bit confused by the following error:

> incus storage create incustest-KAg nfs source=10.1.0.227:/media
Error: NFS driver requires "nfs.host" to be specified or included in "source": [<remote host>:]<remote path>

Did I do something wrong here?
IMO, if you have an nfs.host option, I don’t think you should look for a host in source. Just make source be the path on nfs.host, or on localhost if no nfs.host is provided.

@Foxboron
Copy link
Contributor Author

Did I do something wrong here?

Uhh, so it turns out that the Makefile has been moving things from being built into build/ to running go install. So there is a slight chance I have missed testing an iteration while working on this.

IMO, if you have an nfs.host option, I don’t think you should look for a host in source. Just make source be the path on nfs.host, or on localhost if no nfs.host is provided.

I was thinking mimicking the behavior of truenas.host and source from the truenas driver. Where you can just specify nfs.host globally and pass a path to source. But I haven't nailed the entire behavior there I suspect. I'm a little bit unsure what the most ergonomic way for the options to behave, that also harmonizes with the existing drivers.

@bensmrs
Copy link
Contributor

bensmrs commented Oct 20, 2025

Uhh, so it turns out that the Makefile has been moving things from being built into build/ to running go install. So there is a slight chance I have missed testing an iteration while working on this.

No problem, I ended up using nfs.host for driver initialization.

I was thinking mimicking the behavior of truenas.host and source from the truenas driver. Where you can just specify nfs.host globally and pass a path to source. But I haven't nailed the entire behavior there I suspect. I'm a little bit unsure what the most ergonomic way for the options to behave, that also harmonizes with the existing drivers.

Ok I can’t beat this argument :)

I’m getting some permission denied errors in my tests, I’m investigating.

@bensmrs
Copy link
Contributor

bensmrs commented Oct 20, 2025

So, maybe I’m missing something when initializing the pool, but I can’t get past permission errors. Container creation and launch work well, but that’s not enough, see https://github.com/bensmrs/incus/actions/runs/18648669885/job/53161380407?pr=4

Am I missing something in my setup? See

echo "/media 10.0.0.0/8(rw,sync,no_subtree_check,no_root_squash,no_all_squash)" | sudo tee /etc/exports
sudo exportfs -a

@Foxboron
Copy link
Contributor Author

Foxboron commented Oct 20, 2025

Hmmm, I've only tried incus exec container bash and manually created files on NFS volumes on both ends. And it worked. Are we sure that /media doesn't have any weird permissions in ubuntu? Does it exist etc?

@bensmrs
Copy link
Contributor

bensmrs commented Oct 20, 2025

Container launch works, so regular operations don’t seem to cause any problem.
I’ll debug a bit more later, I was just wondering if your setup had anything specific other than the squash options.

@bensmrs
Copy link
Contributor

bensmrs commented Oct 20, 2025

Well I’m currently out of ideas. user_is_instance_user fails with a permission error. We can’t list /root from within the container, it appears owned by 0:0, whoami fails because user 0 is not found, and from the host, the mount appears owned by 1000000:1000000. I don’t know where to go from there; maybe I’m missing something obvious happening in the OpenFGA tests.

Suggested-by: Morten Linderud <morten@linderud.pw>
Signed-off-by: Benjamin Somers <benjamin.somers@imt-atlantique.fr>
Signed-off-by: Benjamin Somers <benjamin.somers@imt-atlantique.fr>
Signed-off-by: Benjamin Somers <benjamin.somers@imt-atlantique.fr>
Signed-off-by: Benjamin Somers <benjamin.somers@imt-atlantique.fr>
Signed-off-by: Benjamin Somers <benjamin.somers@imt-atlantique.fr>
@jack9603301
Copy link

This draft seems very cool, but why not choose to manually perform NFS automatic mounting from the bottom of the operating system and then run incus in dir mode?

@stgraber
Copy link
Member

This draft seems very cool, but why not choose to manually perform NFS automatic mounting from the bottom of the operating system and then run incus in dir mode?

If you do that, then Incus doesn't know that the same data can be accessed from all systems in a cluster.

@Foxboron
Copy link
Contributor Author

This draft seems very cool, but why not choose to manually perform NFS automatic mounting from the bottom of the operating system and then run incus in dir mode?

I would really want to avoid having to manage the cluster node fstabs for every new NFS mount I want.

@jack9603301
Copy link

I would really want to avoid having to manage the cluster node fstabs for every new NFS mount I want.

Seems like a good reason

@jack9603301
Copy link

If you do that, then Incus doesn't know that the same data can be accessed from all systems in a cluster.

Sorry, I don't understand what you mean

@bensmrs
Copy link
Contributor

bensmrs commented Oct 22, 2025

If you do that, then Incus doesn't know that the same data can be accessed from all systems in a cluster.

Sorry, I don't understand what you mean

The dir backend is not suited for cluster operation because it just considers the local filesystems available to the host. This makes moving data from node to node basically copy it from one node to the other. There are two problems in using NFS with this: i) copying the data is useless, as it is available to every node, and, more critical even, ii) moving data, e.g. instances, would essentially make the receiving node write such data at the same place as it is currently being stored, without any kind of FS locking mechanism.

@jack9603301
Copy link

The dir backend is not suited for cluster operation because it just considers the local filesystems available to the host. This makes moving data from node to node basically copy it from one node to the other. There are two problems in using NFS with this: i) copying the data is useless, as it is available to every node, and, more critical even, ii) moving data, e.g. instances, would essentially make the receiving node write such data at the same place as it is currently being stored, without any kind of FS locking mechanism.

I don't think it's a problem in regular operation and maintenance applications, because the dir backend ultimately relies on the Linux file system tree structure, which will bring more options for deployment. For example, operation and maintenance personnel can choose NFS server remote control mounting (with a single point of failure), Ceph cluster file system or glusterfs cluster file system, DRBD disk synchronization system and deploy incus container data on top. The data will always be automatically synchronized by the underlying system. This is Linux - each part only does what it does best.

@Foxboron
Copy link
Contributor Author

I don't think it's a problem in regular operation and maintenance applications, because the dir backend ultimately relies on the Linux file system tree structure, which will bring more options for deployment. For example, operation and maintenance personnel can choose NFS server remote control mounting (with a single point of failure), Ceph cluster file system or glusterfs cluster file system, DRBD disk synchronization system and deploy incus container data on top. The data will always be automatically synchronized by the underlying system. This is Linux - each part only does what it does best.

I don't think arguing against the merits of an NFS storage driver in the PR adding said storage driver is helpful nor welcome.

@jack9603301
Copy link

I don't think arguing against the merits of an NFS storage driver in the PR adding said storage driver is helpful nor welcome.

Sorry, I don't understand what you mean

@bensmrs
Copy link
Contributor

bensmrs commented Oct 23, 2025

Please keep debate away from this PR. The forum is here for this kind of things, and, more moderately, the issue section, to discuss early design-stage decisions. Your use case, as legitimate as it may seem to you, unfortunately doesn’t cover a tenth of all the use cases of Incus users which we have to account for.

@jack9603301
Copy link

Please keep debate away from this PR. The forum is here for this kind of things, and, more moderately, the issue section, to discuss early design-stage decisions. Your use case, as legitimate as it may seem to you, unfortunately doesn’t cover a tenth of all the use cases of Incus users which we have to account for.

OK, but in fact this is the most typical problem of production deployment. They may not use nfs backend. Due to the complexity of the production system, they directly use dir backend.

@bensmrs
Copy link
Contributor

bensmrs commented Oct 23, 2025

The more I think about it, the more I feel we need to intercept syscalls. It shouldn’t be incredibly hard, and I think it would look pretty similar to the pre-existing setxattr interception. However, special care would need to be taken to filter out non-NFS-mounted paths. WDYT @stgraber? Should we go that way?

(The other idea could be to mount the FS at another location and perform bind-mounting to the actual rootfs location.)

@stgraber
Copy link
Member

The more I think about it, the more I feel we need to intercept syscalls. It shouldn’t be incredibly hard, and I think it would look pretty similar to the pre-existing setxattr interception. However, special care would need to be taken to filter out non-NFS-mounted paths. WDYT @stgraber? Should we go that way?

(The other idea could be to mount the FS at another location and perform bind-mounting to the actual rootfs location.)

Can you provide a summary of the issue as it currently stands? There's been a lot of back and forth over the past few days :)

In general, if we're talking about intercepting chown, that'd be a terrible idea because it's called extremely frequently (so could cause a DoS) and has various variants that may be very hard to handle safely (including calls from within namespaces or chroots).

@Foxboron
Copy link
Contributor Author

Can you provide a summary of the issue as it currently stands? There's been a lot of back and forth over the past few days :)

On an NFS root chown does not work.

Minimal reproducer from what I can tell.
lxc-usernsexec -m u:0:1000000:1000000000 -m g:0:1000000:1000000000 -- /bin/chown 55:55 testfile

It's unclear to me why this is. I have ruled out busybox bugs as this happens with an Arch userland as well as the busybox test images.

@stgraber
Copy link
Member

stgraber commented Oct 23, 2025

@brauner can you look into this with your VFS maintainer hat on?

To reproduce this on pengar, you can do:

root@pengar:~# mkdir -p /mnt/nfs
root@pengar:~# mount -t nfs truenas01.shf.lab.linuxcontainers.org:/mnt/test/incus-nfs /mnt/nfs
Created symlink /run/systemd/system/remote-fs.target.wants/rpc-statd.service → /lib/systemd/system/rpc-statd.service.
root@pengar:~# ls -lh /mnt/nfs
total 512
drwxr-xr-x 2 100000 100000 3 Oct 23 21:01 rootfs
root@pengar:~# ls -lh /mnt/nfs/rootfs/
total 512
-rw-r--r-- 1 100000 100000 0 Oct 23 21:01 foo
root@pengar:~# lxc-usernsexec -m b:0:100000:65536 -- chown 1234:1234 /mnt/nfs/rootfs/foo
chown: changing ownership of '/mnt/nfs/rootfs/foo': Operation not permitted
root@pengar:~# lxc-usernsexec -m b:0:100000:65536 -- ls -l /mnt/nfs/rootfs/foo
-rw-r--r-- 1 root root 0 Oct 23 21:01 /mnt/nfs/rootfs/foo
root@pengar:~# chown 101234:101234 /mnt/nfs/rootfs/foo 
root@pengar:~# lxc-usernsexec -m b:0:100000:65536 -- ls -l /mnt/nfs/rootfs/foo
-rw-r--r-- 1 1234 1234 0 Oct 23 21:01 /mnt/nfs/rootfs/foo

So basically, doing the chown from within a userns going through the uid/gid map fails but doing the exact same chown as real root on the host and then accessing the result from within the namespace is fine.

Pengar is on 6.16.4, I've also reproduced it on 6.16.11. Not sure what the other two have been running, but probably similarly recent kernels. This is NFSv3 so there shouldn't be too much NFS black magic in theory.

@stgraber
Copy link
Member

Okay, so @brauner and I got to the bottom of this one and it's unfortunately not good.

There is no fundamental kernel issue here, it's more of a fundamental design issue with how NFS works...

Since kernel idmap shift isn't supported on top of NFS, we instead only rely on the user namespace. This means that when we have root in the container (uid=0) outside, chown /foo from its current ownership (uid/gid 0/0) to a new ownership (uid/gid 1234/4567), it actually gets translated to uid 100000 wanting to chown a file owned by 100000/100000 to 101234/104567.

That's the request (SETATTR) which gets sent to the server and the server rejects that as the credential (uid=100000/gid=100000) isn't privileged over the target uid/gid (101234/104567).

There is no provision in the NFS protocol to pass through the uidmap/gidmap to effectively inform the server of what uid/gid range the requestor is privileged over and so no way to actually properly handle such a request.

@stgraber
Copy link
Member

So we have a few things to consider here:

  1. At the kernel level, if we were to implement VFS idmap on top of NFSv3 (v4 would be a massive headache), it would solve this situation, but ONLY for containers where the container's uid=0 is mapped to the host's uid=0, any other situation would still fail the server-side check

  2. At the userspace level, there is an NFS FUSE client (https://github.com/sahlberg/fuse-nfs) but it's FUSE2 only as far as I can tell and as a result, doens't have @mihalicyn's support for VFS idmap. Getting it to support VFS idmap would provide a (slower) way forward.

  3. We limit ourselves to what works today. That is 1) shared custom volumes (with a clear note about chown not working from within an unpriv container) 2) VM storage (possibly)

  4. seems safe to do now, it's basically just putting some extra checks on this branch to provide the limited subset. 2) would be nice to have as that NFS client codebase isn't too hard to build, so if we can apply something like (mihalicyn/fuse-overlayfs@89a1af3) to it, it would provide a slow but otherwise functional option for unpriv containers. 1) would need a motivated kernel dev to work on, as @brauner was pointing out, the fact that nobody really talked about this issue before doesn't bode too well as far as general interest for something like this.

@bensmrs
Copy link
Contributor

bensmrs commented Oct 28, 2025

And couldn’t an idmapped bind mount be a solution? The host mounts the NFS shares somewhere, then bind-mounts it to /var/lib/incus/containers/<instance>/ with the proper shift.

Or maybe it’s bindfs-specific, in which case we’re back to using FUSE…

@stgraber
Copy link
Member

And couldn’t an idmapped bind mount be a solution? The host mounts the NFS shares somewhere, then bind-mounts it to /var/lib/incus/containers/<instance>/ with the proper shift.

Or maybe it’s bindfs-specific, in which case we’re back to using FUSE…

That's 1), idmap bind-mount requires per-filesystem kernel code to handle it. NFS does not have support for it. FUSE in general does but requires each filesystem to add some logic, which is why I also brought up 2) as a quicker option to get something done.

@bensmrs
Copy link
Contributor

bensmrs commented Oct 28, 2025

Ok, I had in mind bindfs’ uid-offset and gid-offset options and just thought that maybe this translation mechanism existed for bind mounts (the bind of bindfs made my brain shortcut to my previous question).

I’ll have a quick look at fuse-nfs.

@stgraber stgraber marked this pull request as draft October 29, 2025 01:16
@bensmrs
Copy link
Contributor

bensmrs commented Oct 29, 2025

https://github.com/bensmrs/fuse-nfs should be ok-ish for FUSE3; I’ll try to implement VFS idmap tomorrow.

@bensmrs
Copy link
Contributor

bensmrs commented Oct 29, 2025

My patches should now work (I also integrated a long-waiting PR that I feel is desirable for us), but testing it will probably be painful. Are there quick ways for me to test the patched libfuse, or do I have to compile @mihalicyn’s tree? (in which case, big meh on my side as I have little time for it).
Or you could test on your side @stgraber if you already have everything ready…

@stgraber
Copy link
Member

Cool!

If we get this working, we'd probably do a scheme like /var/lib/storage-pools/my-nfs is the NFS kernel mount, then /var/lib/storge-pools/my-nfs/.fuse is the same share but mounted over FUSE. Then whenever dealing with an unprivileged container, we'd use the .fuse path and everything else goes through the kernel client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Changes to the REST API Documentation Documentation needs updating

Development

Successfully merging this pull request may close these issues.

Add a nfs storage driver

5 participants