Port forwarding does not work using different gateway

Question

Let me try to explain my home network setup:

          ┌────────────────────┐
          │      Internet      │
          │ Public IP: 1.2.3.4 │
          └──────────┬─────────┘
                     │
  ┌──────────────────┴─────────────────┐
  │             ISP Modem              │
  │  Forward everything to AP Router   │
  │            192.168.1.1             │
  └──────────────────┬─────────────────┘
                     │
   ┌─────────────────┴───────────────┐
   │             AP Router           │
   │         DHCP happens here       │
   │ Forward 1122 to 192.168.10.2:22 ├─────────────┐
   │           192.168.10.1          │             │
   └─────────────────┬───────────────┘             │
                     │                             │
                     │                             │
                     │                     ┌───────┴───────┐
                     │                     │ NUC (Ubuntu)  │
                     │                     │ PiHole + VPN  │
                     │                     │ 192.168.10.50 │
                     │                     └───────────────┘
                     │                             ▲
                     │                             │
┌────────────────────┴──────────────────┐          │
│            Desktop (Ubuntu)           │          │ Default routing
│              192.168.10.2             │          │
│    Default gateway: 192.168.10.50     ├──────────┘
│          DNS: 192.168.10.50           │
└───────────────────────────────────────┘

If the desktop uses 192.168.10.1 as the default gateway, doing, for example, SSH to 1.2.3.4:1122 works, I can SSH to the desktop. But I want the desktop to use 192.168.10.50 as the default gateway. In that case, any port forwarding does not work.

After doing a little bit of research this can be done with IP tables/policy based routing, but I know nothing about that. What's the simplest way to do it?

A.B · Accepted Answer · 2021-08-18T18:47:50.867

TL;DR (1st method only)

On Desktop:

ip route add 192.168.10.0/24 dev eth0 table 1000
ip route add default via 192.168.10.1 dev eth0 table 1000
ip rule add iif lo ipproto tcp sport 22 lookup 1000

The problem

The problem here happens on the Desktop.

With a different layout where the NUC reliably intercepts all flows easier methods would have been available. This would have required the NUC to have two network devices because routing two IP LANs on the same Ethernet LAN doesn't prevent issues for example with DHCP. Having the NUC as a stateful bridge would have been an other solution also requiring two NICs.

With the current layout, where the NUC can't intercept all traffic between the AP and the desktop...

... the solution has to be done on the Desktop.

Linux can use policy routing where a selector is used to have a different outcome (by using a different routing table) for the packet. All problems about using multiple routes for apparently same destinations require the use of policy routing, mostly using a selector able to separate according to the source (because the routing table is already here to separate the destination).

One has to separate somehow the packets coming directly from the AP from the packets coming from the NUC, so they can have a different outcome (ie: different routes) when it's about SSH connections to the Desktop.

What doesn't appear to be available with ip rule is a selector where one can distinguish between two packets arriving through two routes when those routes differ only with the gateway that was used. Linux' policy rules don't appear to catch this case: as long as its from the same interface it's the same.

I'll assume that:

Desktop's network interface is called eth0.
Desktop isn't routing (eg: libvirt, LXC, Docker). Routing requires more configuration and to choose what should be done (should a VM receive SSH coming from the NUC or from the AP?). The answers below would need some minor adjustments for properly creating exceptions for the routing case, or containers/VMs will just follow the default route (ie: through NUC).

Here are two methods.

Policy routing matching layer 4 protocol (TCP port 22)

Since Linux 4.17 one can use a selector to match here on TCP port 22 with policy routing. Then it's easy to use a different route for it. Instead of handling the origin of the packet differently, handle this specific port differently:

ip route add 192.168.10.0/24 dev eth0 table 1000
ip route add default via 192.168.10.1 dev eth0 table 1000
ip rule add iif lo ipproto tcp sport 22 lookup 1000

Here iif lo isn't really about the lo interface but is the specific syntax meaning from local system. The LAN route must also be duplicated, or for example an SSH connection from the NUC itself would be replied through the AP, which would emit ICMP redirects to tell about the misconfiguration. In this specific case there's no rule needed to specify an alternate route for received packets since it's the same interface. Had it been an other interface and SRPF enabled (rp_filter=1), ip rule add iif eth0 ipproto tcp dport 22 lookup 1000 with eth0 replaced with the actual other interface in rule and default route would also have been needed.

This is a very simple method achieving goal in 3 commands only.

This could be tweaked for receiving SSH from some specific LAN or address blocks coming from the NUC in case the VPN allows incoming traffic, but this wouldn't allow in any case receiving an SSH connection from the same single public IP source which used the two destinations/routes simultaneously.

Using the AP's MAC address and marks for policy routing

Instead of the previous method, there's an indirect way to identify an incoming packet as coming from the AP gateway rather than from the NUC: its Ethernet source MAC address.

This can't be used directly by policy routing, but it's possible to tag such incoming packet with a firewall mark. A mark can be used by policy routing, and there are ways to get this mark set on reply packets.

I'll split the incoming part and the reply part. As this doesn't depend on the specific kind of incoming traffic, no change is required to handle additional ports forwarded from the AP to the Desktop later.

I'll assume below that:

AP's MAC address (as seen on the desktop with ip neigh show 192.168.10.1 after pinging it) has value 02:00:00:ac:ce:55. Replace this value below.

Incoming and common settings

One should take a look at how Netfilter, iptables and routing interact on this schematic:

An iptables rule in raw/PREROUTING will mark the packet. This is then completed by policy routing in a similar way to previous.

iptables -t raw -A PREROUTING -i eth0 -m mac --mac-source 02:00:00:ac:ce:55 -j MARK --set-mark 1

ip route add default via 192.168.10.1 table 1000 
ip rule add fwmark 1 lookup 1000

Reply

There are two methods to handle reply:

Simple and automatic, TCP-only

Can only be used with TCP, not other protocols, including not UDP.

As the goal is TCP port 22, this is good enough for OP's case. Simply complete the Incoming part with:
```
sysctl -w net.ipv4.tcp_fwmark_accept=1
sysctl -w net.ipv4.fwmark_reflect=1
```
Explanations:
- tcp_fwmark_accept
  
  Each TCP socket created when accepting a new connection will inherit the first packet's mark, as if the SO_MARK socket option had been used for this connection only. Specifically here, all reply traffic will be routed back through the same gateway the incoming traffic arrived from, using the routing table 1000 when the mark is set.
- fwmark_reflect
  
  In a similar way reply packets handled directly by the kernel (like ICMP echo reply or TCP RST and some cases of TCP FIN) inherit the incoming packet's mark. For example that's the case if there is no TCP socket listening (ie: the SSH server is stopped on Desktop). Without this mark an SSH connection attempt through the AP would time out instead of getting a Connection Refused because the TCP RST would be routed through the NUC (and be ignored by the remote client).
or instead...
Generic handling by transferring the mark between packet and conntrack entry and back to reply packet

A mark can be memorized as connmark in a conntrack entry to have it affect all further packets of the flow including reply packets by copying it back in mangle/OUTPUT from conntrack to mark. Complete the Incoming part with:
```
iptables -t mangle -A PREROUTING -m mark --mark 1 -j CONNMARK --set-mark 1
iptables -t mangle -I OUTPUT -m connmark --mark 1 -j MARK --set-mark 1
```
This will handle all cases (including TCP RST and UDP). So the AP could be configured to forward any arbitrary incoming TCP or UDP traffic to the Desktop. Additional documentation in this blog.

Miscellaneous

Caveats

When an address is removed (and then probably added back) or an interface is brought down (then up), all associated routes that were manually added are deleted and won't reappear. So the manual ip route commands at least should be integrated with the tool configuring the Desktop's network so they are added when the network connection is made each time.
Each tool has a different way to do advanced network configuration, which might be incomplete. For example Ubuntu's Netplan doesn't document in its routing-policy settings if it's possible to use iif lo or ipproto tcp sport 22. Tools allowing to use custom scripts to replace non-available features should be preferred (for example ifupdown or NetworkManager can do this).
Nitpicking: for the the extremely convoluted case using the last method where a single remote (public) IP address will connect to the same Desktop service twice using the two routes (seen as two distinct public IP addresses) in case the VPN allows incoming traffic, and uses the same source port for both destinations, the Desktop will only see twice the same flow and will be confused (two UDP would be merged and a 2nd TCP would fail). This can usually be handled when routing (with conntrack zones and/or having automatically conntrack alter a source port), it might not be possible to handle this for the host case here.

Bonus

If Desktop is actually a router, here's how the last method using a mark and CONNTRACK should be altered. Routes to containers must be duplicated to table 1000. This should work, but has not been tested with Docker (which can give additional challenges).

Assuming here that:

Desktop is routing NAT-ed containers in LAN 172.17.0.0/16 through an interface called br0 (Docker would use docker0 for the default network) with local IP address 172.17.0.1/16
Desktop DNATs some ports toward these containers

Changes:

rules and routes

Routes to container(s) must be copied from the main routing table to table 1000. If the container/virtualization tool dynamically adds new interfaces and routes, the new routes must manually (or with some scripted mechanism triggered from some API from the tool) be added in table 1000 too.
```
ip route add 172.17.0.0/16 dev br0 table 1000
```
Without this, incoming connections through AP and marked (in the next bullets) would be routed back to the AP.
keep the previous rule about MAC address in raw table.
delete the previous rules in mangle table
```
iptables -t mangle -F
```
put these rules instead:
```
iptables -t mangle -A PREROUTING -m mark ! --mark 0 -j CONNMARK --save-mark
iptables -t mangle -A PREROUTING -m connmark ! --mark 0 -j CONNMARK --restore-mark
iptables -t mangle -A OUTPUT -m connmark ! --mark 0 -j CONNMARK --restore-mark
```
(some optimizations could be done at the cost of more lines for this single-mark case)

The first PREROUTING rule ensures to not overwrite the conntrack mark with the packet mark with value 0. The 2nd PREROUTING rule sets the mark for routed traffic from containers (with individual packet initially not marked) part of a flow initially established through AP.

Both options work, and I prefer the second one that handles incoming connections in general. This is magical, thank you so much. — Arfian Adam, Aug 18 '21 at 00:46
`ip rule add iif eth0 ipproto tcp dport 22 lookup 1000` is unnecessary / irrelevant to the problem. You only need such rule if you need the host to forward traffics from `eth0` to ssh server on another host with certain route table. — Tom Yan, Aug 18 '21 at 02:42
@TomYan You are correct. This would be needed only if the traffic was received on an other interface and strict reverse path forwarding was enabled (rp_filter=1) to validate the route. I removed it but left a pointer for other use cases. — A.B, Aug 18 '21 at 07:36
@A.B for some reason I still can't get HTTPS port (443) to work. Suppose `domain.com` already points to `1.2.3.4` and port `443` is already forwarded, accessing `https://domain.com/` doesn't work with `192.168.10.50` gateway. Switching back to `192.168.10.1` works. Edit: the web server is on Docker. — Arfian Adam, Aug 18 '21 at 09:32
@ArfianAdam as I wrote your question never mentioned routing and I explicitly waived the routing case including Docker as example. You'd need a different question. Doing *anything* with Docker is difficult because Docker changes settings. Especially Docker activates `br_netfilter`, I already posted a few Q/A on UL and on SF about the Docker causing additional challenges: https://unix.stackexchange.com/a/572086/251756 https://serverfault.com/questions/963759/docker-breaks-libvirt-bridge-network/964491#964491 — A.B, Aug 18 '21 at 09:51
@A.B will having another physical NIC make it simpler in my case? — Arfian Adam, Aug 18 '21 at 11:35
The easiest would be to have another NIC on the NUC and turn the NUC into a router, have settings applied on the NUC rather than on the Desktop. But then the NUC handling a VPN, settings wouldn't be the easiest either. But it would be the-thing-to-do. Adding a NIC on the Desktop won't help much having a simple solution on it. Or you can dedicate the NIC for Docker using macvlan. — A.B, Aug 18 '21 at 11:45
Well even without 2nd NIC you can use macvlan. But again this was not in the question. — A.B, Aug 18 '21 at 11:48
@ArfianAdam We can't tell what you should do when you just say "docker" since docker has different modes of networking. I'm not familiar with it but there are at least three cases AFAIK. When you use the (NAT'd) `bridge` mode (probably the default), the rule should be on the host and the `iif` should not be `lo` but the docker bridge. (You'll also need the DNAT a.k.a. `publish` / port-mapping configured properly.) There's also the `host` mode in which the `iif lo` rule would probably work. With `macvlan` mode the container will be exposed as if it's a standalone host in your LAN. — Tom Yan, Aug 18 '21 at 15:32
Say for example you use `macvlan` mode but want the container to also use `192.168.10.50` as its gateway except for the replies from its SSH or web server, you'll need an `iif lo` rule but the rule (and the extra routes and route table) should be configured on the container itself instead of the docker host. — Tom Yan, Aug 18 '21 at 15:36
I added a Bonus section for the last method when the system is routing. Should work, but with Docker nothing is certain... — A.B, Aug 18 '21 at 18:23