MakerFLOSS_Mikrotik/docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md
sjat 199edf85ad fix(vlans): robust bridge-IP removal; record cutover + gotchas
RouterOS 'find ... address=<prefix>' never matches an ip/address value, so the
legacy-bridge-IP removal is now a :foreach get-and-compare. Refresh the committed
export.rsc to the post-cutover config (flat VLAN 30 + isolated mgmt VLAN 99 on
ether8, vlan-filtering on). Spec updated with execution notes (NM autoconnect flap,
the find-address quirk, and the commit-confirmed detached-flip technique used).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 12:38:04 +02:00

116 lines
6.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CRS310 — flat data path + isolated management VLAN — Design
**Date:** 2026-06-09
**Status:** Approved (brainstorming complete)
**Author:** sjat + Claude
**Supersedes** the placeholder topology in `host_vars/crs310-maker.yml` (the
`10.0.99.x` / SFP+-trunk example). Builds on
`2026-06-07-mikrotik-crs310-ansible-design.md`.
## Purpose
Bring the makerspace CRS310 into service as a **flat L2 switch** on the existing
`10.2.30.0/24` network, with its **management plane isolated on a dedicated VLAN**
reached through one physical port. No SFP+ yet — the 10G uplink is deferred until the
connectors arrive; **`ether1` is the (copper) uplink** for now.
## Context (as found on 2026-06-09)
- Switch on factory **defconf**: one flat `bridge` with all ports, mgmt IP
`192.168.88.1/24` sitting directly on `bridge`, `vlan-filtering=no`.
- Upstream LAN is **flat**: DHCP/gateway at `10.2.30.1`, untagged. Verified by leasing
`10.2.30.227` to mamba *through* the switch's flat bridge.
- mamba is the management station (patched into the switch, reached from fisi over a
`kuku` jump + port-forward tunnel to `192.168.88.1`).
## Topology
VLAN-aware bridge (`bridge`), `vlan-filtering=yes` enabled **last**. All ports are
untagged access ports — **no trunks**.
| Port | Mode | PVID | VLAN | Notes |
|---|---|---|---|---|
| `ether1` | access | 30 | DATA | copper uplink to `10.2.30.0/24` |
| `ether2``ether7` | access | 30 | DATA | device access ports |
| `sfp-sfpplus1/2` | access | 30 | DATA | unused until connectors arrive |
| `ether8` | access | 99 | MGMT | dedicated management port (mamba lives here) |
- **DATA VLAN 30** — internal-only id; ingress/egress on `ether1` is untagged, so the
upstream router sees a plain flat network. The switch CPU (`bridge`) is **not** a
member of VLAN 30 → no switch L3 presence on the user network.
- **MGMT VLAN 99** — `vlan-mgmt` interface on the bridge, IP **`192.168.88.1/24`**, the
bridge/CPU is the only tagged member, `ether8` the only untagged member.
**No default gateway** — management is intentionally isolated.
## Management & internet
- Reachable only from `ether8` (plug the management laptop / mamba there, addressed
`192.168.88.2/24`). The switch does **no routing or DHCP**; `10.2.30.1` keeps both.
- The control plane has **no internet** by design → **NTP/DNS disabled** (they would
only error on an isolated segment; clock won't sync, updates are done manually when
the switch is temporarily patched to the data network).
## Required changes to the IaC
1. `host_vars/crs310-maker.yml`: replace the placeholder topology with the table above;
`switch_mgmt_address: 192.168.88.1/24`, `switch_mgmt_vlan_id: 99`, **no gateway**;
drop the `10.0.99.x` DNS/NTP/gateway placeholders.
2. Role `vlans.yml`: make the **default-route** task conditional on a gateway being set
(skip when isolated); **remove the legacy defconf IP** off the bare `bridge` so it
doesn't collide with the `vlan-mgmt` IP (`192.168.88.1` must live only on
`vlan-mgmt`).
3. Role `identity.yml`: gate NTP (and DNS) behind a flag / empty-server check so an
isolated mgmt plane doesn't configure unreachable servers. Add
`switch_ntp_enabled: false` for this host.
The existing `vlans.yml` membership Jinja already produces the correct sets for an
all-access topology (DATA untagged = data ports, CPU tagged only on MGMT).
## Cutover runbook (lockout-safe; operator on-site at `ether8`)
1. **Restore mgmt path** (done): mamba `enp0s31f6``192.168.88.2/24` (profile
`crs310-bench`); fisi→mamba→switch tunnel up; Ansible reaches `192.168.88.1`.
2. **Move the cable: switch port 5 → port 8.** (Bridge is still flat, so mamba stays
reachable on either port.) Re-confirm reachability.
3. Apply config in order: bridge VLAN table → port PVIDs → create `vlan-mgmt` iface.
Verify the VLAN/PVID state with `vlan-filtering` still **off**. Then the **flip**, as
one ordered sequence (the address can't be on both interfaces at once): remove
`192.168.88.1` from `bridge`, add it to `vlan-mgmt`, set `vlan-filtering=yes`. mamba
(`ether8`, untagged VLAN 99, `.2`) ↔ switch (`.1`) is the canary; the SSH/tunnel may
blip during the flip but must come back. Pre-verifying PVID/membership before the
flip is what prevents a hard lockout.
4. Verify: `/interface/bridge/vlan/print` membership correct, mgmt still reachable, a
device on `ether1`-fed ports still gets `10.2.30.x`.
## Risks
- **Lockout** on enabling `vlan-filtering` if `ether8`/VLAN 99/mgmt-IP aren't aligned.
Mitigated by ordering (filtering last), the live canary connection, and the operator
being on-site to re-cable. WinBox-MAC recovery is unavailable (broken under Wine);
worst case is a no-defaults reset, which we avoid.
- **Removing the legacy bridge IP** is the delicate step — done while the new
`vlan-mgmt` IP is the same address, before filtering, with the connection watched.
## Execution notes (applied 2026-06-09)
Cutover completed; switch is VLAN-filtered with isolated mgmt reachable on `ether8`.
`play_switch.yml` runs idempotently over the new mgmt path. Two gotchas surfaced:
- **NetworkManager autoconnect flap:** moving mamba's cable bounced the link; NM
re-selected the DHCP profile and dropped the static mgmt IP. Fixed by making
`crs310-bench` (192.168.88.2) sticky (`autoconnect yes`, priority 10) and turning
`autoconnect off` on `Wired connection 1`.
- **RouterOS `find ... address=<prefix>` never matches** an `/ip/address` value (returns
0 even on an exact string). The first flip therefore failed to remove the defconf IP
off `bridge`, duplicating `192.168.88.1` onto `vlan-mgmt` and breaking ARP. Fix: remove
by `[find interface=bridge]`, or match via `:foreach`+`/ip/address/get $a address`.
- **The flip was run as a detached, self-reverting on-device job** (`:execute { … :delay
240s; :if ($mgmtok=false) do={ revert } }`) — a commit-confirmed pattern. The first
(failed) attempt auto-healed at the timer; the corrected attempt was confirmed by
setting `:global mgmtok true` within the window. Recommended for any future
`vlan-filtering`/mgmt-IP change made over the network.
## Out of scope
Real inter-VLAN segmentation, the SFP+ 10G uplink/trunk, and any upstream router VLAN
work — revisited when the connectors and a real VLAN plan are ready.