docs: capture topology + operational learnings in CLAUDE.md/README

Bring the everyday guides up to the live state (flat data VLAN 30 + isolated mgmt
VLAN 99 on ether8, DHCP + web UI experiment) and record the gotchas that cost time:
the bench tunnel (paramiko ignores ProxyJump), mamba NM-profile stickiness on cable
flap, the RouterOS find-by-address quirk, and the commit-confirmed detached-flip
pattern for lockout-prone changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-09 13:04:35 +02:00
parent 18de750507
commit 2796616d05
2 changed files with 42 additions and 14 deletions

View file

@ -30,12 +30,35 @@ ansible-playbook play_switch.yml --tags vlans # one domain
ansible-vault view group_vars/mikrotik.vault.yml # read a secret
```
## Access (on-site / bench)
The switch is reachable only via the makerspace laptop `mamba`. Ansible's `network_cli`
uses paramiko, which **ignores ProxyJump**, so port-forward instead of double-hopping:
```bash
ssh -J kuku -p 7576 sjat@10.8.0.4 -L 2222:192.168.88.1:22 -N # tunnel to the switch
ansible-playbook play_switch.yml -e ansible_host=127.0.0.1 -e ansible_port=2222 -e ansible_user=sjat
ssh-keygen -R '[127.0.0.1]:2222' # if the tunnel host key changed
```
- `mamba` is the mgmt station on **switch port 8** (MGMT VLAN); it must be on port 8 to
reach `192.168.88.1`. From a data port it gets `10.2.30.x` and **cannot** reach mgmt.
- NM profiles on `mamba` `enp0s31f6`: `crs310-bench` (static `.2`) and `Wired connection 1`
(DHCP). Moving the cable flaps the link and NM re-selects a profile — pin the intended
one sticky (`autoconnect yes` + higher priority) and the other off, or it reverts.
## Rules
- **Idempotency:** RouterOS tasks use `community.routeros.command` with `:if [find]`
guards. Run every device-touching play **twice**; the second run must report no changes.
- **Lockout safety:** keep an independent recovery channel (serial/WinBox-MAC) when
touching mgmt/services/VLANs; enable `vlan-filtering` **last**.
touching mgmt/services/VLANs; enable `vlan-filtering` **last**. For lockout-prone
changes over the network (vlan-filtering, moving the mgmt IP), run them as a detached
self-reverting job — `:execute { …; :delay 240s; :if ($mgmtok=false) do={ revert } }`,
then `:global mgmtok true` once verified. (Auto-healed a hard lockout during the cutover.)
- **RouterOS `find ... address=<prefix>` never matches** an ip/address or dhcp-network
value (returns 0 even on an exact string) — match by `[find interface=X]` or
`:foreach`+`/ip/address/get $a address`. Bit the mgmt-IP move (duplicated the IP).
- **All real values go in `host_vars`;** the role holds only mechanism + placeholders.
- **Secrets** go to the `makerfloss` vault, never plaintext. Encrypt with
`ansible-vault encrypt --encrypt-vault-id makerfloss <file>`.
@ -43,12 +66,13 @@ ansible-vault view group_vars/mikrotik.vault.yml # read a secret
## Status / next
Bootstrap is done (user `sjat` + key + identity `crs310-maker`, RouterOS 7.19.6 pinned;
default `admin` now disabled). All per-domain task files are **implemented**:
`identity`, `users`, `backup`, `firmware` (opt-in) and `play_bootstrap` / `play_backup`
are idempotency-verified against the device. `vlans` is implemented and Jinja-validated
but its **device run is deferred** — the `host_vars` topology is still a placeholder.
Live on the device (2026-06-09): flat L2 switch on `10.2.30.0/24` — **DATA VLAN 30**
(`ether1` copper uplink + `ether2-7` + SFP+), **isolated MGMT VLAN 99 on `ether8`**
(mgmt `192.168.88.1/24`, no gateway/NTP/DNS), `vlan-filtering` on. The mgmt port also
serves DHCP (`192.168.88.10-.254`) + the web UI as a makerspace experiment (flags
`switch_web_enabled`, `switch_mgmt_dhcp_enabled`). Default `admin` disabled; login as
`sjat` (key, or vaulted password). All task files + `play_bootstrap`/`play_backup` are
idempotency-verified. Design + cutover runbook:
`docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md`.
Next, on-site with a recovery channel: drop the real VLAN/port map into `host_vars`,
reconcile the legacy defconf IP (`192.168.88.1/24` lives directly on `bridge`), then run
`--tags vlans` and confirm mgmt reachability before/after `vlan-filtering=yes`.
Next: SFP+ 10G uplink and real VLAN segmentation once connectors + a VLAN plan are ready.

View file

@ -13,12 +13,16 @@ rebuilt from this repo instead of by hand in WinBox.
| Repo scaffolding, role skeleton, vault | ✅ done |
| On-site device prep + **bootstrap** (named user + SSH key + identity) | ✅ done (2026-06-08) |
| `identity` / `users` / `backup` / `firmware` + `play_bootstrap` / `play_backup` | ✅ implemented; idempotency-verified against the device (firmware is opt-in, lint/syntax only) |
| `vlans` (VLAN-aware bridge, ports, mgmt iface) | ✅ implemented + Jinja-validated; **device run deferred** — needs the real VLAN/port plan and an on-site recovery channel before `vlan-filtering` is enabled |
| `vlans` (VLAN-aware bridge, ports, mgmt iface) | ✅ **applied & live** — flat data VLAN + isolated mgmt VLAN, `vlan-filtering` on |
The switch is reachable today by key auth as user `sjat`. All task files now carry their
real RouterOS logic. The `vlans` topology in `host_vars` is still a **placeholder**:
replace it with the real makerspace VLAN ids + per-port map before running `--tags vlans`
on the live device, and do so on-site with a serial/WinBox-MAC recovery channel open.
**Live topology (2026-06-09):** a flat L2 switch on the makerspace `10.2.30.0/24`
**DATA VLAN 30** (`ether1` copper uplink + `ether2-7` + SFP+) bridged through, and an
**isolated MGMT VLAN 99 on `ether8`** (switch admin at `192.168.88.1`, no gateway/NTP/DNS).
The mgmt port also serves DHCP + the web UI as an experiment (plug into `ether8`, get a
lease, admin at `http://192.168.88.1`; login still required, default `admin` disabled).
SFP+ 10G uplink and real VLAN segmentation are future work. See
`docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md` for the design + the
lockout-safe cutover runbook.
## Layout