MakerFLOSS_Troubleshooting/runbooks/switch-crs310.md
sjat 9ff12700ae Initial troubleshooting workspace: access, network map, runbooks
Scaffold for troubleshooting MakerFLOSS hosts at the makerspace.
Reference + thin runbooks model — authoritative data stays in the
source repos (AnsibleBaobabV4, MakerFLOSS_Mikrotik, MakerFLOSS).

- access.md: reach paths for mamba-on-LAN and fisi-tunneling-in
  (netbird on-demand, VPS bastion, ProxyJump via kuku->mamba),
  with the isolation rule.
- network-map.md: subnet pointers + open question on makerspace
  addressing (10.2.30/172.17.3/10.0.0).
- runbooks/switch-crs310.md: CRS310 connectivity + lockout recovery.
- incidents/: dated log scaffold.
- CLAUDE.md: operating rules for this repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 13:24:26 +02:00

83 lines
3.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook — CRS310 switch (`crs310-maker`)
The "new switch" at the makerspace. MikroTik **CRS310-8G+2S+IN**, RouterOS
7.19.6. Managed by Ansible from `~/Projects/MakerFLOSS_Mikrotik`.
**Authoritative sources:**
- Live topology + cutover runbook:
`MakerFLOSS_Mikrotik/docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md`
- Running config snapshot: `MakerFLOSS_Mikrotik/backups/crs310-maker/export.rsc`
- Device vars: `MakerFLOSS_Mikrotik/host_vars/crs310-maker.yml`
- Field guide: `MakerFLOSS_Mikrotik/docs/makerspace-switch-fieldguide.md`
## Topology recap
- **Transparent L2.** The switch is *not* a router — no inter-VLAN routing, no
presence on the data network.
- **Data VLAN 30** (`10.2.30.0/24`, gw `.1`): `ether1` = uplink, `ether27` =
access (untagged), SFP+ reserved (deferred). Users plug in here.
- **Mgmt VLAN 99** (`192.168.88.0/24`): switch at `192.168.88.1`, reachable
**only via `ether8`** (untagged). DHCP `.10.254`, web UI on. CPU is the only
tagged member. No default route, no DNS, no NTP — isolated by design.
- `vlan-filtering=yes` went live **2026-06-09**.
## Reach
- **Mgmt (reconfig, SSH, web UI):** you need a host on **`ether8`**. On-site:
plug mamba into `ether8`, SSH `192.168.88.1` (or `http://192.168.88.1`).
Remote: there is no standing tunnel to the mgmt VLAN — forward through a host
that *is* on `ether8`. See [access.md](../access.md) §A / §B.
- **Data path test:** plug into `ether27`, expect a `10.2.30.0/24` lease.
## Diagnose
| Symptom | Check |
|---------|-------|
| No link / no DHCP on an access port | Confirm you're on `ether27` (data), not `ether8` (mgmt). Verify uplink `ether1` is up to `10.2.30.1`. |
| Can't reach switch mgmt | Are you on `ether8`? Mgmt is reachable **nowhere else**. Confirm a `192.168.88.x` lease. |
| Suspected config drift | Diff live vs repo: run a backup play, compare `backups/crs310-maker/export.rsc` to git. |
| Lockout after a change | See recovery below. |
Connectivity test via Ansible (from `MakerFLOSS_Mikrotik`, mamba on `ether8`):
```bash
ansible -m community.routeros.command -a "commands='/system/resource/print'" crs310-maker
```
## Fix
Changes land in **`MakerFLOSS_Mikrotik` on `main`** (per the repo's own
workflow). Device-touching rules — **do not skip**:
- Run any device-touching play **twice**; the second run must report no changes
(idempotency).
- **Enable `vlan-filtering` last**, after bridge/PVID/mgmt-VLAN are proven.
- Network-affecting changes (mgmt IP/VLAN) should run as a **self-reverting
detached job** (240s timeout) so a bad flip auto-rolls-back.
- Keep a **WinBox MAC-telnet or serial** recovery channel open when touching
network settings.
```bash
# from ~/Projects/MakerFLOSS_Mikrotik, mamba on ether8
yamllint . && ansible-lint && ansible-playbook play_switch.yml --syntax-check
ansible-playbook play_switch.yml # full day-2
ansible-playbook play_switch.yml --tags vlans # one domain
ansible-playbook play_backup.yml # snapshot config into the repo
```
## Recovery (lockout)
Documented gotchas from the 2026-06-09 cutover (see the spec):
- **mamba NetworkManager flap** on the bench — pin the `crs310-bench` profile
`autoconnect yes`, static `192.168.88.2/24`.
- RouterOS `find ... address=` does **not** match IP prefixes — use
`find interface=` instead (caused a bridge-IP removal bug).
- If locked out over the network, recover via **WinBox MAC-telnet** on `ether8`
or serial console; the detached-job timeout should also self-revert.
## Verify
- `ansible-playbook play_switch.yml` second run → **no changes**.
- Access-port client gets a `10.2.30.0/24` lease and reaches the gateway.
- `ether8` client gets `192.168.88.x` and can SSH `192.168.88.1`.
- `export.rsc` committed and matches intent.