MakerFLOSS_Troubleshooting/runbooks/switch-crs310.md

84 lines
3.7 KiB
Markdown
Raw Normal View History

# Runbook — CRS310 switch (`crs310-maker`)
The "new switch" at the makerspace. MikroTik **CRS310-8G+2S+IN**, RouterOS
7.19.6. Managed by Ansible from `~/Projects/MakerFLOSS_Mikrotik`.
**Authoritative sources:**
- Live topology + cutover runbook:
`MakerFLOSS_Mikrotik/docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md`
- Running config snapshot: `MakerFLOSS_Mikrotik/backups/crs310-maker/export.rsc`
- Device vars: `MakerFLOSS_Mikrotik/host_vars/crs310-maker.yml`
- Field guide: `MakerFLOSS_Mikrotik/docs/makerspace-switch-fieldguide.md`
## Topology recap
- **Transparent L2.** The switch is *not* a router — no inter-VLAN routing, no
presence on the data network.
- **Data VLAN 30** (`10.2.30.0/24`, gw `.1`): `ether1` = uplink, `ether27` =
access (untagged), SFP+ reserved (deferred). Users plug in here.
- **Mgmt VLAN 99** (`192.168.88.0/24`): switch at `192.168.88.1`, reachable
**only via `ether8`** (untagged). DHCP `.10.254`, web UI on. CPU is the only
tagged member. No default route, no DNS, no NTP — isolated by design.
- `vlan-filtering=yes` went live **2026-06-09**.
## Reach
- **Mgmt (reconfig, SSH, web UI):** you need a host on **`ether8`**. On-site:
plug mamba into `ether8`, SSH `192.168.88.1` (or `http://192.168.88.1`).
Remote: there is no standing tunnel to the mgmt VLAN — forward through a host
that *is* on `ether8`. See [access.md](../access.md) §A / §B.
- **Data path test:** plug into `ether27`, expect a `10.2.30.0/24` lease.
## Diagnose
| Symptom | Check |
|---------|-------|
| No link / no DHCP on an access port | Confirm you're on `ether27` (data), not `ether8` (mgmt). Verify uplink `ether1` is up to `10.2.30.1`. |
| Can't reach switch mgmt | Are you on `ether8`? Mgmt is reachable **nowhere else**. Confirm a `192.168.88.x` lease. |
| Suspected config drift | Diff live vs repo: run a backup play, compare `backups/crs310-maker/export.rsc` to git. |
| Lockout after a change | See recovery below. |
Connectivity test via Ansible (from `MakerFLOSS_Mikrotik`, mamba on `ether8`):
```bash
ansible -m community.routeros.command -a "commands='/system/resource/print'" crs310-maker
```
## Fix
Changes land in **`MakerFLOSS_Mikrotik` on `main`** (per the repo's own
workflow). Device-touching rules — **do not skip**:
- Run any device-touching play **twice**; the second run must report no changes
(idempotency).
- **Enable `vlan-filtering` last**, after bridge/PVID/mgmt-VLAN are proven.
- Network-affecting changes (mgmt IP/VLAN) should run as a **self-reverting
detached job** (240s timeout) so a bad flip auto-rolls-back.
- Keep a **WinBox MAC-telnet or serial** recovery channel open when touching
network settings.
```bash
# from ~/Projects/MakerFLOSS_Mikrotik, mamba on ether8
yamllint . && ansible-lint && ansible-playbook play_switch.yml --syntax-check
ansible-playbook play_switch.yml # full day-2
ansible-playbook play_switch.yml --tags vlans # one domain
ansible-playbook play_backup.yml # snapshot config into the repo
```
## Recovery (lockout)
Documented gotchas from the 2026-06-09 cutover (see the spec):
- **mamba NetworkManager flap** on the bench — pin the `crs310-bench` profile
`autoconnect yes`, static `192.168.88.2/24`.
- RouterOS `find ... address=` does **not** match IP prefixes — use
`find interface=` instead (caused a bridge-IP removal bug).
- If locked out over the network, recover via **WinBox MAC-telnet** on `ether8`
or serial console; the detached-job timeout should also self-revert.
## Verify
- `ansible-playbook play_switch.yml` second run → **no changes**.
- Access-port client gets a `10.2.30.0/24` lease and reaches the gateway.
- `ether8` client gets `192.168.88.x` and can SSH `192.168.88.1`.
- `export.rsc` committed and matches intent.