Initial troubleshooting workspace: access, network map, runbooks

Scaffold for troubleshooting MakerFLOSS hosts at the makerspace.
Reference + thin runbooks model — authoritative data stays in the
source repos (AnsibleBaobabV4, MakerFLOSS_Mikrotik, MakerFLOSS).

- access.md: reach paths for mamba-on-LAN and fisi-tunneling-in
  (netbird on-demand, VPS bastion, ProxyJump via kuku->mamba),
  with the isolation rule.
- network-map.md: subnet pointers + open question on makerspace
  addressing (10.2.30/172.17.3/10.0.0).
- runbooks/switch-crs310.md: CRS310 connectivity + lockout recovery.
- incidents/: dated log scaffold.
- CLAUDE.md: operating rules for this repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-09 13:24:26 +02:00
commit 9ff12700ae
8 changed files with 443 additions and 0 deletions

13
.gitignore vendored Normal file
View file

@ -0,0 +1,13 @@
# Never commit secrets or local scratch
.env
*.key
*.pem
secrets/
scratch/
*.retry
# Editor / OS
.DS_Store
*.swp
.idea/
.vscode/

44
CLAUDE.md Normal file
View file

@ -0,0 +1,44 @@
# CLAUDE.md — MakerFLOSS_Troubleshooting
Operating guide for working in this repo. This is a **troubleshooting workspace**
for MakerFLOSS hosts at the Orange Makerspace.
## What this repo is
Reference + thin runbooks. It does **not** hold authoritative IPs/topology/secrets
— those live in the source repos. Keep it that way; link, don't copy.
## Source repos (authoritative — most fixes land here)
- `~/Projects/AnsibleBaobabV4` — canonical infra-as-code: makerfloss VPS,
`makerfloss1`, `mf04`, `wg1` WireGuard plane, Netbird control plane, all
containers. Git remote = baobab Forgejo. Has Ansible vault (`prod`).
- `~/Projects/MakerFLOSS_Mikrotik` — the CRS310 switch. Ansible vault
(`makerfloss`). Strict lockout-safety rules — read its CLAUDE.md before
touching the device.
- `~/Projects/MakerFLOSS` — docs/slides site (docs.makerfloss.eu).
## Rules (decided 2026-06-09)
1. **Fixes go to the relevant source repo's `main`.** Apply directly there, then
run. For live switch/infra, follow that repo's idempotency + lockout-safety
rules (run device plays twice; enable VLAN-filtering last; detached
self-reverting jobs for mgmt changes).
2. **Access path for Claude (on fisi): Netbird, on-demand only.** Bring the
overlay up for the task, `netbird down` immediately after. Prefer the
VPS-bastion path when it suffices (no tunnel on fisi at all). **Isolation is
a hard requirement** — nothing from the makerspace should be able to reach
fisi/the homelab. See [access.md](access.md) §C.
3. **Reference, don't duplicate.** When you need a fact, link to the source-repo
file. If you cache a value here, note it can drift.
4. **Log real work** in [incidents/](incidents/) — symptom, root cause, the
source-repo commit, verification.
5. **Never commit secrets.** Vault keys live under `~/.ansible/vault-keys/`.
## Start-of-session checklist
1. [access.md](access.md) — pick a reach path for where you are.
2. [network-map.md](network-map.md) — confirm host/subnet (note the open
question about makerspace addressing).
3. [runbooks/](runbooks/) — find or write the runbook.
4. Verify with evidence before claiming a fix works.

48
README.md Normal file
View file

@ -0,0 +1,48 @@
# MakerFLOSS Troubleshooting
A working repo for troubleshooting and fixing hosts at the **Orange Makerspace**
that are part of the MakerFLOSS project.
This repo is **reference + thin runbooks**: it does *not* duplicate authoritative
data (IPs, topology, secrets). Those live in the source repos below. Here we keep
access procedures, runbooks, and an incident log, with pointers back to source.
## Source repos (authoritative)
| Repo | Path | What it owns |
|------|------|--------------|
| **AnsibleBaobabV4** | `~/Projects/AnsibleBaobabV4` | **Canonical infra-as-code.** The makerfloss VPS, `makerfloss1`, `mf04`, the makerfloss WireGuard plane (`wg1`), the Netbird control plane, and all containerised services. This is where most fixes land. |
| **MakerFLOSS_Mikrotik** | `~/Projects/MakerFLOSS_Mikrotik` | The **CRS310 switch** (`crs310-maker`) — Ansible-managed RouterOS config. The "new switch" at the makerspace. |
| **MakerFLOSS** | `~/Projects/MakerFLOSS` | Documentation site (`docs.makerfloss.eu`) and slides. Docs-only; the human-readable hardware/service catalog. |
> Note: `AnsibleBaobabV4` is a separate (homelab) project that also happens to
> manage the MakerFLOSS infrastructure — early MakerFLOSS work started there and
> stayed. Its git remote is the **baobab** Forgejo, not the MakerFLOSS one.
## Where fixes go
Fixes land in the **relevant source repo's `main`** branch (per decision
2026-06-09). Switch/live-infra changes still follow that repo's own
lockout-safety and idempotency rules (e.g. run device-touching plays twice;
enable VLAN-filtering last). This repo only holds runbooks and the incident log.
## Layout
```
.
├── access.md # HOW to reach makerspace hosts (read this first)
├── network-map.md # thin network overview + pointers + open questions
├── runbooks/ # task-focused troubleshooting guides
│ ├── README.md
│ └── switch-crs310.md
└── incidents/ # dated log of issues worked + outcomes
└── README.md
```
## Quick start for a troubleshooting session
1. Read [`access.md`](access.md) — pick a reach path for where you are
(makerspace with mamba, or tunneling in from `fisi`).
2. Check [`network-map.md`](network-map.md) for the host/subnet you're after.
3. Find or create a runbook in [`runbooks/`](runbooks/).
4. Apply fixes in the source repo; log what happened in [`incidents/`](incidents/).

157
access.md Normal file
View file

@ -0,0 +1,157 @@
# Access — reaching MakerFLOSS / makerspace hosts
How to get a shell on the hosts we troubleshoot, from each place you might be
working. **Read the section that matches your situation.**
There are two actors:
- **You (sjat)** — at the makerspace with **mamba** (laptop), or elsewhere.
- **Claude** — lives on **fisi** (homelab) and must *tunnel in* to help at the
makerspace.
---
## The hosts and where they live
| Host | Address | SSH | Where it sits | Notes |
|------|---------|-----|---------------|-------|
| **makerfloss VPS** | `88.99.32.236` / `makerfloss.eu` | `:7576` | Public (Hetzner) | Public entrypoint + bastion. Forgejo on `:7577`. WG `wg1` hub `10.13.0.1`. Netbird control plane `nb.makerfloss.eu`. |
| **makerfloss1** | `172.17.3.51` | `:7576` | Makerspace LAN | Local Docker/dev host. `wg1` peer `10.13.0.2` (full-tunnel). |
| **mf04** | `10.0.0.184` | `:7576` | Makerspace LAN | Testmachine. `wg1` peer `10.13.0.3` (split-tunnel). Netbird peer. |
| **crs310-maker** (switch) | mgmt `192.168.88.1` | `:22` | Makerspace, port `ether8` only | Mgmt VLAN 99, isolated. Data VLAN 30 = `10.2.30.0/24`. See [runbooks/switch-crs310.md](runbooks/switch-crs310.md). |
| **fisi** (home) | `10.20.10.17` | `:7576` | Homelab | Where Claude runs. Behind OPNsense; baobab WG hub is **kuku** (`10.8.0.1`, UDP 51194). |
| **mamba** (laptop) | LAN-dependent | `:7576` | Roams | Baobab WG peer `10.8.0.4`; makerfloss `wg1` roaming peer `10.13.0.5`; Netbird peer. |
> The makerspace addressing is **not fully pinned** — see the open question in
> [network-map.md](network-map.md). Confirm the host's current IP before relying
> on the table.
---
## A. You're at the makerspace with mamba (on the LAN)
Easiest case — mamba is directly on the makerspace network.
- **The switch (mgmt):** plug mamba into **`ether8`** (mgmt VLAN 99). You get a
DHCP lease in `192.168.88.10254`; switch is `192.168.88.1` (SSH, and web UI
`http://192.168.88.1`). A pinned NetworkManager profile `crs310-bench` /
static `192.168.88.2` is documented in the Mikrotik repo for this.
- **The switch (data ports `ether27`):** you land on data VLAN 30
(`10.2.30.0/24`, gateway `.1`) — normal user network, *no* switch mgmt here.
- **makerfloss1 / mf04:** SSH directly to their LAN IPs (`172.17.3.51`,
`10.0.0.184`) on `:7576`.
For Ansible against the switch from mamba, run from the `MakerFLOSS_Mikrotik`
repo with mamba on `ether8`.
---
## B. Claude tunneling in from fisi (no laptop at the makerspace)
fisi has **no direct route** to the makerspace LAN. Pick one path. Ordered by
preference given the **isolation requirement** (keep makerspace and homelab
apart — see the rule box below).
### B1. Netbird overlay — *on-demand only* (primary)
fisi is to be enrolled as a peer on the self-hosted Netbird control plane
(`nb.makerfloss.eu`), but the tunnel is brought **up only when needed and torn
down after**. This reaches other Netbird peers (mf04, mamba) on the
`100.92.0.0/16` overlay.
One-time enrollment (needs a setup key from the `nb.makerfloss.eu` dashboard):
```bash
# install netbird (Debian) — see netbirdio docs / baobab.netbird_client role
netbird up --setup-key <KEY> --management-url https://nb.makerfloss.eu
netbird status # confirm peers Connected
```
Per-session use:
```bash
netbird up # bring the overlay up for this session
# ... do the work, ssh to peer's 100.x address ...
netbird down # TEAR DOWN when done — do not leave it up
```
> **Why on-demand:** sjat's explicit constraint — nothing from the makerspace
> should be able to bleed into fisi/the homelab. Netbird stays **down by
> default** on fisi; it is not a standing tunnel.
### B2. Via the makerfloss VPS bastion (no tunnel on fisi)
The cleanest isolation-preserving path: fisi only ever talks to the **public
VPS**, which hops onward over its own `wg1` tunnel. fisi itself joins no
makerspace network.
```bash
# Land on the VPS:
ssh -p 7576 sjat@makerfloss.eu
# From the VPS, reach wg1 peers on the makerfloss plane:
ssh -p 7576 sjat@10.13.0.2 # makerfloss1
ssh -p 7576 sjat@10.13.0.3 # mf04
```
Or, if/when the per-machine SSH DNAT from the VPN design is in place, jump
straight through the VPS with a single hop (ports `:7578`/`:7579` → testmachine
`:7576`). Confirm these DNAT rules exist before relying on them.
```bash
ssh -J sjat@makerfloss.eu:7576 -p 7576 sjat@10.13.0.3 # fisi → VPS → mf04
```
This path does **not** require mamba to be present.
### B3. ProxyJump via kuku → mamba (only when mamba is at the makerspace)
This is the chain the `AnsibleBaobabV4` inventory already uses for `mf04`:
```
fisi → kuku (baobab WG hub) → mamba (WG peer 10.8.0.4, on the makerspace LAN) → mf04
```
```bash
ssh -o ProxyJump="kuku,sjat@10.8.0.4:7576" -p 7576 sjat@10.0.0.184 # mf04
```
In `AnsibleBaobabV4/host_vars/mf04.yml`:
`ansible_ssh_common_args: '-o ProxyJump="kuku,sjat@10.8.0.4:7576"'`.
**Constraint:** only works while mamba is physically at the makerspace, online,
and roaming on the baobab WG plane. Breaks the moment mamba leaves.
### Choosing between B1/B2/B3
| Need | Use |
|------|-----|
| Reach mf04/mamba over the overlay, mamba may be absent | **B1** (netbird, on-demand) |
| Reach makerfloss1 / mf04 / VPS-side services, max isolation, no tunnel on fisi | **B2** (VPS bastion) |
| mamba is at the makerspace and you want the existing Ansible path | **B3** (ProxyJump) |
| The **switch** mgmt plane (`192.168.88.1`) | None of these directly — mgmt VLAN 99 is reachable only from `ether8`. Need a host *on* `ether8` (mamba, or a testmachine patched in) to forward through. See switch runbook. |
---
## C. Isolation rule (important)
> **Keep the makerspace and the homelab apart.** fisi must not hold a standing
> tunnel into the makerspace. Bring Netbird **up only for the task, down
> immediately after**. Prefer the VPS-bastion path (B2) when it suffices, since
> it puts no makerspace network on fisi at all. The concern is preventing any
> compromise at the makerspace from reaching fisi/the homelab.
---
## D. Secrets & credentials (where they live — not stored here)
- **Ansible vault (baobab):** `~/.ansible/vault-keys/prod.txt`; secrets in
`AnsibleBaobabV4/group_vars/all/90-secrets.vault.yml`
(`ansible-vault edit ... --vault-id prod@`). WG/Netbird/service secrets are
`vault_*` keys there.
- **Ansible vault (switch):** identity `makerfloss`,
`~/.ansible/vault-keys/makerfloss.txt`; switch admin password in
`MakerFLOSS_Mikrotik/group_vars/mikrotik.vault.yml`.
- **SSH:** operator key `~/.ssh/id_ed25519`; project-wide SSH port is **7576**
(switch SSH is on `:22`, Forgejo git on `:7577`).
Never commit secrets to this repo.

33
incidents/README.md Normal file
View file

@ -0,0 +1,33 @@
# Incidents
Dated log of issues we troubleshoot and how they resolved. One file per
incident: `YYYY-MM-DD-short-slug.md`. Keep them short and factual — symptom,
what we found, what we changed (and in which source repo), outcome.
## Suggested template
```markdown
# YYYY-MM-DD — <short title>
- **Host(s):**
- **Reported by / observed:**
- **Reach path used:** (access.md §...)
## Symptom
## Investigation
## Root cause
## Fix
- Source repo + commit:
- Live action taken:
## Verification
## Follow-ups
```
## Log
_(none yet)_

43
network-map.md Normal file
View file

@ -0,0 +1,43 @@
# Network map (thin)
Pointers, not the source of truth. Authoritative data is in the source repos —
links below. Confirm live values before acting.
## Subnets seen across the repos
| Subnet | Role | Source of truth |
|--------|------|-----------------|
| `10.2.30.0/24` | **CRS310 data VLAN 30** (the new switch). Uplink `ether1` → gateway `10.2.30.1`; access ports `ether27`. | `MakerFLOSS_Mikrotik/host_vars/crs310-maker.yml`, `docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md` |
| `192.168.88.0/24` | **CRS310 mgmt VLAN 99** — isolated, switch at `192.168.88.1`, reachable only from `ether8`. DHCP `.10.254`. | same |
| `172.17.3.0/24` | OrangeMakers LAN — `makerfloss1` at `.51`. | `AnsibleBaobabV4/host_vars/makerfloss1.yml` |
| `10.0.0.0/24` | Makerspace LAN — `mf04` at `.184`. | `AnsibleBaobabV4/host_vars/mf04.yml` |
| `10.13.0.0/24` | **makerfloss WireGuard plane (`wg1`)**. Hub `10.13.0.1` (VPS), `makerfloss1` `.2`, `mf04` `.3`, `sjat-roaming` `.5`. UDP `:51820`. | `AnsibleBaobabV4/host_vars/makerfloss.yml`, `specs/2026-05-12-makerfloss-wireguard-design.md` |
| `100.92.0.0/16` | **Netbird overlay** (`wt0`), control plane `nb.makerfloss.eu`. | `specs/2026-05-27-makerspace-vpn-design.md` |
| `10.8.0.0/24` | baobab (home) WireGuard plane. Hub **kuku** `10.8.0.1` (UDP `:51194`); mamba `10.8.0.4`. | `AnsibleBaobabV4` |
| `10.20.10.0/24` | homelab LAN — **fisi** `.17`, kuku `.118`, papa `.11`. | `AnsibleBaobabV4` |
## ⚠️ Open question — makerspace addressing
The makerspace shows **three different subnets** for hosts that are all
physically there: `10.2.30.0/24` (new CRS310 data VLAN), `172.17.3.0/24`
(`makerfloss1`), and `10.0.0.0/24` (`mf04`). It's not yet documented here how
they relate — i.e. whether the new switch sits in front of the existing
OrangeMakers `172.17.3.x` / `10.0.0.x` network, replaces part of it, or is a
parallel segment.
**To resolve (sjat to confirm on-site):**
- What IP does an ethernet client on the new switch's data ports (`ether27`)
actually get, and from which DHCP server / gateway?
- Are `makerfloss1` (`172.17.3.51`) and `mf04` (`10.0.0.184`) reachable from a
data-port client on the new switch, or are they on a different segment?
- Is the `10.2.30.1` uplink gateway the OrangeMakers router, or something new?
Update this section (and the host IPs in [access.md](access.md)) once confirmed.
## Public services (makerfloss VPS, `88.99.32.236`)
All TLS-terminated at the VPS via Traefik, certs via Gandi DNS-01:
`docs.makerfloss.eu`, `slides.makerfloss.eu`, `forgejo.makerfloss.eu` (git SSH
`:7577`), `mail.makerfloss.eu` (Poste.io), `discourse.makerfloss.eu`,
`snipeit.makerfloss.eu`, `nb.makerfloss.eu` (Netbird).
Source: `AnsibleBaobabV4/host_vars/makerfloss.yml`.

22
runbooks/README.md Normal file
View file

@ -0,0 +1,22 @@
# Runbooks
Task-focused troubleshooting guides. Thin — they point at the source repos for
authoritative config and commands rather than duplicating them.
## Index
| Runbook | Covers |
|---------|--------|
| [switch-crs310.md](switch-crs310.md) | The MikroTik CRS310 switch — connectivity, VLANs, mgmt-plane lockout recovery, Ansible reconfig. |
## Adding a runbook
Keep it lean. Good structure:
1. **Symptom** — what you observe.
2. **Reach** — which [access.md](../access.md) path applies.
3. **Diagnose** — concrete checks (commands, what good/bad looks like).
4. **Fix** — where the change lands (source repo + file), and the safety rules
for applying it to live infra.
5. **Verify** — how you confirm it's actually fixed (evidence, not assertion).
6. **Links** — source-repo specs/runbooks.

83
runbooks/switch-crs310.md Normal file
View file

@ -0,0 +1,83 @@
# Runbook — CRS310 switch (`crs310-maker`)
The "new switch" at the makerspace. MikroTik **CRS310-8G+2S+IN**, RouterOS
7.19.6. Managed by Ansible from `~/Projects/MakerFLOSS_Mikrotik`.
**Authoritative sources:**
- Live topology + cutover runbook:
`MakerFLOSS_Mikrotik/docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md`
- Running config snapshot: `MakerFLOSS_Mikrotik/backups/crs310-maker/export.rsc`
- Device vars: `MakerFLOSS_Mikrotik/host_vars/crs310-maker.yml`
- Field guide: `MakerFLOSS_Mikrotik/docs/makerspace-switch-fieldguide.md`
## Topology recap
- **Transparent L2.** The switch is *not* a router — no inter-VLAN routing, no
presence on the data network.
- **Data VLAN 30** (`10.2.30.0/24`, gw `.1`): `ether1` = uplink, `ether27` =
access (untagged), SFP+ reserved (deferred). Users plug in here.
- **Mgmt VLAN 99** (`192.168.88.0/24`): switch at `192.168.88.1`, reachable
**only via `ether8`** (untagged). DHCP `.10.254`, web UI on. CPU is the only
tagged member. No default route, no DNS, no NTP — isolated by design.
- `vlan-filtering=yes` went live **2026-06-09**.
## Reach
- **Mgmt (reconfig, SSH, web UI):** you need a host on **`ether8`**. On-site:
plug mamba into `ether8`, SSH `192.168.88.1` (or `http://192.168.88.1`).
Remote: there is no standing tunnel to the mgmt VLAN — forward through a host
that *is* on `ether8`. See [access.md](../access.md) §A / §B.
- **Data path test:** plug into `ether27`, expect a `10.2.30.0/24` lease.
## Diagnose
| Symptom | Check |
|---------|-------|
| No link / no DHCP on an access port | Confirm you're on `ether27` (data), not `ether8` (mgmt). Verify uplink `ether1` is up to `10.2.30.1`. |
| Can't reach switch mgmt | Are you on `ether8`? Mgmt is reachable **nowhere else**. Confirm a `192.168.88.x` lease. |
| Suspected config drift | Diff live vs repo: run a backup play, compare `backups/crs310-maker/export.rsc` to git. |
| Lockout after a change | See recovery below. |
Connectivity test via Ansible (from `MakerFLOSS_Mikrotik`, mamba on `ether8`):
```bash
ansible -m community.routeros.command -a "commands='/system/resource/print'" crs310-maker
```
## Fix
Changes land in **`MakerFLOSS_Mikrotik` on `main`** (per the repo's own
workflow). Device-touching rules — **do not skip**:
- Run any device-touching play **twice**; the second run must report no changes
(idempotency).
- **Enable `vlan-filtering` last**, after bridge/PVID/mgmt-VLAN are proven.
- Network-affecting changes (mgmt IP/VLAN) should run as a **self-reverting
detached job** (240s timeout) so a bad flip auto-rolls-back.
- Keep a **WinBox MAC-telnet or serial** recovery channel open when touching
network settings.
```bash
# from ~/Projects/MakerFLOSS_Mikrotik, mamba on ether8
yamllint . && ansible-lint && ansible-playbook play_switch.yml --syntax-check
ansible-playbook play_switch.yml # full day-2
ansible-playbook play_switch.yml --tags vlans # one domain
ansible-playbook play_backup.yml # snapshot config into the repo
```
## Recovery (lockout)
Documented gotchas from the 2026-06-09 cutover (see the spec):
- **mamba NetworkManager flap** on the bench — pin the `crs310-bench` profile
`autoconnect yes`, static `192.168.88.2/24`.
- RouterOS `find ... address=` does **not** match IP prefixes — use
`find interface=` instead (caused a bridge-IP removal bug).
- If locked out over the network, recover via **WinBox MAC-telnet** on `ether8`
or serial console; the detached-job timeout should also self-revert.
## Verify
- `ansible-playbook play_switch.yml` second run → **no changes**.
- Access-port client gets a `10.2.30.0/24` lease and reaches the gateway.
- `ether8` client gets `192.168.88.x` and can SSH `192.168.88.1`.
- `export.rsc` committed and matches intent.