commit 9ff12700ae651fa2730ad7808532f4012b4fd9d3 Author: sjat Date: Tue Jun 9 13:24:26 2026 +0200 Initial troubleshooting workspace: access, network map, runbooks Scaffold for troubleshooting MakerFLOSS hosts at the makerspace. Reference + thin runbooks model — authoritative data stays in the source repos (AnsibleBaobabV4, MakerFLOSS_Mikrotik, MakerFLOSS). - access.md: reach paths for mamba-on-LAN and fisi-tunneling-in (netbird on-demand, VPS bastion, ProxyJump via kuku->mamba), with the isolation rule. - network-map.md: subnet pointers + open question on makerspace addressing (10.2.30/172.17.3/10.0.0). - runbooks/switch-crs310.md: CRS310 connectivity + lockout recovery. - incidents/: dated log scaffold. - CLAUDE.md: operating rules for this repo. Co-Authored-By: Claude Opus 4.8 (1M context) diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..548bfda --- /dev/null +++ b/.gitignore @@ -0,0 +1,13 @@ +# Never commit secrets or local scratch +.env +*.key +*.pem +secrets/ +scratch/ +*.retry + +# Editor / OS +.DS_Store +*.swp +.idea/ +.vscode/ diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..b940729 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,44 @@ +# CLAUDE.md — MakerFLOSS_Troubleshooting + +Operating guide for working in this repo. This is a **troubleshooting workspace** +for MakerFLOSS hosts at the Orange Makerspace. + +## What this repo is + +Reference + thin runbooks. It does **not** hold authoritative IPs/topology/secrets +— those live in the source repos. Keep it that way; link, don't copy. + +## Source repos (authoritative — most fixes land here) + +- `~/Projects/AnsibleBaobabV4` — canonical infra-as-code: makerfloss VPS, + `makerfloss1`, `mf04`, `wg1` WireGuard plane, Netbird control plane, all + containers. Git remote = baobab Forgejo. Has Ansible vault (`prod`). +- `~/Projects/MakerFLOSS_Mikrotik` — the CRS310 switch. Ansible vault + (`makerfloss`). Strict lockout-safety rules — read its CLAUDE.md before + touching the device. +- `~/Projects/MakerFLOSS` — docs/slides site (docs.makerfloss.eu). + +## Rules (decided 2026-06-09) + +1. **Fixes go to the relevant source repo's `main`.** Apply directly there, then + run. For live switch/infra, follow that repo's idempotency + lockout-safety + rules (run device plays twice; enable VLAN-filtering last; detached + self-reverting jobs for mgmt changes). +2. **Access path for Claude (on fisi): Netbird, on-demand only.** Bring the + overlay up for the task, `netbird down` immediately after. Prefer the + VPS-bastion path when it suffices (no tunnel on fisi at all). **Isolation is + a hard requirement** — nothing from the makerspace should be able to reach + fisi/the homelab. See [access.md](access.md) §C. +3. **Reference, don't duplicate.** When you need a fact, link to the source-repo + file. If you cache a value here, note it can drift. +4. **Log real work** in [incidents/](incidents/) — symptom, root cause, the + source-repo commit, verification. +5. **Never commit secrets.** Vault keys live under `~/.ansible/vault-keys/`. + +## Start-of-session checklist + +1. [access.md](access.md) — pick a reach path for where you are. +2. [network-map.md](network-map.md) — confirm host/subnet (note the open + question about makerspace addressing). +3. [runbooks/](runbooks/) — find or write the runbook. +4. Verify with evidence before claiming a fix works. diff --git a/README.md b/README.md new file mode 100644 index 0000000..dbae933 --- /dev/null +++ b/README.md @@ -0,0 +1,48 @@ +# MakerFLOSS Troubleshooting + +A working repo for troubleshooting and fixing hosts at the **Orange Makerspace** +that are part of the MakerFLOSS project. + +This repo is **reference + thin runbooks**: it does *not* duplicate authoritative +data (IPs, topology, secrets). Those live in the source repos below. Here we keep +access procedures, runbooks, and an incident log, with pointers back to source. + +## Source repos (authoritative) + +| Repo | Path | What it owns | +|------|------|--------------| +| **AnsibleBaobabV4** | `~/Projects/AnsibleBaobabV4` | **Canonical infra-as-code.** The makerfloss VPS, `makerfloss1`, `mf04`, the makerfloss WireGuard plane (`wg1`), the Netbird control plane, and all containerised services. This is where most fixes land. | +| **MakerFLOSS_Mikrotik** | `~/Projects/MakerFLOSS_Mikrotik` | The **CRS310 switch** (`crs310-maker`) — Ansible-managed RouterOS config. The "new switch" at the makerspace. | +| **MakerFLOSS** | `~/Projects/MakerFLOSS` | Documentation site (`docs.makerfloss.eu`) and slides. Docs-only; the human-readable hardware/service catalog. | + +> Note: `AnsibleBaobabV4` is a separate (homelab) project that also happens to +> manage the MakerFLOSS infrastructure — early MakerFLOSS work started there and +> stayed. Its git remote is the **baobab** Forgejo, not the MakerFLOSS one. + +## Where fixes go + +Fixes land in the **relevant source repo's `main`** branch (per decision +2026-06-09). Switch/live-infra changes still follow that repo's own +lockout-safety and idempotency rules (e.g. run device-touching plays twice; +enable VLAN-filtering last). This repo only holds runbooks and the incident log. + +## Layout + +``` +. +├── access.md # HOW to reach makerspace hosts (read this first) +├── network-map.md # thin network overview + pointers + open questions +├── runbooks/ # task-focused troubleshooting guides +│ ├── README.md +│ └── switch-crs310.md +└── incidents/ # dated log of issues worked + outcomes + └── README.md +``` + +## Quick start for a troubleshooting session + +1. Read [`access.md`](access.md) — pick a reach path for where you are + (makerspace with mamba, or tunneling in from `fisi`). +2. Check [`network-map.md`](network-map.md) for the host/subnet you're after. +3. Find or create a runbook in [`runbooks/`](runbooks/). +4. Apply fixes in the source repo; log what happened in [`incidents/`](incidents/). diff --git a/access.md b/access.md new file mode 100644 index 0000000..40a100f --- /dev/null +++ b/access.md @@ -0,0 +1,157 @@ +# Access — reaching MakerFLOSS / makerspace hosts + +How to get a shell on the hosts we troubleshoot, from each place you might be +working. **Read the section that matches your situation.** + +There are two actors: +- **You (sjat)** — at the makerspace with **mamba** (laptop), or elsewhere. +- **Claude** — lives on **fisi** (homelab) and must *tunnel in* to help at the + makerspace. + +--- + +## The hosts and where they live + +| Host | Address | SSH | Where it sits | Notes | +|------|---------|-----|---------------|-------| +| **makerfloss VPS** | `88.99.32.236` / `makerfloss.eu` | `:7576` | Public (Hetzner) | Public entrypoint + bastion. Forgejo on `:7577`. WG `wg1` hub `10.13.0.1`. Netbird control plane `nb.makerfloss.eu`. | +| **makerfloss1** | `172.17.3.51` | `:7576` | Makerspace LAN | Local Docker/dev host. `wg1` peer `10.13.0.2` (full-tunnel). | +| **mf04** | `10.0.0.184` | `:7576` | Makerspace LAN | Testmachine. `wg1` peer `10.13.0.3` (split-tunnel). Netbird peer. | +| **crs310-maker** (switch) | mgmt `192.168.88.1` | `:22` | Makerspace, port `ether8` only | Mgmt VLAN 99, isolated. Data VLAN 30 = `10.2.30.0/24`. See [runbooks/switch-crs310.md](runbooks/switch-crs310.md). | +| **fisi** (home) | `10.20.10.17` | `:7576` | Homelab | Where Claude runs. Behind OPNsense; baobab WG hub is **kuku** (`10.8.0.1`, UDP 51194). | +| **mamba** (laptop) | LAN-dependent | `:7576` | Roams | Baobab WG peer `10.8.0.4`; makerfloss `wg1` roaming peer `10.13.0.5`; Netbird peer. | + +> The makerspace addressing is **not fully pinned** — see the open question in +> [network-map.md](network-map.md). Confirm the host's current IP before relying +> on the table. + +--- + +## A. You're at the makerspace with mamba (on the LAN) + +Easiest case — mamba is directly on the makerspace network. + +- **The switch (mgmt):** plug mamba into **`ether8`** (mgmt VLAN 99). You get a + DHCP lease in `192.168.88.10–254`; switch is `192.168.88.1` (SSH, and web UI + `http://192.168.88.1`). A pinned NetworkManager profile `crs310-bench` / + static `192.168.88.2` is documented in the Mikrotik repo for this. +- **The switch (data ports `ether2–7`):** you land on data VLAN 30 + (`10.2.30.0/24`, gateway `.1`) — normal user network, *no* switch mgmt here. +- **makerfloss1 / mf04:** SSH directly to their LAN IPs (`172.17.3.51`, + `10.0.0.184`) on `:7576`. + +For Ansible against the switch from mamba, run from the `MakerFLOSS_Mikrotik` +repo with mamba on `ether8`. + +--- + +## B. Claude tunneling in from fisi (no laptop at the makerspace) + +fisi has **no direct route** to the makerspace LAN. Pick one path. Ordered by +preference given the **isolation requirement** (keep makerspace and homelab +apart — see the rule box below). + +### B1. Netbird overlay — *on-demand only* (primary) + +fisi is to be enrolled as a peer on the self-hosted Netbird control plane +(`nb.makerfloss.eu`), but the tunnel is brought **up only when needed and torn +down after**. This reaches other Netbird peers (mf04, mamba) on the +`100.92.0.0/16` overlay. + +One-time enrollment (needs a setup key from the `nb.makerfloss.eu` dashboard): + +```bash +# install netbird (Debian) — see netbirdio docs / baobab.netbird_client role +netbird up --setup-key --management-url https://nb.makerfloss.eu +netbird status # confirm peers Connected +``` + +Per-session use: + +```bash +netbird up # bring the overlay up for this session +# ... do the work, ssh to peer's 100.x address ... +netbird down # TEAR DOWN when done — do not leave it up +``` + +> **Why on-demand:** sjat's explicit constraint — nothing from the makerspace +> should be able to bleed into fisi/the homelab. Netbird stays **down by +> default** on fisi; it is not a standing tunnel. + +### B2. Via the makerfloss VPS bastion (no tunnel on fisi) + +The cleanest isolation-preserving path: fisi only ever talks to the **public +VPS**, which hops onward over its own `wg1` tunnel. fisi itself joins no +makerspace network. + +```bash +# Land on the VPS: +ssh -p 7576 sjat@makerfloss.eu + +# From the VPS, reach wg1 peers on the makerfloss plane: +ssh -p 7576 sjat@10.13.0.2 # makerfloss1 +ssh -p 7576 sjat@10.13.0.3 # mf04 +``` + +Or, if/when the per-machine SSH DNAT from the VPN design is in place, jump +straight through the VPS with a single hop (ports `:7578`/`:7579` → testmachine +`:7576`). Confirm these DNAT rules exist before relying on them. + +```bash +ssh -J sjat@makerfloss.eu:7576 -p 7576 sjat@10.13.0.3 # fisi → VPS → mf04 +``` + +This path does **not** require mamba to be present. + +### B3. ProxyJump via kuku → mamba (only when mamba is at the makerspace) + +This is the chain the `AnsibleBaobabV4` inventory already uses for `mf04`: + +``` +fisi → kuku (baobab WG hub) → mamba (WG peer 10.8.0.4, on the makerspace LAN) → mf04 +``` + +```bash +ssh -o ProxyJump="kuku,sjat@10.8.0.4:7576" -p 7576 sjat@10.0.0.184 # mf04 +``` + +In `AnsibleBaobabV4/host_vars/mf04.yml`: +`ansible_ssh_common_args: '-o ProxyJump="kuku,sjat@10.8.0.4:7576"'`. + +**Constraint:** only works while mamba is physically at the makerspace, online, +and roaming on the baobab WG plane. Breaks the moment mamba leaves. + +### Choosing between B1/B2/B3 + +| Need | Use | +|------|-----| +| Reach mf04/mamba over the overlay, mamba may be absent | **B1** (netbird, on-demand) | +| Reach makerfloss1 / mf04 / VPS-side services, max isolation, no tunnel on fisi | **B2** (VPS bastion) | +| mamba is at the makerspace and you want the existing Ansible path | **B3** (ProxyJump) | +| The **switch** mgmt plane (`192.168.88.1`) | None of these directly — mgmt VLAN 99 is reachable only from `ether8`. Need a host *on* `ether8` (mamba, or a testmachine patched in) to forward through. See switch runbook. | + +--- + +## C. Isolation rule (important) + +> **Keep the makerspace and the homelab apart.** fisi must not hold a standing +> tunnel into the makerspace. Bring Netbird **up only for the task, down +> immediately after**. Prefer the VPS-bastion path (B2) when it suffices, since +> it puts no makerspace network on fisi at all. The concern is preventing any +> compromise at the makerspace from reaching fisi/the homelab. + +--- + +## D. Secrets & credentials (where they live — not stored here) + +- **Ansible vault (baobab):** `~/.ansible/vault-keys/prod.txt`; secrets in + `AnsibleBaobabV4/group_vars/all/90-secrets.vault.yml` + (`ansible-vault edit ... --vault-id prod@`). WG/Netbird/service secrets are + `vault_*` keys there. +- **Ansible vault (switch):** identity `makerfloss`, + `~/.ansible/vault-keys/makerfloss.txt`; switch admin password in + `MakerFLOSS_Mikrotik/group_vars/mikrotik.vault.yml`. +- **SSH:** operator key `~/.ssh/id_ed25519`; project-wide SSH port is **7576** + (switch SSH is on `:22`, Forgejo git on `:7577`). + +Never commit secrets to this repo. diff --git a/incidents/README.md b/incidents/README.md new file mode 100644 index 0000000..d36c1c1 --- /dev/null +++ b/incidents/README.md @@ -0,0 +1,33 @@ +# Incidents + +Dated log of issues we troubleshoot and how they resolved. One file per +incident: `YYYY-MM-DD-short-slug.md`. Keep them short and factual — symptom, +what we found, what we changed (and in which source repo), outcome. + +## Suggested template + +```markdown +# YYYY-MM-DD — + +- **Host(s):** +- **Reported by / observed:** +- **Reach path used:** (access.md §...) + +## Symptom + +## Investigation + +## Root cause + +## Fix +- Source repo + commit: +- Live action taken: + +## Verification + +## Follow-ups +``` + +## Log + +_(none yet)_ diff --git a/network-map.md b/network-map.md new file mode 100644 index 0000000..6ad07c8 --- /dev/null +++ b/network-map.md @@ -0,0 +1,43 @@ +# Network map (thin) + +Pointers, not the source of truth. Authoritative data is in the source repos — +links below. Confirm live values before acting. + +## Subnets seen across the repos + +| Subnet | Role | Source of truth | +|--------|------|-----------------| +| `10.2.30.0/24` | **CRS310 data VLAN 30** (the new switch). Uplink `ether1` → gateway `10.2.30.1`; access ports `ether2–7`. | `MakerFLOSS_Mikrotik/host_vars/crs310-maker.yml`, `docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md` | +| `192.168.88.0/24` | **CRS310 mgmt VLAN 99** — isolated, switch at `192.168.88.1`, reachable only from `ether8`. DHCP `.10–.254`. | same | +| `172.17.3.0/24` | OrangeMakers LAN — `makerfloss1` at `.51`. | `AnsibleBaobabV4/host_vars/makerfloss1.yml` | +| `10.0.0.0/24` | Makerspace LAN — `mf04` at `.184`. | `AnsibleBaobabV4/host_vars/mf04.yml` | +| `10.13.0.0/24` | **makerfloss WireGuard plane (`wg1`)**. Hub `10.13.0.1` (VPS), `makerfloss1` `.2`, `mf04` `.3`, `sjat-roaming` `.5`. UDP `:51820`. | `AnsibleBaobabV4/host_vars/makerfloss.yml`, `specs/2026-05-12-makerfloss-wireguard-design.md` | +| `100.92.0.0/16` | **Netbird overlay** (`wt0`), control plane `nb.makerfloss.eu`. | `specs/2026-05-27-makerspace-vpn-design.md` | +| `10.8.0.0/24` | baobab (home) WireGuard plane. Hub **kuku** `10.8.0.1` (UDP `:51194`); mamba `10.8.0.4`. | `AnsibleBaobabV4` | +| `10.20.10.0/24` | homelab LAN — **fisi** `.17`, kuku `.118`, papa `.11`. | `AnsibleBaobabV4` | + +## ⚠️ Open question — makerspace addressing + +The makerspace shows **three different subnets** for hosts that are all +physically there: `10.2.30.0/24` (new CRS310 data VLAN), `172.17.3.0/24` +(`makerfloss1`), and `10.0.0.0/24` (`mf04`). It's not yet documented here how +they relate — i.e. whether the new switch sits in front of the existing +OrangeMakers `172.17.3.x` / `10.0.0.x` network, replaces part of it, or is a +parallel segment. + +**To resolve (sjat to confirm on-site):** +- What IP does an ethernet client on the new switch's data ports (`ether2–7`) + actually get, and from which DHCP server / gateway? +- Are `makerfloss1` (`172.17.3.51`) and `mf04` (`10.0.0.184`) reachable from a + data-port client on the new switch, or are they on a different segment? +- Is the `10.2.30.1` uplink gateway the OrangeMakers router, or something new? + +Update this section (and the host IPs in [access.md](access.md)) once confirmed. + +## Public services (makerfloss VPS, `88.99.32.236`) + +All TLS-terminated at the VPS via Traefik, certs via Gandi DNS-01: +`docs.makerfloss.eu`, `slides.makerfloss.eu`, `forgejo.makerfloss.eu` (git SSH +`:7577`), `mail.makerfloss.eu` (Poste.io), `discourse.makerfloss.eu`, +`snipeit.makerfloss.eu`, `nb.makerfloss.eu` (Netbird). +Source: `AnsibleBaobabV4/host_vars/makerfloss.yml`. diff --git a/runbooks/README.md b/runbooks/README.md new file mode 100644 index 0000000..da532da --- /dev/null +++ b/runbooks/README.md @@ -0,0 +1,22 @@ +# Runbooks + +Task-focused troubleshooting guides. Thin — they point at the source repos for +authoritative config and commands rather than duplicating them. + +## Index + +| Runbook | Covers | +|---------|--------| +| [switch-crs310.md](switch-crs310.md) | The MikroTik CRS310 switch — connectivity, VLANs, mgmt-plane lockout recovery, Ansible reconfig. | + +## Adding a runbook + +Keep it lean. Good structure: + +1. **Symptom** — what you observe. +2. **Reach** — which [access.md](../access.md) path applies. +3. **Diagnose** — concrete checks (commands, what good/bad looks like). +4. **Fix** — where the change lands (source repo + file), and the safety rules + for applying it to live infra. +5. **Verify** — how you confirm it's actually fixed (evidence, not assertion). +6. **Links** — source-repo specs/runbooks. diff --git a/runbooks/switch-crs310.md b/runbooks/switch-crs310.md new file mode 100644 index 0000000..3dc6e6d --- /dev/null +++ b/runbooks/switch-crs310.md @@ -0,0 +1,83 @@ +# Runbook — CRS310 switch (`crs310-maker`) + +The "new switch" at the makerspace. MikroTik **CRS310-8G+2S+IN**, RouterOS +7.19.6. Managed by Ansible from `~/Projects/MakerFLOSS_Mikrotik`. + +**Authoritative sources:** +- Live topology + cutover runbook: + `MakerFLOSS_Mikrotik/docs/superpowers/specs/2026-06-09-crs310-flat-mgmtvlan-design.md` +- Running config snapshot: `MakerFLOSS_Mikrotik/backups/crs310-maker/export.rsc` +- Device vars: `MakerFLOSS_Mikrotik/host_vars/crs310-maker.yml` +- Field guide: `MakerFLOSS_Mikrotik/docs/makerspace-switch-fieldguide.md` + +## Topology recap + +- **Transparent L2.** The switch is *not* a router — no inter-VLAN routing, no + presence on the data network. +- **Data VLAN 30** (`10.2.30.0/24`, gw `.1`): `ether1` = uplink, `ether2–7` = + access (untagged), SFP+ reserved (deferred). Users plug in here. +- **Mgmt VLAN 99** (`192.168.88.0/24`): switch at `192.168.88.1`, reachable + **only via `ether8`** (untagged). DHCP `.10–.254`, web UI on. CPU is the only + tagged member. No default route, no DNS, no NTP — isolated by design. +- `vlan-filtering=yes` went live **2026-06-09**. + +## Reach + +- **Mgmt (reconfig, SSH, web UI):** you need a host on **`ether8`**. On-site: + plug mamba into `ether8`, SSH `192.168.88.1` (or `http://192.168.88.1`). + Remote: there is no standing tunnel to the mgmt VLAN — forward through a host + that *is* on `ether8`. See [access.md](../access.md) §A / §B. +- **Data path test:** plug into `ether2–7`, expect a `10.2.30.0/24` lease. + +## Diagnose + +| Symptom | Check | +|---------|-------| +| No link / no DHCP on an access port | Confirm you're on `ether2–7` (data), not `ether8` (mgmt). Verify uplink `ether1` is up to `10.2.30.1`. | +| Can't reach switch mgmt | Are you on `ether8`? Mgmt is reachable **nowhere else**. Confirm a `192.168.88.x` lease. | +| Suspected config drift | Diff live vs repo: run a backup play, compare `backups/crs310-maker/export.rsc` to git. | +| Lockout after a change | See recovery below. | + +Connectivity test via Ansible (from `MakerFLOSS_Mikrotik`, mamba on `ether8`): + +```bash +ansible -m community.routeros.command -a "commands='/system/resource/print'" crs310-maker +``` + +## Fix + +Changes land in **`MakerFLOSS_Mikrotik` on `main`** (per the repo's own +workflow). Device-touching rules — **do not skip**: + +- Run any device-touching play **twice**; the second run must report no changes + (idempotency). +- **Enable `vlan-filtering` last**, after bridge/PVID/mgmt-VLAN are proven. +- Network-affecting changes (mgmt IP/VLAN) should run as a **self-reverting + detached job** (240s timeout) so a bad flip auto-rolls-back. +- Keep a **WinBox MAC-telnet or serial** recovery channel open when touching + network settings. + +```bash +# from ~/Projects/MakerFLOSS_Mikrotik, mamba on ether8 +yamllint . && ansible-lint && ansible-playbook play_switch.yml --syntax-check +ansible-playbook play_switch.yml # full day-2 +ansible-playbook play_switch.yml --tags vlans # one domain +ansible-playbook play_backup.yml # snapshot config into the repo +``` + +## Recovery (lockout) + +Documented gotchas from the 2026-06-09 cutover (see the spec): +- **mamba NetworkManager flap** on the bench — pin the `crs310-bench` profile + `autoconnect yes`, static `192.168.88.2/24`. +- RouterOS `find ... address=` does **not** match IP prefixes — use + `find interface=` instead (caused a bridge-IP removal bug). +- If locked out over the network, recover via **WinBox MAC-telnet** on `ether8` + or serial console; the detached-job timeout should also self-revert. + +## Verify + +- `ansible-playbook play_switch.yml` second run → **no changes**. +- Access-port client gets a `10.2.30.0/24` lease and reaches the gateway. +- `ether8` client gets `192.168.88.x` and can SSH `192.168.88.1`. +- `export.rsc` committed and matches intent.