The release where chippy stops being one thing

v1.1 ended with a launcher. You typed chippy -nessy ROM, two binaries spawned, the chippy TUI attached over DAP to a separate nessy process running an NES game, and a 6502 emulator that had been written for the Apple-1 suddenly had pretensions of running Super Mario Bros. The DAP server was the seam. The NES variant was the doorway. v1.1.1 had hardened the cross-process handshake enough that the launcher survived a real second machine.

v1.2.0 is the release where the seam and the doorway become an architecture. Three big things land in this cut, each of which I want to talk about as its own thing:

  1. The 6502 core — cpu, dap, peripheral, symbols, expr, loader, trace — gets promoted out of internal/ and made part of the public library API. A semver contract is written down. A whole separate Go project can now go get github.com/nkane/chippy and use the same CPU chippy uses.
  2. The CPU↔PPU model for VariantNES flips from instruction-stepped to per-cycle interleave, ported from Mesen2’s master-clock model. The Blargg/nesdev accuracy ROMs that v1.1 couldn’t pass start passing.
  3. The nessy code carves out of the monorepo and into its own repository at github.com/nkane/nessy. v1.2.0 is the last chippy release that contains NES code.

Each of these is doing work the others depend on, in an order that mattered. You can’t carve nessy out into a separate repo until the library it depends on has a public API. You can’t make the library API stable until you’ve decided what’s in it. You can’t make VariantNES accurate enough for SMB1 to boot without per-cycle interleave. You can’t justify the per-cycle interleave to a library audience that only wants a Klaus-passing 6502 unless it’s gated to the variant that needs it.

Let me take the three in the order they actually shipped.

1. Out of internal/

The Go community gave us internal/. It’s load-bearing. Anything you put in a package whose import path contains /internal/ is unimportable from outside the module — concretely, code in module-A/internal/cpu is invisible to code in module-B. It’s the one mechanism the language provides for saying “this is mine, hands off, I will not keep its API stable.”

Through v1.1.x, chippy lived almost entirely under internal/. The CPU was at internal/cpu. The DAP server was at internal/dap. The TUI was at internal/tui. The peripheral interface was at internal/peripheral. This was a deliberate hedge: I didn’t want to promise an API surface I hadn’t lived with. If something turned out to need a breaking change, I could just make it, because no one outside the module could possibly be importing from internal/.

The carve-out plan needed nessy to live in its own repository, which meant nessy could no longer reach into chippy’s internal/. So in v1.2.0 the parts that nessy uses got promoted:

Old pathNew pathPurpose
internal/cpucpuCPU + bus + RAM + variants + tickers
internal/dapdapDebug Adapter Protocol server + client
internal/peripheralperipheralPeripheral interface for MMIO devices
internal/symbolssymbolscc65 .dbg parsing
internal/exprexprBreakpoint condition + watch evaluator
internal/loaderloader.bin / .prg / .hex / .o toolchain support
internal/tracetraceExecution trace recording + replay

Two packages stayed private. internal/tui is still private — it’s chippy-specific Bubble Tea code, no one is importing it. internal/nes is briefly still private — for the rest of this release it lives inside chippy, on its way out the door to its own repo. It’s the carve-out’s last witness in the chippy tree.

What “stable” actually means

Promoting a package out of internal/ is the easy mechanical part. The harder part is committing to keep the shape of the API stable. So I wrote docs/api.md and stuck it at the root of the repo. Stripped down, the contract is:

The exported names in cpu, dap, peripheral, symbols, expr, loader, and trace constitute the public API. Adding new exports is a minor version. Removing or breaking-changing an existing export is a major version. The major-version contract types are cpu.Bus, cpu.Peripheral, cpu.Ticker, and cpu.Variant. If those four interface shapes change, that’s a major-version bump and a migration note.

I went back and forth on whether to write that contract before I had any external consumers. The argument for not writing it: you’re committing to a shape you might want to change. The argument for writing it: nessy is about to become the first external consumer, and a public API with no contract is exactly the kind of thing that ships breaking changes accidentally because nothing in the workflow flags them.

The contract won. The cost of being explicit is small. The cost of not being explicit and then breaking nessy on a refactor is large, because the refactor will not be obvious — none of chippy’s own tests will fail. Tests inside the package can’t see external consumers. The contract is the thing that catches the breakage at PR review time, not at go.mod bump time.

(There’s a related invariant from the v1.0 work that stayed invariant through the promotion: the opcode-init file-lexical init() order — opcodes.go < opcodes_cmos.go < opcodes_illegal.go. Renaming any of those three files breaks the CMOS table init. I documented it in the v1.0 post under D2 of ADR 0001. Promoting cpu out of internal/ didn’t change it; it’s still the same Go init order. But I want to call it out here because the kind of person who carefully renames packages on a promotion is exactly the kind of person who might accidentally rename a file to “make it cleaner” and silently break BCD-correct CMOS dispatch. Don’t do that. The CI catches it because Klaus’s 6502_65C02_functional_test regresses, but you don’t want to find out that way.)

Bare semver tags

The other piece of v1.2.0’s library story is that the tags now mean something specific. v1.2.0 is the chippy library’s v1.2.0. If a third party — nessy, say — imports chippy at v1.2.0, the API shape they get is exactly the documented one and won’t shift under them until v2.0.

To keep this clean, nessy needed to leave the monorepo’s tag space entirely. In the monorepo, nessy releases were tagged nessy-v* so they didn’t collide with chippy’s bare vX.Y.Z. After the carve, in the nessy repo, the prefix doesn’t make sense — there’s no chippy tag to collide with, the nessy-v prefix breaks goreleaser’s defaults, and go install module@vX.Y.Z is the convention. So the carve renamed every historical nessy tag with git filter-repo --tag-rename nessy-v:v. nessy ships under bare semver from v0.1.0 forward, and the chippy repo no longer has any nessy-v* tags in it.

This is the kind of housekeeping that’s tedious to write and a five-minute job to do. The value of doing it now, on the carve, is that nobody ever has to ask “wait, is that the chippy v0.5.0 or the nessy v0.5.0?” There’s only one v0.5.0, and it lives in nessy.

2. Per-cycle CPU↔PPU interleave

Here is the chunk of v1.2.0 that I spent the most time on and that produced the largest accuracy improvement, and I want to walk through why it had to land the way it did, because the question “why isn’t this just always on” is the most interesting design question in chippy’s history.

The problem: instruction-stepped doesn’t pass the timing ROMs

The chippy CPU through v1.1 was instruction-stepped. Step() ran an opcode start-to-finish, returned the cycle count, and the bus ticker (if present) ticked the peripheral chain that many cycles at the end. For an Apple-1 emulator, this is fine — there’s no PPU, the keyboard is level-triggered, and nothing in the system cares that the ticks arrived in one batch at the end of the opcode.

For the NES it is not fine. The NES PPU runs at 3 PPU dots per CPU cycle, every CPU cycle, and it reads a status register ($2002) that the CPU can poll mid-instruction. There’s an entire class of game code — the Mario sprite-0-hit splits, the SMB3 status-bar bar, basically every raster-timed effect — that polls the vblank flag at a specific dot, and gets the wrong answer if you batch-tick three frames’ worth of dots at the end of the opcode. The Blargg accuracy ROMs (ppu_vbl_nmi, cpu_interrupts_v2) probe exactly this. With batched ticking, my results were ppu_vbl_nmi 5/10, cpu_interrupts_v2 1/5. Not even close.

The Mesen2 reference (the open-source emulator that’s become the gold standard for cycle-precision NES emulation) implements this with a master-clock model. NTSC runs at 12 master clocks per CPU cycle and 4 master clocks per PPU dot. A read takes its cycle budget split +5 master clocks before the bus access and +7 master clocks after the bus access; a write is +7 pre and +5 post. The PPU is advanced to the master clock at each of those split points. The total per CPU cycle is +12 either way, so per-instruction cycle counts come out byte-identical to the batched path. But the PPU sees the dot it would see on real silicon, in the order it would see it on real silicon, with the right sub-cycle phase for each access.

The decision for v1.2.0: port the Mesen2 master-clock model into chippy’s cpu.Step for VariantNES and only for VariantNES. The chippy debugger continues to use the batched tick for NMOS / 65C02 — same instruction boundaries, same cycle totals, byte-identical Klaus-functional results, no perfgate regression. Nothing about a TUI-driven 6502 debugger needs per-cycle interleave; the cost of always-on would be paid by the largest audience of the smallest emulator-debugger, and they’d get nothing for it.

What the per-cycle path actually looks like

For VariantNES, every bus access ticks the chain one CPU cycle, split into the master-clock budget I just described. Reads and writes both call PPU.Run(deadline) at their respective splits — the deadline is the master-clock count for the half-cycle. Addressing-mode dummy reads are issued per template:

  • A page-crossing absolute read issues a dummy read of (oldPCH | newPCL) before fixing up the high byte. The PPU sees that dummy. (This becomes the load-bearing detail in the Tom Harte bus-trace work much later — see the v1.6 post.)
  • A page-crossing write always issues the dummy read regardless of whether the page actually crossed (NMOS quirk; the address calculation always takes the path through the partial address). The PPU sees that one too.
  • An indirect JMP ($XXXX) issues two consecutive reads from XXXX and XXXX+1, the second of which wraps on the page boundary (the famous JMP indirect bug). Both ticks land on the PPU.

The instrCycles == accounted assertion at the end of each opcode is a guard rail. If you write an opcode whose dummy-cycle template doesn’t add up to its documented cycle count, the assertion fires in the test build. It has saved me from a handful of off-by-one bugs in the addressing-mode templates that would otherwise have shown up several frames later as a desync in an entirely different opcode.

What this gets you

This is the part where the receipts justify the work. With the per-cycle path landed:

ROMv1.1 resultv1.2 result
ppu_vbl_nmi5/1010/10
cpu_interrupts_v21/55/5
instr_timingfailpass
apu_test4/88/8
instr_miscpartial4/4
instr_test-v5 (official)14/1616/16
oam_readfailpass
mmc3_testn/a (no mapper)1, 2, 3, 5 pass

The remaining gaps — mmc3_test 4 and 6, ppu_open_bus, oam_stress — are all PPU-side and live in the nessy repo’s accuracy harness from v0.4.0 onward. They are tracked, scoped, and closing on their own timeline. The chippy-side cycle-precision substrate is done.

There’s a Mesen2-aligned detail I want to call out because it took me a day to find: the NMI hijack check (the sub-cycle rule that handles an NMI asserted between push16(PC) and push(P) during BRK/IRQ service) sits between those two pushes in serviceVector. The NMI poll latch sampling — nmiPollPrev / irqPollPrev, one cycle delayed — is what makes cli_latency and nmi_and_brk pass. And the branch IRQ-poll quirk — a taken non-page-cross branch ignores an IRQ that was asserted at its last clock, because the poll has been rolled back — is what makes the cycle-perfect branch behavior right.

Those four rules are silent and load-bearing. Each one is a five-line change. Getting any of them wrong breaks a different sub-test, which is exactly why you need a single oracle to align against. I’ll talk more about the choice to commit to Mesen2 as that oracle in the nessy post, but the rule applies in both repos: one reference emulator, every borrow traceable to a specific source-line citation.

What this costs you (it doesn’t)

The reason this change was safe to land: it’s VariantNES-gated. NMOS and 65C02 still use the batched tick. The perfgate sees no regression on the bare debugger inner loop. Klaus passes byte-identically. The decimal-mode sweep is unaffected. Nobody who isn’t running NES code pays anything for the existence of NES-accurate timing.

This is exactly the “optional behavior costs zero when absent” rule the v1.1 ticker hook was built around. The per-cycle path lives in code that runs only when c.Variant.PerCycle is true. For every other variant, the same Step function runs through the batched path it always did.

(There is a wrinkle that becomes load-bearing in v1.5: because the per-cycle path is VariantNES-gated and the NES variant disables BCD, the Tom Harte bus-trace harness — which wants to compare per-cycle bus activity on VariantNMOS with BCD intact — has to force-enable the per-cycle path on VariantNMOS for the test. The machinery to do that is a --per-cycle-force flag on the test rig. The library API doesn’t expose it, because no production consumer wants NMOS per-cycle. But the seam is there.)

3. Carving nessy out

OK, the carve. Three things made this safe to do here and not earlier:

  1. The library API was now stable. nessy could pin a chippy version in go.mod and have a contract about what wouldn’t shift under it.
  2. The CPU was now accurate enough. With per-cycle interleave landing, the nessy code didn’t need to be parallel-fixed in two repos. The accuracy work was done on the chippy side and flowed in via the dependency.
  3. The release pipeline already understood the split. Going back to v1.1.0, the monorepo had a release-split shape that cut binaries for chippy and nessy separately on a tag push. The tooling was rehearsed.

The actual mechanics:

  • git filter-repo extracted internal/nes/ and cmd/nessy* (and their test fixtures) into a new repo at github.com/nkane/nessy, preserving git history end-to-end. Every commit that touched those paths in chippy is in the nessy repo with its original SHA in the commit message. Blame still works.
  • --tag-rename nessy-v:v renamed the historical nessy releases. nessy now ships under bare v0.1.0, v0.1.2, v0.2.0 (etc.) in its own repo. The chippy repo no longer has any nessy-v* tags.
  • nessy’s go.mod pins github.com/nkane/chippy v1.2.0. Any future chippy bump on the nessy side is a one-line go get plus whatever migration the chippy release notes describe.
  • Several scattered shell scripts under chippy’s scripts/ that referenced nessy paths got cleaned out in a follow-up chore: scrub stragglers from chippy after nessy carve-out commit. The go mod tidy that came next dropped ebiten + xgb + sync from chippy’s dependency graph entirely.

That last bullet is the entire point of the carve. Before v1.2.0, anyone importing chippy as a Go library — even just cpu — pulled in Ebiten’s CGO dependency graph, the X11/GL dev-header chain, the audio driver pieces, all of it. After v1.2.0, chippy’s go.mod is pure Go, no CGO, no graphics deps. go install github.com/nkane/chippy/cmd/chippy@latest from a fresh container with no headers works. nessy carries the graphics weight, because nessy is the part that needs it.

What I worried about and what was actually fine

Versioning friction. A CPU-core fix now requires a chippy release plus a nessy go.mod bump. This is genuinely extra work. But the chippy release cadence is, by design, dictated by value to chippy users, not by nessy’s roadmap. nessy can pin a chippy version and stay on it; nothing about nessy’s work requires immediate uptake of every chippy commit. In practice the friction has shown up roughly twice per release cycle and each time the extra step is a go get -u and a CI run.

API drift between local-on-disk dev and pinned chippy. During chippy development I sometimes want to test a nessy interaction before the chippy fix is tagged. go.mod’s replace directive handles this — replace github.com/nkane/chippy => /path/to/local/chippy lets nessy build against an uncommitted chippy working tree. The rule we hold: a PR with a replace in go.mod does not merge to nessy main. The replace is for dev only, and the chippy fix has to ship first.

Two CI surfaces. The chippy CI lost the NES accuracy job (it left with nessy). chippy’s CI is now just the library + the TUI + the CPU corpora. nessy’s CI is the Ebiten build + the NES accuracy job + the demo regression ROMs. Both are simpler. Neither runs the other’s tests anymore. This is honest — the chippy CI was already ignoring the NES stuff most of the time, and the perfgate was strictly chippy-side.

What I get to write next

v1.3.0 is the polish release for the TUI debugger — conditional breakpoints carried over DAP into the server, :mem for memory pokes through the bus, watch-array expansion with the cc65 .dbg finding I have to write a whole paragraph about (spoiler: .dbg doesn’t carry struct member layout, the entire feature scope reshapes), trace-replay search and side-by-side diff, and a deep-rewind keyframe ring that turns reverse-step from a hundreds-of-steps-back tool into a millions-of-steps-back tool.

v1.4 is two releases — v1.4.0 adds a single feature (a custom-request extension point in the DAP server) and v1.4.1 removes a feature that turned out to be a marketplace problem (the VS Code extension). It’s the smallest release pair in the history of the project. It’s also the post I’ve been most looking forward to writing, because it’s about deciding to take a thing out of your release pipeline and ship the next clean one immediately.

v1.5 is the architectural release. Inproc DAP transport. TUI-via-DAP. Live state streaming. Complete CPU ROM coverage (Klaus interrupt, AllSuiteA, Wolfgang Lorenz, Tom Harte). Host debug hooks that let nessy build NES-aware debugging on top of chippy without forking the protocol.

v1.6 is the cleanup-pass release where the accuracy tail closes (one decimal-mode ARR bug, ten branch / JSR / RTS per-cycle bus quirks, the WDC 65C02 Tom Harte set), the manual struct-overlay watch ships in the form the .dbg finding from v1.3 makes possible, dirty-region memory streams come through, and goreleaser brews: migrates to homebrew_casks:.

Then the nessy post. There are eight nessy releases — v0.1.0 through v0.8.0 — and unlike chippy they’re all pre-1.0. Each one is closing a specific accuracy gap or adding a class of mapper. The arc goes from “NROM, three hand-rolled demos, NMI works” to “every Konami expansion-audio chip, MMC3 with A12 scanline IRQ, a headless recorder that produces byte-identical capture, PAL and Dendy timing, and Mesen2 as the cycle-precision oracle.” That whole story compresses into one post.

In the meantime, v1.2 is the seam. The library exists. The carve is done. The accuracy gate is green. We can build things on top now.