the KK Computer:
a Radical 6502 Redesign


illustration of the expanded register set

What Is It?

It's a microcomputer that uses a 65C02 cpu teamed up with a coprocessor — logic capable of interpreting the instruction stream in tandem with the cpu. To the programmer they appear seamlessly as a super-65C02, a new branch on the 65xx tree. Undefined 65C02 opcodes are used to support 6 new registers and 44 new instructions, including ITC NEXT (as used by Forth). The new instructions execute in hardware and run at full speed; they are not breaks to software emulation. The diagram below gives a general idea of what's under the hood. You can click to see a little more detail.
KK_block diagram simplified

What Can It Do?

The main feature is fast access to a flat, 24-​bit address space. Although there are many ways to retrofit a 'C02 with expanded memory, AFAIK none can touch this implementation for speed. Given any random address in the 16 megabyte space, KK can fetch that byte in 6-8 cyles. It's fast because:

  • the extended addressing is part of the instruction set. There's no MMU to spoon-feed.
  • memory is organized as blocks of 64K (not a fraction thereof, such as 16K)
Units of 64K make it easy to treat the entire memory as a linear, 16 megabyte space. That's because the results of address arithmetic — a 24-bit addition, for example — are directly acceptable by the machine. The most-significant byte is the block address. This is far more efficient than a scheme using, say, 16K blocks because a 16K-block system needs to mask and shift the addition result to isolate the block address before it can be passed to the hardware. Efficient linear addressing opens the door to a modest range of "big data" applications — tasks which would cause most expanded-​memory 8-bit machines to hit the wall. They suffer an order-​of-​magnitude speed disadvantage simulating a linear space.

How Does It Work?

Most of the control signals originate as microcode fetched from an EPROM array. The microcode runs on a state machine that's clocked in lock-​step with the CPU. (One state machine cycle = one CPU/bus cycle.) The CPU directs instruction fetching. Simultaneous execution by it and the state machine determines the result.

Block (aka bank) addresses are stored in Register File A, whose read section (at the top of the diagram) feeds A23-​A16. These new address lines — and A15-​A0 directly from the 65C02 — are what address the 16 megabyte space. (You'll see there's a second register file which shadows the first. This makes it possible to read back the stored bank addresses when necessary.) A single instruction is all that's required to load a new bank address. That's considerably simpler than I/O operations on an MMU. And, in contrast to an MMU manipulation, there's no need to save then later restore A, X, Y or P. These registers remain undisturbed.

The coprocessor acts as an exo-​skeleton for the 65C02. Devices such as the register files connect to the data bus and receive microcoded cues to update themselves from it. Some­times they drive the bus when the CPU thinks it's reading data, instruction opcodes or instruction operands from memory. Some of the 46 undefined (aka illegal) 65C02 opcodes get aliased by the 32 x 8 PROM before they reach the CPU. Others are used "as is." In fact, some of the 65C02's so-called NOPs actually generate an address and use the bus, even though the data is discarded. This odd behavior turns out to present important opportunities. But all these details are invisible; the programmer simply has 44 new inst­ructions available (for a total of 254).

What Are The New Instructions and Registers?

  • instructions that load and save the bank addresses cued up in the register file (K0, K1, K2 & K3)
  • instructions that actually output a bank address onto A23-​A16 (usually on a transient basis)
  • miscellaneous

K0 is presented almost continuously on address lines A23-​A16, as it's used for all code fetches. K0 is also the default for data accesses. In the absence of a specification for K1, K2 or K3, it will be K0 that's used, which means the data access will occur in the same bank as the currently-executing code. (A exception applies when stack and zero-page address modes are used. Such accesses always use bank $00 — not to be confused with K0).

Three single-​byte prefix instructions are associated with registers K1, K2 and K3. Use of a prefix is one way to specify a "Far" data access — that is, one whose bank address is independent of where the currently-executing code resides. The prefix is followed by and acts upon any typical 65C02 instruction such as INC Absolute, CMP Indirect,Y etc. Here's the sequence. At run-​time the CPU fetches the prefix byte but ignores it. Even the coprocessor takes no immediate action. Next comes the target instruction, and, as usual, off-​chip logic co-​executes every cycle. This includes the extra cycles for zero-​page indirection and other variations. Then K1, K2 or K3 is read out "on cue" for the data transfer which is the final cycle of the instruction (final three cycles for Read-​Modify-​Write). All of the CPU's 64K possible addresses are re-​mapped by the bank switch. Then K0 is re-​selected for A23-​A16, an opcode fetch occurs, and the program proceeds without missing a beat. (The only added delay was one cycle for the prefix.) Many combinations of instructions and address modes can use the prefixes and thus become Far. Considering all the combinations, you could say there are hundreds of new inst­ructions, not just 44.

Because they open a "Far" dimension for so many 65c02 instructions, the prefixes are very general in their applicability. But, as noted, a one cycle penalty applies. Even this can be avoided, albeit with a loss of generality. The separate prefix may be omitted for six specific cases involving LDA and STA, since six specific, "all-in-one" opcodes are provided. Specific opcodes are also provided for JMP_K3, JSR_K3 and RTS_K3. These instructions include an operation that exchanges K3 and K0 — and, because K0 is updated, a new 64K bank becomes the default. (Happily, this does not imply alternative zero-​pages and stacks! Recall that accesses using stack and zero-​page address modes always use bank zero, regardless of what K0 may contain.)

Bank-address registers K1, K2 and K3 load themselves in the same manner that X, Y and A load themselves — that is, with specific opcodes provided for the purpose. The available address modes are Immediate, Absolute, Zero-pg and Zero-pg, X. These three registers can also be pushed and pulled from stack, and K0 can be pushed. Altogether there are 34 instructions for performing Far jumps, Far data accesses, and for loading and saving bank addresses.

Miscellaneous instructions and registers

A few highlights from this disparate group are as follows. The SCAN_K3 instruction forces the CPU to rapidly read a long string of bytes from memory, as part of a program that outputs video.

W is a 16-​bit register readable in zero-​page, one of whose functions is double-​indexed addressing. This is a two-step process that starts with an instruction coded to use (Z-​pg,X) mode. The cpu does the index addition then fetches a two-byte pointer from zero-page as it completes the instruction. KK copies the two-byte pointer on the fly, making it subsequently available in Z-pg at W (with no need to repeat the index addition). If a subsequent instruction is coded to use (W),Y then the result of the two-instruction sequence is (Z-pg,X),Y mode. It's equivalent to using X to index to a pointer, fetching the pointer, then indexing again into a data array. This is not so unusual, given that even a 16-bit word (two bytes) constitutes an array. Double-​indexed addressing accelerates Forth mainstays such as @ and ! (fetch and store).

IP is a 16-​bit register whose most notable function is as a pointer for the JMP((IP++)) instruction. This double-​indirect jump with post-​increment is a hardware realization of Forth's ubiquitous NEXT operation. Hardware NEXT has quadruple the speed of the code sequence it replaces. Overall Forth program speed increases by about 90%.

Conclusion

KimKlone's CPU and the off-​chip accessories function as a unified whole, much the same as a monolithic device created in a wafer fab. The design is notable for introducing major features despite the limitations of a legacy Programming Model and even legacy silicon.

The linear memory organization allows efficient manipulation of objects larger than 64K — a capability which is absent from commercial 6502 microcomputers and from microprocessors such as the MOS 6509 and the Hudson Soft 6280 and which, in retrospect, is more suggestive of the WDC 65816. (The KK was created shortly after but without any influence from the 65816.)

VIEW THE MAIN ARTICLE (including the Kimklone FAQ & photo gallery)
view the main index of my site
copyright Jeff Laughton