The 65k project - Feature Discussion

This page discusses and defines the features of the 65k processor family, thus acts as the requirements definition for the new processor.

All requirements are based on the original NMOS 6502 processor, using the "legal" opcodes. The 65k should run original NMOS 6502 (maybe 65C02) code without modifications.

Currently this page is a somewhat fragmented list of requirements, neither complete nor necessarily consistent. It may be sorted later when requirements and features are more finalized. They will be condensed into the actual processor specifications.

Goal of these requirements is that they should be:

Implementable - it should be possible to actually implement them...
Useful - A feature should be reasonably easy to use and provide features that are deemed missing in the 6502
Simple - apply the KISS principle ("Keep It Simple Sweetheart"), i.e. basically try to minimize the lines of code to implement the features.
Elegant - make an elegant design. This is not really a measurable goal, but a goal nevertheless.
Least Surprise - a feature should be "natural" to use and not "surprise" developers and coders with strange or unexpected features
Keep the 6502-ishness - don't try to add a RISC CPU, or add (many) more complex (CISC) addressing modes or operations, but try to build on the 6502 advantages like zeropage addressing

On one hand the processor should fit use cases for embedded systems. It could implement a complete system-on-a-chip with a little ROM, some RAM, and I/O in a single FPGA. As a CPU-only it should also be a (more or less) direct replacement for the 6502, with additional features e.g. in terms of wider registers. On the other hand it should provide - maybe in a different packaging - extended linear address space and features comparable to about an early 68k processor.

News:

2010-10-23 Published the page
2010-10-23 Added section about Accumulator-Memory architecture
2010-10-17 Added section about effective address register
2010-10-03 First working draft finished
2010-09-18 Started this page

Table of content

Preface
- License
- Disclaimer
- Contributors
- Changes
Modularization
Virtualization
Register Width Expansion
- Register expansion in the 65816
- Alternative 1: New Registers
- Alternative 2: Prefix Bytes
- Comparison
- Conclusion
Sign extension handling
- 65816, 80x86 Sign extension handling
- Operand vs. Operation extension
- Automatic Extension
- Force Zero
- Conclusion
Number of Registers
- Accumulator-Memory architecture
- Register Sets
- Zeropage as Registers
- Additional, Separate Registers
- Comparison
- Conclusion
Address Expansion
- Natural Address Expansion
- 65816 Address Expansion: Bank Registers
- CS/A65 Address Expansion: MMU
- 80x86 - Segmentation
- PowerPC Address Translation
- Address Virtualization
- Address Space Selection
- Comparison
- Conclusion
Advanced Bus Features
- 65816 Bus Features
- CS/A65 Bus Features
- Multiprocessor/-core Synchronization
- Multiprocessor/-core Synchronization with caches
- Prioritized IRQs
- Comparison
- Conclusion
Addressing Modes
- 65816 Addressing Modes
- 68000 Addressing Modes
- 65k Addressing Modes Draft
- Conclusion
Advanced Opcodes
- Base Register
- Width Handling of Stack Register, Base Register, ...
- IN?/DE? immediate
- Jump Subroutine
- Interrupts, RTI
- Branches
Operating Modes
- Stack Pointer
- Access to user mode
- Call to supervisor mode from user mode
Memory Interface
- Wide Memory bus
- Cache
- Write Pipeline
- Conclusion
Mathematics
- Integer Mathematics
- Bit Manipulation Operations
- Other Operations
- Floating Point Operations
Vector and Block Operations
- 65816 MVN/MVP Operations
- 68k MOVEM, MOVE16 Operations
- CS/A Blitter Operations
- Graphics Blitter Operations
- Comparison
- Conclusion
Effective Address Register
- Load effective address into AC
- Load effective address into new register
- Conclusion

Preface

License

This content is licensed under the Creative Commons Attribution Share-Alike license, CC-BY-SA version 3.0.

Note this "content" includes this web page, but does not include the 6502.org header and the left and right web page columns. Click on the "Maximize" link to see the contents covered by this license.

Disclaimer

The content comes with no warranty at all! There is no guarantee and no promise that this specification is correct, consistent, will actually work, or will ever be implemented at all.

To my understanding the techniques described here have been used by various processors for decades already. Still there is no guarantee that a processor according to this spec would not be covered by some patents.

Subject to change without notice!

Contributors

André Fachat - initial author: 8bit Homepage

Changes

This section describes the changes to the document:

Date	Author	Changes
2010-10-03	André Fachat	First working draft
2010-10-17	André Fachat	Added section about effective address register

Modularization

The NMOS 6502 basically always has the same core, i.e. the same instructions Differences for example to the 6504 consisted of how many address lines were routed from the core to the chip's outside.

Providing the right solution from embedded solutions to personal computing requires that the core can be customized, providing different features for different requirements. An embedded system or a 6502 replacement may not need any MMU or virtualization, but can very much profit from wider registers and wider arithmetics.

The 65k will provide a modular implementation. The minimum core will provide 6502-like features with wider registers. Virtualization, MMU, Cache, and different bus width may be modular options. More details to follow in the processor roadmap and the specs.

Virtualization

The 6502 does not have any virtualization features. Virtualization in its strongest form - "Full Virtualization" - means that a program running in a virtual machine can not distinguish the virtual machine it runs in from a real processor. In particular this means that an unaltered operating system for that processor can run in a full-virtualization-virtual-machine on that processor. The efficiency argument defines that the uncritical processor opcodes executed in the virtual machine are executed natively on the processor. The critical operations are those that modify the system resources, like I/O, or processor status registers - not the arithmetic flags, but the system flags like "I" - on the 6502 SEI would be a critical opcode. The virtual machine monitor must be in complete control of those resources.

A common way to handle critical operations is to "trap-and-emulate". When such an operation is detected, the processor traps into the virtual machine monitor (also known as hypervisor) and the opcode is emulated. This can even be required for simple reads or writes to I/O address space!

Partial virtualization does not provide enough virtualization to run an unaltered operating system in the virtual machine.

Virtualization in the 6502 world has not been implemented yet, as the 6502 has no hardware protection features. A virtualized 6502 has to

provide a virtual address space for the virtualized 6502
processor registers can be stored away on context switches to the hypervisor similar to interrupts
interrupts must be caught by the hypervisor and dispatched to the virtualized 6502 as needed This requires specific placement of interrupt vectors
I/O address space must be trapped
Some 6502 instructions must be made privileged or be virtualized, like the SEI instruction, to fulfill the Popek and Goldberg virtualization requirement.
All registers - including memory management registers(!) - must be readable, or be available only to privileged instructions, so that the hypervisor can monitor changes. This has interesting implications for example when an operating system that uses page tables is virtualized. Page tables are stored in main memory, but to track changes by the virtualized operating system, the hardware MMU can not be used. These page tables are tracked by page faults on write, the entries translated and written to the physical MMU. This process can have a severe performance impact.

The 65k will provide partial virtualization, with a) a hypervisor and a user space mode, b) virtualization of critical CPU resources like the "I" flag, c) trapping of critical opcodes when in user space mode, d) virtual memory and I/O address mappings that can only be changed in hypervisor mode, e) interrupt virtualization

With these features, the 65k provides enough virtualization to run a pure 6502 operating system in user space mode.

Register Width Expansion

The 6502 has three main registers, AC, XR, YR. These three registers serve as accumulator and index registers, and are each 8 bit wide. Lacking 16 bit operations is one of the main issues with the 6502 in modern times. So the goal of the register expansion is to provide 16 bit registers and arithmetic operations.

Register expansion in the 65816

The 65816 has 16 bit registers as well. It expands the existing 8 bit registers to 16 bit using a mode bit. This bit (actually two, one for the AC and one for XR/YR is being set by a special instruction. When it is set, the register operations are performed on all 16 bit, but using the original opcodes. I.e. an opcode

A9 00 : LDA #$00

becomes

A9 00 00 : LDA #$0000

for example.

The registers are always used as 16 bit. Modifying AC with an 8 bit width opcode leaves the upper 8 bit unchanged. This has strange consequences, when an 8 bit AC is being transferred into a 16 bit XR or YR. The unmodified high byte is transferred as well, resulting in a value that maybe was not wanted.

This approach has advantages and disadvantages. First of all, the 16 bit code can be basically as short as the 8 bit code (not counting the larger data). On the other hand, switching between 8 bit and 16 bit operations requires an extra instruction. Also it introduces as "hidden state" into the program. A program is interpreted differently (and quite differently) depending on a mode bit that is evaluated at runtime. Depending on the mode bit an opcode can have a different length. So the assembler program, as well as disassemblers (be they code or human) always need to know what mode (8 or 16 bit) the code is meant for.

This situation led me to look for alternatives to get 16 bit operations.

Alternative 1: New Registers

The first idea to add 16 bit registers to the 6502 is to add new 16 bit registers U,V,W that work as 16 bit accumulator and index registers.

My first naive design approach is in 65k opcodes alternatives 1.txt

The approach shown in the file has a number of drawbacks:

It makes the system even more "non-symmetric". The new registers have different capabilities than their 8 bit counterparts. Already on the original 6502, the X and Y registers are not completely symmetric and these registers make the situation even worse.
Each new operation consumes an extra opcode, filling up the opcode space quite quickly.
Linked with the missing symmetry - it might be easier to implement these registers with an own complete ALU and internal busses. This would require more logic to implement and thus more chip estate
It's not elegant, it would violate the design goals from above

On the other hand they keep the code small and make fetching the code faster.

Alternative 2: Prefix Bytes

A Prefix Byte modifies the behaviour of the opcode following the prefix byte. Prefix bytes have a long history, they reach back to the Z80 and maybe even further. Also the 6809 used them to expand the opcode space. There are two types of prefix bytes:

Modifier prefix: the following opcode is basically the same as before, but modified for example by using a different number of bits (8 bit vs. 16 bit) or different registers
Multi-byte opcodes: a single byte starting a two-byte opcode enables 256 new opcodes in an otherwise single-byte opcode machine. These two-byte opcodes need not have to have anything in common with their single-byte counterparts.

The Z80 for example used both types of prefix bytes.

To expand the 6502 registers a prefix byte could be used that modifies the existing opcode to use registers with an expanded number of bits

Write Great Code page 280 (Google Books) - about 80x86 prefix bytes

Comparison

Here is a comparison of the different approaches to expand the 6502 register width:

	Mode bits	New registers	Prefix bytes
Description	"Hidden state" mode bits switch the existing registers and opcodes between 8 and 16 bit	A new set of 16 bit registers with complete new opcodes augment the existing 8 bit registers and opcodes	A modifier prefix byte modifies the existing opcodes to use a wider register size
Program size (and thus fetch speed)	+ (kept short)	+ (kept short)	- (longer for each operation)
Switching between 8 and 16 bit	- (extra instructions)	+ (by instruction, no extra cost)	+ (by instruction, no extra cost)
Interoperatbility between 8 and 16 bit	+ (same registers)	- (different registers, need transfer)	+ (same registers)
Opcode space usage	+ (Only single mode switch opcode)	0 (lots of new opcodes - but could be implemented as multi-byte opcode)	+ (one modifier prefix would basically suffice)
Implementation complexity	+ (single register set, modified opcodes)	- (new registers, new opcodes)	+ (single register set, modified opcodes)
Number of registers	- (no new registers)	+ (new set of registers)	- (no new registers)

Conclusion

All options basically compare at the same level. The difference is in how I weigh the different options. WDC obviously has chosen to go for the small code size route, taking the cost of extra switching opcodes between 8 bit and 16 bit operations.

I don't weigh the code size as much. For me it is important to not use hidden state in the CPU as an architectural principle, so I weigh the option to switch between 8 and 16 bit operation higher. Also that allows me to easily use short 8 bit code sparsly intermingled with 16 bit operations.

The option to add new registers and new opcodes requires to add new registers to the processor, requiring more chip estate. These registers are not "symmetric" to the existing ones (different width) and require new opcodes, maybe even a new ALU. Although this was my first design approach, this complexity rules it out.

The 65k will use modifier prefix bytes to extend the existing registers and opcodes to 16 bit width.

Sign extension handling

65816, 80x86 Sign extension handling

In a 65816 all registers are always used as 16 bit registers. As mentioned above, the 65816 does not modify the high byte of the 16 bit AC register when modifying the lower 8 bit only. This is similar to the behaviour of the 80x86 architecture, that when computing 8- or 16-bit values no extension happens.

This approach is - in case of the 80x86 - also motivated by the fact that historically the 16 bit registers actually are two 8 bit registers combined. In the case of the 65816 the AC high byte can be used to "store" extra values that can be swapped into AC low byte with the XBA opcode.

At least the 80x86 seems to have extra instructions to sign-extend a register.

The 68k has two different types of registers, the address and the data registers. When assigning values to address registers, the value is sign-extended. When assigning to data registers, the register is only modified in the width as given by the instruction (which can be byte, word, or long - 8, 16 or 32 bit respectively).

One goal of the extension handling is that code written for a narrow register width will also run on a system with wider registers. Interoperability should be ensured when calling narrow code from wide code (e.g. by zero-extending narrow registers). It should be possible to have processor options with 16, 32 or maybe even 64 bit register width.

Operand vs. Operation extension

Sign extension can happen at two places. When reading an operand it can be extended to fit the operation size. For example a

ADC.L #$01

could extend the byte-sized operand to a 32 bit ("L") operation input. When the result of an operation is written back, it can be extended to the register size. An

	LDA.L #$01
	ADC.B #$80

could then result in AC being sign extended to $FFFFFF81. This has implications, though, when AC is being used as an unsigned value.

Automatic Extension

Similar to the automatic setting of the status register bits (Z, N for example) in the 6502, the processor could automatically extend the sign of an operation. This could have advantages when the resulting value is a relative value. It would then automatically preserve whether the value is positive or negative.

On the other hand unsigned values above $7F would be extended to a value that is not expected.

The only operation with relative values is the branch operation. Here the default is 8 bit - but not even a register value.

Force Zero

The 65816 uses a different behaviour for AC and the index registers. The high byte of the XR and YR registers is forced to all zeros when 8 bit index register operation is selected. This allows to use increment/decrement opcodes without caring about the register size.

Conclusion

After reset, the high byte could always be zero when not using 16 bit registers at all. 8 bit operations would only modify the low 8 bits, leaving the others to zero. Only care would have to be taken to set the high byte to zero when going back from 16 bit to 8 bit operations.

A new instruction could be used to clear a register completely - no matter what number of bits the register internally has (this should preferably a single-byte opcode to even provide improvement over the original 6502).

There will be no extension when reading an operand for an operation.

When writing to a register, to support the principle of least surprise, the 65k will automatically zero-extend the result from an operation to the full register width. I.e. the result of 8 bit instructions will be extended with zeros to the full register width, be it 16, 32 or even 64 bit. Results of 16 bit operations will be extended with zeros if the target register is 32 or 64 bit and so on.

The 65k will provide extension opcodes to extend sign, zeros, or ones. It will also provide separate instructions to clear the full register (no matter how wide the register actually is)

Optionally it may include a prefix operation to NOT zero-extend the result of an operation to the target register's full width.

Number of Registers

The 6502 has three main registers, AC, XR, YR. Compared to other (larger) processors this is a very low number. The 68k for example has 8 address and 8 data registers - each 32 bit wide. So how can the number of registers be increased?

Accumulator-Memory architecture

The 6502 has a memory-accumulator architecture. I.e. most opcodes combine the value from accumulator or another register with a value from memory, or stores a register to memory. This is different from a processor like the 68k, that has many operations involving two registers and no memory location.

With the 6502 you could say that as almost every cycle is a valid memory access, the processor has many opportunities where it can possibly wait for the memory. This is the inherent result of the 6502's accumulator-memory architecture. I.e. on most opcodes the accumulator is combined with a value from memory or stored in memory.

The one way around this limitation would be to increase the number of registers, so more operations could reduce the number of memory accesses.

Register Sets

Register sets duplicate (or multiply) an existing set of registers with the same set of features, and make them available via specific exchange instructions. The Z80 for example provides a second set of registers (the ' registers) that are supposed to be used by fast interrupt routines for example.

Zeropage as Registers

The 6502 has a specific addressing mode, zeropage, addressing the lowest page in memory. The zeropage location is determined by the second byte of the opcode.

LDA $12

for example puts the value from zeropage location $12 into the accumulator. The zeropage location could be interpreted as a register number.

Unfortunately even though the zeropage provides 256 bytes, it still is a scarce resource and also requires a memory access - which makes it slower than a simple register access. The 65816 provides a direct register to move the zeropage to anywhere in bank 0.

Additional, Separate Registers

The processor could simply get new registers in addition to the existing ones. In contrast to register sets this would mean new opcodes for operations on these new registers.

As the 6502 has no means of "numbering" registers, there is no easy means of extending the existing operations with new registers. A separate set of operations would have to be implemented, a prefix to existing opcodes would not be enough.

Comparison

Here is a comparison of the different approaches to expand the number of registers:

	Register Sets	Zeropage	More registers
Speed	+ (either prefix, or exchange operation, but no memory access)	- (zeropage "register number", plus memory access)	+ (either prefix, or new single byte ops)
Number of registers	- (small multiple of 3)	+ (256 byte resp. 128 word registers)	- (small)
Simplicity	- (new set of operations, prefix, or exchange opcodes)	+ (already existing opcodes)	- (new set of operations)
Interrupt Handling	- (Need to be explicitely saved - or not used either outside or inside the interrupt)	+ (no action needed)	- (Need to be explicitely saved - or not used either outside or inside the interrupt)

Conclusion

The zeropage alternative actually competes quite well, even though it is an "external" solution and requires memory access.

Zeropage register access is actually simple and "known" - the operations already exist in the 6502.

To speed up the memory access it should be possible to provide a separate zeropage (write-through) cache that does not require memory access (if the processor is faster than memory). A zeropage addressing base register could provide a means of easily replacing the "zeropage register set" with another one.

The 65k will use zeropage "registers", with zeropage cache where applicable, and a base register to move the "zeropage register set".

Address Expansion

The 6502 has a 16 bit address bus. This amounts to a whopping 64 kByte of memory. Even already the old and famous C64 already had some bank switching schemes to expand the address space to more than that. So there is a need to expand the number of address lines available.

Also the stack registers and stack size are important here. The 6502 stack is only 256 byte long - and needs to be expanded for larger systems as well.

One goal is to extend the address space beyond the original 64k. On the other hand it should be possible to run 16 bit programs in all of the expanded memory. This can only be achieved by some kind of address translation.

A Commodore PET would need 5 memory areas (if you don't count unmapped memory): up to 32k RAM, 2k video RAM (I/O), 24k ROM ($8800-$e7ff), 256 Byte I/O, and 4.75k ROM ($e900-$ffff). A Commodore 64 would need even more memory areas: 8 (depending on memory configuration. Using power on default): 0-1 CPU register, ~1k RAM ($0002-$03ff), 1k video RAM ($0400-$07ff), 38k RAM ($0800-$9ffff), 8k ROM ($a000-$bfff), 4k RAM ($c000-$cfff), 4k I/O ($d000-$dfff), 8k ROM ($e000-$ffff). This could be reduced to 6 if a "low priority" RAM mapping for the whole 64k could be used that is being overlayed by the other mappings.

Switching the memory translation should be efficient, and as far as possible done in hardware. One test use case would be to switch from a simulated Commodore PET to a simulated Commodore 64 memory environment.

In addition to relocation, the memory management coming with the address expansion should allow for protection and sharing. Protection means that a process must not be able to access or modify memory areas it should not access or modify without permission. This "prevents a malicious or malfunctioning program from interfering with the operation of other running programs" (Wikipedia on Memory management, link below). The other feature that should be possible is share memory areas between processes as a fast means of interprocess communication. A server process for example could directly write data into the calling process' memory.

The location of the reset and interrupt vectors also needs to be discussed. If they stay at $FFFC-$FFFF, they would probably be in the middle of the system's RAM instead of ROM. On the other hand RAM is faster than ROM these days.

Natural Address Expansion

One goal is to make the address expansion as "natural" as possible. This means similar to how absolute (16 bit) addressing expands zeropage addressing, larger address space addressing modes should expand the old 16 bit addressing. A consequence of this is that the PC, the program counter has the full number of address bits. 16 bit addressing would simply address the low 64 kByte of memory.

There is one caveat with this approach though - return addresses are two byte on the stack. A JSR from outside the low 64k requires more than two byte on the stack, and also an own RTS code to use these three bytes. This is probably the reason why WDC decided on their banking approach.

65816 Address Expansion: Bank Registers

The 65816 still has a 16 bit program counter register (PC). This register is extended by an 8 bit Program Bank Register (PBR) to give a 24 bit physical address.

Virtual data addresses are also still 16 bit values, and are extended by a Data Bank Register (DBR) to provide a 24 bit physical address. The program counter can thus not cross bank boundaries, it wraps around from $XXFFFF to $XX0000. Only special instructions that modify the PBR change the execution bank.

"Direct" addressing modes - formerly known as zeropage addressing - however is determined by the 16 bit Direct register (D) and always results in a physical address in bank 0 ($000000-$00FFFF). Also the stack can only be in bank 0 - its position is determined by the stack high byte register.

CS/A65 Address Expansion: MMU

The CS/A65 computer expands the 6502 CPU with an MMU, that maps any of the 16 4k-blocks of virtual (CPU) address space into 256 4k-blocks physical address space. The physical address space is 1 MByte and filled with RAM and ROM. The later versions also include memory management features like "no-execute", "write-protect" and "not-mapped" bits. This approach is commonly called paging.

The instruction pointer is still always 16 bit, as the CPU only sees 16 bit virtual address space.

Using this MMU requires to load the up to 16 MMU registers during a context switch. This can prove costly in a multitasking operating system.

Modern systems use a slightly different approach. The program only sets a memory address to the MMU, and the MMU then loads the mappings from these memory locations - as needed. The loaded "Page Table Entries (PTEs)" are stored in a "Translation Lookaside Buffer (TLB)".

Loading PTEs on demand however, can make opcode timing non-deterministic - when crossing an MMU block boundary an extra memory access to read the PTE may be inserted.

Wikipedia on MMU

80x86 - Segmentation

The 80x86 architecture since its first 8086 incarnation uses segment registers. In the 8088/8086 the segment register was 16 bit. Shifted 4 bits to the left was used as base address for a 64k window into the 1 MByte physical address space.

Since the 80286 the segment register content points to a descriptor table in memory, that describes the segment. This description includes the - physical - base address, segment size (which is being checked), write protection and execute-only protection.

Segmentation has the advantage that it can quickly be changed for example in case of a context switch.

Newer 80x86 also use an MMU and paging in addition to segmentation.

The 80x86 can have a 16 bit, 32 bit or 64 bit instruction pointer, depending on the mode of operation. The 16 bit is extended with a 48 bit value to create a 64 bit virtual address space. The 32 bit instruction pointer is zero-extended.

"Canonical addressing" means that all unused address bits on the upper end (say address bits 63 down to 48 when address bits 0-47 form the physical address) are either all zeros or all ones.

Wikipedia on Memory Segmentation

PowerPC Address Translation

The PowerPC actually has two different types of address translation. The Block Address Translation (BAT Registers map variable sized memory regions from virtual to physical ("real" in PowerPC terms) addresses. There is only a limited number of these registers. The Segmented Translation "breaks virtual memory into segments, which are divided into 4 kByte pages, each representing physical memory".

For segmented translation, the virtual address is split into segment number, page address, and byte within the block. The segment number is used to look up a segment id, that is hashed with the page address, the hash is then used as index in the page table to retrieve the real page number. This lookup is cached, however, in TLBs.

For block address translation, which seems to be more interesting here, there are four (resp. eight) block address translation CPU registers each for data and instructions. Each BAT includes a virtual block address, and a block length definition, that allows to match virtual address blocks from 128 kByte to 256 MByte. The virtual address is truncated to the "block length", and OR'd with the real page number for that block, to retrieve the physical address.

The advantage of this approach is that due to the variable size, only a few entries suffice to map large memory areas.

Block address translation is not implemented in all PowerPC processors, notably not in the G4 and G5. But it is used in the 4xx embedded processors.

PowerPC Address Translation

Address Virtualization

Virtualization of a processor hides the physical properties of a processor from the running program, but only show an abstraction of the processor. The program running in the virtual machine should in the best case not be able to find out that it is running in a virtual machine and not on real hardware. The 65k should provide virtualization to such an extent that a normal 6502 operating system could run in the virtual machine.

Providing address space virtualization requires that any changes to the memory management subsystem are privileged operations and must be trapped when executed in user space mode.

When trapped, the processor enters hypervisor mode, and must be able to find out what a virtual address means in the physical address space. A specific instruction, possibly an opcode prefix could calculate a) the effective virtual address of an opcode, and another instruction could then translate this virtual address to a physical address. The other way around is not possible, as a physical address could be mapped multiple times.

Address Space Selection

The 68k provides a function code mapping for address space translation. I.e. the translation registers can be loaded with function code and a function code mask, and if the operation's function code matches the translation register's function code, it is active.

Comparison

Comparing the different approaches above, there are two main approaches, MMU and BAT. A paged MMU is used by CS/A65, as well as all modern CPU architectures (above a certain size). They have evolved to provide different page sizes, from 4k as in the beginning up to 4M sizes. The BAT approach - conceptually an extension to the segmentation - on the other side provides differently sized segments from the start, and can map these segments to physical memory. Systems using segmentation or BAT have evolved to include paged mapping as well though.

All approaches can use TLBs - a translation cache - to speed up the mapping for addresses recently used.

Here is a comparison of the different approaches to expand the address space:

	MMU	BAT	BAT with select mask
Description	A paged memory management unit translates fixed sized blocks from virtual to physical addresses via page table lookups	A set of segment descriptors or BAT registers describe variable sized memory ranges and map them to physical addresses	Similar to BAT registers, but with additional match code to automatically select active entries
Lazy-Loading (Loading speed)	+ (a paged MMU can automatically traverse the page table tree when a translation is needed, as a specific virtual offset maps to defined entries in that tree)	- (At least the matching part of all BAT registers must be available to the CPU, as a lookup cannot be automated due to the variable sized blocks)	0 - BAT registers can be loaded in advance and dynamically be selected using match codes. If more BAT registers are required than are available, they need to be reloaded dynamically though.
Translation Speed	- A lazy-loading MMU needs to traverse the page tables to actually find the correct mapping, which involves additional memory operations. Note an MMU where the mapping is loaded on context switch is faster, but not feasible due to the large number of required mappings.)	+ (the BAT registers must be loaded at context switch, no memory operations are required, so translation speed is fast - maybe even TLBs can be avoided)	+ (the BAT registers must be loaded at context switch, no memory operations are required, so translation speed is fast - maybe even TLBs can be avoided)
Variable sized memory mappings (e.g. to virtualize a Commodore PET (min size 256 Byte) or even C64 (min size 2 Byte)	- (paged MMUs normally only have a single block size, typically 4k, or additionally a large page size of something like 4M)	+ (The BAT registers provide variable sized mappings, although not necessarily in the granularity required)	+ (The BAT registers provide variable sized mappings, although not necessarily in the granularity required)
System managment	+ (PTEs can provide system management info like write protection, no-execute bits, tags, ...)	+ (BATs can provide system management info like write protection, no-execute bits, tags, ...)	+ (BATs can provide system management info like write protection, no-execute bits, tags, ...)

Two requirements working against each other weigh in here:

Differently-sized memory mappings like for a Commodore emulator, from 2 byte, via 256 byte, up to 38kByte. Boundaries on arbitrary addresses?
The runtime effort to map large memory areas on a personal computing system

A two-staged approach may be the solution here

Conclusion

The 65k will use a three-staged approach.
First the virtual user space address will be truncated to a definable number of address bits via masking.
Secondly the address is matched against one of 4 (or 8?) BAT-like segments. These segments have a start address and end address. Each segment has two modes. Either it directly provides a physical start address for the memory segment. Or it provides the start address of an MMU mapping, together with the selection of one of a few possible mapping block lengths.

A processor option may be to leave out the paged MMU, but the segment regsisters are required. For a processor with optional cache, an additional MMU bit whether the caching is allowed is required.

Each memory mapping provides at least a read-only bit, no-execute protection, a valid bit, as well as a hypervisor bit (for mappings only available to the hypervisor). The segment registers also provide external (physical) bus width, either "native" or 8 bit (for I/O).

To select the active segment registers, match codes per register are used. Memory environments are identified by an (8 bit) environment number, which is mapped to the match code. The segment register is active when the memory enviorment number matches the match code.

Wikipedia on Memory Management

Advanced Bus Features

The 6502 has a very simple bus interface. Clock, r/-w, address and data lines. The only special input signal is RDY that allows to halt the CPU to wait for slow memory. SYNC signals the system when an opcode is fetched.

These signals have alread very early be expanded by BE for example. This signal decouples the CPU bus (address, data, r/-w) from the system, so that a video processor (e.g. the C64 VIC) could take over the bus without extra bus drivers.

Signal	Feature	Reason
RDY	When asserted to the CPU, the cpu waits until it finishes the current memory access cylce.	Used to let the CPU wait for slow memory. Note: for reads only on the NMOS6502, for reads and writes on the CMOS versions.
SYNC	CPU output. Signals an opcode fetch.	Can be used to single-step the CPU, or to catch bus errors when an opcode fetch is done on no-execute memory.

Other features already implemented in 6502 systems are ABORT, No execute, write protect, and bus error.

The signals decided upon here also need to be located either between CPU core and MMU (even if the MMU is integrated into the CPU), or between CPU (including MMU) and the system.

65816 Bus Features

The 65816 has a number of additional signals:

Signal	Feature	Reason
ABORT	When this input signal to the CPU is asserted, the CPU finishes its current opcode, but does not update the register values (including the PC), then fetches the ABORT vector similar to an interrupt.	This is used when an invalid memory location is accessed. The opcode is aborted, the CPU can change the memory mapping so that the memory location becomes valid, and then rerun the opcode.
VPA/VDA	These CPU output signals tell the system whether the current cycle is an opcode fetch (VDA+VPA), valid program address (VPA), valid data address (VDA) or invalid (none asserted).	Replaces the SYNC output. Allows to speed up (avoid wait states) for invalid cycles when the CPU can run faster than the system. May be used for memory mapping.
/VP	Asserted by the CPU when an interrupt (IRQ, NMI, RESET, ABORT) vector is pulled.	Can be used to specifically map or dynamically replace interrupt vectors.
BE	Bus Enable. Input to decouple address, data and r/-w lines from the system..	When an external processor (like video) requires memory access, the CPU can be switched off the system bus without further drivers.
/ML	Memory Lock. Is asserted by the CPU during the read-modify-write cycles of such an opcode (like ROR ABS)	Locks the memory access to that address for other CPUs.
M/X	Outputs the AC and index register mode (8 bit vs. 16 bit)	May be used for memory management purposes.
E	Outputs the emulation mode (native vs. emulation)	May be used for memory management purposes.

Of these signals, M/X and E are 65816 specific. The signals ABORT, VPA/VDA, /VP, BE and /ML signals are candidates for 65k.

CS/A65 Bus Features

In my CS/A65 system I have implemented some other advanced features:

Signal	Feature	Reason
/BE	Bus Enable. Input to decouple address, data and r/-w lines from the system..	When an external processor (like video) requires memory access, the CPU can be switched off the system bus without further drivers.
NOTMAPPED	The CPU board asserts this signal when a memory location is accessed that is not mapped in the board's MMU. The AUXCPU processor can then halt the main CPU (via RDY) and fix the error condition	This signal detects a bus error condition. The AUXCPU is a kind of replacement for the 65816's ABORT pin.
WPROT	The CPU board asserts this signal when a memory location is written to that is mapped as read-only in the board's MMU. The AUXCPU processor can then halt the main CPU (via RDY) and fix the error condition	This signal detects a bus error condition. The AUXCPU is a kind of replacement for the 65816's ABORT pin.
NOEXEC	The CPU board asserts this signal when an opcode fetch is performed on a memory location that is mapped as no-execute in the board's MMU. The AUXCPU processor can then halt the main CPU (via RDY) and fix the error condition	This signal detects a bus error condition. The AUXCPU is a kind of replacement for the 65816's ABORT pin.

COPRO	This board (not signal actually) implements a 6502 co-processor. It features a hardware register protected by optimistic locking. This is implemented by a hardware load-linked, store-conditional register access.	This feature is implemented to provide safe synchronization between the two processors.

Multiprocessor/-core Synchronization

More modern CPUs provide features to synchronize multiple cores and/or CPUs.

Feature	Description	Reason
Test-and-Set opcodes	This opcode reads a specific memory location, and changes that value in an atomic way (i.e. no other CPU can change the memory location between the read and write) .	Used to synchronize multiple CPUs. This is a read-modify-write opcode and could be implemented using a memory lock signal.
Compare-and-Swap opcodes	This opcode checks that a specific value is in a memory location, and only when this is the case, changes that memory location to a new value.	Used to synchronize multiple CPUs. This is a read-modify-write opcode and could be implemented using a memory lock signal.
Load-Linked/Store-Conditional	When a memory location is read (load-linked), the CPU monitors changes to that location. When the CPU then writes to it (store-conditional) the write only succeeds when no modification has been done from other CPUs. Thus implements a lock-free atomic read-modify-write operation.	Used to synchronize multiple CPUs. Requires bus snooping.

Multiprocessor/-core Synchronization with caches

Modern CPUs use caches to improve performance in the presence of memory slower than the CPU would need it. For the 65k a cache is optional as well (see below). So these have to be taken into account for Multicore/-processor synchronization.

A good example of how a memory model has been refined in the light of upcoming multicore processors is the Java programming language memory model.

I will not go further into this discussion, but present the following conclusion.

To allow the implementation of a proper "happens-before" relationship, the 65k will provide memory barrier instructions to a) flush all writes to memory, and b) flush the read cache, both in total and for a given address (maybe address range).

Prioritized IRQs

The 6502 has a two-staged interrupt process. The standard IRQ interrupts the CPU, which then passes the PC and the status to the stack, fetches the interrupt vector and continues execution there. The IRQ can be prevented by setting the "I" flag in the processor status register. The NMI works similar, but can not be prevented. While the IRQ is level-triggered, the NMI is edge-triggered to prevent an infinite loop.

More modern CPUs have multiple interrupt lines, where each interrupt has a different priority. A CPU being in a specific interrupt level can be interrupted by a higher level interrupt.

Comparison

The features mentioned above work at different parts of the architecture. The ABORT functionality is a functionality of the processor core - each register must have a shadow copy, that is updated only when the operation completed successfully (i.e. without an ABORT).

The read-only, no-execute etc bits are features of the memory management unit, not of the processor core. A separate component can then use these bits - together with the CPU's R/-W, SYNC etc outputs - to generate the CPU's ABORT input.

The bus synchronization can be implemented by signals that the CPU generates (either ML, or one signal for LL/SC each), that are passed through the MMU and/or Cache, and are used by an external arbiter to synchronize the multiple processors.

Conclusion

The 65k core will provide the fetch type signals (RDY, or VPA/VDA). It will have an ABORT input to provide for an opcode rollback.

The 65k MMU will provide at least read-only, no-execute, and not-mapped bits for each memory location. A separate component will use these bits to detect bus error conditions and create the ABORT signal.

The processor bus interface will provide tri-state bus drivers, as well as an arbiter interface (BE, RDY), so the external bus can be multiplexed. This includes ML resp. LL/SC signals for an interprocess synchronization.

In addition to the IRQ and NMI interrupt signals, the CPU has at least two more intermediary interrupt lines with interrupt prioritization. This includes appropriate opcodes and interrupt vectors.

Addressing Modes

The 6502 is lacking some important addressing modes that are needed for

object-oriented programming
completely relative programs

65816 Addressing Modes

The 65816 has a number of new addressing modes. Most importantly the memory address space is separated into banks of 64kByte each.

There are two new 8 bit registers, the program bank register and the data bank register that determine the bank of the opcode fetch and the bank for the data access respectively. The data bank register is used for the 6502 addressing modes that specify a 16 bit address only. Some of the new addressing modes are introduced to allow the specification of a 24 bit address directly, without the need for the data bank register.

Additionally there is the 16 bit direct register. Former zeropage addressing modes add the value in this register to the "zeropage" offset, and are now called "direct" addressing modes. The "Direct page" always is in bank zero, i.e. all former "zeropage" addressing modes work in bank zero.

Relative addressing for long branches has been extended to 16 bit, from -32768 to +32767.

Stack on the 65816 can also only be in bank zero, but in native mode the stack pointer is 16 bit. A new stack-relative addressing mode has been added, where the effective address is the sum of the stack pointer and an 8 bit offset. The last addressing mode is "Stack relative Indirect Indexed" - The stack pointer and an 8 bit offset are added to form an address in bank zero. From this address a 16 bit address is read, the data bank register is added to form a 24 bit address. Then finally YR is added to this 24 bit address.

6502 history and future

68000 Addressing Modes

The 68k opcodes are quite systematic and symmetric in terms of source and target operand. In the 6502, the target operand (or the source operand in case of stores) is defined by the opcode (LDA vs. LDY etc). In the 68k the opcode defines the operation, and two operands define the source and target operand for the operation. The 68k has a number of addressing modes, some of which are similar to the 6502 modes...

Register direct - defines a register as source or target operand
Absolute - the operand defines the address, from which the operand value is read
Program Counter Relative - program counter plus offset (plus index register on 68020 or later)
Register indirect - the content of an register is used as address, optionally with register predecrement, postincrement, or offsets (plus index register on 68020 and later)
Immediate - the value following the opcode
Implied - register implied as defined by the opcode

The 68k has pre-decrement and post-increment operations. I.e. when a register is used as an index, it can be decremented before or incremented after the actual operation. From 68020 and later there is an addressing mode that adds an index register (data or address register) to an address register, plus an offset, and uses the result as effective address.

65k Addressing Modes Draft

In the 65k all registers have the full length of the address bus, 16 bit on 16 bit options, 32 bit on 32 bit options and so on. The address space is not segmented, but linear. Only segment registers (see above) "confine" addressing to a defined address space.

Using this approach, using long index registers provide easy access to the whole address space already. Consider a 32 bit XR with

LDA $00,X

To "naturally" extend the addressing modes, zeropage and absolute addressing modes (plus their indexed variants) are extended by absolute long addressing modes that use 32 bit addresses. So it is possible to write

LDA $12345678,X

Indirect addressing modes are extended by long indexed addressing modes where the address pointed to by the opcode is a 32 bit address. For example

LDA [$00],Y

means that at address $00000000 there is a 32 bit (4 byte) address, which is added to YR to get the effective address.

To support relative code, branch opcodes will be augmented with wide (16 bit) and long (32 bit) offsets (when 32 bit option implemented).

For object-oriented programming and other modern programming styles, there will be one (maybe two) additional "base registers", and other (prefix) operations:

Add the value of the base register to the data address after all other processing (prefix to other opcodes)
Add the value of the stack register to the data address after all other processing (prefix to other opcodes)
push base register on stack and replace with value given
pull base register from stack
new jump and jump subroutine addressing modes

This allows for the following scenario: Consider an object-oriented setting. Each object instance contains, in its first address, a pointer to the class definition where the actual method code is stored.

class1	bra method1
	bra method2
	...

object1	.long class1
	.word attribute1	; data value
	.long object2		; object reference
	...
object2	.long class1
	.word attribute1
	.long 0

The base register could contain the address of the current object. When executing a method on object1, then calling a method of object2 could then be implemented like this:

	...			; context is object1 (i.e. base register contains address of object1)
	LDA.L 6,B		; load AC "L"ong (32 bit) from "B"ase address, 
				; 6 is offset of object2 ptr to base address (address of object1)
				; AC now contains base address for object2
	PRB			; push base address, replace with value of AC
				; now base address contains address of object2
	LDY.0 #3		; "method number" times 3 (zero-extend byte immediate value to full register size)
	JSR (0,B),Y		; jump subroutine, Add offset 0 to "B"ase address,
				; read address from there, add YR to get final (16 bit) address to JSR to
				; (similar to indirect-Y, but add base address before indirection)
				; this results in address of method2 on object2

	...			; execute method2 on object2
	RTS			; return subroutine (16 bit)

	PLB			; pull base register from stack
	...			; continue in the context of object1

To be clear, normal opcodes still work without the base registers. Adding base register, or similarly stack pointer, requires an additional prefix. This increases memory and number of cycles, but is offset because during "offset" operation shorter addressing modes can be used.

An extra opcode is

LEA addressing_mode ; Load Effective Address

allows to load AC with the effective address of an addressing mode. It is subject to all prefixes (register width, base registers) as other opcodes.

No post-/pre-indecrement or decrement opcodes are provided, as with the IN* opcodes an effective replacement exists. An example could be

	LDA.W ($12),Y
	INY #2

INY could not be replaced with INY.W, as the latter would increment the Y register by 1, but using 16 bit width. For the advanced "INY immediate" opcode see below.

Conclusion

When interpreting the zeropage values as kind of registers, the 6502 addressing modes are quite powerful already. Adding a base register, resp. the stack register or the program counter to the address, the addressing modes become even more powerful.

The 68k offset addressing modes are similar to the 6502's zeropage indexed or absolute indexed addressing modes. The new 68020+ indexed addressing modes could be interpreted as similar to the 6502 indirect addressing modes, if the zeropage location is interpreted as a register.

The 65k will add absolute long, as well as long indirect addressing modes (as described above). Branch operations will allow wide and long relative offsets. Addresses can be offset by a either a new base register, the stack register or the program counter.

Advanced Opcodes

This section summarizes requirements for new functionality and new opcodes, like

base register - extended addressing mode, see above
time stamp counter - count opcode cycles etc
advanced functionality - LEA, INY immediate, ...

Base Register

The base register is an extension to the existing addressing modes. It needs to be set, but to allow for easy object-oriented programming it needs to be put on the stack and back.

TAB - transfer AC into base register
PRB - push old base register on stack, and transfer AC into base register (Push and Replace Base register)
PHB - push base register on stack
PLB - pull base register from stack

Width Handling of Stack Register, Base Register, ...

Upon reset the Stack register is set to $00000100. The existing 6502 opcodes TXS and TXS work on the lowest 8 bit, resulting an 6502 compatible operation. Using width extension prefixes is a natural extension to a wider stack register size.

Upon reset the base register is set to $00000000. Similarly the base register operations are subject to the operation width and width extension prefixes.

The SWP ("SWaP") opcode exchanges high order and low oder parts of the operand. It can be applied to AC (implied) or a memory location, in which case it is a read-modify-write operation. The 8-bit version (no prefix) exchanges the high order nibble with the low order nibble in the byte operand. The wide version (16 bit operation width prefix) exchanges the high order byte with the low order byte, and so on.

IN?/DE? immediate

To accomodate increments and decrements of the index registers for wide and long operations, the INY, INX, DEY and DEX opcodes get a new variation with an immediate operand, that determines the increment resp. decrement. The opcodes are applicable to the operation width and width extension prefixes.

Jump Subroutine

The jump subroutine gets more powerful addressing modes, to allow for better use in object-oriented programming for example.

JSR (zp),Y - indirect-Y
JSR (abs,X) - X-indirect

These opcodes use 16 bit addresses. To use wider addresses, a new opcode must be defined, plus a corresponding return opcode.

JSRL abs
JSRL abslong
JSRL (zp),Y
JSRL [zp],Y
RTSL

The operands that are not full-width are sign-extended to the full address register width. Note: these could possibly be implemented by applying operation width prefix to the normal JSR and RTS opcodes.

Interrupts, RTI

Interrupts always jump into hypervisor more (see below). A BRK opcode works similar (in hardware) as an interrupt, in fact the 6502 implements the interrupt as a BRK opcode. A 6502 interrupt pushes the interrupt location to the stack, then the status register.

However, this stack frame is not sufficient when the address is larger than 16 bit, or the status register is extended to more than 8 bit.

This stack frame can easily be extended, though. The 6502 status register has bit 5 always 1. If the status on the stack has bit 5 cleared, the stack frame can be different:

		byt STATUS		- existing status register, bit 5 cleared
		byt EXTSTATUS		- extended status (to be defined)
		long RETADDR		- long return address (possibly wide - 16 bit - depending on ext. status)

The RTI opcode then does not need to be prefixed depending on address space. Instead it reacts on the data on the stack.

If the interrupt flag is set, this operation may be trapped as privileged operation.

Branches

The 65k will provide

BRA jump_target

"BRanch Always" relative jump, as well as

		BSR jsr_target
		BSRL jsr_target

"Branch SubRoutine" relative jump to subroutine opcodes. The BSR opcode puts a two-byte return address on the stack, BSRL puts four-byte return address on stack.

Operating Modes

The 65k provides two operating modes, user mode and supervisor mode. Within user mode, address space is virtualized, BRK is used to break out from user mode into supervisor mode. Special opcodes are provided to return from BRK resp. jump to interrupt vectors (reset, irq, nmi) in user space. In supervisor mode address space is not virtualized, but directly mapped to physical addresses.

The stack pointer exists in two versions, one for the user mode, one for the supervisor mode, to easily switch between modes.

Supervisor mode environment number is 1 (see address mapping above). The user mode environment number is stored in a separate register.

Stack Pointer

The opcodes TXS and TSX operate on the respective stack pointer, the supervisor stack pointer in supervisor mode, the user mode stack pointer in user mode.

In addition to the existing TXS and TSX operations there will be two more privileged operations to handle the user space stack pointer from supervisor mode:

TUSX - move user space stack pointer to XR
TXUS - move XR to user space stack pointer

Access to user mode

From supervisor mode data must be read from or written to user mode environments. For this purpose some specific privileged prefix opcodes are provided:

TUEX - move user mode environment number to XR
TXUE - move XR to user mode environment number
MEN - next opcode operates on user mode memory environment

Loading data from another - e.g. user mode - environment would look like:

	LDY #2		; environment number
	TXUE
	LDX usp		; load address of environment user stack pointer to XR
	MEN		; next opcode on memory environment YR
	LDA.W 0,X	; load word data from user mode stack in environment #2

With the operation size prefix for the load operation, automatically multiple bytes could be read from an environment - even across mapping borders.

To jump to user space, MEMY and MEMX could be used as well

	LDX usp		; load address of environment user stack pointer to XR
	TXUS
	LDX #2
	TXUE
	...
	MEN
	JMP addr_in_env2

These prefixes do not work on jump subroutine. The prefixed jump operation automatically switches to user mode.

Call to supervisor mode from user mode

To switch from user mode to supervisor mode a new CALL opcode is used. It works similarly to the BRK opcode, but has a different opcode value (to keep compatibility with BRK operation). The byte behind the CALL opcode determines the operation. Program counter is put on the user mode stack before ...

The RTC opcode returns from a CALL operation, by switching to user mode, pulling the program counter from the stack resuming operation.

Memory Interface

The 6502 has an 8 bit wide memory interface. With given clock frequency this limits the maximum memory bandwidth. The bandwidth can only be increased by increasing the memory bus width.

Additionally a cache can be used to improve the memory bandwidth - by not requiring to read some data when it is in the cache already.

Wide Memory bus

Depending on the processor option, the memory bus could be 8, 16, or even 32 bit wide. The processor core on the other hand has internal 8 bit (opcode, small registers), but also 16 or maybe 32 bit reads and writes. Unfortunately the wide reads and writes are not always aligned with the memory width.

An (unaligned) write of 16 bit on an odd address of a 16 bit may need a 16 bit read of the first address, modify its second byte, write back the 16 bit word, then read the second 16 bit address, modify its first byte and write it back.

Also an unaligned read of a wide data (or code) fragment could trigger two separate reads. In fact for reading a 16 bit absolute value on an 8 bit bus this is exactly what the 6502 currently does.

On the other hand, reading say a 32 bit value from the PC contains a complete (6502) opcode and reads - in a single cycle - all those bytes that are needed to read the opcode and which the 6502 would need multiple cycles for.

A fully wide bus interface is not the whole solution, though. I/O chips for example usually only have an 8 bit interface. Mapping them on the memory bus would mean that their addresses would have two bytes or even four bytes address difference. This would not allow any kind of old systems emulation, where I/O devices normally have consecutive addresses.

Cache

A cache only makes sense when the memory access is slower than the CPU can handle the data read. On the 6502 a cache does not make much sense - the processor can handle the data as fast as it comes in - it sometimes even needs "bogus" cycles where it does some internal work.

On a 65k the situation might be different. If the core is run at, say 100 MHz, and memory can only deliver at 10 MHz, a cache very well makes sense.

A cache can also help if the bus width is extended from 8 bit to say 32 bit. The wide bus interface allows to read the complete (6502) opcode, plus the following opcode. discardng this information does not make sense, so instead of reading it again and again, it should be cached.

The data cache (if instruction and data are separate) should provide some kind of "zeropage" flag to give zeropage addresses priority in the cache, thus implementing the zeropage cache mentioned above.

Wikipedia on Cache

Write Pipeline

When the processor writes a wide data element, it needs to separate it into parts in case the data element is misaligned with the memory address. This is one cause for creating multiple cycles on a wide memory access. Another one is when only a part of a memory location is modified. If the memory interface does not provide a way to do that - e.g. by providing select lines per byte - this requires reading the full memory word, modifying the part, and writing the whole word back. So writing byte back into a word-sized memory could require two cycles.

The memory interface should thus provide per-byte select lines. This can either be used to write only the affected data, or to trigger a special (optimized) read-mofify-write memory access cycle, where only the affected data is modified.

If the 65k implements a cache, each write must either invalidate the cache for the relevant memory locations, or overwrite the cached value.

Conclusion

The 65k will have different options for memory bus widths, each bus width requires read- and write-sequencers to break down wider or misaligned memory accesses.

The segment registers of the MMU will provide information about whether the external (physical) memory is native width (depending on processor option), or 8 bit for I/O.

The 65k, when using a bus width larger than 8 bit will cache the wide data read at least for the next access - separately for instruction and data. Optionally a larger cache can be provided (Details to be determined).

The memory interface will provide per-byte select lines.

Mathematics

The 6502 only has simple mathematics operations, ADC and SBC. A multiply would be a great addition, but also operations for checksums, or even SIMD (single instruction, multiple data) operations. Bit manipulation operations have been implemented in the CMOS 6502 for example.

Floating point operations would also be a great addition.

Integer Mathematics

The 65k will provide integer arithmetic ADC and SBC operations at the same widths as available for the other operations. Decimal mode will be supported.

Integer multiply will be supported signed and unsigned with

	LDA.W #1	; operand 1
	MULS.W #2	; operand 2

Result will be in AC. If the MUL operation is defined as byte operation, the result will be wide (16 bit), and so on. If two values of the maximum register width are multiplied (e.g. 32x32 bit multiply with 32 bit registers) the high order half of the result is discarded. The overflow bit will be set if result does not fit in AC, i.e. if the high order half is not the sign extension of the low order half.

Integer division will also be provided in signed and unsigned variants. One variant divides AC by the given operand of the same size, and stores the resulting quotient in AC at the same size. The remainder is discarded. The extended variant uses the AC in twice the size of the operand. A byte division thus takes a 16 bit AC, divides it by an 8 bit operand, then stores the quotient in the lower half of AC, and the remainder in the high order half of AC. Details are to be defined.

Bit Manipulation Operations

The 65C02 provides a number of bit operations:

SMB - set a single memory bit in a zeropage location
RMB - clear a single memory bit in a zeropage location
BBR - branch on zeropage location bit reset
BBS - branch on zeropage location bit set
TSB - test and set memory bit (zeropage and absolute)
TRB - test and reset memory bit (zeropage and absolute)

The TSB and TRB take an immediate byte operand. The Z-flag is set by the bitwise AND and the value of the memory location given. The memory location is then ORd with the immediate byte and stored back (TSB), or ANDed with the complement of the immediate byte operand, and stored back again. This opcodes implements the atomic test-and-set instruction, that could be used for multithreaded synchronization (the 65C02 can not synchronize multiple CPUs).

The SMB, RMB, as well as the BBR and BBS opcodes do not provide more functionaliy than can be achieved with BIT and BNE/BEQ, resp. the normal AND and ORA opcodes. They are more efficient though, as they do not change (or not even use) the AC. They work on zeropage locations only.

The 68k has these four operations:

BCHG - change a single bit in the operand
BSET - set a single bit in the operand
BCLR - clear a single bit in the operand
BTST - Test a single bit in the operand

Compared to the 65C02 they are more versatile in the addressing modes (not only zeropage), but handle single bits only. The 68k provides an extra test-and-set operation (TAS) on byte locations, that set the high order bit (7) only.

To make the opcodes useful, they should provide more addressing modes than the 65C02 versions.

The 65k will provide Single Bit Change, Set, Clear and Test operations, on byte-sized operands, with the "usual" addressing modes. A Test-and-set resp. test-and-clear operation will provide the atomic synchronization operations.

65C02 opcodes, esp. the bit operations

Other Operations

The 65k will provide operations to count one-bits in an operand. The 65k will optionally provide sliced operations. I.e. N 8 bit operations in an N*8 bit register size.

Floating Point Operations

At this point in time no floating point operations are planned. A prefix opcode will be reserved for floating point operations.

Vector and Block Operations

The 65816 already has block move operations, also the 68k has operations to move multiple data elements. For the CS/A computer I am working on a block transfer engine as well.

65816 MVN/MVP Operations

The MVN and MVP operations use the X and Y registers as source and destination start addresses, and two bytes follwing the opcode as their respective bank numbers. The 16 bit AC contains one less than the number of bytes to transfer (i.e. AC=$ffff transfers $10000 bytes).

There is a "negative" MVN and a "positive" MVP operation. The positive operation increases the start addresses after each transfer, the negative operation decreases the addresses. This helps transferring overlapping memory areas.

As there is a bank byte in the opcode, data can either be transferred within a bank, or from one bank to another - but each of the source or destination memory areas may not span a bank boundary.

68k MOVEM, MOVE16 Operations

The 68k does not have a "bulk" data movement operation. There is the MOVEM opcode that allows to transfer multiple registers to memory and back - which is helpful for interrupt routines though.

The 68040 extends this with the MOVE16 opcode. It can transfer a cache line of 16 bytes. Source or target cache lines must be aligned at 16 byte address boundaries - postincrement addressing however starts at the given (maybe unaligned) address, then wraps around at the cache line boundary.

Thus the data transfer is always a full 16 consecutive byte, and data has to be aligned.

CS/A Blitter Operations

The CS/A blitter (block transfer engine) allows to set a full source, a full destination address, and a counter of up to 256 bytes. Additionally an 8 bit increment can be set for each source and destination address. Each address can either increment or decrement indepently from the other address (so a swap copy is possible). It works byte-wide though.

Graphics Blitter Operations

The original "blitter" name derives from "Bit BLIT", which stands for "bit-block image transfer" (see wikipedia link). It is a "a computer graphics operation in which several bitmaps are combined into one using a raster operator". This means that during transfer of data from one location to another one a "raster operator" - basically a boolean formula - is used to manipulate the data. This could be to AND a source value with the destination, OR it, invert it or even XOR it. When using more than one source, a mask could be applied as well.

However, "modern graphics hardware and software has almost completely replaced bitwise operations with mathematical operations such as alpha compositing". I.e. a blitter is not used for graphics anymore these days.

wikipedia on Bit blit

Comparison

Here is a feature comparison of the approaches mentioned above:

	65816	68k MOVE16	CS/A Blitter	Graphics Blitter
Transfer size	+ (1 byte up to 64k)	- (fixed 16 byte)	0 (1 up to 256 byte)	+ (flexible in general)
Source and Destination	0 (only within a bank)	- (aligned at 16 byte boundary)	+ (any address)	+ (any address)
increment/decrement	0 (only either increment or decrement both addresses by 1)	- (only postincrement)	+ (flexible increments of -256 to +256 interleave per address	+ (flexible increments per address
Speed, cycles	0 (7 cycles per byte transfer)	+ (burst read of cache line)	+ (2 cycles per byte transfer, which is maximum for a byte-wide bus)	depending on operation and implementation
Operations	- None	- (None)	- (None)	+ (boolean operations, even including more than two inputs)

Conclusion

The 65k will have options for bus interfaces with different width, from one byte, two byte, even more. Using an interleaved approach will cost performance, as each memory cycle is only partially used (e.g. one byte per 16 bit memory read), but interesting effects could be implemented with it.

The implementation effort for graphics operations could be quite high, and the benefits at this time are not really clear.

The 65k will implement two block move operations similar to the 65816 versions, but allowing for the X and Y registers to hold full addresses, and AC holding the number of bytes (maybe -1) to transfer.

The MVN and MVP operations may be interruptible (to improve interrupt latency), carry could be set as sign for an interruption so that a

dotrans	MVP
	BCS dotrans

would ensure that all bytes were transferred. A prefix byte opcode will be reserved for blitter operations implemented as a future extension.

Effective Address Register

Several processors have an instruction to load the effective address of an addressing mode into a register. The 68k for example uses it to calculate a - possibly complex - address and reuse it into several consecutive other opcodes. The same is true for the x86 architecture.

Such a register would also allow to reduce the number of instruction variations. One would just load the effective address and each operation like ADC, ROL etc would only need one addressing mode, namely to use the EA register.

On the 6502, however, address computations are not necessarily as complex as on other processors. One could still imagine though that code like this

	LEA (zp),y
	LDA #12
	ADC (E)
	STA (E)

could be more efficient.

Also this register could help adding new instructions that would only need a single addressing mode - namely "indirect-effective-address".

Load effective address into AC

The LEA opcodes of the other processors load the effective address into a general purpose (or at least one of the 8 68k address registers). The only general purpose register the 6502 has is the accumulator.

One option would therefore be to provide an instruction to load the effective address into AC. Other opcodes could get a new addressing mode like "indirect A", that would take the address from AC and use it as memory address.

The advantage of this approach is that it does not require an additional register. The disadvantage is that using the Accumulator as address register severely limits its usability, as most relevant opcodes need and/or modify the accumulator.

Load effective address into new register

The LEA operation could use load the address into a specific effective address register. This register could then be used in an indirect-E "(E)" addressing mode.

The disadvantage of this approach is that it does require an additional register. With this register come additional operations to push/pull or transfer the register. The advantage is that with an extra register the value is reusable across multiple operations, which greatly improves its value.

Conclusion

The 65k will provide an "effective address" register "E", with "LEA" opcodes to load an effective address from the usual addressing modes. Push/Pull/Increment/Decrement operations will be provided. Also standard operations will get a new addressing mode, "indirect-E" that take the effective address for the operand from the E register.

Return to Homepage

Last modified: 2010-12-31