The 65k Project - Architecture Overview

This page describes an overview on the 65k architecture.

News:

2010-10-23 Published this page
2010-10-14 Started this page

Table of content

Preface
General Considerations
Core Analysis
- Addressing Mode Analysis
- Opcode Data Path Analysis

Preface

License

This content is licensed under the Creative Commons Attribution Share-Alike license, CC-BY-SA version 3.0.

Note this "content" includes this web page, but does not include the 6502.org header and the left and right web page columns. Click on the "Maximize" link to see the contents covered by this license.

Disclaimer

The content comes with no warranty at all! There is no guarantee and no promise that this specification is correct, consistent, will actually work, or will ever be implemented at all.

To my understanding the techniques described here have been used by various processors for decades already. Still there is no guarantee that a processor according to this spec would not be covered by some patents.

Subject to change without notice!

Contributors

André Fachat - initial author: 8bit Homepage

Changes

This section describes the changes to the document:

Date	Author	Changes
2010-10-25	André Fachat	Updated Core Architecture Diagram and Description

General Considerations

This section describes general considerations for the processor design.

Read/Write Sequencers

The processor has one problem: addresses are at byte level, bus and register widths are in general more than one byte wide - so misaligned accesses can happen:

Access Type	Bus width	Alignment	Comment
Byte data read	any width	yes/no	The data can be read in any width from the bus, and the relevant byte be picked from it and given to the bus
Byte data write	byte width	automatic	just write the byte
Byte data write	word or larger width	yes/no	If the data can not be written as byte width but must use a larger width, the original data must be read, the relevant byte be modified, and the data word be written back
word or larger data read/write access	byte width	automatic	The data access can just be executed one after the other
word data read/write access	word width	aligned	Just do the access
word data read/write access	word width	misaligned	The word access must be broken up into two byte accesses, and executed as byte acces on a word bus as described above.
...	...	...	...

The table only shows a subset of possible combinations. To resolve this problem, specific components are used, read and write sequencers. These components take a read or write request from the core and break them up into accesses that the external bus can execute.

If some address area is accessed as byte-wide I/O, and other parts of the address area as word-wide or wider memory, the processor must have multiple read and write sequencer, one per address width (or a sequencer that can handle multiple widths).

Doing misaligned accesses slows the processor down from its optimum speed. In general wider bus width is still faster than narrower bus widths.

Simultaneous Multithreading

The 6502 - as well as the 65k - is very efficient concerning bus read/write cycles per opcode. If, however, a misaligned wide access is broken down into two or more smaller memory accesses, the core is waiting.

In this case the core could switch to a separate set of registers, and execute code for a second processor - what is called these days as SMT, simultaneous multithreading.

But as the processor is very memory access efficient, a second thread may not have time for many memory accesses on its own.

This is a topic for a later version though anyway.

Pipelining

Pipelining is a processor technique that divides the execution of an opcode into different stages like fetch, decoding, execution and store. Modern processors have pipelines up to 31 such stages (Pentium 4, Pentium D, see link).

An advantage of pipelining is that more than one opcode can be executed in parallel. The first opcode could be writing back data, the next one executing, and next one being decoded and the last one being fetched. This way more functional units (fetch, decode, ...) can be used at any time, making the system more efficient. Even if each opcode requires more than one cycle, each cycle can start - and finish - an opcode, making the processor faster in terms of opcodes per cycle.

A disadvantage is that branches can invalidate all the work that has been done for the following cycles. Therefore branch prediction techniques have been developed to reduce the cost of pipeline invalidation due to branches.

The 6502 already has a limited form of pipelining. The last cycle of any opcode actually is the fetch of the next opcode. That is also the reason why the 6502 is little-endian: the processor fetches two-byte operands with the low byte first. After the first fetch an index register is added to the low byte, while the high byte is being fetched. Then in the next cycle the carry is added to the high byte. In fact there is an optimization that eliminates the last cyle when no carry needs to be added to the high byte. Here too branches lead to problems: When a branch is taken, the "official" end of the opcode is not reached and interrupt handling suspended until the end of the next opcode.

The 65k architecture will in the current version implement pipelining similar to the 6502.

Core Analysis

This section analyses the requirements for the core architecture.

Addressing Mode Analysis

To define the necessary data paths in the core between registers, ALU and other components, here the addressing modes are analysed. As the internal register and data path widths is always full width, arithmethic operations (adds) do not need to be broken up into smaller chunks. The example used here is to load a value into the accumulator. They start with the program counter on the address bus and the opcode parameter on the data bus.

Note that using a register and taking a new value into the same register can take place on the same clock cycle if registers are assumed (as opposed to transparent latches as in the original 6502).

Also note that the initial parameter fetch comes as opcode parameter, thus from a different input bus, which has to be taken into account in the core design.

Immediate

The immediate addressing mode is easy...

Step	Transfer	Description
1	Data bus -> AC	The opcode parameter value fetched and on the data bus is transferred into the register

Zeropage and Absolute

The zeropage and absolute - including the new long and quad - addressing modes have one indirection - the opcode parameter is an address used to fetch the actual value

Step	Transfer	Description
1	Data bus -> data bus input reg.	The opcode parameter value fetched and on the data bus is transferred into the data bus input register (Note 1)
2	Data bus input reg. -> address bus	The data bus input register is put onto the address bus
2	data bus -> AC	The value read from the data bus is taken into the register

Zeropage and Absolute Indexed

In this addressing mode an index register value is added to the address before reading the actual address.

Step	Transfer	Description
1	Data bus -> data bus input reg.	The opcode parameter value fetched and on the data bus is transferred into the data bus input register
2	Data bus input reg. -> ALU A	The data bus input register value is put to ALU input A (Note 3)
	index register -> ALU B	The index register value is put to ALU input B
	ALU out -> temp	The ALU output is written to the temp register
Step 3 is optional if B,S or P are added (prefix OF bits)
3	temp -> ALU A	The temporary register value is put to ALU input A
	B,S or PC register -> ALU B	The index register value is put to ALU input B
	ALU out -> temp	The ALU output is written to the temp register (Note 2)

4	temp -> address bus	The temporary register is put onto the address bus
4	data bus -> AC	The value read from the data bus is taken into the register (via Pass5)

Zeropage and Absolute Indexed Indirect

This is an extension to the previous addressing mode. The value read from the addressing mode above is interpreted as address to read the actual value from

Step	Transfer	Description
1	Data bus -> data bus input reg.	The opcode parameter value fetched and on the data bus is transferred into the temporary (or the data in) register
2	Data bus input reg. -> ALU A	The data bus input register value is put to ALU input A
	index register -> ALU B	The index register value is put to ALU input B
	ALU out -> temp	The ALU output is written to the temp register
Step 3 is optional if B,S or P are added (prefix OF bits)
3	temp -> ALU A	The temporary register value is put to ALU input A
	B,S or PC register -> ALU B	The index register value is put to ALU input B
	ALU out -> temp	The ALU output is written to the temp register (Note 2)

4	temp -> address bus	The temporary register is put onto the address bus
4	data bus -> AC	The value read from the data bus is taken into the temp register
5	temp -> address bus	The temporary register is put onto the address bus
5	data bus -> AC	The value read from the data bus is taken into the register (via Pass5)

Zeropage and Absolute Indirect Indexed

Step	Transfer	Description
1	Data bus -> data bus input reg.	The opcode parameter value fetched and on the data bus is transferred into the data bus input register
Step 2 is optional if B,S or P are added (prefix OF bits)
2	data bus input reg. -> ALU A	The temporary register value is put to ALU input A
	B,S or PC register -> ALU B	The index register value is put to ALU input B
	ALU out -> temp	The ALU output is written to the temp register (Note 2)

3	temp/data bus input register -> address bus	The temporary register (resp. the data bus input register if step 2 is not taken) is put onto the address bus (ALU passthrough)
3	data bus -> AC	The value read from the data bus is taken into the temp register
4	temp -> ALU A	The temporary register value is put to ALU input A
	index register -> ALU B	The index register value is put to ALU input B
	ALU out -> temp	The ALU output is written to the temp register
5	temp -> address bus	The temporary register is put onto the address bus
5	data bus -> AC	The value read from the data bus is taken into the register

Relative

The relative addressing mode is for jumps only

Step	Transfer	Description
1	Data bus -> data bus input reg.	The opcode parameter value fetched and on the data bus is transferred into the temporary (or the data in) register
2	data bus input reg. -> ALU A	The temporary register value is put to ALU input A
	PC -> ALU B	The Program counter value is put to ALU input B
	ALU out -> PC	The ALU output is written to the Program counter

Notes

The value written to the temp register could directly be written to the address bus output register and to be put on the address bus in the next step, eliminating the need for the temp register here.
The ALU output value could be directly written to the address bus. This, however, would add the ALU processing time to the setup time for the address bus, limiting the possible clock speeds.
As the opcode operand is always read full width before actually adding the register value, there can be no optimization by eliminating the high byte add. The processor always works as if there was a carry - and thus a 16 bit 65k with 8 bit data bus will actually be one cycle slower than the 6502 if there is no carry

Opcode Data Path Analysis

In this section is an analysis of the different types of opcodes.

Load/Store

Opcodes: LDA, LDX, LDY, STA, STX, STY, STZ

Load opcodes load a register with a value from memory. Store opcodes write data to a memory location. These opcodes work the same way as the bare addressing modes described above. Only for stores, not just the address value is put on the bus, but also the register value.

Note: this poses a problem that during the zeropage and absolute addressing modes, during parameter fetch the value read as parameter must be available as address on the data address bus, while at the same time the register value has to be put onto the data bus. This has to be considered in the core design.

Load Effective Address

The LEA opcode can actually work similar to the load opcodes as above, i.e. basically transferring the value of the address output register into E. But a cycle can be saved instead if the address value is directly stored in E instead of the address output register. This can be decided depending on implementation details.

Arithmetic Operations

Opcodes: ADC, SBC, CMP, ORA, AND, EOR, CPX, CPY, TSB, TRB, BIT, ADS, ADE, ADB, SBS, SBE, SBB

These opcodes read an operand value, process it in the ALU, then store it in the AC (together with the relevant status bits). I.e. in the last addressing mode cycle as above, the value is read and stored in the temp register. In the following cycle, the temp register and the AC are put on the ALU A and B inputs respectively. The ALU output is passed on to the AC, which takes the value over at the end of the cycle. In this very cycle the next opcode can be read using the opcode fetch circuitry - as on the 6502.

Read-Modify-Write Operations

Opcodes: DEC, DEX, DEY, INC, INX, INY, ROL, ROR, ASL, LSR, SWP, BCN

For the accumulator-addressing mode opcodes (ROL A, ROR A, LSR A, ASL A, SWP A, BCN A), it is simple. After opcode fetch and decode, the AC is given to the ALU and the result is transferred back to AC at the end of the cylce.

For the other opcodes during the load cycle, the value is read into the temp register. During the second cycle the ALU performs the operation, and stores the value in the data bus output driver, but does not perform a valid write. During the third cycle the data is actually written. In fact during the second cycle the next opcode fetch may actually already be performed (Note 1)

Register Transfer Operations

Opcodes: TAX, TXA, TAY, TYA, TXS, TSX, TPA, TSY, TYS, TEA, TAE, TBA, TAB

These opcodes are simple. In the cylce after the opcode fetch the source register value is put on the internal bus, and transferred to the register input bus (using the pass gates), and stored in the register at the end of the cycle. During this transfer cycle the next opcode can be fetched.

Register Swap Operations

Opcodes: SAB, SAX, SAY, SXY, SAE, SAB

These opcodes are more complicated. In the first cycle, AC is transferred into the temp register. In the second cycle the other (S/X/Y) register is transferred into AC. In the third cycle then the temp register is stored into the other register. (Note 2)

Status Register Operations

Opcodes: SEC, CLC, SED, CLD, SEI, CLI, CLV

TODO

Clear Operations

Opcodes: CLY, CLX, CLA

TODO

Stack Operations

Opcodes: PLA, PHA, PLX, PHX, PLY, PHY, PLB, PHB, PLE, PHE, PRB

TODO

Jump Operations

Opcodes: JMP, JPU

TODO

Jump Subroutine Operations

Opcodes: JSR, BSR

TODO

Return Subroutine Operations

Opcodes: RTS, RTI, RTU

TODO

Branch Operations

Opcodes: BNE, BEQ, BPL, BMI, BVS, BVC, BCC, BCS, BRA

TODO

Branch Operations

Opcodes: BNE, BEQ, BPL, BMI, BVS, BVC, BCC, BCS, BRA

TODO

Move Operations

Opcodes: MVN, MVP, MVNTU, MVNFU, MVPTU, MVPFU

TODO

Fill Operations

Opcodes: FILU

TODO

Quick Operations

Opcodes: DEC, DEY, DEX, INC, INY, INX, ROL, ROR, ASL, LSR, INE, DEE, INB, DEB - quick addressing modes

TODO

Control Register Operations I

Opcodes: LCR, SCR, BCR

TODO

Control Register Operations II

Opcodes: SENV, SMMU

TODO

Memory Control Operations

Opcodes: SCA, LLA, WMB, RMB

TODO

Notes

The ALU output value could be directly written to the data bus. This, however, would add the ALU processing time to the setup time for the data bus, limiting the possible clock speeds.
If the ALU does not provide a pass-through mode, then a pass gate from the temp register output to the register input bus is required

Return to Homepage

Last modified: 2012-04-11