The 65k Project - Architecture Overview
(C) 2010-2011 André Fachat
This page describes an overview on the 65k architecture.
Table of content
License
This content is licensed under the Creative Commons Attribution Share-Alike license, CC-BY-SA version 3.0.
Note this "content" includes this web page, but does not include the 6502.org header and the left and right web page columns. Click on the "Maximize" link to see the contents covered by this license.
Disclaimer
The content comes with no warranty at all! There is no guarantee and no promise that this specification is correct, consistent, will actually work, or will ever be implemented at all.
To my understanding the techniques described here have been used by various processors for decades already. Still there is no guarantee that a processor according to this spec would not be covered by some patents.
Subject to change without notice!
Contributors
- André Fachat - initial author: 8bit Homepage
Changes
This section describes the changes to the document:
Date | Author | Changes |
---|---|---|
2010-10-25 | André Fachat | Updated Core Architecture Diagram and Description |
This section describes general considerations for the processor design.
Read/Write Sequencers
The processor has one problem: addresses are at byte level, bus and register widths are in general more than one byte wide - so misaligned accesses can happen:
Access Type | Bus width | Alignment | Comment |
---|---|---|---|
Byte data read | any width | yes/no | The data can be read in any width from the bus, and the relevant byte be picked from it and given to the bus |
Byte data write | byte width | automatic | just write the byte |
Byte data write | word or larger width | yes/no | If the data can not be written as byte width but must use a larger width, the original data must be read, the relevant byte be modified, and the data word be written back |
word or larger data read/write access | byte width | automatic | The data access can just be executed one after the other |
word data read/write access | word width | aligned | Just do the access |
word data read/write access | word width | misaligned | The word access must be broken up into two byte accesses, and executed as byte acces on a word bus as described above. |
... | ... | ... | ... |
The table only shows a subset of possible combinations. To resolve this problem, specific components are used, read and write sequencers. These components take a read or write request from the core and break them up into accesses that the external bus can execute.
If some address area is accessed as byte-wide I/O, and other parts of the address area as word-wide or wider memory, the processor must have multiple read and write sequencer, one per address width (or a sequencer that can handle multiple widths).
Doing misaligned accesses slows the processor down from its optimum speed. In general wider bus width is still faster than narrower bus widths.
Simultaneous Multithreading
The 6502 - as well as the 65k - is very efficient concerning bus read/write cycles per opcode. If, however, a misaligned wide access is broken down into two or more smaller memory accesses, the core is waiting.
In this case the core could switch to a separate set of registers, and execute code for a second processor - what is called these days as SMT, simultaneous multithreading.
But as the processor is very memory access efficient, a second thread may not have time for many memory accesses on its own.
This is a topic for a later version though anyway.
Pipelining
Pipelining is a processor technique that divides the execution of an opcode into different stages like fetch, decoding, execution and store. Modern processors have pipelines up to 31 such stages (Pentium 4, Pentium D, see link).
An advantage of pipelining is that more than one opcode can be executed in parallel. The first opcode could be writing back data, the next one executing, and next one being decoded and the last one being fetched. This way more functional units (fetch, decode, ...) can be used at any time, making the system more efficient. Even if each opcode requires more than one cycle, each cycle can start - and finish - an opcode, making the processor faster in terms of opcodes per cycle.
A disadvantage is that branches can invalidate all the work that has been done for the following cycles. Therefore branch prediction techniques have been developed to reduce the cost of pipeline invalidation due to branches.
The 6502 already has a limited form of pipelining. The last cycle of any opcode actually is the fetch of the next opcode. That is also the reason why the 6502 is little-endian: the processor fetches two-byte operands with the low byte first. After the first fetch an index register is added to the low byte, while the high byte is being fetched. Then in the next cycle the carry is added to the high byte. In fact there is an optimization that eliminates the last cyle when no carry needs to be added to the high byte. Here too branches lead to problems: When a branch is taken, the "official" end of the opcode is not reached and interrupt handling suspended until the end of the next opcode.
The 65k architecture will in the current version implement pipelining similar to the 6502.
This section analyses the requirements for the core architecture.
Addressing Mode Analysis
To define the necessary data paths in the core between registers, ALU and other components, here the addressing modes are analysed. As the internal register and data path widths is always full width, arithmethic operations (adds) do not need to be broken up into smaller chunks. The example used here is to load a value into the accumulator. They start with the program counter on the address bus and the opcode parameter on the data bus.
Note that using a register and taking a new value into the same register can take place on the same clock cycle if registers are assumed (as opposed to transparent latches as in the original 6502).
Also note that the initial parameter fetch comes as opcode parameter, thus from a different input bus, which has to be taken into account in the core design.
Immediate
The immediate addressing mode is easy...
Step | Transfer | Description |
---|---|---|
1 | Data bus -> AC | The opcode parameter value fetched and on the data bus is transferred into the register |
Zeropage and Absolute
The zeropage and absolute - including the new long and quad - addressing modes have one indirection - the opcode parameter is an address used to fetch the actual value
Step | Transfer | Description |
---|---|---|
1 | Data bus -> data bus input reg. | The opcode parameter value fetched and on the data bus is transferred into the data bus input register (Note 1) |
2 | Data bus input reg. -> address bus | The data bus input register is put onto the address bus |
data bus -> AC | The value read from the data bus is taken into the register |
Zeropage and Absolute Indexed
In this addressing mode an index register value is added to the address before reading the actual address.
Step | Transfer | Description |
---|---|---|
1 | Data bus -> data bus input reg. | The opcode parameter value fetched and on the data bus is transferred into the data bus input register |
2 | Data bus input reg. -> ALU A | The data bus input register value is put to ALU input A (Note 3) |
index register -> ALU B | The index register value is put to ALU input B | |
ALU out -> temp | The ALU output is written to the temp register | |
Step 3 is optional if B,S or P are added (prefix OF bits) | ||
3 | temp -> ALU A | The temporary register value is put to ALU input A |
B,S or PC register -> ALU B | The index register value is put to ALU input B | |
ALU out -> temp | The ALU output is written to the temp register (Note 2) | |
4 | temp -> address bus | The temporary register is put onto the address bus |
data bus -> AC | The value read from the data bus is taken into the register (via Pass5) |
Zeropage and Absolute Indexed Indirect
This is an extension to the previous addressing mode. The value read from the addressing mode above is interpreted as address to read the actual value from
Step | Transfer | Description |
---|---|---|
1 | Data bus -> data bus input reg. | The opcode parameter value fetched and on the data bus is transferred into the temporary (or the data in) register |
2 | Data bus input reg. -> ALU A | The data bus input register value is put to ALU input A |
index register -> ALU B | The index register value is put to ALU input B | |
ALU out -> temp | The ALU output is written to the temp register | |
Step 3 is optional if B,S or P are added (prefix OF bits) | ||
3 | temp -> ALU A | The temporary register value is put to ALU input A |
B,S or PC register -> ALU B | The index register value is put to ALU input B | |
ALU out -> temp | The ALU output is written to the temp register (Note 2) | |
4 | temp -> address bus | The temporary register is put onto the address bus |
data bus -> AC | The value read from the data bus is taken into the temp register | |
5 | temp -> address bus | The temporary register is put onto the address bus |
data bus -> AC | The value read from the data bus is taken into the register (via Pass5) |
Zeropage and Absolute Indirect Indexed
Step | Transfer | Description |
---|---|---|
1 | Data bus -> data bus input reg. | The opcode parameter value fetched and on the data bus is transferred into the data bus input register |
Step 2 is optional if B,S or P are added (prefix OF bits) | ||
2 | data bus input reg. -> ALU A | The temporary register value is put to ALU input A |
B,S or PC register -> ALU B | The index register value is put to ALU input B | |
ALU out -> temp | The ALU output is written to the temp register (Note 2) | |
3 | temp/data bus input register -> address bus | The temporary register (resp. the data bus input register if step 2 is not taken) is put onto the address bus (ALU passthrough) |
data bus -> AC | The value read from the data bus is taken into the temp register | |
4 | temp -> ALU A | The temporary register value is put to ALU input A |
index register -> ALU B | The index register value is put to ALU input B | |
ALU out -> temp | The ALU output is written to the temp register | |
5 | temp -> address bus | The temporary register is put onto the address bus |
data bus -> AC | The value read from the data bus is taken into the register |
Relative
The relative addressing mode is for jumps only
Step | Transfer | Description |
---|---|---|
1 | Data bus -> data bus input reg. | The opcode parameter value fetched and on the data bus is transferred into the temporary (or the data in) register |
2 | data bus input reg. -> ALU A | The temporary register value is put to ALU input A |
PC -> ALU B | The Program counter value is put to ALU input B | |
ALU out -> PC | The ALU output is written to the Program counter |
Notes
- The value written to the temp register could directly be written to the address bus output register and to be put on the address bus in the next step, eliminating the need for the temp register here.
- The ALU output value could be directly written to the address bus. This, however, would add the ALU processing time to the setup time for the address bus, limiting the possible clock speeds.
- As the opcode operand is always read full width before actually adding the register value, there can be no optimization by eliminating the high byte add. The processor always works as if there was a carry - and thus a 16 bit 65k with 8 bit data bus will actually be one cycle slower than the 6502 if there is no carry
Opcode Data Path Analysis
In this section is an analysis of the different types of opcodes.
Load/Store
Opcodes: LDA, LDX, LDY, STA, STX, STY, STZ
Load opcodes load a register with a value from memory. Store opcodes write data to a memory location. These opcodes work the same way as the bare addressing modes described above. Only for stores, not just the address value is put on the bus, but also the register value.
Note: this poses a problem that during the zeropage and absolute addressing modes, during parameter fetch the value read as parameter must be available as address on the data address bus, while at the same time the register value has to be put onto the data bus. This has to be considered in the core design.
Load Effective Address
The LEA opcode can actually work similar to the load opcodes as above, i.e. basically transferring the value of the address output register into E. But a cycle can be saved instead if the address value is directly stored in E instead of the address output register. This can be decided depending on implementation details.
Arithmetic Operations
Opcodes: ADC, SBC, CMP, ORA, AND, EOR, CPX, CPY, TSB, TRB, BIT, ADS, ADE, ADB, SBS, SBE, SBB
These opcodes read an operand value, process it in the ALU, then store it in the AC (together with the relevant status bits). I.e. in the last addressing mode cycle as above, the value is read and stored in the temp register. In the following cycle, the temp register and the AC are put on the ALU A and B inputs respectively. The ALU output is passed on to the AC, which takes the value over at the end of the cycle. In this very cycle the next opcode can be read using the opcode fetch circuitry - as on the 6502.
Read-Modify-Write Operations
Opcodes: DEC, DEX, DEY, INC, INX, INY, ROL, ROR, ASL, LSR, SWP, BCN
For the accumulator-addressing mode opcodes (ROL A, ROR A, LSR A, ASL A, SWP A, BCN A), it is simple. After opcode fetch and decode, the AC is given to the ALU and the result is transferred back to AC at the end of the cylce.
For the other opcodes during the load cycle, the value is read into the temp register. During the second cycle the ALU performs the operation, and stores the value in the data bus output driver, but does not perform a valid write. During the third cycle the data is actually written. In fact during the second cycle the next opcode fetch may actually already be performed (Note 1)
Register Transfer Operations
Opcodes: TAX, TXA, TAY, TYA, TXS, TSX, TPA, TSY, TYS, TEA, TAE, TBA, TAB
These opcodes are simple. In the cylce after the opcode fetch the source register value is put on the internal bus, and transferred to the register input bus (using the pass gates), and stored in the register at the end of the cycle. During this transfer cycle the next opcode can be fetched.
Register Swap Operations
Opcodes: SAB, SAX, SAY, SXY, SAE, SAB
These opcodes are more complicated. In the first cycle, AC is transferred into the temp register. In the second cycle the other (S/X/Y) register is transferred into AC. In the third cycle then the temp register is stored into the other register. (Note 2)
Status Register Operations
Opcodes: SEC, CLC, SED, CLD, SEI, CLI, CLV
TODO
Clear Operations
Opcodes: CLY, CLX, CLA
TODO
Stack Operations
Opcodes: PLA, PHA, PLX, PHX, PLY, PHY, PLB, PHB, PLE, PHE, PRB
TODO
Jump Operations
Opcodes: JMP, JPU
TODO
Jump Subroutine Operations
Opcodes: JSR, BSR
TODO
Return Subroutine Operations
Opcodes: RTS, RTI, RTU
TODO
Branch Operations
Opcodes: BNE, BEQ, BPL, BMI, BVS, BVC, BCC, BCS, BRA
TODO
Branch Operations
Opcodes: BNE, BEQ, BPL, BMI, BVS, BVC, BCC, BCS, BRA
TODO
Move Operations
Opcodes: MVN, MVP, MVNTU, MVNFU, MVPTU, MVPFU
TODO
Fill Operations
Opcodes: FILU
TODO
Quick Operations
Opcodes: DEC, DEY, DEX, INC, INY, INX, ROL, ROR, ASL, LSR, INE, DEE, INB, DEB - quick addressing modes
TODO
Control Register Operations I
Opcodes: LCR, SCR, BCR
TODO
Control Register Operations II
Opcodes: SENV, SMMU
TODO
Memory Control Operations
Opcodes: SCA, LLA, WMB, RMB
TODO
Notes
- The ALU output value could be directly written to the data bus. This, however, would add the ALU processing time to the setup time for the data bus, limiting the possible clock speeds.
- If the ALU does not provide a pass-through mode, then a pass gate from the temp register output to the register input bus is required
Return to Homepage
Last modified: 2012-04-11