![]() |
The
Kraken
A 16-Bit RISC Microprocessor Dennis Kim, Andrew Lin
Contents:
You can download all of the Magic cells HERE. You can download all of the Verilog HERE. |
Abstract
This report summarizes the design process of the Kraken - a 16-Bit RISC Microprocessor in the MOSIS 1.2m AMI process. The processor?s architecture features 16 bit instruction words, 16 internal general-purpose registers, and 14 external address lines to external memory. The processor is timed using 2 phase, non-overlapping clocks, and each instruction is executed in two cycles. At a circuit level, the chip is designed mostly in static CMOS, with the exception of a dynamic arithmetic logic unit. The chip was tested in simulation with a maximum clock speed of 25 MHz. Click HERE to see a larger image (JPEG) of the chip. Click HERE for a medium sized image (JPEG).
Features
We implemented a 16-bit RISC microprocessor based on a simplified version of the MIPS architecture. The processor has 16-bit instruction words and 16 general purpose registers. Every instruction is completed in two cycles. Non-overlapping two-phase clocks are used as the timing mechanism for the control and datapath units. This section includes a summary of the main features of the processor, a description of the pins, a high level diagram of the external interface of the chip, and the instruction word formats. Here is a list of the major features of the Kraken:
| Mem[15:0] | Input/Output | 16-bit memory bus. |
| Mem_addr_s1[13:0] | Output | 14-bit address lines to address the main memory. |
| Mem_wrt_s1 | Output | Asserted high when the chip writes data to Mem[15:0]. Otherwise, the chip treats Mem[15:0] as inputs. |
| Phi1, Phi2 | Input | 2 phase, non-overlapping clocks for chip timing. |
| Reset_s1 | Input | Resets the chip when asserted high. The chip will switch to the idle state, and fetch from the main memory starting at address 0. |
| Vdd | Input | Power. |
| GND | Input | Ground. |
Here is a high-level block diagram that describes the external interface of the chip.
Instruction Set Architecture (ISA)
The ISA of the Kraken consists of 16 instructions with
a 4-bit fixed size operation code. The instruction words are 16-bits long.
The following chart describes the instruction formats.
|
|
|
|
|
|
|||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
| ADD |
|
|
|
|
|
|
|
||||||||||||
| SUB |
|
|
|
|
|
|
|
||||||||||||
| AND |
|
|
|
|
|
|
|
||||||||||||
| OR |
|
|
|
|
|
|
|
||||||||||||
| XOR |
|
|
|
|
|
|
|
||||||||||||
| NOT |
|
|
|
|
|
|
|||||||||||||
| SLA |
|
|
|
|
|
|
|||||||||||||
| SRA |
|
|
|
|
|
|
|||||||||||||
| LI |
|
|
|
|
|
|
|||||||||||||
| LW |
|
|
|
|
|
|
|||||||||||||
| SW |
|
|
|
|
|
|
|||||||||||||
| BIZ |
|
|
|
|
|
|
|||||||||||||
| BNZ |
|
|
|
|
|
|
|||||||||||||
| JAL |
|
|
|
|
|
|
|||||||||||||
| JMP |
|
|
|
|
|
||||||||||||||
| JR |
|
|
|
|
|
||||||||||||||
The Kraken features five instruction classes:
1. Arithmetic (Two?s Complement) ALU operation (2)
3. Memory operations (3)
4. Conditional Branch operations (2)
5. Program Count Jump operations (3)
Chip Architecture
The processor can be broken up into two distinct units: control and datapath. The control unit will generate eleven signals to control the flow of data from and to an external memory unit through the datapath. The opcode from the instruction register, compare to zero signal, and the reset signal comprise the inputs into the control unit that determine the specific state of the FSM and the output control signals that go out to the datapath units. To follow the detailed explanation of the FSM flow and related control signal please refer to Appendix A (State Diagram)(PDF), Appendix B (Control Flow Table)(PDF) and Appendix C (Datapath Block Diagram)(PDF).
Control Unit Description:
The Control FSM has only three distinct states that determine the operation of the processor: IDLE, FETCH and EXECUTE. When the reset signal (reset_s1) goes high from any state, the FSM will be placed in the IDLE state. While in the IDLE state the control unit will send the PC write enable signal (pc_wrt_s2 = 1) and select zero (pc_sel_s2 = 0) as the current Program count.
When the reset signal goes low, the FSM?s next state will be the FETCH state and the instruction from Memory address 0 will be loaded into the Instruction Register (IR) to begin program execution. The control looks at the next state = FETCH and generates the IR write (ir_wrt_s1), Operand A Select (opA_sel_s1), Operand B Select (opB_sel_s1 = 0010) and the ALU add operation (alu_op_s1 = 00000001) to load the IR with the next instruction and increment the PC by 1. These events all occur on the first clock of the FETCH state. One-hot signals are used for alu_op_s1, opB_sel_s1, and data_sel_s2 to make for easier decoding in the datapath units. The operation at the next phase of FETCH will be determined by the opcode (opcode_s2) from the IR, except for the incremented PC that is written in from the ALU ouput latch in all cases. The ALU Operations will load in Operands A and B from the Register File. The Load word will only need Operand A, while the Store word will need both operands (one for the address and one for the data word). The Branch instructions will use the offset in its instruction word and PC + 1 count as operands into the ALU. The JAL stores the incremented PC in the Register File, while the JR loads the return address into Operand A.
After phase two of the FETCH state, the FSM enters the EXECUTE state. During the first phase for an ALU operation, the appropriate alu_op_s1 control signals are sent to the ALU as decoded from the opcode. The operand mux (opA_sel_s1 & opB_sel_s1) control signals are also generated to select the latch outputs. For the other operations (except LI), an add operation is required from the ALU. The operands chosen for the add are determined by the operation specified. The Load and Store words will access Memory on this first phase as well. The second phase of EXECUTE writes data into the register file or writes a new address into the PC. For the branch instruction, the control will look at the check zero signal from operand A to determine if the branch should be taken and the new PC should be written. The control returns the next state to FETCH to repeat the process for the next instruction.
Custom Circuit Design and Layout
The datapath and register file are custom designed. Clock and control unit signals are routed in first layer metal (M1) vertically through the datapath. Data signals are routed in second layer metal (M2) horizontally along the datapath to the register file. The bit-slice pitch is 78 lambda, so the total datapath height is 1248 lambda. Clock and control unit signals are routed vertically through the datapath, and these signals are buffered along the top of the datapath reduce the delay of the control and clock signals as they travel from the control unit to the datapath. To reduce clock skew, all clock qualification logic is done locally in the datapath. To reduce signal integrity problems from coupling noise, we were careful to make sure dynamic nodes did not run in parallel with the clock lines. The chip floorplan is attached (appendix D (PDF)). The bitslice wiring plan that we used is also attached (appendix E (PDF)).
We implemented a dynamic arithmetic logic unit. The carry chain for the adder is a pre-charged Manchester carry chain, and we buffered this chain every 4 bits to reduce the delay of the carry. A compact programmable pre-charged logic unit design is used as well. A logic operation can be performed by turning on various combinations of foot transistors in the pull-down paths. Schematics of this logic unit (PDF), adder carry circuit (PDF), and the adder propagate and generate circuit (PDF) are attached.
The register file is physically laid out as an 8-bit by 32-bit array. It is organized as 16 16-bit registers, with two read ports, and a single write port. The bit-lines are pre-charged during the high phase of phi1. Four bits are required to address the register file. The three most significant bits are passed to 3-8 decoders, and the least significant bit selects which 8-bit physical row is written to or read from. Two 8-bit rows fit nicely in the 78 lambda datapath bitslice pitch. The word-lines are driven vertically from the decoders and buffers placed at the top of the array. The layout of the register file consists of two 3-8 decoders, two multiplexers, word-line drivers, SRAM cells, the read and write circuits, and the clock and control signal buffers.
The rest of the custom blocks consist of multiplexers and latches. The zero-detect block is implemented as a pre-charged OR gate. All multiplexers are controlled by a one-hot word.
After place and route, we noticed that the MEM_WRT signal was driven from the standard cells to 17 enable ports on the pad frame. So we buffered this signal as well to reduce the delay. A power and ground frame is routed around the periphery of the chip?s core, close to the pad frame. Power and ground is connected at both sides and down the center of the custom datapath and register file blocks.
Verification Methodology
The HDL model of the microprocessor written in Verilog setup the building blocks for the actual circuit implementation developed for the control and datapath. An assembler was written in Perl. Using this, we could generate the binary instruction words for an assembly program. The Verilog model itself was tested using an assembly language program (about 70 instructions). All the instructions were executed at least once during the first testing pass. More instructions were added to test for corner cases of the different units. This test program was verified by checking the contents of the register file after program execution and viewing the waveforms in Magellen to make sure the expected state and datapath timing transitions was taking place. This program was used to generate test vectors for different sections of the actual circuit implementation, including the top level schematics and layout.
The Register File and ALU schematics were the first major blocks of the datapath that were verified. We first looked at an individual bit slice of the ALU to test as a stand alone unit. After correcting some wiring errors with the ALU schematics, the full 16-bit ALU was verified against the test vectors created from Verilog. The same approach was taken to test the Register File. Some minor fixes to a row of the Register File, including the Read/Write and Decode circuitry, were needed after testing. But the entire unit was put together and verified against the test vectors.
The major debugging pains came up when the entire datapath was put together in SUE. The error came from undefined control signals that were coming from the input test vectors. These undefined control signals spread the dreaded XXX?s throughout the circuit. It was difficult the to track down the exact origin of the error because of the pass-gate muxes that were used in our datapath. The XXX?s would spread from one input and feedback to the other input of the muxes. The original problem was thought to be a circuit implementation, but it was eventually solved in Verilog by defining all the control signals to a high and low value in every possible state.
The verification of the layout went smoothly for the most part. The individual cells created in Magic were mapped to the SUE counterpart whenever possible and LVS checked with Gemini. This hierarchical approach caught many subtle layout errors with ease. After the Register File and the rest of the datapath were LVS clean, they were put together as a unit and also checked against the Verilog test vectors. Both checks came up clean and inserted into the padframe for final checks. After place and route of the control unit between the pads and datapath, the final LVS and snooper checks came up clean after fixing a minor connection error in layout to the one of the pads.
A Perl script was written to parse our main IRSIM command file and increase the clock frequency that the chip was simulated at. We were able to simulate our chip at a maximum of 25 MHz.
To test the chip in silicon, we plan to use the test vectors that we generated from our Verilog model in conjunction with the hardware tester available to us. We can use our Verilog model to generate additional test vectors for debugging purposes or for more fault coverage.
Tools Comment
For the most part, the design tools worked quite well for us - we had no major problems. All of the tools did not take too much time to learn. We never found a way to update waveforms in Magellan after changing the verilog. The only thing we could do was kill and restart Magellan. We also had some trouble with v2sue, since it does take kindly to assign statements and backslash n labels. The online documentation for CellSnake could have been a bit more detailed. For example, we had some trouble getting ports2edif to work because the tutorial merely stated to run a command with a bunch of files - it did not state what the purpose of the files were. We actually found that the FloatWell script did not report a few floating wells that we had in our layout. Luckily, Dean notified us that we had floating wells in the ALU, so we caught this bug before tapeout. After correcting this, we generated a cif file from the chip layout, and FloatWell and WellRoute again. We also viewed the cif file in Magic, and checked the n-wells by hand in case FloatWell missed something.
Time Spent
Dennis' Time Spent
1/4-1/15 - 20 hours
Research, write proposal
1/16-1/23 - 30 hours
Architecting, begin coding Verilog Control
1/25-2/5 - 45 hours
Create verilog stimulus model, test & debug verilog model,
enter in SUE schematics of datapath cells, verify final SUE datpath
2/6-2/19 - 70 hours
Layout datapath units, verify datapath layout
2/22-3/1 - 15 hours
Final chip assembly and verification.
Total = 180 hours
Andrew's Time Spent
1/4-1/15 - 10 hours
Research, write proposal.
(My grandfather passed away during this part of the quarter,
so Dennis covered for me during this difficult time. I greatly appreciate
his help).
1/16-1/23 - 20 hours
Datapath Verilog and debug. Wrote assembler in Perl.
1/25-2/5 - 45 hours
Register file SUE schematics and verification.
2/6-2/19 - 70 hours
Layout of register file, control line qualifiers and drivers, and datapath elements.
2/22-3/1 - 20 hours
Place and route, and final chip assembly. Tested chip timing.
Total = 165 hours
Comments on the Class
We found that we gained a lot of practical experience in all aspects of chip design from designing a processor from scratch. After finishing the design, we did think about what we would do differently if we were to design another chip. We now have a better feel on how to floorplan a design, implement specific circuits, and when to be worried about certain circuit level issues (such as coupling noise, IR drop, and charge sharing). Running though the entire design flow with a chip design of our own is the best way to learn about chip design. We would like to thank the teaching staff for their help in making this project possible.
References
N. Weste, K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, 1993.
D. Patterson, J. Hennessy, Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann Pulishers, 1994.
M. Horowitz, EE271: Introduction to VLSI Systems Course Notes.