[front matter]
About the Authors
Foreword
Acknowledgements
Figure Acknowledgements
Preface
Content and Structure
The VEX (VLIW Example) computing system
Audience
Cross-cutting Topics
How to read this book
Chapter 1: An Introduction to Embedded Processing
1.1 What is Embedded Computing?
1.1.1 Attributes of embedded devices
1.1.2 Embedded is growing
1.2 Distinguishing Between Embedded and General Purpose Computing
1.2.1 The "run one program only" phenomenon
1.2.2 Backward and Binary compatibility
Feature compatibility
Source compatibility
Tool compatibility
Operating System compatibility
1.2.3 Physical limits in the embedded domain
Cost
Power consumption and energy efficiency
Physical dimensions
1.3 Characterizing Embedded Computing
1.3.1 Categorization by type of processing engine
Digital Signal Processors
Network Processors
1.3.2 Categorization by application area
Image Processing and Consumer
Printers.
Cameras.
Digital Video.
Communication
Wired phone networks.
Cellular networks.
Cellular telephones.
Routers.
Automotive
1.3.3 Categorization by Workload differences
Controlling
Switching and routing
Media processing
1.4 Embedded market structure
1.4.1 The market for embedded processor cores
1.4.2 Business model of embedded processors
Getting a better yield.
Controlling Area and Power Consumption.
Testing costs
1.4.3 Costs and Product Volume
1.4.4 Software and The Embedded Software Market
1.4.5 Industry standards
1.4.6 Product lifecycle
1.4.7 The Transition to System-on-Chip Design
Effects of System-On-Chip on the business model
Centers of embedded design
1.4.8 The Future of Embedded Systems
Connectivity: Always-on Infrastructure
State: Personal Storage
Administration
Security
The Next Generation
1.5 Further Reading
1.6 Exercises
Chapter 2: An Overview of VLIW and ILP
2.1 Semantics and parallelism
2.1.1 Baseline: sequential program semantics
2.1.2 Pipelined execution, overlapped execution and multiple execution units
2.1.3 Dependence and program rearrangement
2.1.4 ILP and other forms of parallelism
2.2 Design philosophies
2.2.1 An illustration of design philosophies: RISC vs. CISC
2.2.2 First definition of VLIW
2.2.3 A design philosophy: VLIW
VLIW vs. Superscalar
VLIW vs. DSP
2.3 Role of the compiler
2.3.1 The phases of a high-performance compiler
2.3.2 Compiling for ILP and VLIW
Global instruction scheduling
Software Pipelining
Profiling
2.4 VLIW in the embedded and DSP domains
2.5 Historical Perspective and Further Reading
2.5.1 ILP hardware in the 1960s and 70s
Early supercomputer arithmetic units.
Attached signal processors
Horizontal microcode
2.5.2 The development of ILP code generation in the 1980s
Acyclic Microcode Compaction Techniques
Inadequacy of local compaction.
First global compaction attempts.
Acyclic Region Scheduling Techniques.
Cyclic Techniques: Software Pipelining
Systematic software pipelining.
2.5.3 VLIW development in the 1980s
2.5.4 ILP in the 1990s and 2000s
2.6 Exercises
Chapter 3: An Overview of ISA Design
3.1 Overview: What to Hide
3.1.1 Architectural State: Memory and Registers
3.1.2 Pipelining and Operational Latency
3.1.3 Multiple Issue and Hazards
Exposing dependence and independence
Structural Hazards
Resource Hazards
3.1.4 Exception and Interrupt Handling
3.1.5 Discussion
3.2 Basic VLIW design principles
3.2.1 Implications for Compilers and Implementations
3.2.2 Execution model subtleties
"Equals" Model.
"Less-than-or-Equals" Model.
3.3 Designing a VLIW ISA for Embedded Systems
3.3.1 Application domain
3.3.2 ILP Style
3.3.3 Hardware/Software Trade-offs
3.4 Instruction-Set Encoding
3.4.1 A Larger Definition of Architecture
3.4.2 Encoding and Architectural Style
RISC encodings
CISC encodings
VLIW encodings
Why not Superscalar encodings?
DSP encodings
Vector encodings
3.5 VLIW Encoding
3.5.1 Operation encoding
3.5.2 Instruction encoding
Fixed-overhead encoding
Distributed encoding
Template-based encoding
3.5.3 Dispatching and opcode subspaces
3.6 Encoding and Instruction-Set Extensions
3.7 Further Reading
3.8 Exercises
Chapter 4: Architectural Structures in ISA design
4.1 The Datapath
4.1.1 Location of operands and results
4.1.2 Datapath Width
4.1.3 Operation Repertoire
Simple integer and compare operations.
Carry, Overflow and Other Flags.
Common bitwise utilities.
Integer Multiplication.
Fixed-Point multiplication.
Integer division.
Floating Point Operations.
Saturated Arithmetic.
4.1.4 Micro-SIMD Operations
Alignment issues.
Precision issues.
Dealing with control flow.
Pack, unpack, and mix.
Reductions.
4.1.5 Constants
4.2 Registers and Clusters
4.2.1 Clustering
Architecturally Invisible Clustering.
Architecturally Visible Clustering.
4.2.2 Heterogeneous register files
4.2.3 Address and Data Registers
4.2.4 Special register file features
4.3 Memory Architecture
4.3.1 Addressing Modes
4.3.2 Access Sizes
4.3.3 Alignment issues
4.3.4 Caches and local memories
Prefetching
Local Memories and Lockable Caches
4.3.5 Exotic modes for embedded processing
4.4 Branch Architecture
4.4.1 Unbundling Branches
Two-step Branching.
Three-step Branching.
4.4.2 Multi-way branches
4.4.3 Multi-cluster branches
4.4.4 Branches and loops
4.5 Speculation and Predication
4.5.1 Speculation
Control Speculation
"Non-Recovery" speculation model.
"Recovery" Speculation Model.
Data Speculation
4.5.2 Predication
Full prediction
Partial predication
Cost and benefits of predication
Predication in the embedded domain
4.6 System Operations
4.7 Further Reading
4.8 Exercises
Chapter 5: Microarchitecture Design
5.1 Register File Design
5.1.1 Register File Structure
5.1.2 Register Files, Technology, and Clustering
5.1.3 Separate Address and Data Register Files
5.1.4 Special Registers and Register File Features
5.2 Pipeline Design
5.2.1 Balancing a Pipeline
5.3 VLIW Fetch, Sequencing and Decoding
5.3.1 Instruction Fetch
5.3.2 Alignment and Instruction Length
5.3.3 Decoding and Dispersal
5.3.4 Decoding and ISA Extensions
5.4 The Datapath
5.4.1 Execution Units
5.4.2 Bypassing and Forwarding Logic
5.4.3 Exposing Latencies
5.4.4 Predication and Selects
5.5 Memory Architecture
5.5.1 Local memory and caches
5.5.2 Byte manipulation
5.5.3 Addressing, Protection and Virtual Memory
5.5.4 Memories in Multiprocessor Systems
5.5.5 Memory Speculation
5.6 Control Unit
5.6.1 The Branch Architecture
5.6.2 Predication and Selects
5.6.3 Interrupts and Exceptions
5.6.4 Exceptions and pipelining
"Drain" and "Flush" Pipeline Models
Early Commit
Delayed Commit
5.7 Control Registers
5.8 Power Considerations
5.8.1 Energy Efficiency and ILP
System-Level Power Considerations.
5.9 Further Reading
5.10 Exercises
Chapter 6: System Design and Simulation
6.1 System-on-Chip (SoC)
6.1.1 IP Blocks and Design Reuse
A concrete SoC Example
Virtual Components and the VSIA Alliance
6.1.2 Design Flows
Creation Flow
Verification Flow
6.1.3 SoC Buses
Data Widths
Masters, Slaves and Arbiters
Bus Transactions
Simple Transactions.
Burst Transactions.
Split Transactions.
Test Modes
6.2 Processor Cores and System-On-Chip
6.2.1 Non-programmable accelerators
Reconfigurable logic
6.2.2 Multiprocessing on a chip
Symmetric Multiprocessing
Heterogeneous Multiprocessing
Example: a multi-core platform for mobile multimedia
6.3 Overview of Simulation
6.3.1 Using simulators
6.4 Simulating a VLIW architecture
6.4.1 Interpretation
6.4.2 Compiled simulation
Memory
Registers
Control Flow
Exceptions
Analysis of compiled simulation
Performance measurement and compiled simulation
6.4.3 Dynamic binary translation
6.4.4 Trace-Driven Simulation
6.5 System simulation
6.5.1 I/O and Concurrent Activities
6.5.2 Hardware simulation
Discrete Event Simulation
6.5.3 Accelerating Simulation
In-Circuit Emulation
Hardware accelerators for simulation
6.6 Validation and verification
6.6.1 Simulation, Verification and Test
Formal Verification
Design for Testability
Debugging support for SoC
6.7 Further Reading
6.8 Exercises
Chapter 7: Embedded Compiling and Toolchains
7.1 What is important in an ILP Compiler?
7.2 Embedded cross-development toolchains
7.2.1 Compiler
7.2.2 Assembler
7.2.3 Libraries
7.2.4 Linker
7.2.5 Post-link optimizer
7.2.6 Run-time program loader
7.2.7 Simulator
7.2.8 Debugger, monitor ROMs
7.2.9 Automated test systems
7.2.10 Profiling tools
7.2.11 Binary utilities
7.3 Structure of an ILP compiler
7.3.1 Front end
7.3.2 Machine-independent optimizer
7.3.3 Back end: machine-specific optimizations.
7.4 Code Layout
7.4.1 Code Layout Techniques
DAG-based placement
The "Pettis-Hansen" technique
Procedure Inlining.
Cache line coloring
Temporal-Order Placement
7.5 Embedded-specific trade-offs for compilers
7.5.1 Space vs. Time vs. Energy trade-offs
7.5.2 Power-specific optimizations
Fundamentals of Power Dissipation
Switching.
Leakage
Power-aware software techniques
Reducing Switching Activity.
Power-Aware Instruction Selection.
Scheduling for Minimal Dissipation.
Memory Access Optimizations.
Data Remapping.
7.6 DSP-Specific Compiler Optimizations
Compiler-visible features of DSPs
Heterogeneous registers.
Addressing modes.
Limited Connectivity.
Local Memories.
Harvard architecture.
Instruction Selection and Scheduling
Address Computation and Offset Assignment
Local Memories
Register Assignment Techniques
Retargetable DSP and ASIP compilers
7.7 Further Reading
7.8 Exercises
Chapter 8: Compiling for VLIWs and ILP
8.1 Profiling
8.1.1 Kinds of profiles
8.1.2 Profile collection
8.1.3 Synthetic profiles (heuristics in lieu of profiles)
8.1.4 Profile Bookkeeping and Methodology
8.1.5 Profiles and embedded applications
8.2 Scheduling
8.2.1 Acyclic Region Types and Shapes
Basic blocks
Traces
Superblocks
Hyperblocks
Treegions
Percolation Scheduling
8.2.2 Region formation
Region selection
Enlargement techniques
Phase ordering considerations
8.2.3 Schedule construction
Analyzing programs for schedule construction
Compaction Techniques
Cycle vs. Operation scheduling
Linear techniques
Graph-based techniques (list scheduling)
Compensation Code
Another view of scheduling problems
8.2.4 Resource management during scheduling
Resource vectors
Finite state automata
8.2.5 Loop scheduling
Modulo scheduling
The Modulo Reservation Table (MRT).
Searching for the II.
Scheduling heuristics.
Prologues and epilogues.
Modulo variable expansion.
Iterative Modulo Scheduling.
Advanced modulo scheduling techniques.
8.2.6 Clustering
8.3 Register allocation
8.3.1 Phase Ordering Issues
Register allocation and scheduling
8.4 Speculation and Predication
8.4.1 Control and Data Speculation
8.4.2 Predicated execution
8.4.3 Prefetching
8.4.4 Data Layout methods
8.4.5 Static and hybrid branch prediction
8.5 Instruction selection
8.6 Further Reading
8.7 Exercises
Chapter 9: The Run-time System
9.1 Exceptions, interrupts, and traps
Exception Handling
9.2 Application Binary Interface considerations
9.2.1 Loading programs
9.2.2 Data layout
9.2.3 Accessing global data
9.2.4 Calling conventions
Registers
Call instructions
Call sites
Function prologues and epilogues
9.2.5 Advanced ABI topics
Variable-length argument lists
Dynamic stack allocation
Garbage collection
Linguistic exceptions
9.3 Code Compression
9.3.1 Motivations
9.3.2 Compression and Information Theory
9.3.3 Architectural Compression Options
Decompression on fetch
Decompression on refill
Load-time decompression.
9.3.4 Compression Methods
Hand-tuned ISAs.
Ad-hoc compression schemes.
RAM decompression.
Dictionary-based software compression.
Cache-based compression.
Quantifying compression benefits
9.4 Embedded Operating Systems
9.4.1 "Traditional" OS issues, revisited
9.4.2 Real-Time
Real-Time Scheduling
9.4.3 Multiple flows of control
Threads, processes and microkernels
Threads.
Processes.
Microkernels.
9.4.4 Market Considerations
Embedded Linux
9.4.5 Downloadable code and Virtual Machines
9.5 Multiprocessing and Multithreading
9.5.1 Multiprocessing in the embedded world
9.5.2 Multiprocessing and VLIW
9.6 Further Reading
9.7 Exercises
Chapter 10: Application Design and Customization
10.1 Programming Language choices
10.1.1 Overview of embedded programming languages
10.1.2 Traditional C and ANSI C
10.1.3 C++ and embedded C++
Embedded C++
10.1.4 Matlab
10.1.5 Embedded Java
The Allure of Embedded Java.
Embedded Java: the dark side
10.1.6 C Extensions for Digital Signal Processing
Restricted Pointers
Fixed Point Data Types
Circular Arrays
Matrix referencing and operators
10.1.7 Pragmas, Intrinsics and Inline Assembler
Compiler pragmas and type annotations.
Assembler inserts and intrinsics.
10.2 Performance, Benchmarking and Tuning
10.2.1 Importance and methodology
10.2.2 Tuning an application for performance
Profiling
Performance Tuning and Compilers
Developing for ILP targets
Write code for predication.
Localize variables.
Help memory alias analysis.
Eliminate redundant loads.
Enable prefetching and exploit locality properties.
Specialize code.
Use look-up tables and memoization.
10.2.3 Benchmarking
10.3 Scalability and Customizability
10.3.1 Scalability and Architecture Families
10.3.2 Exploration and Scalability
10.3.3 Customization
Customized implementations
10.3.4 Reconfigurable hardware
Using programmable logic
10.3.5 Customizable processors and tools
Describing Processors
10.3.6 Tools for customization
Assemblers
Debuggers
The operating system
Customizable compilers
10.3.7 Architecture exploration is harder than it seems
Difficulty in characterizing the workload.
Difficulty in writing retargetable tools.
Fast measurement of performance.
Dealing with the complexity
Other barriers to customization
10.4 Further Reading
10.5 Exercises
Chapter 11: Application Areas
11.1 Digital Printing and Imaging
11.1.1 Photo Printing Pipeline
JPEG decompression
Scaling
Colorspace Conversion
Dithering
11.1.2 Implementation and Performance
11.2 Telecom applications
11.2.1 Voice coding
Waveform codecs,
Vocoders
Hybrid coders,
11.2.2 Multiplexing
11.2.3 The GSM Enhanced Full Rate codec
11.3 Other application areas
11.3.1 Digital video
MPEG-1 and MPEG-2
MPEG-4
11.3.2 Automotive
Fail-safety and fault-tolerance
Engine control units
In-vehicle networking
11.3.3 Hard disk drives
Motor Control
Data decoding
Disk scheduling and on-disk management tasks
Disk scheduling and off-disk management tasks
11.3.4 Networking and Network Processors
Network Processors
11.4 Further Reading
11.5 Exercises
Appendix A: The VEX System
A.1 The VEX Instruction Set Architecture
VEX Assembler Notation
A.1.1 Clusters
A.1.2 Execution Model
A.1.3 Architecture State
A.1.4 Arithmetic and Logic Operations
Examples
A.1.5 Intercluster Communication
Examples
A.1.6 Memory Operations
A.1.7 Control Operations
Examples
A.1.8 Structure of the default VEX Cluster
Register Files and Immediates
A.1.9 VEX Semantics
A.2 The VEX Run-Time Architecture
A.2.1 Data Allocation and Layout
A.2.2 Register Usage
A.2.3 Stack Layout and Procedure Linkage
Procedure Linkage
A.3 The VEX C Compiler
A.3.1 Command-line Options
Output Files
Preprocessing
Optimization
Profiling
Language Definition
Libraries
Passing Options to Compile Phases
Terminal Output and Process Control
Other Options
A.3.2 Compiler Pragmas
Unrolling and Profiling
Assertions
Memory Disambiguation
Cache Control
A.3.3 Inline Expansion
A.3.4 Machine Model Parameters
A.3.5 Custom Instructions
A.4 Visualization Tools
A.5 The VEX Simulation System
Gprof support
Simulating custom instructions
Simulating Memory Hierarchy
A.6 Customizing the VEX Toolchain
Clusters
Machine Model Resources
Computational MEs
Register Bank MEs
Machine Resources
Memory Hierarchy Parameters
A.7 Examples of tool usage
A.7.1 Compile and run
A.7.2 Profiling
A.7.3 Custom architectures
A.8 Exercises
List of Technical Terms and Keywords
ASIC (application-specific integrated circuit)
ASIP (application-specific instruction-set processor)
ATM (automated teller machine)
Assertion
Automaton
BDTI
Backend
Base+offset addressing
Binding
Bitrate
Bluetooth
Branch registers
Byte code
CAD (computer-aided design)
CFG (control flow graph)
CISC (complex instruction-set computer)
CMOS (complementary metal-oxide semiconductor)
CODEC (coder/decoder)
COMA (cache-only memory architecture)
Callee-save register
Caller-save register
Checkpointing
Codesign
Compensation code
Compiler
Conditional move
Control registers
Cydrome
DAG (directed acyclic graph)
DAXPY
DMA (direct memory access)
DRAM (dynamic random-access memory)
DTMF (dual-tone multi-frequency)
Data flow
Datapath
Disambiguation
Dispersal
Dithering
Dynamically linked library (DLL)
ECC (error correction codes)
ECL (emitter-coupled logic)
EDA (electronic design automation)
EEMBC
Embedded Java
Endianness
Equals machine
Exception
Execution time
FFT (fast Fourier transform)
FIFO (first-in, first-out)
FIR (finite impulse response)
FPGA (field-programmable gate array)
Finite-state machine
Full predication
Function pointer
Function specialization
General-purpose registers
Handheld
Hard coding
Hard macro
Hyperblock
IBURG
IIR (infinite impulse response)
ILP (instruction-level parallelism)
IMMU (instruction MMU)
IP (intellectual property block)
IPC (interprocess communication)
ISA (instruction-set architecture)
If-conversion
Inlining
Instruction
Instruction opcode
Instruction scheduling
Instrumentation
Interrupt
LALR (look-ahead left-to-right) Parser
Leak
Less-than-or-equals machine
Link time
Linux
MAC (multiply-accumulate) instruction
MAX (multimedia acceleration extensions)
MIPS
MMU (memory management unit)
MOPS
MSB (most significant bit)
MTU (maximum transfer unit)
Mathematical relation
Memoization
Microcontroller
Microkernel
Middleware
Misprediction
Modulo scheduling
Multiflow Computer
Multithreading
Mux (multiplexer)
NRE (non recurrent engineering)
NaN (not-a-number)
NaT (not-a-thing)
Name space
OS (operating system)
Operation
Overlapped execution
Own data
PC (program counter)
PC-relative addressing
PDA (personal digital assistant)
PIC (position-independent code)
PID (position-independent data)
POS (point-of-sale) terminal
Parallel execution
Parallel issue
Parallelism
Partial predication
Phase ordering
Pipeline
Pipelining
Position-independent code (PIC)
Pragma (#pragma)
Precomputation
Preconditioning
Predication
Prefetching
Preserved register
Profile
Program invocation time
Quicksort
RTL (register transfer level)
Reentrant
Retargetable
SDRAM (synchronous DRAM)
SMP (symmetric multiprocessing)
SSE (streaming SIMD extensions)
Scan chain
Scheduling
Scratch register
Scratch-pad memory
Select
SoC (system-on-a-chip)
SoPC (system-on-a-programmable-chip)
Soft macro
Software pipeline
Speculation
Speculative operation
Spill
Static binding
Stub
Subword
Superblock
Supercomputing
Superpipelining
Superscalar
Superword
Syllable
Synthesizable
TLB (translation lookaside buffer)
TTL (transistor-transistor logic)
Timestamp
Toolchain
Trace scheduling
Trampoline code
Treegion
Trip count
UART (universal asynchronous receiver/transmitter)
UARTs
UDP (user datagram protocol)
UML (unified modeling language)
Unicode
VGA (video graphics array)
VLSI (very large-scale integration)
VM (virtual memory)
VM was
Varargs
Vectorizable
Verilog
Versioning
Virtual machine
Vocoder