Home ] Meet the Authors ] Errata Corrige ] Suggestions ] Table of Contents ] Downloads ] VEX forum ]

Embedded Computing

A VLIW Approach to Architecture, Compilers and Tools

Joseph A. Fisher, Paolo Faraboschi, Cliff Young

 

Table of Contents

 

About the Authors

Foreword

Preface

    Content and Structure

    The VEX (VLIW Example) computing system

    Audience

    Cross-cutting Topics

    How to read this book

Acknowledgements

Figure Acknowledgements

Chapter 1: An Introduction to Embedded Processing

    1.1 What is Embedded Computing?

        1.1.1 Attributes of embedded devices

        1.1.2 Embedded is growing

    1.2 Distinguishing Between Embedded and General Purpose Computing

        1.2.1 The "run one program only" phenomenon

        1.2.2 Backward and Binary compatibility

            Feature compatibility

            Source compatibility

            Tool compatibility

            Operating System compatibility

        1.2.3 Physical limits in the embedded domain

            Cost

            Power consumption and energy efficiency

            Physical dimensions

    1.3 Characterizing Embedded Computing

        1.3.1 Categorization by type of processing engine

            Digital Signal Processors

            Network Processors

        1.3.2 Categorization by application area

            Image Processing and Consumer

            Printers.        

            Cameras.

            Digital Video.

            Communication

            Wired phone networks.

            Cellular networks.

            Cellular telephones.

            Routers.

            Automotive

        1.3.3 Categorization by Workload differences

            Controlling

            Switching and routing

            Media processing

    1.4 Embedded market structure

        1.4.1 The market for embedded processor cores

        1.4.2 Business model of embedded processors

            Getting a better yield.

            Controlling Area and Power Consumption.

            Testing costs

        1.4.3 Costs and Product Volume

        1.4.4 Software and The Embedded Software Market

        1.4.5 Industry standards

        1.4.6 Product lifecycle

        1.4.7 The Transition to System-on-Chip Design

            Effects of System-On-Chip on the business model

            Centers of embedded design

        1.4.8 The Future of Embedded Systems

            Connectivity: Always-on Infrastructure

            State: Personal Storage

            Administration

            Security

            The Next Generation

    1.5 Further Reading

    1.6 Exercises

 

Chapter 2: An Overview of VLIW and ILP

    2.1 Semantics and parallelism

        2.1.1 Baseline: sequential program semantics

        2.1.2 Pipelined execution, overlapped execution and multiple execution units

        2.1.3 Dependence and program rearrangement

        2.1.4 ILP and other forms of parallelism

    2.2 Design philosophies

        2.2.1 An illustration of design philosophies: RISC vs. CISC

        2.2.2 First definition of VLIW

        2.2.3 A design philosophy: VLIW

            VLIW vs. Superscalar

            VLIW vs. DSP

    2.3 Role of the compiler

        2.3.1 The phases of a high-performance compiler

        2.3.2 Compiling for ILP and VLIW

            Global instruction scheduling

            Software Pipelining

            Profiling

    2.4 VLIW in the embedded and DSP domains

    2.5 Historical Perspective and Further Reading

        2.5.1 ILP hardware in the 1960s and 70s

            Early supercomputer arithmetic units.

            Attached signal processors

            Horizontal microcode

        2.5.2 The development of ILP code generation in the 1980s

            Acyclic Microcode Compaction Techniques

            Inadequacy of local compaction.

            First global compaction attempts.

            Acyclic Region Scheduling Techniques.

            Cyclic Techniques: Software Pipelining

            Systematic software pipelining.

        2.5.3 VLIW development in the 1980s

        2.5.4 ILP in the 1990s and 2000s

    2.6 Exercises

 

Chapter 3: An Overview of ISA Design

    3.1 Overview: What to Hide

        3.1.1 Architectural State: Memory and Registers

        3.1.2 Pipelining and Operational Latency

        3.1.3 Multiple Issue and Hazards

            Exposing dependence and independence

            Structural Hazards

            Resource Hazards

        3.1.4 Exception and Interrupt Handling

        3.1.5 Discussion

    3.2 Basic VLIW design principles

        3.2.1 Implications for Compilers and Implementations

        3.2.2 Execution model subtleties

            "Equals" Model.

            "Less-than-or-Equals" Model.

    3.3 Designing a VLIW ISA for Embedded Systems

        3.3.1 Application domain

        3.3.2 ILP Style

        3.3.3 Hardware/Software Trade-offs

    3.4 Instruction-Set Encoding

        3.4.1 A Larger Definition of Architecture

        3.4.2 Encoding and Architectural Style

            RISC encodings

            CISC encodings

            VLIW encodings

            Why not Superscalar encodings?

            DSP encodings

            Vector encodings

    3.5 VLIW Encoding

        3.5.1 Operation encoding

        3.5.2 Instruction encoding

            Fixed-overhead encoding

            Distributed encoding

            Template-based encoding

        3.5.3 Dispatching and opcode subspaces

    3.6 Encoding and Instruction-Set Extensions

    3.7 Further Reading

    3.8 Exercises

 

Chapter 4: Architectural Structures in ISA design

    4.1 The Datapath

        4.1.1 Location of operands and results

        4.1.2 Datapath Width

        4.1.3 Operation Repertoire

            Simple integer and compare operations.

            Carry, Overflow and Other Flags.

            Common bitwise utilities.

            Integer Multiplication.

            Fixed-Point multiplication.

            Integer division.

            Floating Point Operations.

            Saturated Arithmetic.

        4.1.4 Micro-SIMD Operations

            Alignment issues.

            Precision issues.

            Dealing with control flow.

            Pack, unpack, and mix.

            Reductions.

        4.1.5 Constants

    4.2 Registers and Clusters

        4.2.1 Clustering

            Architecturally Invisible Clustering.

            Architecturally Visible Clustering.

        4.2.2 Heterogeneous register files

        4.2.3 Address and Data Registers

        4.2.4 Special register file features

    4.3 Memory Architecture

        4.3.1 Addressing Modes

        4.3.2 Access Sizes

        4.3.3 Alignment issues

        4.3.4 Caches and local memories

            Prefetching

            Local Memories and Lockable Caches

        4.3.5 Exotic modes for embedded processing

    4.4 Branch Architecture

        4.4.1 Unbundling Branches

            Two-step Branching.

            Three-step Branching.

        4.4.2 Multi-way branches

        4.4.3 Multi-cluster branches

        4.4.4 Branches and loops

    4.5 Speculation and Predication

        4.5.1 Speculation

            Control Speculation

            "Non-Recovery" speculation model.

            "Recovery" Speculation Model.

            Data Speculation

        4.5.2 Predication

            Full prediction

            Partial predication

            Cost and benefits of predication

            Predication in the embedded domain

    4.6 System Operations

    4.7 Further Reading

    4.8 Exercises

 

Chapter 5: Microarchitecture Design

    5.1 Register File Design

        5.1.1 Register File Structure

        5.1.2 Register Files, Technology, and Clustering

        5.1.3 Separate Address and Data Register Files

        5.1.4 Special Registers and Register File Features

    5.2 Pipeline Design

        5.2.1 Balancing a Pipeline

    5.3 VLIW Fetch, Sequencing and Decoding

        5.3.1 Instruction Fetch

        5.3.2 Alignment and Instruction Length

        5.3.3 Decoding and Dispersal

        5.3.4 Decoding and ISA Extensions

    5.4 The Datapath

        5.4.1 Execution Units

        5.4.2 Bypassing and Forwarding Logic

        5.4.3 Exposing Latencies

        5.4.4 Predication and Selects

    5.5 Memory Architecture

        5.5.1 Local memory and caches

        5.5.2 Byte manipulation

        5.5.3 Addressing, Protection and Virtual Memory

        5.5.4 Memories in Multiprocessor Systems

        5.5.5 Memory Speculation

    5.6 Control Unit

        5.6.1 The Branch Architecture

        5.6.2 Predication and Selects

        5.6.3 Interrupts and Exceptions

        5.6.4 Exceptions and pipelining

            "Drain" and "Flush" Pipeline Models

            Early Commit

            Delayed Commit

    5.7 Control Registers

    5.8 Power Considerations

        5.8.1 Energy Efficiency and ILP

            System-Level Power Considerations.

    5.9 Further Reading

    5.10 Exercises

 

Chapter 6: System Design and Simulation

    6.1 System-on-Chip (SoC)

        6.1.1 IP Blocks and Design Reuse

            A concrete SoC Example

            Virtual Components and the VSIA Alliance

        6.1.2 Design Flows

            Creation Flow

            Verification Flow

        6.1.3 SoC Buses

            Data Widths

            Masters, Slaves and Arbiters

            Bus Transactions

            Simple Transactions.

            Burst Transactions.

            Split Transactions.

            Test Modes

    6.2 Processor Cores and System-On-Chip

        6.2.1 Non-programmable accelerators

            Reconfigurable logic

        6.2.2 Multiprocessing on a chip

            Symmetric Multiprocessing

            Heterogeneous Multiprocessing

            Example: a multi-core platform for mobile multimedia

    6.3 Overview of Simulation

        6.3.1 Using simulators

    6.4 Simulating a VLIW architecture

        6.4.1 Interpretation

        6.4.2 Compiled simulation

            Memory

            Registers

            Control Flow

            Exceptions

            Analysis of compiled simulation

            Performance measurement and compiled simulation

        6.4.3 Dynamic binary translation

        6.4.4 Trace-Driven Simulation

    6.5 System simulation

        6.5.1 I/O and Concurrent Activities

        6.5.2 Hardware simulation

            Discrete Event Simulation

        6.5.3 Accelerating Simulation

            In-Circuit Emulation

            Hardware accelerators for simulation

    6.6 Validation and verification

        6.6.1 Simulation, Verification and Test

            Formal Verification

            Design for Testability

            Debugging support for SoC

    6.7 Further Reading

    6.8 Exercises

 

Chapter 7: Embedded Compiling and Toolchains

    7.1 What is important in an ILP Compiler?

    7.2 Embedded cross-development toolchains

        7.2.1 Compiler

        7.2.2 Assembler

        7.2.3 Libraries

        7.2.4 Linker

        7.2.5 Post-link optimizer

        7.2.6 Run-time program loader

        7.2.7 Simulator

        7.2.8 Debugger, monitor ROMs

        7.2.9 Automated test systems

        7.2.10 Profiling tools

        7.2.11 Binary utilities

    7.3 Structure of an ILP compiler

        7.3.1 Front end

        7.3.2 Machine-independent optimizer

        7.3.3 Back end: machine-specific optimizations.

    7.4 Code Layout

        7.4.1 Code Layout Techniques

            DAG-based placement

            The "Pettis-Hansen" technique

            Procedure Inlining.

            Cache line coloring

            Temporal-Order Placement

    7.5 Embedded-specific trade-offs for compilers

        7.5.1 Space vs. Time vs. Energy trade-offs

        7.5.2 Power-specific optimizations

            Fundamentals of Power Dissipation

            Switching.

            Leakage

            Power-aware software techniques

            Reducing Switching Activity.

            Power-Aware Instruction Selection.

            Scheduling for Minimal Dissipation.

            Memory Access Optimizations.

            Data Remapping.

    7.6 DSP-Specific Compiler Optimizations

            Compiler-visible features of DSPs

            Heterogeneous registers.

            Addressing modes.

            Limited Connectivity.

            Local Memories.

            Harvard architecture.

            Instruction Selection and Scheduling

            Address Computation and Offset Assignment

            Local Memories

            Register Assignment Techniques

            Retargetable DSP and ASIP compilers

    7.7 Further Reading

    7.8 Exercises

 

Chapter 8: Compiling for VLIWs and ILP

    8.1 Profiling

        8.1.1 Kinds of profiles

        8.1.2 Profile collection

        8.1.3 Synthetic profiles (heuristics in lieu of profiles)

        8.1.4 Profile Bookkeeping and Methodology

        8.1.5 Profiles and embedded applications

    8.2 Scheduling

        8.2.1 Acyclic Region Types and Shapes

            Basic blocks

            Traces

            Superblocks

            Hyperblocks

            Treegions

            Percolation Scheduling

        8.2.2 Region formation

            Region selection

            Enlargement techniques

            Phase ordering considerations

        8.2.3 Schedule construction

            Analyzing programs for schedule construction

            Compaction Techniques

            Cycle vs. Operation scheduling

            Linear techniques

            Graph-based techniques (list scheduling)

            Compensation Code

            Another view of scheduling problems

        8.2.4 Resource management during scheduling

            Resource vectors

            Finite state automata

        8.2.5 Loop scheduling

            Modulo scheduling

            The Modulo Reservation Table (MRT).

            Searching for the II.

            Scheduling heuristics.

            Prologues and epilogues.

            Modulo variable expansion.

            Iterative Modulo Scheduling.

            Advanced modulo scheduling techniques.

        8.2.6 Clustering

    8.3 Register allocation

        8.3.1 Phase Ordering Issues

            Register allocation and scheduling

    8.4 Speculation and Predication

        8.4.1 Control and Data Speculation

        8.4.2 Predicated execution

        8.4.3 Prefetching

        8.4.4 Data Layout methods

        8.4.5 Static and hybrid branch prediction

    8.5 Instruction selection

    8.6 Further Reading

    8.7 Exercises

 

Chapter 9: The Run-time System

    9.1 Exceptions, interrupts, and traps

            Exception Handling

    9.2 Application Binary Interface considerations

        9.2.1 Loading programs

        9.2.2 Data layout

        9.2.3 Accessing global data

        9.2.4 Calling conventions

            Registers

            Call instructions

            Call sites

            Function prologues and epilogues

        9.2.5 Advanced ABI topics

            Variable-length argument lists

            Dynamic stack allocation

            Garbage collection

            Linguistic exceptions

    9.3 Code Compression

        9.3.1 Motivations

        9.3.2 Compression and Information Theory

        9.3.3 Architectural Compression Options

            Decompression on fetch

            Decompression on refill

            Load-time decompression.

        9.3.4 Compression Methods

            Hand-tuned ISAs.

            Ad-hoc compression schemes.

            RAM decompression.

            Dictionary-based software compression.

            Cache-based compression.

            Quantifying compression benefits

    9.4 Embedded Operating Systems

        9.4.1 "Traditional" OS issues, revisited

         9.4.2 Real-Time

            Real-Time Scheduling

        9.4.3 Multiple flows of control

            Threads, processes and microkernels

            Threads.

            Processes.

            Microkernels.

        9.4.4 Market Considerations

            Embedded Linux

        9.4.5 Downloadable code and Virtual Machines

    9.5 Multiprocessing and Multithreading

        9.5.1 Multiprocessing in the embedded world

        9.5.2 Multiprocessing and VLIW

    9.6 Further Reading

    9.7 Exercises

 

Chapter 10: Application Design and Customization

    10.1 Programming Language choices

        10.1.1 Overview of embedded programming languages

        10.1.2 Traditional C and ANSI C

        10.1.3 C++ and embedded C++

            Embedded C++

        10.1.4 Matlab

        10.1.5 Embedded Java

            The Allure of Embedded Java.

            Embedded Java: the dark side

        10.1.6 C Extensions for Digital Signal Processing

            Restricted Pointers

            Fixed Point Data Types

            Circular Arrays

            Matrix referencing and operators

        10.1.7 Pragmas, Intrinsics and Inline Assembler

            Compiler pragmas and type annotations.

            Assembler inserts and intrinsics.

    10.2 Performance, Benchmarking and Tuning

        10.2.1 Importance and methodology

        10.2.2 Tuning an application for performance

            Profiling

            Performance Tuning and Compilers

            Developing for ILP targets

            Write code for predication.

            Localize variables.

            Help memory alias analysis.

            Eliminate redundant loads.

            Enable prefetching and exploit locality properties.

            Specialize code.

            Use look-up tables and memoization.

        10.2.3 Benchmarking

    10.3 Scalability and Customizability

        10.3.1 Scalability and Architecture Families

        10.3.2 Exploration and Scalability

        10.3.3 Customization

            Customized implementations

        10.3.4 Reconfigurable hardware

            Using programmable logic

        10.3.5 Customizable processors and tools

            Describing Processors

        10.3.6 Tools for customization

            Assemblers

            Debuggers

            The operating system

            Customizable compilers

        10.3.7 Architecture exploration is harder than it seems

            Difficulty in characterizing the workload.

            Difficulty in writing retargetable tools.

            Fast measurement of performance.

            Dealing with the complexity

            Other barriers to customization

    10.4 Further Reading

    10.5 Exercises

 

Chapter 11: Application Areas

    11.1 Digital Printing and Imaging

        11.1.1 Photo Printing Pipeline

            JPEG decompression

            Scaling

            Colorspace Conversion

            Dithering

        11.1.2 Implementation and Performance

    11.2 Telecom applications

        11.2.1 Voice coding

            Waveform codecs,

            Vocoders

            Hybrid coders,

        11.2.2 Multiplexing

        11.2.3 The GSM Enhanced Full Rate codec

    11.3 Other application areas

        11.3.1 Digital video

            MPEG-1 and MPEG-2

            MPEG-4

        11.3.2 Automotive

            Fail-safety and fault-tolerance

            Engine control units

            In-vehicle networking

        11.3.3 Hard disk drives

            Motor Control

            Data decoding

            Disk scheduling and on-disk management tasks

            Disk scheduling and off-disk management tasks

        11.3.4 Networking and Network Processors

            Network Processors

    11.4 Further Reading

    11.5 Exercises

 

Appendix A: The VEX System

    A.1 The VEX Instruction Set Architecture

            VEX Assembler Notation

        A.1.1 Clusters

        A.1.2 Execution Model

        A.1.3 Architecture State

        A.1.4 Arithmetic and Logic Operations

            Examples

        A.1.5 Intercluster Communication

            Examples

        A.1.6 Memory Operations

        A.1.7 Control Operations

            Examples

        A.1.8 Structure of the default VEX Cluster

            Register Files and Immediates

        A.1.9 VEX Semantics

    A.2 The VEX Run-Time Architecture

        A.2.1 Data Allocation and Layout

        A.2.2 Register Usage

        A.2.3 Stack Layout and Procedure Linkage

            Procedure Linkage

    A.3 The VEX C Compiler

        A.3.1 Command-line Options

            Output Files

            Preprocessing

            Optimization

            Profiling

            Language Definition

            Libraries

            Passing Options to Compile Phases

            Terminal Output and Process Control

            Other Options

        A.3.2 Compiler Pragmas

            Unrolling and Profiling

            Assertions

            Memory Disambiguation

            Cache Control

        A.3.3 Inline Expansion

        A.3.4 Machine Model Parameters

        A.3.5 Custom Instructions

    A.4 Visualization Tools

    A.5 The VEX Simulation System

            Gprof support

            Simulating custom instructions

            Simulating Memory Hierarchy

    A.6 Customizing the VEX Toolchain

            Clusters

            Machine Model Resources

            Computational MEs

            Register Bank MEs

            Machine Resources

            Memory Hierarchy Parameters

    A.7 Examples of tool usage

        A.7.1 Compile and run

        A.7.2 Profiling

        A.7.3 Custom architectures

    A.8 Exercises

 

Appendix B: Glossary

Appendix C: Bibliography

List of Technical Terms / Keywords

ASIC (application-specific integrated circuit)
ASIP (application-specific instruction-set processor)
ATM (automated teller machine)
Assertion
Automaton
BDTI
Backend
Base+offset addressing
Binding
Bitrate
Bluetooth
Branch registers
Byte code
CAD (computer-aided design)
CFG (control flow graph)
CISC (complex instruction-set computer)
CMOS (complementary metal-oxide semiconductor)
CODEC (coder/decoder)
COMA (cache-only memory architecture)
Callee-save register
Caller-save register
Checkpointing
Codesign
Compensation code
Compiler
Conditional move
Control registers
Cydrome
DAG (directed acyclic graph)
DAXPY
DMA (direct memory access)
DRAM (dynamic random-access memory)
DTMF (dual-tone multi-frequency)
Data flow
Datapath
Disambiguation
Dispersal
Dithering
Dynamically linked library (DLL)
ECC (error correction codes)
ECL (emitter-coupled logic)
EDA (electronic design automation)
EEMBC
Embedded Java
Endianness
Equals machine
Exception
Execution time
FFT (fast Fourier transform)
FIFO (first-in, first-out)
FIR (finite impulse response)
FPGA (field-programmable gate array)
Finite-state machine
Full predication
Function pointer
Function specialization
General-purpose registers
Handheld
Hard coding
Hard macro
Hyperblock
IBURG
IIR (infinite impulse response)
ILP (instruction-level parallelism)
IMMU (instruction MMU)
IP (intellectual property block)
IPC (interprocess communication)
ISA (instruction-set architecture)
If-conversion
Inlining
Instruction
Instruction opcode
Instruction scheduling
Instrumentation
Interrupt
LALR (look-ahead left-to-right) Parser
Leak
Less-than-or-equals machine
Link time
Linux
MAC (multiply-accumulate) instruction
MAX (multimedia acceleration extensions)
MIPS
MMU (memory management unit)
MOPS
MSB (most significant bit)
MTU (maximum transfer unit)
Mathematical relation
Memoization
Microcontroller
Microkernel
Middleware
Misprediction
Modulo scheduling
Multiflow Computer
Multithreading
Mux (multiplexer)
NRE (non recurrent engineering)
NaN (not-a-number)
NaT (not-a-thing)
Name space
OS (operating system)
Operation
Overlapped execution
Own data
PC (program counter)
PC-relative addressing
PDA (personal digital assistant)
PIC (position-independent code)
PID (position-independent data)
POS (point-of-sale) terminal
Parallel execution
Parallel issue
Parallelism
Partial predication
Phase ordering
Pipeline
Pipelining
Position-independent code (PIC)
Pragma (#pragma)
Precomputation
Preconditioning
Predication
Prefetching
Preserved register
Profile
Program invocation time
Quicksort
RTL (register transfer level)
Reentrant
Retargetable
SDRAM (synchronous DRAM)
SMP (symmetric multiprocessing)
SSE (streaming SIMD extensions)
Scan chain
Scheduling
Scratch register
Scratch-pad memory
Select
SoC (system-on-a-chip)
SoPC (system-on-a-programmable-chip)
Soft macro
Software pipeline
Speculation
Speculative operation
Spill
Static binding
Stub
Subword
Superblock
Supercomputing
Superpipelining
Superscalar
Superword
Syllable
Synthesizable
TLB (translation lookaside buffer)
TTL (transistor-transistor logic)
Timestamp
Toolchain
Trace scheduling
Trampoline code
Treegion
Trip count
UART (universal asynchronous receiver/transmitter)
UARTs
UDP (user datagram protocol)
UML (unified modeling language)
Unicode
VGA (video graphics array)
VLSI (very large-scale integration)
VM (virtual memory)
VM was
Varargs
Vectorizable
Verilog
Versioning
Virtual machine
Vocoder

Home ] Meet the Authors ] Errata Corrige ] Suggestions ] Table of Contents ] Downloads ] VEX forum ]