Home ] Meet the Authors ] Errata Corrige ] Suggestions ] [ Table of Contents ] Downloads ] VEX forum ]

Embedded Computing

A VLIW Approach to Architecture, Compilers and Tools
Joseph A. Fisher, Paolo Faraboschi, Cliff Young

 

[click on each chapter to expand it or see the fully expanded flat ToC]

[front matter]
About the Authors
Foreword
Acknowledgements
Figure Acknowledgements
Preface
Content and Structure
The VEX (VLIW Example) computing system
Audience
Cross-cutting Topics
How to read this book
Chapter 1: An Introduction to Embedded Processing
    1.1 What is Embedded Computing?
        1.1.1 Attributes of embedded devices
        1.1.2 Embedded is growing
    1.2 Distinguishing Between Embedded and General Purpose Computing
        1.2.1 The "run one program only" phenomenon
        1.2.2 Backward and Binary compatibility
            Feature compatibility
            Source compatibility
            Tool compatibility
            Operating System compatibility
        1.2.3 Physical limits in the embedded domain
            Cost
            Power consumption and energy efficiency
            Physical dimensions
    1.3 Characterizing Embedded Computing
        1.3.1 Categorization by type of processing engine
            Digital Signal Processors
            Network Processors
        1.3.2 Categorization by application area
            Image Processing and Consumer
            Printers.        
            Cameras.
            Digital Video.
            Communication
            Wired phone networks.
            Cellular networks.
            Cellular telephones.
            Routers.
            Automotive
        1.3.3 Categorization by Workload differences
            Controlling
            Switching and routing
            Media processing
    1.4 Embedded market structure
        1.4.1 The market for embedded processor cores
        1.4.2 Business model of embedded processors
            Getting a better yield.
            Controlling Area and Power Consumption.
            Testing costs
        1.4.3 Costs and Product Volume
        1.4.4 Software and The Embedded Software Market
        1.4.5 Industry standards
        1.4.6 Product lifecycle
        1.4.7 The Transition to System-on-Chip Design
            Effects of System-On-Chip on the business model
            Centers of embedded design
        1.4.8 The Future of Embedded Systems
            Connectivity: Always-on Infrastructure
            State: Personal Storage
            Administration
            Security
            The Next Generation
    1.5 Further Reading
    1.6 Exercises
Chapter 2: An Overview of VLIW and ILP
    2.1 Semantics and parallelism
        2.1.1 Baseline: sequential program semantics
        2.1.2 Pipelined execution, overlapped execution and multiple execution units
        2.1.3 Dependence and program rearrangement
        2.1.4 ILP and other forms of parallelism
    2.2 Design philosophies
        2.2.1 An illustration of design philosophies: RISC vs. CISC
        2.2.2 First definition of VLIW
        2.2.3 A design philosophy: VLIW
            VLIW vs. Superscalar
            VLIW vs. DSP
    2.3 Role of the compiler
        2.3.1 The phases of a high-performance compiler
        2.3.2 Compiling for ILP and VLIW
            Global instruction scheduling
            Software Pipelining
            Profiling
    2.4 VLIW in the embedded and DSP domains
    2.5 Historical Perspective and Further Reading
        2.5.1 ILP hardware in the 1960s and 70s
            Early supercomputer arithmetic units.
            Attached signal processors
            Horizontal microcode
        2.5.2 The development of ILP code generation in the 1980s
            Acyclic Microcode Compaction Techniques
            Inadequacy of local compaction.
            First global compaction attempts.
            Acyclic Region Scheduling Techniques.
            Cyclic Techniques: Software Pipelining
            Systematic software pipelining.
        2.5.3 VLIW development in the 1980s
        2.5.4 ILP in the 1990s and 2000s
    2.6 Exercises
Chapter 3: An Overview of ISA Design
    3.1 Overview: What to Hide
        3.1.1 Architectural State: Memory and Registers
        3.1.2 Pipelining and Operational Latency
        3.1.3 Multiple Issue and Hazards
            Exposing dependence and independence
            Structural Hazards
            Resource Hazards
        3.1.4 Exception and Interrupt Handling
        3.1.5 Discussion
    3.2 Basic VLIW design principles
        3.2.1 Implications for Compilers and Implementations
        3.2.2 Execution model subtleties
            "Equals" Model.
            "Less-than-or-Equals" Model.
    3.3 Designing a VLIW ISA for Embedded Systems
        3.3.1 Application domain
        3.3.2 ILP Style
        3.3.3 Hardware/Software Trade-offs
    3.4 Instruction-Set Encoding
        3.4.1 A Larger Definition of Architecture
        3.4.2 Encoding and Architectural Style
            RISC encodings
            CISC encodings
            VLIW encodings
            Why not Superscalar encodings?
            DSP encodings
            Vector encodings
    3.5 VLIW Encoding
        3.5.1 Operation encoding
        3.5.2 Instruction encoding
            Fixed-overhead encoding
            Distributed encoding
            Template-based encoding
        3.5.3 Dispatching and opcode subspaces
    3.6 Encoding and Instruction-Set Extensions
    3.7 Further Reading
    3.8 Exercises
 
Chapter 4: Architectural Structures in ISA design
    4.1 The Datapath
        4.1.1 Location of operands and results
        4.1.2 Datapath Width
        4.1.3 Operation Repertoire
            Simple integer and compare operations.
            Carry, Overflow and Other Flags.
            Common bitwise utilities.
            Integer Multiplication.
            Fixed-Point multiplication.
            Integer division.
            Floating Point Operations.
            Saturated Arithmetic.
        4.1.4 Micro-SIMD Operations
            Alignment issues.
            Precision issues.
            Dealing with control flow.
            Pack, unpack, and mix.
            Reductions.
        4.1.5 Constants
    4.2 Registers and Clusters
        4.2.1 Clustering
            Architecturally Invisible Clustering.
            Architecturally Visible Clustering.
        4.2.2 Heterogeneous register files
        4.2.3 Address and Data Registers
        4.2.4 Special register file features
    4.3 Memory Architecture
        4.3.1 Addressing Modes
        4.3.2 Access Sizes
        4.3.3 Alignment issues
        4.3.4 Caches and local memories
            Prefetching
            Local Memories and Lockable Caches
        4.3.5 Exotic modes for embedded processing
    4.4 Branch Architecture
        4.4.1 Unbundling Branches
            Two-step Branching.
            Three-step Branching.
        4.4.2 Multi-way branches
        4.4.3 Multi-cluster branches
        4.4.4 Branches and loops
    4.5 Speculation and Predication
        4.5.1 Speculation
            Control Speculation
            "Non-Recovery" speculation model.
            "Recovery" Speculation Model.
            Data Speculation
        4.5.2 Predication
            Full prediction
            Partial predication
            Cost and benefits of predication
            Predication in the embedded domain
    4.6 System Operations
    4.7 Further Reading
    4.8 Exercises
 
Chapter 5: Microarchitecture Design
    5.1 Register File Design
        5.1.1 Register File Structure
        5.1.2 Register Files, Technology, and Clustering
        5.1.3 Separate Address and Data Register Files
        5.1.4 Special Registers and Register File Features
    5.2 Pipeline Design
        5.2.1 Balancing a Pipeline
    5.3 VLIW Fetch, Sequencing and Decoding
        5.3.1 Instruction Fetch
        5.3.2 Alignment and Instruction Length
        5.3.3 Decoding and Dispersal
        5.3.4 Decoding and ISA Extensions
    5.4 The Datapath
        5.4.1 Execution Units
        5.4.2 Bypassing and Forwarding Logic
        5.4.3 Exposing Latencies
        5.4.4 Predication and Selects
    5.5 Memory Architecture
        5.5.1 Local memory and caches
        5.5.2 Byte manipulation
        5.5.3 Addressing, Protection and Virtual Memory
        5.5.4 Memories in Multiprocessor Systems
        5.5.5 Memory Speculation
    5.6 Control Unit
        5.6.1 The Branch Architecture
        5.6.2 Predication and Selects
        5.6.3 Interrupts and Exceptions
        5.6.4 Exceptions and pipelining
            "Drain" and "Flush" Pipeline Models
            Early Commit
            Delayed Commit
    5.7 Control Registers
    5.8 Power Considerations
        5.8.1 Energy Efficiency and ILP
            System-Level Power Considerations.
    5.9 Further Reading
    5.10 Exercises
 
Chapter 6: System Design and Simulation
    6.1 System-on-Chip (SoC)
        6.1.1 IP Blocks and Design Reuse
            A concrete SoC Example
            Virtual Components and the VSIA Alliance
        6.1.2 Design Flows
            Creation Flow
            Verification Flow
        6.1.3 SoC Buses
            Data Widths
            Masters, Slaves and Arbiters
            Bus Transactions
            Simple Transactions.
            Burst Transactions.
            Split Transactions.
            Test Modes
    6.2 Processor Cores and System-On-Chip
        6.2.1 Non-programmable accelerators
            Reconfigurable logic
        6.2.2 Multiprocessing on a chip
            Symmetric Multiprocessing
            Heterogeneous Multiprocessing
            Example: a multi-core platform for mobile multimedia
    6.3 Overview of Simulation
        6.3.1 Using simulators
    6.4 Simulating a VLIW architecture
        6.4.1 Interpretation
        6.4.2 Compiled simulation
            Memory
            Registers
            Control Flow
            Exceptions
            Analysis of compiled simulation
            Performance measurement and compiled simulation
        6.4.3 Dynamic binary translation
        6.4.4 Trace-Driven Simulation
    6.5 System simulation
        6.5.1 I/O and Concurrent Activities
        6.5.2 Hardware simulation
            Discrete Event Simulation
        6.5.3 Accelerating Simulation
            In-Circuit Emulation
            Hardware accelerators for simulation
    6.6 Validation and verification
        6.6.1 Simulation, Verification and Test
            Formal Verification
            Design for Testability
            Debugging support for SoC
    6.7 Further Reading
    6.8 Exercises
 
Chapter 7: Embedded Compiling and Toolchains
    7.1 What is important in an ILP Compiler?
    7.2 Embedded cross-development toolchains
        7.2.1 Compiler
        7.2.2 Assembler
        7.2.3 Libraries
        7.2.4 Linker
        7.2.5 Post-link optimizer
        7.2.6 Run-time program loader
        7.2.7 Simulator
        7.2.8 Debugger, monitor ROMs
        7.2.9 Automated test systems
        7.2.10 Profiling tools
        7.2.11 Binary utilities
    7.3 Structure of an ILP compiler
        7.3.1 Front end
        7.3.2 Machine-independent optimizer
        7.3.3 Back end: machine-specific optimizations.
    7.4 Code Layout
        7.4.1 Code Layout Techniques
            DAG-based placement
            The "Pettis-Hansen" technique
            Procedure Inlining.
            Cache line coloring
            Temporal-Order Placement
    7.5 Embedded-specific trade-offs for compilers
        7.5.1 Space vs. Time vs. Energy trade-offs
        7.5.2 Power-specific optimizations
            Fundamentals of Power Dissipation
            Switching.
            Leakage
            Power-aware software techniques
            Reducing Switching Activity.
            Power-Aware Instruction Selection.
            Scheduling for Minimal Dissipation.
            Memory Access Optimizations.
            Data Remapping.
    7.6 DSP-Specific Compiler Optimizations
            Compiler-visible features of DSPs
            Heterogeneous registers.
            Addressing modes.
            Limited Connectivity.
            Local Memories.
            Harvard architecture.
            Instruction Selection and Scheduling
            Address Computation and Offset Assignment
            Local Memories
            Register Assignment Techniques
            Retargetable DSP and ASIP compilers
    7.7 Further Reading
    7.8 Exercises
 
Chapter 8: Compiling for VLIWs and ILP
    8.1 Profiling
        8.1.1 Kinds of profiles
        8.1.2 Profile collection
        8.1.3 Synthetic profiles (heuristics in lieu of profiles)
        8.1.4 Profile Bookkeeping and Methodology
        8.1.5 Profiles and embedded applications
    8.2 Scheduling
        8.2.1 Acyclic Region Types and Shapes
            Basic blocks
            Traces
            Superblocks
            Hyperblocks
            Treegions
            Percolation Scheduling
        8.2.2 Region formation
            Region selection
            Enlargement techniques
            Phase ordering considerations
        8.2.3 Schedule construction
            Analyzing programs for schedule construction
            Compaction Techniques
            Cycle vs. Operation scheduling
            Linear techniques
            Graph-based techniques (list scheduling)
            Compensation Code
            Another view of scheduling problems
        8.2.4 Resource management during scheduling
            Resource vectors
            Finite state automata
        8.2.5 Loop scheduling
            Modulo scheduling
            The Modulo Reservation Table (MRT).
            Searching for the II.
            Scheduling heuristics.
            Prologues and epilogues.
            Modulo variable expansion.
            Iterative Modulo Scheduling.
            Advanced modulo scheduling techniques.
        8.2.6 Clustering
    8.3 Register allocation
        8.3.1 Phase Ordering Issues
            Register allocation and scheduling
    8.4 Speculation and Predication
        8.4.1 Control and Data Speculation
        8.4.2 Predicated execution
        8.4.3 Prefetching
        8.4.4 Data Layout methods
        8.4.5 Static and hybrid branch prediction
    8.5 Instruction selection
    8.6 Further Reading
    8.7 Exercises
 
Chapter 9: The Run-time System
    9.1 Exceptions, interrupts, and traps
            Exception Handling
    9.2 Application Binary Interface considerations
        9.2.1 Loading programs
        9.2.2 Data layout
        9.2.3 Accessing global data
        9.2.4 Calling conventions
            Registers
            Call instructions
            Call sites
            Function prologues and epilogues
        9.2.5 Advanced ABI topics
            Variable-length argument lists
            Dynamic stack allocation
            Garbage collection
            Linguistic exceptions
    9.3 Code Compression
        9.3.1 Motivations
        9.3.2 Compression and Information Theory
        9.3.3 Architectural Compression Options
            Decompression on fetch
            Decompression on refill
            Load-time decompression.
        9.3.4 Compression Methods
            Hand-tuned ISAs.
            Ad-hoc compression schemes.
            RAM decompression.
            Dictionary-based software compression.
            Cache-based compression.
            Quantifying compression benefits
    9.4 Embedded Operating Systems
        9.4.1 "Traditional" OS issues, revisited
         9.4.2 Real-Time
            Real-Time Scheduling
        9.4.3 Multiple flows of control
            Threads, processes and microkernels
            Threads.
            Processes.
            Microkernels.
        9.4.4 Market Considerations
            Embedded Linux
        9.4.5 Downloadable code and Virtual Machines
    9.5 Multiprocessing and Multithreading
        9.5.1 Multiprocessing in the embedded world
        9.5.2 Multiprocessing and VLIW
    9.6 Further Reading
    9.7 Exercises
 
Chapter 10: Application Design and Customization
    10.1 Programming Language choices
        10.1.1 Overview of embedded programming languages
        10.1.2 Traditional C and ANSI C
        10.1.3 C++ and embedded C++
            Embedded C++
        10.1.4 Matlab
        10.1.5 Embedded Java
            The Allure of Embedded Java.
            Embedded Java: the dark side
        10.1.6 C Extensions for Digital Signal Processing
            Restricted Pointers
            Fixed Point Data Types
            Circular Arrays
            Matrix referencing and operators
        10.1.7 Pragmas, Intrinsics and Inline Assembler
            Compiler pragmas and type annotations.
            Assembler inserts and intrinsics.
    10.2 Performance, Benchmarking and Tuning
        10.2.1 Importance and methodology
        10.2.2 Tuning an application for performance
            Profiling
            Performance Tuning and Compilers
            Developing for ILP targets
            Write code for predication.
            Localize variables.
            Help memory alias analysis.
            Eliminate redundant loads.
            Enable prefetching and exploit locality properties.
            Specialize code.
            Use look-up tables and memoization.
        10.2.3 Benchmarking
    10.3 Scalability and Customizability
        10.3.1 Scalability and Architecture Families
        10.3.2 Exploration and Scalability
        10.3.3 Customization
            Customized implementations
        10.3.4 Reconfigurable hardware
            Using programmable logic
        10.3.5 Customizable processors and tools
            Describing Processors
        10.3.6 Tools for customization
            Assemblers
            Debuggers
            The operating system
            Customizable compilers
        10.3.7 Architecture exploration is harder than it seems
            Difficulty in characterizing the workload.
            Difficulty in writing retargetable tools.
            Fast measurement of performance.
            Dealing with the complexity
            Other barriers to customization
    10.4 Further Reading
    10.5 Exercises
 
Chapter 11: Application Areas
    11.1 Digital Printing and Imaging
        11.1.1 Photo Printing Pipeline
            JPEG decompression
            Scaling
            Colorspace Conversion
            Dithering
        11.1.2 Implementation and Performance
    11.2 Telecom applications
        11.2.1 Voice coding
            Waveform codecs,
            Vocoders
            Hybrid coders,
        11.2.2 Multiplexing
        11.2.3 The GSM Enhanced Full Rate codec
    11.3 Other application areas
        11.3.1 Digital video
            MPEG-1 and MPEG-2
            MPEG-4
        11.3.2 Automotive
            Fail-safety and fault-tolerance
            Engine control units
            In-vehicle networking
        11.3.3 Hard disk drives
            Motor Control
            Data decoding
            Disk scheduling and on-disk management tasks
            Disk scheduling and off-disk management tasks
        11.3.4 Networking and Network Processors
            Network Processors
    11.4 Further Reading
    11.5 Exercises
 
Appendix A: The VEX System
    A.1 The VEX Instruction Set Architecture
            VEX Assembler Notation
        A.1.1 Clusters
        A.1.2 Execution Model
        A.1.3 Architecture State
        A.1.4 Arithmetic and Logic Operations
            Examples
        A.1.5 Intercluster Communication
            Examples
        A.1.6 Memory Operations
        A.1.7 Control Operations
            Examples
        A.1.8 Structure of the default VEX Cluster
            Register Files and Immediates
        A.1.9 VEX Semantics
    A.2 The VEX Run-Time Architecture
        A.2.1 Data Allocation and Layout
        A.2.2 Register Usage
        A.2.3 Stack Layout and Procedure Linkage
            Procedure Linkage
    A.3 The VEX C Compiler
        A.3.1 Command-line Options
            Output Files
            Preprocessing
            Optimization
            Profiling
            Language Definition
            Libraries
            Passing Options to Compile Phases
            Terminal Output and Process Control
            Other Options
        A.3.2 Compiler Pragmas
            Unrolling and Profiling
            Assertions
            Memory Disambiguation
            Cache Control
        A.3.3 Inline Expansion
        A.3.4 Machine Model Parameters
        A.3.5 Custom Instructions
    A.4 Visualization Tools
    A.5 The VEX Simulation System
            Gprof support
            Simulating custom instructions
            Simulating Memory Hierarchy
    A.6 Customizing the VEX Toolchain
            Clusters
            Machine Model Resources
            Computational MEs
            Register Bank MEs
            Machine Resources
            Memory Hierarchy Parameters
    A.7 Examples of tool usage
        A.7.1 Compile and run
        A.7.2 Profiling
        A.7.3 Custom architectures
    A.8 Exercises
 
[Appendix B-C]
Appendix B: Glossary
Appendix C: Bibliography
 
List of Technical Terms and Keywords
    ASIC (application-specific integrated circuit)
    ASIP (application-specific instruction-set processor)
    ATM (automated teller machine)
    Assertion
    Automaton
    BDTI
    Backend
    Base+offset addressing
    Binding
    Bitrate
    Bluetooth
    Branch registers
    Byte code
    CAD (computer-aided design)
    CFG (control flow graph)
    CISC (complex instruction-set computer)
    CMOS (complementary metal-oxide semiconductor)
    CODEC (coder/decoder)
    COMA (cache-only memory architecture)
    Callee-save register
    Caller-save register
    Checkpointing
    Codesign
    Compensation code
    Compiler
    Conditional move
    Control registers
    Cydrome
    DAG (directed acyclic graph)
    DAXPY
    DMA (direct memory access)
    DRAM (dynamic random-access memory)
    DTMF (dual-tone multi-frequency)
    Data flow
    Datapath
    Disambiguation
    Dispersal
    Dithering
    Dynamically linked library (DLL)
    ECC (error correction codes)
    ECL (emitter-coupled logic)
    EDA (electronic design automation)
    EEMBC
    Embedded Java
    Endianness
    Equals machine
    Exception
    Execution time
    FFT (fast Fourier transform)
    FIFO (first-in, first-out)
    FIR (finite impulse response)
    FPGA (field-programmable gate array)
    Finite-state machine
    Full predication
    Function pointer
    Function specialization
    General-purpose registers
    Handheld
    Hard coding
    Hard macro
    Hyperblock
    IBURG
    IIR (infinite impulse response)
    ILP (instruction-level parallelism)
    IMMU (instruction MMU)
    IP (intellectual property block)
    IPC (interprocess communication)
    ISA (instruction-set architecture)
    If-conversion
    Inlining
    Instruction
    Instruction opcode
    Instruction scheduling
    Instrumentation
    Interrupt
    LALR (look-ahead left-to-right) Parser
    Leak
    Less-than-or-equals machine
    Link time
    Linux
    MAC (multiply-accumulate) instruction
    MAX (multimedia acceleration extensions)
    MIPS
    MMU (memory management unit)
    MOPS
    MSB (most significant bit)
    MTU (maximum transfer unit)
    Mathematical relation
    Memoization
    Microcontroller
    Microkernel
    Middleware
    Misprediction
    Modulo scheduling
    Multiflow Computer
    Multithreading
    Mux (multiplexer)
    NRE (non recurrent engineering)
    NaN (not-a-number)
    NaT (not-a-thing)
    Name space
    OS (operating system)
    Operation
    Overlapped execution
    Own data
    PC (program counter)
    PC-relative addressing
    PDA (personal digital assistant)
    PIC (position-independent code)
    PID (position-independent data)
    POS (point-of-sale) terminal
    Parallel execution
    Parallel issue
    Parallelism
    Partial predication
    Phase ordering
    Pipeline
    Pipelining
    Position-independent code (PIC)
    Pragma (#pragma)
    Precomputation
    Preconditioning
    Predication
    Prefetching
    Preserved register
    Profile
    Program invocation time
    Quicksort
    RTL (register transfer level)
    Reentrant
    Retargetable
    SDRAM (synchronous DRAM)
    SMP (symmetric multiprocessing)
    SSE (streaming SIMD extensions)
    Scan chain
    Scheduling
    Scratch register
    Scratch-pad memory
    Select
    SoC (system-on-a-chip)
    SoPC (system-on-a-programmable-chip)
    Soft macro
    Software pipeline
    Speculation
    Speculative operation
    Spill
    Static binding
    Stub
    Subword
    Superblock
    Supercomputing
    Superpipelining
    Superscalar
    Superword
    Syllable
    Synthesizable
    TLB (translation lookaside buffer)
    TTL (transistor-transistor logic)
    Timestamp
    Toolchain
    Trace scheduling
    Trampoline code
    Treegion
    Trip count
    UART (universal asynchronous receiver/transmitter)
    UARTs
    UDP (user datagram protocol)
    UML (unified modeling language)
    Unicode
    VGA (video graphics array)
    VLSI (very large-scale integration)
    VM (virtual memory)
    VM was
    Varargs
    Vectorizable
    Verilog
    Versioning
    Virtual machine
    Vocoder