Embedded ComputingA VLIW Approach to Architecture, Compilers and ToolsJoseph A. Fisher, Paolo Faraboschi, Cliff Young |
Table of Contents
About the Authors Foreword Preface Content and Structure The VEX (VLIW Example) computing system Audience Cross-cutting Topics How to read this book Acknowledgements Figure Acknowledgements Chapter 1: An Introduction to Embedded Processing 1.1 What is Embedded Computing? 1.1.1 Attributes of embedded devices 1.1.2 Embedded is growing 1.2 Distinguishing Between Embedded and General Purpose Computing 1.2.1 The "run one program only" phenomenon 1.2.2 Backward and Binary compatibility Feature compatibility Source compatibility Tool compatibility Operating System compatibility 1.2.3 Physical limits in the embedded domain Cost Power consumption and energy efficiency Physical dimensions 1.3 Characterizing Embedded Computing 1.3.1 Categorization by type of processing engine Digital Signal Processors Network Processors 1.3.2 Categorization by application area Image Processing and Consumer Printers. Cameras. Digital Video. Communication Wired phone networks. Cellular networks. Cellular telephones. Routers. Automotive 1.3.3 Categorization by Workload differences Controlling Switching and routing Media processing 1.4 Embedded market structure 1.4.1 The market for embedded processor cores 1.4.2 Business model of embedded processors Getting a better yield. Controlling Area and Power Consumption. Testing costs 1.4.3 Costs and Product Volume 1.4.4 Software and The Embedded Software Market 1.4.5 Industry standards 1.4.6 Product lifecycle 1.4.7 The Transition to System-on-Chip Design Effects of System-On-Chip on the business model Centers of embedded design 1.4.8 The Future of Embedded Systems Connectivity: Always-on Infrastructure State: Personal Storage Administration Security The Next Generation 1.5 Further Reading 1.6 Exercises
Chapter 2: An Overview of VLIW and ILP 2.1 Semantics and parallelism 2.1.1 Baseline: sequential program semantics 2.1.2 Pipelined execution, overlapped execution and multiple execution units 2.1.3 Dependence and program rearrangement 2.1.4 ILP and other forms of parallelism 2.2 Design philosophies 2.2.1 An illustration of design philosophies: RISC vs. CISC 2.2.2 First definition of VLIW 2.2.3 A design philosophy: VLIW VLIW vs. Superscalar VLIW vs. DSP 2.3 Role of the compiler 2.3.1 The phases of a high-performance compiler 2.3.2 Compiling for ILP and VLIW Global instruction scheduling Software Pipelining Profiling 2.4 VLIW in the embedded and DSP domains 2.5 Historical Perspective and Further Reading 2.5.1 ILP hardware in the 1960s and 70s Early supercomputer arithmetic units. Attached signal processors Horizontal microcode 2.5.2 The development of ILP code generation in the 1980s Acyclic Microcode Compaction Techniques Inadequacy of local compaction. First global compaction attempts. Acyclic Region Scheduling Techniques. Cyclic Techniques: Software Pipelining Systematic software pipelining. 2.5.3 VLIW development in the 1980s 2.5.4 ILP in the 1990s and 2000s 2.6 Exercises
Chapter 3: An Overview of ISA Design 3.1 Overview: What to Hide 3.1.1 Architectural State: Memory and Registers 3.1.2 Pipelining and Operational Latency 3.1.3 Multiple Issue and Hazards Exposing dependence and independence Structural Hazards Resource Hazards 3.1.4 Exception and Interrupt Handling 3.1.5 Discussion 3.2 Basic VLIW design principles 3.2.1 Implications for Compilers and Implementations 3.2.2 Execution model subtleties "Equals" Model. "Less-than-or-Equals" Model. 3.3 Designing a VLIW ISA for Embedded Systems 3.3.1 Application domain 3.3.2 ILP Style 3.3.3 Hardware/Software Trade-offs 3.4 Instruction-Set Encoding 3.4.1 A Larger Definition of Architecture 3.4.2 Encoding and Architectural Style RISC encodings CISC encodings VLIW encodings Why not Superscalar encodings? DSP encodings Vector encodings 3.5 VLIW Encoding 3.5.1 Operation encoding 3.5.2 Instruction encoding Fixed-overhead encoding Distributed encoding Template-based encoding 3.5.3 Dispatching and opcode subspaces 3.6 Encoding and Instruction-Set Extensions 3.7 Further Reading 3.8 Exercises
Chapter 4: Architectural Structures in ISA design 4.1 The Datapath 4.1.1 Location of operands and results 4.1.2 Datapath Width 4.1.3 Operation Repertoire Simple integer and compare operations. Carry, Overflow and Other Flags. Common bitwise utilities. Integer Multiplication. Fixed-Point multiplication. Integer division. Floating Point Operations. Saturated Arithmetic. 4.1.4 Micro-SIMD Operations Alignment issues. Precision issues. Dealing with control flow. Pack, unpack, and mix. Reductions. 4.1.5 Constants 4.2 Registers and Clusters 4.2.1 Clustering Architecturally Invisible Clustering. Architecturally Visible Clustering. 4.2.2 Heterogeneous register files 4.2.3 Address and Data Registers 4.2.4 Special register file features 4.3 Memory Architecture 4.3.1 Addressing Modes 4.3.2 Access Sizes 4.3.3 Alignment issues 4.3.4 Caches and local memories Prefetching Local Memories and Lockable Caches 4.3.5 Exotic modes for embedded processing 4.4 Branch Architecture 4.4.1 Unbundling Branches Two-step Branching. Three-step Branching. 4.4.2 Multi-way branches 4.4.3 Multi-cluster branches 4.4.4 Branches and loops 4.5 Speculation and Predication 4.5.1 Speculation Control Speculation "Non-Recovery" speculation model. "Recovery" Speculation Model. Data Speculation 4.5.2 Predication Full prediction Partial predication Cost and benefits of predication Predication in the embedded domain 4.6 System Operations 4.7 Further Reading 4.8 Exercises
Chapter 5: Microarchitecture Design 5.1 Register File Design 5.1.1 Register File Structure 5.1.2 Register Files, Technology, and Clustering 5.1.3 Separate Address and Data Register Files 5.1.4 Special Registers and Register File Features 5.2 Pipeline Design 5.2.1 Balancing a Pipeline 5.3 VLIW Fetch, Sequencing and Decoding 5.3.1 Instruction Fetch 5.3.2 Alignment and Instruction Length 5.3.3 Decoding and Dispersal 5.3.4 Decoding and ISA Extensions 5.4 The Datapath 5.4.1 Execution Units 5.4.2 Bypassing and Forwarding Logic 5.4.3 Exposing Latencies 5.4.4 Predication and Selects 5.5 Memory Architecture 5.5.1 Local memory and caches 5.5.2 Byte manipulation 5.5.3 Addressing, Protection and Virtual Memory 5.5.4 Memories in Multiprocessor Systems 5.5.5 Memory Speculation 5.6 Control Unit 5.6.1 The Branch Architecture 5.6.2 Predication and Selects 5.6.3 Interrupts and Exceptions 5.6.4 Exceptions and pipelining "Drain" and "Flush" Pipeline Models Early Commit Delayed Commit 5.7 Control Registers 5.8 Power Considerations 5.8.1 Energy Efficiency and ILP System-Level Power Considerations. 5.9 Further Reading 5.10 Exercises
Chapter 6: System Design and Simulation 6.1 System-on-Chip (SoC) 6.1.1 IP Blocks and Design Reuse A concrete SoC Example Virtual Components and the VSIA Alliance 6.1.2 Design Flows Creation Flow Verification Flow 6.1.3 SoC Buses Data Widths Masters, Slaves and Arbiters Bus Transactions Simple Transactions. Burst Transactions. Split Transactions. Test Modes 6.2 Processor Cores and System-On-Chip 6.2.1 Non-programmable accelerators Reconfigurable logic 6.2.2 Multiprocessing on a chip Symmetric Multiprocessing Heterogeneous Multiprocessing Example: a multi-core platform for mobile multimedia 6.3 Overview of Simulation 6.3.1 Using simulators 6.4 Simulating a VLIW architecture 6.4.1 Interpretation 6.4.2 Compiled simulation Memory Registers Control Flow Exceptions Analysis of compiled simulation Performance measurement and compiled simulation 6.4.3 Dynamic binary translation 6.4.4 Trace-Driven Simulation 6.5 System simulation 6.5.1 I/O and Concurrent Activities 6.5.2 Hardware simulation Discrete Event Simulation 6.5.3 Accelerating Simulation In-Circuit Emulation Hardware accelerators for simulation 6.6 Validation and verification 6.6.1 Simulation, Verification and Test Formal Verification Design for Testability Debugging support for SoC 6.7 Further Reading 6.8 Exercises
Chapter 7: Embedded Compiling and Toolchains 7.1 What is important in an ILP Compiler? 7.2 Embedded cross-development toolchains 7.2.1 Compiler 7.2.2 Assembler 7.2.3 Libraries 7.2.4 Linker 7.2.5 Post-link optimizer 7.2.6 Run-time program loader 7.2.7 Simulator 7.2.8 Debugger, monitor ROMs 7.2.9 Automated test systems 7.2.10 Profiling tools 7.2.11 Binary utilities 7.3 Structure of an ILP compiler 7.3.1 Front end 7.3.2 Machine-independent optimizer 7.3.3 Back end: machine-specific optimizations. 7.4 Code Layout 7.4.1 Code Layout Techniques DAG-based placement The "Pettis-Hansen" technique Procedure Inlining. Cache line coloring Temporal-Order Placement 7.5 Embedded-specific trade-offs for compilers 7.5.1 Space vs. Time vs. Energy trade-offs 7.5.2 Power-specific optimizations Fundamentals of Power Dissipation Switching. Leakage Power-aware software techniques Reducing Switching Activity. Power-Aware Instruction Selection. Scheduling for Minimal Dissipation. Memory Access Optimizations. Data Remapping. 7.6 DSP-Specific Compiler Optimizations Compiler-visible features of DSPs Heterogeneous registers. Addressing modes. Limited Connectivity. Local Memories. Harvard architecture. Instruction Selection and Scheduling Address Computation and Offset Assignment Local Memories Register Assignment Techniques Retargetable DSP and ASIP compilers 7.7 Further Reading 7.8 Exercises
Chapter 8: Compiling for VLIWs and ILP 8.1 Profiling 8.1.1 Kinds of profiles 8.1.2 Profile collection 8.1.3 Synthetic profiles (heuristics in lieu of profiles) 8.1.4 Profile Bookkeeping and Methodology 8.1.5 Profiles and embedded applications 8.2 Scheduling 8.2.1 Acyclic Region Types and Shapes Basic blocks Traces Superblocks Hyperblocks Treegions Percolation Scheduling 8.2.2 Region formation Region selection Enlargement techniques Phase ordering considerations 8.2.3 Schedule construction Analyzing programs for schedule construction Compaction Techniques Cycle vs. Operation scheduling Linear techniques Graph-based techniques (list scheduling) Compensation Code Another view of scheduling problems 8.2.4 Resource management during scheduling Resource vectors Finite state automata 8.2.5 Loop scheduling Modulo scheduling The Modulo Reservation Table (MRT). Searching for the II. Scheduling heuristics. Prologues and epilogues. Modulo variable expansion. Iterative Modulo Scheduling. Advanced modulo scheduling techniques. 8.2.6 Clustering 8.3 Register allocation 8.3.1 Phase Ordering Issues Register allocation and scheduling 8.4 Speculation and Predication 8.4.1 Control and Data Speculation 8.4.2 Predicated execution 8.4.3 Prefetching 8.4.4 Data Layout methods 8.4.5 Static and hybrid branch prediction 8.5 Instruction selection 8.6 Further Reading 8.7 Exercises
Chapter 9: The Run-time System 9.1 Exceptions, interrupts, and traps Exception Handling 9.2 Application Binary Interface considerations 9.2.1 Loading programs 9.2.2 Data layout 9.2.3 Accessing global data 9.2.4 Calling conventions Registers Call instructions Call sites Function prologues and epilogues 9.2.5 Advanced ABI topics Variable-length argument lists Dynamic stack allocation Garbage collection Linguistic exceptions 9.3 Code Compression 9.3.1 Motivations 9.3.2 Compression and Information Theory 9.3.3 Architectural Compression Options Decompression on fetch Decompression on refill Load-time decompression. 9.3.4 Compression Methods Hand-tuned ISAs. Ad-hoc compression schemes. RAM decompression. Dictionary-based software compression. Cache-based compression. Quantifying compression benefits 9.4 Embedded Operating Systems 9.4.1 "Traditional" OS issues, revisited 9.4.2 Real-Time Real-Time Scheduling 9.4.3 Multiple flows of control Threads, processes and microkernels Threads. Processes. Microkernels. 9.4.4 Market Considerations Embedded Linux 9.4.5 Downloadable code and Virtual Machines 9.5 Multiprocessing and Multithreading 9.5.1 Multiprocessing in the embedded world 9.5.2 Multiprocessing and VLIW 9.6 Further Reading 9.7 Exercises
Chapter 10: Application Design and Customization 10.1 Programming Language choices 10.1.1 Overview of embedded programming languages 10.1.2 Traditional C and ANSI C 10.1.3 C++ and embedded C++ Embedded C++ 10.1.4 Matlab 10.1.5 Embedded Java The Allure of Embedded Java. Embedded Java: the dark side 10.1.6 C Extensions for Digital Signal Processing Restricted Pointers Fixed Point Data Types Circular Arrays Matrix referencing and operators 10.1.7 Pragmas, Intrinsics and Inline Assembler Compiler pragmas and type annotations. Assembler inserts and intrinsics. 10.2 Performance, Benchmarking and Tuning 10.2.1 Importance and methodology 10.2.2 Tuning an application for performance Profiling Performance Tuning and Compilers Developing for ILP targets Write code for predication. Localize variables. Help memory alias analysis. Eliminate redundant loads. Enable prefetching and exploit locality properties. Specialize code. Use look-up tables and memoization. 10.2.3 Benchmarking 10.3 Scalability and Customizability 10.3.1 Scalability and Architecture Families 10.3.2 Exploration and Scalability 10.3.3 Customization Customized implementations 10.3.4 Reconfigurable hardware Using programmable logic 10.3.5 Customizable processors and tools Describing Processors 10.3.6 Tools for customization Assemblers Debuggers The operating system Customizable compilers 10.3.7 Architecture exploration is harder than it seems Difficulty in characterizing the workload. Difficulty in writing retargetable tools. Fast measurement of performance. Dealing with the complexity Other barriers to customization 10.4 Further Reading 10.5 Exercises
Chapter 11: Application Areas 11.1 Digital Printing and Imaging 11.1.1 Photo Printing Pipeline JPEG decompression Scaling Colorspace Conversion Dithering 11.1.2 Implementation and Performance 11.2 Telecom applications 11.2.1 Voice coding Waveform codecs, Vocoders Hybrid coders, 11.2.2 Multiplexing 11.2.3 The GSM Enhanced Full Rate codec 11.3 Other application areas 11.3.1 Digital video MPEG-1 and MPEG-2 MPEG-4 11.3.2 Automotive Fail-safety and fault-tolerance Engine control units In-vehicle networking 11.3.3 Hard disk drives Motor Control Data decoding Disk scheduling and on-disk management tasks Disk scheduling and off-disk management tasks 11.3.4 Networking and Network Processors Network Processors 11.4 Further Reading 11.5 Exercises
Appendix A: The VEX System A.1 The VEX Instruction Set Architecture VEX Assembler Notation A.1.1 Clusters A.1.2 Execution Model A.1.3 Architecture State A.1.4 Arithmetic and Logic Operations Examples A.1.5 Intercluster Communication Examples A.1.6 Memory Operations A.1.7 Control Operations Examples A.1.8 Structure of the default VEX Cluster Register Files and Immediates A.1.9 VEX Semantics A.2 The VEX Run-Time Architecture A.2.1 Data Allocation and Layout A.2.2 Register Usage A.2.3 Stack Layout and Procedure Linkage Procedure Linkage A.3 The VEX C Compiler A.3.1 Command-line Options Output Files Preprocessing Optimization Profiling Language Definition Libraries Passing Options to Compile Phases Terminal Output and Process Control Other Options A.3.2 Compiler Pragmas Unrolling and Profiling Assertions Memory Disambiguation Cache Control A.3.3 Inline Expansion A.3.4 Machine Model Parameters A.3.5 Custom Instructions A.4 Visualization Tools A.5 The VEX Simulation System Gprof support Simulating custom instructions Simulating Memory Hierarchy A.6 Customizing the VEX Toolchain Clusters Machine Model Resources Computational MEs Register Bank MEs Machine Resources Memory Hierarchy Parameters A.7 Examples of tool usage A.7.1 Compile and run A.7.2 Profiling A.7.3 Custom architectures A.8 Exercises
Appendix B: Glossary Appendix C: Bibliography |
List of Technical Terms / KeywordsASIC (application-specific integrated circuit) |