

# ESSES 2003 European Summer School on Embedded Systems

# Lecture Notes Part X

# Embedded Sysems: Introduction and Overview





Editors: Ylva Boivie, Hans Hansson, Jane Kim, Sang Lyul Min



Strängnäs, August 20-22, 2003 ISSN 1404-3041 ISRN MDH-MRTC-106/2003-1-SE



### **Design of Embedded Systems**

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris





# Constraint Programming Approach

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris

### **Quotations**

"Constraint programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the problem, the computer solves it."



Eugene C. Freuder CONSTRAINTS, April 1997



### **Introduction and Motivation**

Synthesis of the following code (inner loop of differential equation integrator)

### while c do

### begin

```
x1 := x + dx;

u1 := u - (3*x*u*dx);

y1 := y + u*dx;

c := x < a;

x := x1; y := y1; u := u1;

end;
```





### Register Allocation as Graph Coloring



### **Constraints:**

$$\begin{split} &[r_1,r_2,r_3,r_4,r_5,r_6] :: 0..2, \\ &r_1 \neq r_2, \, r_1 \neq r_3, \, r_2 \neq r_3, \\ &r_2 \neq r_4, \, r_3 \neq r_4, \, r_4 \neq r_6, \\ &r_5 \neq r_6. \end{split}$$



# Register Allocation as Clique Finding

- for all r<sub>i</sub>, r<sub>j</sub> which are not connected by an edge: r<sub>i</sub> ≠ 1 ∨ r\_j ≠ 1
- The maximal clique can found by maximizing the following cost function:

$$cost = \sum_{i} r_{i}$$





### **Constraint Consistency**

- All constraints are stored in the constraint store
- Consistency methods are applied to find inconsistent values and prune variables' domains
- Different types of consistency methods:
  - Node consistency
  - Arc consistency
  - Path consistency

· ...





### **Consistency Properties**

- Node consistency
  - A network is node consistent if in each node domain each value is consistent with unary constraint (e.g., X > 7)
- Arc consistency
  - A network is arc consistent if for each arc connecting variables V<sub>i</sub> and V<sub>j</sub> for each value in the domain of V<sub>i</sub> there exist a value in the domain of V<sub>j</sub> consistent with binary constraint (e.g., X > Y)

### **Node and Arc Consistency**

Example



1..6 
$$0..5$$

Not node consistent Not arc consistent node consistent arc consistent

### **Need for search**

- Node, arc and path consistency are in general not complete (complete for some problems with particular structures)
- Complete algorithm: N-consistency for N variable problems → exponential complexity
- Example:





### Search

 Solver is not complete and search for a solution is needed







### **Constraint properties**

- may specify partial information need not uniquely specify the values of its variables,
- non-directional typically one can infer a constraint on each present variable,
- declarative specify relationship, not a procedure to enforce this relationship,
- additive order of imposing constraints does not matter,
- rarely independent typically they share variables

# More realistic example Scheduling

### Scheduling of the data-flow graph



### **Constraints:**

- for all op<sub>i</sub> and op<sub>j</sub> such that op<sub>i</sub>
   before op<sub>j</sub>
   T<sub>i</sub> + D<sub>i</sub> ≤ T<sub>j</sub>
- for all op<sub>i</sub> and op<sub>j</sub> that can use the same resource
   T<sub>i</sub> + D<sub>i</sub> ≤ T<sub>j</sub> ∨ T<sub>j</sub> + D<sub>j</sub> ≤ T<sub>i</sub> ∨ R<sub>i</sub> ≠ R<sub>j</sub>

### **Problems**

Constraint propagation for

$$T_i + D_i \le T_i \lor T_i + D_i \le T_i \lor R_i \ne R_i$$
 is weak

- Not all solvers support disjunctive constraints.
  - Other solution (reified constraints):

$$\begin{split} T_i + D_i &\leq T_j \Leftrightarrow B1, \\ T_j + D_j &\leq T_i \Leftrightarrow B2, \\ R_i &\neq R_j \Leftrightarrow B3, \end{split}$$

B1 + B2 + B3 ≥ 1.







$$\begin{aligned} T_k + D_k &\leq T_n \vee T_n + D_n \leq T_k \vee R_k \neq R_n \\ T_m + D_m &\leq T_n \vee T_m + D_m \leq T_n \vee R_m \neq R_n \end{aligned}$$



### **Global constraints**

Non-overlapping rectangles



 $\begin{aligned} \text{diff2}([~[X_i,Y_i,DX_i,DY_i],\\ [X_j,Y_j,DX_j,DY_j]~]~) \end{aligned}$ 

- All knowledge in one "place" makes it possible to define good consistency methods (OR, mathematics, geometry, etc.)
- Specific algorithms for consistency more efficient



### **Scheduling Example Constraints**

```
\begin{split} &T1+2 \leq T6,\, T2+2 \leq T6,\\ &T3+2 \leq T7,\, T4+2 \leq T8,\\ &T5+1 \leq T9,\, T6+2 \leq T10,\\ &T7+2 \leq T11,\, T10+1 \leq T11,\\ &\text{diff2}([\ [T1,R1,2,1],\, [T2,R2,2,1],\, [T3,R3,2,1],\\ &\quad [T4,R4,2,1],\, [T6,R6,2,1],\, [T7,R7,2,1],\\ &\quad [T5,R5,1,1],\, [T8,R8,1,1],\, [T9,R9,1,1],\\ &\quad [T10,R10,1,1],\, [T11,R11,1,1]\ ]). \end{split}
```





# Other Synthesis Problems Defined with Constraints

- High-level synthesis:
  - Chaining,
  - Conditional execution,
  - Pipelined components,
  - Algorithmic pipelining,
  - Switching activity reduction (power consumption)
  - .
- System design
  - different aspects of design space exploration
  - scheduling
  - component assignment
  - memory allocation/data assignment
  - power/energy consumption
  - · ...





# Additional Constraints element

- Element constraints
  - element(N, [X<sub>1</sub>, X<sub>2</sub>, ..., X<sub>n</sub>], Value)
    - propagation from N to Value

propagation from Value to N

Value =  $x \rightarrow N=i$  and Xi = x ...

- Examples- element(N, [2, 3, 4, 4], V)
  - N :: 1..2, V :: {2, 3}
  - V = 4, N :: 3..4
- Used to define discrete cost functions and different relations









### **Edge Finding Algorithm**

- Martin-Shmoys algorithm with O(n²) complexity.
- Up phase
  - for each unique lct we create a set
    S = {t | LCT(t) <= lct} and make checking whether a task can be the first or before</p>
- Down phase
  - similar but using est and checking whether a task can be the last one or after all tasks.

### **System Synthesis Example**



- original MILP formulation- 47 timing variables, 225 binary (bus 153) and1081 constraints (bus 416)
- commercial linear programming package used to solve the problem (XLP, developed by XMP Software, Inc.)

|           |      |    | Execution time |    |    |    |    |    |    |    |
|-----------|------|----|----------------|----|----|----|----|----|----|----|
| Processor | Cost | S1 | S2             | S3 | S4 | S5 | S6 | S7 | S8 | S9 |
| P1        | 4    | 2  | 2              | 1  | 1  | 1  | 1  | 3  | -  | 1  |
| P2        | 5    | 3  | 1              | 1  | 3  | 1  | 2  | 1  | 2  | 1  |
| P3        | 2    | 1  | 1              | 2  | -  | 3  | 1  | 4  | 1  | 4  |



# Modeling of cost and execution time

- Execution time element(P1, [2, 3, 1], D1)... element(P9, [1, 1, 4], D4)
- Cost
   (P1=1 ∨ P2=1 ∨ ... ∨ P9=1) ⇔ C1,
   ...
   (P1=6 ∨ P2=6 ∨ ... ∨ P9=6) ⇔ C6,

### **System synthesis results**

|                | Perform |        |          | rmance opti | mization | Cost optimization |           |  |
|----------------|---------|--------|----------|-------------|----------|-------------------|-----------|--|
| Design         | Cost    | (time  | MILP (s) | CLP (s)     | B&B      | CLP (s)           | B&B Nodes |  |
|                |         | units) |          |             | Nodes    |                   |           |  |
|                | 10      | 6      | 6438.00  | 0.43        | 84       | 0.55              | 92        |  |
| Bus            | 6       | 7      | 5371.80  | 0.53        | 114      | 0.68              | 144       |  |
|                | 5       | 15     | 3691.20  | 0.43        | 68       | 0.70              | 103       |  |
|                | 15      | 5      | 3732.00  | 0.43        | 20       | 1.67              | 125       |  |
| point-to-point | 12      | 6      | 26710.20 | 1.42        | 98       | 2.18              | 169       |  |
| links          | 8       | 7      | 32320.20 | 1.00        | 58       | 2.59              | 198       |  |
|                | 7       | 8      | 4510.80  | 1.64        | 75       | 2.02              | 112       |  |
|                | 5       | 15     | 38501.20 | 1.50        | 32       | 1.48              | 77        |  |

Cost = 4\*C1 + 4\*C2 + 5\*C3 + 5\*C4 + 2\*C5 + 2\*C6



# System synthesis results with local memory

|                |      | Performance  | Performa  | ince optin | Cost optimization |        |           |
|----------------|------|--------------|-----------|------------|-------------------|--------|-----------|
| Design         | Cost | (time units) | MILP (s)  | CLP        | B&B               | CLP    | B&B Nodes |
|                |      |              | * *       | (s)        | Nodes             | (s)    |           |
|                | 28   | 6            | 6592.20   | 0.71       | 76                | 2.58   | 252       |
|                | 23   | 7            | 5371.80   | 1.07       | 193               | 1.94   | 266       |
| Bus            | 22   | 8            | 123252.60 | 0.95       | 124               | 14.85  | 856       |
|                | 21   | 10           | 316860.60 | 114.92     | 4 534             | 119.55 | 8 799     |
|                | 18   | 11           | 236724.00 | 88.23      | 7 015             | 2.37   | 477       |
|                | 17   | 12           | 138004.20 | 0.93       | 268               | 10.39  | 3 076     |
|                | 14   | 15           | 3581.40   | 0.54       | 22                | 9.89   | 3 076     |
|                | 38   | 5            | -         | 0.56       | 24                | 2.08   | 107       |
|                | 30   | 6            | -         | 0.99       | 59                | 3.75   | 155       |
| point-to-point | 25   | 7            | -         | 1.60       | 79                | 5.58   | 314       |
| links          | 23   | 8            | -         | 1.82       | 57                | 3.21   | 184       |
|                | 22   | 10           | -         | 4.50       | 84                | 59.25  | 855       |
|                | 19   | 11           | -         | 27.34      | 794               | 101.03 | 2 851     |
|                | 18   | 12           | -         | 97.72      | 2 686             | 8.66   | 1 047     |
|                | 14   | 15           | -         | 1.18       | 14                | 4.95   | 328       |





|         |                |              | IVIC | app          | ıng          | s to        | Pr         | OCE         | 255        | ors         |
|---------|----------------|--------------|------|--------------|--------------|-------------|------------|-------------|------------|-------------|
| Task    | Uni-<br>versal | BMA<br>array | PAR1 | DCT<br>array | FIR<br>array | BMA<br>pipe | FIR<br>seq | FIR<br>pipe | DCT<br>seq | DCT<br>pipe |
| IN      | -              | -            | -    | -            | -            | -           | -          |             | -          | -           |
| FB1     |                | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| BMA     | 7234           | 484          | -    | -            | -            | 3617        | -          | -           | -          | -           |
| FIR     | 7234           | -            | -    | -            | 510          | -           | 3461       | 1170        | -          | -           |
| PRAE    | 1280           | -            | 128  | -            | -            | -           | -          | -           | -          | 474         |
| DCT     | 12312          | -            | -    | 132          | -            | -           | -          | -           | 6156       | 474         |
| Q<br>IQ | -              | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| IDCT    | 12312          | -            | -    | 132          | -            | -           | -          | -           | 6156       | -<br>474    |
| REK     | 1536           | - [          | 256  | 132          | - [          | -           |            |             | 0130       | 4/4         |
| C       | 132            | -            | 230  |              | -            | -           | -          | -           | _          |             |
| FB2     | -              | _            | _    | _            | _            | _           | _          | _           | _          | _           |
| 1 02    | -              |              |      |              |              |             |            |             | (0)        | TAKE        |





#### **Experimental results** H.261 example AVERAGE EXECUTION PIPELINE DEGREE DEADLINE both greedy 1% -16% memory

### Scheduling of Mars Path Finder under Power Consumption Constraints

The mars rover operates on very limited power supply. The power is given by solar panels. The power obtained from solar panels was measured at different temperatures and the results were the following: 14.9W at -40°C, 12.0W at -60°C and 9.0W at -80°C. There is a battery power source too, which gives maximal 10.0W and it is not replenishable energy so the battery power should be used as little as possible. The mars rover has 6 driving and 4 steering motors, which need to be warmed up before respective driving and steering can be performed.



### Scheduling of Mars Path Finder under Power Consumption Constraints

| Operation                    | Duration |                                                  |  |
|------------------------------|----------|--------------------------------------------------|--|
| Heating steering motors      | 5s       | At least 5s and at most 50s before               |  |
| (HSM1&2, HSM3&4)             |          | steering starts                                  |  |
| Heating wheel motors         | 5s       | At least 5s and at most 50s before driving       |  |
| (HWM1&2, HWM3&4, HWM5&6)     |          | starts                                           |  |
| Hazard detection (HD1 & HD2) | 10s      | At least 10s before steering starts              |  |
| Steering (Steer1, Steer2)    | 5s       | At least 5s before driving                       |  |
| Driving (Drive1, Drive2)     | 10s      | At least 10s before next hazard detection starts |  |



## **Scheduling of Mars Path Finder under Power Consumption Constraints**

| Task                | Duration | Power -40°C | Power -60°C | Power -80°C |
|---------------------|----------|-------------|-------------|-------------|
| Heat two motors     | 5s       | 7.6W        | 9.5W        | 11.3W       |
| Drive               | 10s      | 7.5W        | 10.9W       | 13.8W       |
| Steer               | 5s       | 4.3W        | 6.2W        | 8.1W        |
| Hazard<br>Detection | 10s      | 5.1W        | 6.1W        | 7.3W        |
| CPU                 | Constant | 2.5W        | 3.1W        | 3.7W        |



### **Modeling**

Precedence constraints:

$$t_hd1 + d_hd1 \le t_steer1$$
,

. . .

 $t_steer1 + d_steer1 \le t_drive1$ ,

 $t_hwm12 \le t_steer1 + 50$ ,

Power consumption constraints:

cumulative([t\_hd1, ..., th\_sm12],

[p\_hd1, ..., p\_sm12],

[d\_hd1, ..., d\_sm12], Power)

Optimize "Power"







cycle(2, [ [2,6], [3,4], [1], [2,3], [2,6], [2,5]] )



[[2],[4],[1],[3],[6],[5]]



### Search

- Standard search uses depth-first-search with backtracking.
- Optimization uses branch-and-bound or similar methods.







### Search (cont'd)

[City1::2..4, City2::{1,3..4}, City3::{1..2,4}, City4::1..3]

- How to select order of variable assignment?
  - dynamic vs. static
  - criteria
- How to select values to be assigned from variable's domain?
  - a single value
  - sub-domain
  - ...



### **Variable Selection**

- Static and dynamic
  - input order (static)
  - first-fail principle (smallest size of the domain)
  - smallest value in the domain
  - largest value in the domain
  - largest difference between the smallest and second smallest value in its domain
  - smallest max value in the domain
  - .



### **Value Selection**

- Single value
  - minimum in the domain and then upwards
  - maximum in the domain and then downwards
  - middle and then towards smallest and largest
  - random
  - · ...
- Domain split
  - split into two sub-domains
  - split into N
  - · ...



### **Search improvements**

- Partial enumeration algorithms (instead of labeling)
  - Credit Search,
  - Limited Discrepancy Search (LDS).
- Assignment of subintervals instead of values to domain variables — possibly examines a bigger part of a solution space.
- Problem-dependent specific heuristics.
- Neighbourhood search...









### **Summary and conclusions**

### Advantages:

- focus on a specification of the problem, not on a solution method.
- unified framework for different algorithms to be used to solve a problem (by encapsulating them as constraints).
- easy definition of problems with many heterogeneous constraints.
- easy extension of a problem by adding new constraints.
- collaboration of solvers and distributed computing

### **Summary and conclusions**

- Limitations:
  - NP-hard problems.
  - often non-predictable behavior of a solver.
  - difficult to define and add new constraints:
    - into existing systems interface problems.
    - new propagation algorithms need to be developed.
  - difficult to match constraints with actual problems.



### **CP finite domain systems**

- SICStus Prolog
- CHIP from COSYTEC
- IF/Prolog
- ILOG
- Mozart/Oz
- Gnu Prolog
- JaCoP Java based our own solver



### **Selected CP Web resources**

- Constraints archive http://www.cs.unh.edu/ccc/archive
- Guide to constraints programming http://kti.ms.mff.cuni.cz/~bartak/constraints
- Sicstus manual http://www.sics.se/isl/sicstus/sicstus\_toc.html
- Gnu Prolog http://www.gnu.org/software/prolog/prolog.html
- Mozart/Oz http://www.mozart-oz.org/

### Other resources

- Book
  - K. Mariott and P. J. Stuckey Programming with Constraints: An Introduction, The MIT Press, 1998.
- Conferences
  - Principles and Practice of Constraint Programming (CP)
  - The Practical Application of Constraint Technologies and Logic Programming (PACLP)
- Journal
  - Constraints (Kluwer Academic Publishers)



### **Selected Papers**

- Kuchcinski, K., Embedded System Synthesis by Timing Constraints Solving, Proc. 10th International Symposium on System Synthesis, Antwerp, Belgium, September 17-19, 1997.
- Gruian, F. and Kuchcinski, K., Operation Binding and Scheduling for Low Power Using Constraint Logic
   Programming, Proc. 24th Euromicro Conference, Workshop on Digital System Design, Västerås, Sweden, August 25-27, 1998.
- Kuchcinski, K. and Wolinski, Ch., Global Approach to Assignment and Scheduling of Complex Behaviours based on HCDG and Constraint Programming, Journal of Systems Architecture, 2003, Elsevier Science.

### **Selected Papers (cont'd)**

- Kuchcinski, K., Constraints Driven Design Space Exploration for Distributed Embedded Systems, Journal of Systems Architecture, vol. 47, no. 3-4, pp. 241-261, 2001, Elsevier Science.
- Szymanek, R. and Kuchcinski, K., A Constructive Algorithm for Memory-Aware Task Assignment and Scheduling, Proc. 9th International Symposium on Hardware/Software Codesign, Copenhagen, Denmark, Apr. 2001.
- Szymanek, R. and Kuchcinski, K., Partial Assignment Technique for Task Graph Scheduling, 40th DAC, Anaheim, USA, June 2003.
- Kuchcinski, K. Constraint-driven scheduling and resource assignment, ACM Trans. on Design Automation of Electronic Systems, vol. 8, no. 3, pp. 355-383, 2003.



### Methodologies for Embedded System design

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris



### **Input to System Design**

- Executable specification (functional requirements):
  - usually provided as interacting processes/tasks,
  - very often multi-language specifications,
  - can be simulated and verified,
  - can be used to perform analysis, e.g, estimation.
- Specification languages: C, C++, VHDL, Verilog, SystemC, Esterel, SDL, etc.
- Set of (non-functional) design requirements (cost, speed, I/O rate, power consumption, etc.).

### **Output from System Design**

- A set of system modules assigned to system components (CPU's, DSP's, ASIC's, etc.).
- Communication modules.
- Each module can be further synthesized to hardware using *high-level synthesis* or *compiled to software*.





# **Basic Characteristics of the Methodology**

- Behavioral specification is given for the complete heterogeneous system, regardless of how different parts will be later implemented.
- Analysis techniques are provided; specially different estimation techniques.
- Synthesis tools are used to automatically explore a design space.
  - high-level synthesis, RTL synthesis,
  - compilers, cross-compilers,
  - interface generators,
  - etc.



### **Estimation of Design Parameters**

- Estimation of parameters such as size, cost, power consumption.
- Does not need to be very precise but has to be "consistent" — follows real design parameters.
- Usually 15%-20% inaccurate.
- Trade-off between accuracy and estimation time.



### Improvements of the Design Process

- High-level specification is made before architecture selection and implementation decisions can be made more accurate (better exploration of architectures).
- A uniform description of HW and SW makes it possible to move parts of the systems between HW and SW.
- HW and SW development is moved closer and the integration cost is reduced.
- An early evaluation of system characteristics is possible.





#### **Representation Example**



Process communication graph

#### **Allocation of System Components**

- Decides about the kind and number of components for implementation of the system
  - processing elements: μprocesosrs, microcontrollers, DSP's, ASIP's, ASIC's, FPGA's, etc.
  - storage elements: memories, register files, registers, etc.
  - communication devices: buses, point-to-point links, networks, etc.
  - specialized I/O devices: A/D, D/A, frame grabbers etc.

#### **Partitioning**

- Functional partitioning vs. structural partitioning.
- Abstraction level.
- Partitioning granularity (fine or course):
  - modules,
  - processes and procedures,
  - instructions.
- Partitioning objective:
  - performance,
  - minimal communication,
  - low power,
  - combination of several criteria.





#### **Communication Synthesis**

- Creation of abstract communication channels by communication clustering.
- Communication refinement
  - selection of communication lines width,
  - protocol selection,
  - etc.
- Interface generation:
  - device drivers,
  - communication hardware,
  - etc.





#### **Design Decisions**

- Different types of design decisions
  - selection of components, partitioning, assignments, scheduling, etc.
  - decisions regarding runtime system done off-line or are postponed to runtime (e.g., static vs. runtime scheduling)
- Design decisions are mutually dependent
- Huge design space



#### **Design Automation**

- Uses internal representations which are usually based on graphs.
- Graph algorithms (shortest path, Hamiltonian circuit, topological sort, depth-first-search, breadth-firstsearch, SAT, etc.).
- Optimization methods (M)ILP, CLP, heuristics, etc.
- Tractable and intractable problems.
- Decidable and undecidable problems.
- Decision problems and combinatorial optimization problems.

### **Design Automation Consequences**

- Most of the problems which need to be solved in design automation are NP-complete or NP-hard.
- Usually only small problems can be solved exactly.
- Need for algorithms which do not guarantee optimal solutions but "good enough" solutions
  - approximation algorithms guarantee a solution with a cost that is within some margin of the optimum,
  - heuristics algorithms that are constructed based on "rules-of-thumb"; nothing can be said in advance about the quality of the result.

### Examples of Embedded Systems (cont'd)

Anti-lock brakes
Auto-focus cameras
Automatic teller machines
Automatic teller machines
Automatic transmission
Avionic systems
Battery chargers
Camcorders
Cell-phone base stations
Cordless phones
Cruise control
Curbside check-in systems
Digital cameras
Disk drives
Electronic card readers
Electronic instruments

Electronic toys/games Factory control

Fingerprint identifiers

Home security systems Life-support systems Medical testing systems

Fax machines

MPEG decoders
Network cards
Network switches/routers
On-board navigation
Pagers
Photocopiers
Portable video games
Printers
Satellite phones
Scanners
Samart ovens/dishwashers
Speech recognizers
Stereo systems
Teleconferencing systems
Televisions
Temperature controllers
Theft tracking systems
TV set-top boxes
VCR's, DVD players

Video game consoles Video phones

Washers and dryers



Source: Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

# "A device that includes a programmable computer but is not itself a general-purpose computer."



#### **Embedded Systems (cont'd)**

- Computing systems embedded within electronic devices
- Hard to define. Nearly any computing system other than a desktop computer
- Billions of units produced yearly, versus millions of desktop units
- Perhaps 50 per household and per automobile



Introduction, (c) 2000 Vahid/Givargis

#### **Embedded Systems (cont'd)**

- Non User-Programmable.
- Based on programmable components (e.g., Microcontrollers, DSP's...) but often contain application specific hardware (IC's, ASIC's).
- Reactive Real-Time Systems:
  - React to external environment,
  - Maintain permanent interaction,
  - Ideally never terminate,
  - Are subject to external timing constraints (realtime).

### Characteristics of Embedded Systems

- Sophisticated functionality.
- Real-time operation.
- Low manufacturing cost.
- Low power.
- Designed to tight deadlines by small teams.
- "Resource conscious" vs. "Unlimited resources" programming

#### **SoC Embedded System**



- Assembly of "prefabricated components" often purchased from external vendors ("IP")
  - "black box" hierarchy
- Design & Verification at the System level
  - rather than the logic level
  - Interface and communication
- Great Importance of Software RV

Source: Alberto Sangiovanni-Vincentelli, 35th DAC









### Typical Hardware Components of DSP System

| Component class                 | Implements                                                  | Compiler                                              | Specification        |
|---------------------------------|-------------------------------------------------------------|-------------------------------------------------------|----------------------|
| DSP processor                   | Low data-rate DSP<br>Slow control loops<br>Appl. Spec. alg. | (Retargetable)<br>code generator<br>High level synth. | Assembly<br>C<br>DFL |
| Microcontroller                 | User interface<br>Slow control loops                        | C compiler                                            | С                    |
| Hardware accelerator            | High data-rate DSP                                          | High level synth.<br>RT level synth.                  | C, DFL<br>VHDL       |
| Communication blocks and memory | Internal & external communication Storage & buffering       | Memory mgmt. (A)synchronous interface synth.          | Data-sheets<br>STG   |
| Others                          | Usually FSMD's - clock generators                           | RT level synth. Asynchronous                          | VHDL *RV             |

Source: H. de Man, et. al. "Co-design of DSP Hardware/Software Co-design, Kluwer 1995.

#### Importance of Embedded System Design Methodologies

- Hardware complexity.
- Heterogeneous systems containing hardware (both digital and analog) and software.
- Heterogeneous components (CPU's, DSP's, ASIC's, buses, point-to-point links, etc.).
- Heterogeneous requirements performance, cost, power consumption, etc.
- System-on-chip.
- Shorter design cycles required by time-to-market constraints.

. . . .



### Software vs. Hardware Design short summary

- Software
  - flexibility,
  - reconfigurability, easy update, etc.,
  - complex functionality,
  - cost,
  - · ...
- Hardware
  - speed,
  - power consumption,
  - cost in large volumes,
  - ...



#### **Design of Embedded Systems**

- Need to be done using high-level specification, programming and hardware description languages not assembly languages and gate/transistor level design.
- Requires efficient design space exploration and synthesis/compilation tools.
- Different design requirements has to be taken into account, e.g., cost, performance, testability, quality of service, power consumption.
- Multi-language design framework.

### Importance of High-Level Design Methods

#### System Verification Processing Speeds

| System Implementation              | Processing time (s/frame) |  |
|------------------------------------|---------------------------|--|
| Behavioral model                   | 1 200 (20 min/frame)      |  |
| RTL model                          | 144 000 (1.6 days/frame)  |  |
| Gate model                         | 228 000 (2.6 days/frame)  |  |
| Gate model on hardware accelerator | 1 200                     |  |
| Rapid Prototype                    | 0.5                       |  |
| Target Hardware                    | 0.05                      |  |

Source: Paul Clemente, Ron Crevier, Peter Runstadle Synthesis A Case Study", VHDL Times, vol. 5, no. 1.



#### **Specification and Programming**

- Specification languages, such as UML, SDL.
- Programming languages, such as C, C++, Java, Esterel, assembly languages.
- Hardware description languages, such as VHDL, Verilog, SystemC.
- **Example**: combining SystemC and C++ gives unified simulation environment for hardware and software.

#### **Hardware Description Languages**

- Cover several levels of design abstraction as well as behavioral and structural description domain.
- Contain typical features of programming languages, such as data types and program statements.
- Special features:
  - time concept,
  - structure description,
  - parallelism.
- VHDL (IEEE standard), Verilog, SystemC.



### **Design Representations** (Computational Model)

- Used to represent/model digital systems under design.
- Generated by a compiler from system specification or coded directly in the model.
- Represent the semantics, structure and timing of the system.
- Usually based on some kind of annotated graph representation.
- Used internally by design automation systems of the modeler/designer.

#### **Design** — **Synthesis**

- Software translation into target code for a processor (real-time operating system might be used).
- Hardware synthesis translation of a behavioral representation of a design into a structural one.
- Communication synthesis generates hardware and software which interconnects system components.











#### **Summary**

- Embedded systems are important class of electronic systems which can be found everywhere,
- Combine hardware and software solutions,
- Cover several engineering and research areas:
  - microelectronics,
  - real-time systems,
  - software development,
  - etc.
- Need careful design which optimizes different design parameters.



# ESSES 2003 European Summer School on Embedded Systems

# Lecture Notes Part XII

Embedded Sysems: Introduction and Overview





Editors: Ylva Boivie, Hans Hansson, Jane Kim, Sang Lyul Min



Strängnäs, August 20-22, 2003 ISSN 1404-3041 ISRN MDH-MRTC-106/2003-1-SE

# **Embedded Computing Examples**

Wayne Wolf
Dept. of Electrical Engr.
Princeton University



#### **The Parapet Project**

- #Goal: design SoC networks for real-time distributed vision.
  - □ The best way to get a good design example is to create our own.
  - △ Video is a high-performance, low-power, cost-sensitive application.

#### **Parapet goals**

- #Algorithms (gesture recognition).
  - ☐ How do we adapt algorithms to the needs of realtime embedded video.
- # Distributed systems.
  - riangle Communicating cameras.
- # Embedded software.
  - △ Middleware, code optimization.
- SoC architecture.
  - △ Heterogeneous multiprocessors.









# **Tuning the smart camera software**

- #Initial C/Trimedia was direct translation from Matlab.
- **Goals**:
  - □ Increase frame rate.
  - □ Reduce latency.
  - ☐ Identify bottlenecks for next-generation architecture.

#### Real-time vs. just fast

- - □To satisfy the rate, must minimize variations in processing time.









#### **Optimizations**

- **#Change the program structure.**
- ★Change the instructions.

#### **Algorithmic changes**

- #Superellipses were expensive to fit and overkill.
  - □ Replaced with ellipse fitting.
- #Improved adjacency algorithm.

#### **Region finding**

- #Operates on 3 x 3 window.

Sequential algorithm---window moves one pixel per step.

#### **Program changes**

★Contour fitting is very control intensive:

□ Compares local configurations of bits.

**XTransformed into data-parallel operations** 

for VLIW:



#### **Instruction changes**

- #Trimedia provides library of intrinsic functions that map onto Trimedia instruction sequences.
- - △Loop unrolling.

# Before and after stage times



#### **Results**

#After: 31 frames/sec w/o HMM, 25 frames/sec with HMM.

- #Smaller variation in frame processing time.

#### **Architectural experiments**

#### **#Fritts/Wolf:**

- □ characterize applications;
- compare architectural styles (VLIW, superscalar);
- evaluate architectural parameters (clock rate, pipelining, etc.).



# **Workload characteristics experiments**

- #Goal: compare media workload characteristics to general-purpose load.
- #Compiled on Impact compiler, measured with with Impact simulator.

#### **Basic characteristics**

- # Comparison of operation frequencies with SPEC
  - △(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1)
  - △Lower frequency of memory and floating-point operations
- **₩ Basic block statistics** 

  - Need global scheduling techniques to extract ILP

# Basic characteristics, cont'd

- **★Static branch prediction** 
  - △Average of 89.5% static branch prediction on training input
  - △ Average of 85.9% static branch prediction on evaluation input
- # Data types and sizes
  - Nearly 70% of all instructions require only 8 or 16 bit data types

# Multimedia looping characteristics

#### 

- △95% of CPU time in two innermost loop levels
- △About 10 iterations per loop on average

#### **♯Complex loop control**

- □ = average # of instructions executed per loop invocation/total # of loop instructions



# Instruction level parallelism

#### 

- △parallel model: 8-issue

### #Explores only parallel scheduling performance

- △assumes an ideal processor model



## Multiprocessor architectures for video

- #Interested in high-speed video processing.
  - △150 frames/sec.
- #Want reasonably low-power operation for pervasive applications.

#### **High-speed smart cameras**

- #High frame rates provide better motion capture.
- #Frame rate of 150 frames/sec is considered desirable.
- #Stanford CMOS camera can digitize at 10,000 frames/sec.





# **Memory structure**

- #Feed-forward data communication.
- #Allocating memory depends on data volumes, access patterns, flexibility.

# Average processing time by stage



## **Average IPC by stage**



# Tiehan's VLIW implementation

- **#Unroll loop to perform multiple** comparisons in parallel.
- #Pack results into bit vector to address results table.
- Register file, cache provide for reuse of pixel values.

## **Contour crawler machine**

**#Hardware implementation of VLIW code:** 



# **Crawler and memory**

- #Crawler performance depends on memory system.
- **#Access patterns vary in 2 dimensions:**

1 2 3

8 X 4

7 6 5

### **Memory system design**

- #Want to minimize number of partitions to reduce row/column overhead.
- #Only memory organization that allows for all parallel accesses is one-word partition.
- #Assume we fetch one row or column at a time---3 fetches/cycle.

# Single contour crawler

#Assuming row/column access pattern, crawler is faster than VLIW by a relatively small constant.

### **Multiple crawlers**

\*Assuming we can patch together contours, we can start multiple crawlers.



- # Multiple crawler performance is limited by memory.
  - △ Multiple crawlers' memory accesses can conflict.

#### **Full-frame SIMD**

- #Can build a large SIMD array with one processor per pixel.
- **%Area\*delay:** 

  - △PE is probably about the same size as the crawler.
  - Not clear it is worth the silicon.

# Heterogeneous system

- ₩ Region:
- **Contour:**
- ₩ Ellipse:
  - □Superscalar/RISC.
- ₩ Graph:
  - □RISC.

# Stage pipelining

#Significant amount of time in non-streaming stages.



# Heterogeneous vs. VLIW

#### 器VLIW:

- △Off-the-shelf IP.
- □ Easy to program.
- △10 mm<sup>2</sup> in 0.13 micron.

#### **#Heterogeneous:**

- □ Requires more design of blocks, memory.
- △ Pipelineable for 2.3X speed-up.

# Heterogeneous multiprocessor size

| stage              | PE         | area<br>(mm^2) |
|--------------------|------------|----------------|
| 1 1 1              | MIPS32     | 0.0            |
| background         | 4Km        | 0.9            |
| contour            | custom     | 0.001          |
| ellipse, graph     | MIPS64 5Kf | 5              |
| total frame proces | 5.901      |                |
| classification     | MIPS64 5Kf | 5              |
| number of frame p  | 3          |                |
| grand total        |            | 22.703         |

# Research problems in embedded computing

Wayne Wolf
Dept. of Electrical Engr.
Princeton University

## **Networks-on-chips**

- **#Link technology.**
- **#Irregular network architectures.**
- **#Network synthesis.**
- ₩QoS.

### **Platform architectures**

- #Figure out how to measure the reusability of a platform.

# **Memory systems**

- **⊞Controllable caches.**
- **#Distributed memory architectures.**
- **#Software methods for exploiting memory.**

# Low power design

- #Effects of leakage.

### **Processor architecture**

- **#Choosing specialized instruction sets.**
- **#Adaptations for networks-on-chips.**

### **Software**

- **#Exploting configurable caches.**

### **Networks of SoCs**

- #Integrating network functions into platforms.
- **♯On-the-fly adaptation to network conditions.**

### Kris Kuchcinski

Lund University



#### **Design of Embedded Systems**

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris



# Examples of Embedded Systems (cont'd)

Anti-lock brakes
Auto-focus cameras
Automatic teller machines
Automatic toll systems
Automatic transmission
Avionic systems
Battery chargers
Camcorders
Cell phones
Cell-phone base stations
Cordless phones
Cruise control
Curbside check-in systems
Digital cameras
Disk drives

Disk drives
Electronic card readers
Electronic instruments
Electronic toys/games
Factory control
Fax machines
Fingerprint identifiers
Home security systems
Life-support systems
Medical testing systems

Modems
MPEG decoders
Network cards
Network switches/routers
On-board navigation
Pagers
Photocopiers
Point-of-sale systems
Portable video games
Printers
Satellite phones
Scanners
Smart ovens/dishwashers
Speech recognizers
Stereo systems
Teleconferencing systems
Televisions

Televisions
Temperature controllers
Theft tracking systems
TV set-top boxes
VCR's, DVD players
Video game consoles
Video phones
Washers and dryers

Source: Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis







#### **Embedded Systems (cont'd)**

- Computing systems embedded within electronic devices
- Hard to define. Nearly any computing system other than a desktop computer
- Billions of units produced yearly, versus millions of desktop units
- Perhaps 50 per household and per automobile



#### **Embedded Systems (cont'd)**

- Non User-Programmable.
- Based on programmable components (e.g., Microcontrollers, DSP's...) but often contain application specific hardware (IC's, ASIC's).
- Reactive Real-Time Systems:
  - React to external environment,
  - Maintain permanent interaction,
  - Ideally never terminate,
  - Are subject to external timing constraints (real time).

# Characteristics of Embedded Systems

- Sophisticated functionality.
- Real-time operation.
- Low manufacturing cost.
- Low power.
- Designed to tight deadlines by small teams.
- "Resource conscious" vs. "Unlimited resources" programming

#### **SoC Embedded System**



- Assembly of "prefabricated components" often purchased from external vendors ("IP")
  - "black box" hierarchy
- Design & Verification at the System level
  - rather than the logic level
  - Interface and communication
- Great Importance of Softwa

Source: Alberto Sangiovanni-Vincentelli, 35th DAC









# Typical Hardware Components of DSP System

| Component class                 | Implements                                                  | Compiler                                              | Specification        |
|---------------------------------|-------------------------------------------------------------|-------------------------------------------------------|----------------------|
| DSP processor                   | Low data-rate DSP<br>Slow control loops<br>Appl. Spec. alg. | (Retargetable)<br>code generator<br>High level synth. | Assembly<br>C<br>DFL |
| Microcontroller                 | User interface<br>Slow control loops                        | C compiler                                            | С                    |
| Hardware<br>accelerator         | High data-rate DSP                                          | High level synth.<br>RT level synth.                  | C, DFL<br>VHDL       |
| Communication blocks and memory | Internal & external communication Storage & buffering       | Memory mgmt. (A)synchronous interface synth.          | Data-sheets<br>STG   |
| Others                          | Usually FSMD's - clock generators - DMA blocks              | RT level synth. Asynchronous synth.                   | VHDL OUT R           |

Source: H. de Man, et. al. "Co-design of DSP Hardware/Software Co-design, Kluwer 1995.

#### Importance of Embedded System Design Methodologies

- Hardware complexity.
- Heterogeneous systems containing hardware (both digital and analog) and software.
- Heterogeneous components (CPU's, DSP's, ASIC's, buses, point-to-point links, etc.).
- Heterogeneous requirements performance, cost, power consumption, etc.
- System-on-chip.
- Shorter design cycles required by time-to-market constraints.

**...** 



# Software vs. Hardware Design short summary

- Software
  - flexibility,
  - reconfigurability, easy update, etc.,
  - complex functionality,
  - cost,
- Hardware
  - speed,
  - power consumption,
  - cost in large volumes,
  - ...



#### **Design of Embedded Systems**

- Need to be done using high-level specification, programming and hardware description languages not assembly languages and gate/transistor level design.
- Requires efficient design space exploration and synthesis/compilation tools.
- Different design requirements has to be taken into account, e.g., cost, performance, testability, quality of service, power consumption.
- Multi-language design framework.

# Importance of High-Level Design Methods

#### System Verification Processing Speeds

| System Implementation              | Processing time (s/frame) |  |
|------------------------------------|---------------------------|--|
| Behavioral model                   | 1 200 (20 min/frame)      |  |
| RTL model                          | 144 000 (1.6 days/frame)  |  |
| Gate model                         | 228 000 (2.6 days/frame)  |  |
| Gate model on hardware accelerator | 1 200                     |  |
| Rapid Prototype                    | 0.5                       |  |
| Target Hardware                    | 0.05                      |  |

Source: Paul Clemente, Ron Crevier, Peter Runstadler "I Synthesis A Case Study", VHDL Times, vol. 5, no. 1.



#### **Specification and Programming**

- Specification languages, such as UML, SDL.
- Programming languages, such as C, C++, Java, Esterel, assembly languages.
- Hardware description languages, such as VHDL, Verilog, SystemC.
- Example: combining SystemC and C++ gives unified simulation environment for hardware and software.

#### **Hardware Description Languages**

- Cover several levels of design abstraction as well as behavioral and structural description domain.
- Contain typical features of programming languages, such as data types and program statements.
- Special features:
  - time concept,
  - structure description,
  - parallelism.
- VHDL (IEEE standard), Verilog, SystemC.



# **Design Representations** (Computational Model)

- Used to represent/model digital systems under design.
- Generated by a compiler from system specification or coded directly in the model.
- Represent the semantics, structure and timing of the system.
- Usually based on some kind of annotated graph representation.
- Used internally by design automation systems of the modeler/designer.

### **Design** — Synthesis

- Software translation into target code for a processor (real-time operating system might be used).
- Hardware synthesis translation of a behavioral representation of a design into a structural one.
- Communication synthesis generates hardware and software which interconnects system components.











#### **Summary**

- Embedded systems are important class of electronic systems which can be found everywhere,
- Combine hardware and software solutions,
- Cover several engineering and research areas:
  - microelectronics,
  - ▼ real-time systems,
  - software development,
  - etc.
- Need careful design which optimizes different de parameters.



#### Methodologies for Embedded System design

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris



#### **Input to System Design**

- Executable specification (functional requirements):
  - usually provided as interacting processes/tasks,
  - very often multi-language specifications,
  - can be simulated and verified,
  - can be used to perform analysis, e.g, estimation.
- Specification languages: C, C++, VHDL, Verilog, SystemC, Esterel, SDL, etc.
- Set of (non-functional) design requirements (cost, speed, I/O rate, power consumption, etc.).

#### **Output from System Design**

- A set of system modules assigned to system components (CPU's, DSP's, ASIC's, etc.).
- Communication modules.
- Each module can be further synthesized to hardware using high-level synthesis or compiled to software.





# **Basic Characteristics of the Methodology**

- Behavioral specification is given for the complete heterogeneous system, regardless of how different parts will be later implemented.
- Analysis techniques are provided; specially different estimation techniques.
- Synthesis tools are used to automatically explore a design space.
  - high-level synthesis, RTL synthesis,
  - compilers, cross-compilers,
  - interface generators,
  - etc.



#### **Estimation of Design Parameters**

- Estimation of parameters such as size, cost, power consumption.
- Does not need to be very precise but has to be "consistent" — follows real design parameters.
- Usually 15%-20% inaccurate.
- Trade-off between accuracy and estimation time.



# Improvements of the Design Process

- High-level specification is made before architecture selection and implementation decisions can be made more accurate (better exploration of architectures).
- A uniform description of HW and SW makes it possible to move parts of the systems between HW and SW.
- HW and SW development is moved closer and the integration cost is reduced.
- An early evaluation of system characteristics is possible.





#### **Representation Example**



Process communication g

#### **Allocation of System Components**

- Decides about the kind and number of components for implementation of the system
  - processing elements: μprocesosrs, microcontrollers, DSP's, ASIP's, ASIC's, FPGA's, etc.
  - storage elements: memories, register files, registers, etc.
  - communication devices: buses, point-to-point links, networks, etc.
  - specialized I/O devices: A/D, D/A, frame grabbers, etc.

#### **Partitioning**

- Functional partitioning vs. structural partitioning.
- Abstraction level.
- Partitioning granularity (fine or course):
  - modules,
  - processes and procedures,
  - instructions.
- Partitioning objective:
  - performance,
  - minimal communication,
  - low power,
  - combination of several criteria.





#### **Communication Synthesis**

- Creation of abstract communication channels by communication clustering.
- Communication refinement
  - selection of communication lines width,
  - protocol selection,
  - etc.
- Interface generation:
  - device drivers,
  - communication hardware,
  - etc.





#### **Design Decisions**

- Different types of design decisions
  - selection of components, partitioning, assignments, scheduling, etc.
  - decisions regarding runtime system done off-line or are postponed to runtime (e.g., static vs. runtime scheduling)
- Design decisions are mutually dependent
- Huge design space



#### **Design Automation**

- Uses internal representations which are usually based on graphs.
- Graph algorithms (shortest path, Hamiltonian circuit, topological sort, depth-first-search, breadth-firstsearch, SAT, etc.).
- Optimization methods (M)ILP, CLP, heuristics, etc.
- Tractable and intractable problems.
- Decidable and undecidable problems.
- Decision problems and combinatorial optimization problems.

# Design Automation Consequences

- Most of the problems which need to be solved in design automation are NP-complete or NP-hard.
- Usually only small problems can be solved exactly.
- Need for algorithms which do not guarantee optimal solutions but "good enough" solutions
  - approximation algorithms guarantee a solution with a cost that is within some margin of the optimum,
  - heuristics algorithms that are constructed based on "rules-of-thumb"; nothing can be said advance about the quality of the result.



### Constraint Programming Approach

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris

#### **Quotations**

"Constraint programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the problem, the computer solves it."



Eugene C. Freuder CONSTRAINTS, April 1997



### **Introduction and Motivation**

Synthesis of the following code (inner loop of differential equation integrator)

#### while c do

#### begin

$$x1 := x + dx;$$
  
 $u1 := u - (3*x*u*dx);$   
 $y1 := y + u*dx;$   
 $c := x < a;$ 

x := x1; y := y1; u := u1;

end;





### Register Allocation as Graph Coloring



#### **Constraints:**

$$\begin{split} &[r_1,r_2,r_3,r_4,r_5,r_6] :: 0..2, \\ &r_1 \neq r_2, \, r_1 \neq r_3, \, r_2 \neq r_3, \\ &r_2 \neq r_4, \, r_3 \neq r_4, \, r_4 \neq r_6, \\ &r_5 \neq r_6. \end{split}$$



# Register Allocation as Clique Finding

 $\,\,$  for all  $r_i,\,r_j$  which are not connected by an edge:

$$r_i \neq 1 \lor r_j \neq 1$$

The maximal clique can found by maximizing the following cost function:

$$cost = \sum_{i} r_{i}$$





### **Constraint Consistency**

- All constraints are stored in the constraint store
- Consistency methods are applied to find inconsistent values and prune variables' domains
- Different types of consistency methods:
  - Node consistency
  - Arc consistency
  - Path consistency

₩ ...





### **Consistency Properties**

- Node consistency
  - A network is node consistent if in each node domain each value is consistent with unary constraint (e.g., X > 7)
- Arc consistency
  - A network is arc consistent if for each arc connecting variables V<sub>i</sub> and V<sub>j</sub> for each value in the domain of V<sub>i</sub> there exist a value in the domain of V<sub>i</sub> consistent with binary constraint (e.g., X > Y)

### **Node and Arc Consistency**

Example



Not node consistent Not arc consistent node consistent arc consistent

### **Need for search**

- Node, arc and path consistency are in general not complete (complete for some problems with particular structures)
- Complete algorithm: N-consistency for N variable problems → exponential complexity
- Example:





### Search

Solver is not complete and search for a solution is needed







### **Constraint properties**

- may specify partial information need not uniquely specify the values of its variables,
- non-directional typically one can infer a constraint on each present variable,
- declarative specify relationship, not a procedure to enforce this relationship,
- additive order of imposing constraints does not matter,
- rarely independent typically they share variable

# More realistic example Scheduling

### Scheduling of the data-flow graph



#### **Constraints:**

- for all op<sub>i</sub> and op<sub>j</sub> such that op<sub>i</sub> before op<sub>j</sub>
   T<sub>i</sub> + D<sub>i</sub> ≤ T<sub>j</sub>
- for all op<sub>i</sub> and op<sub>j</sub> that can use the same resource
   T<sub>i</sub> + D<sub>i</sub> ≤ T<sub>i</sub> ∨ T<sub>i</sub> + D<sub>i</sub> ≤ T<sub>i</sub> ∨ R<sub>i</sub> ≠ R

### **Problems**

Constraint propagation for

$$T_i + D_i \le T_i \lor T_i + D_i \le T_i \lor R_i \ne R_i$$
 is weak

- Not all solvers support disjunctive constraints.
  - Other solution (reified constraints):

$$T_i + D_i \le T_j \Leftrightarrow B1,$$
  
 $T_j + D_j \le T_i \Leftrightarrow B2,$   
 $R_i \ne R_j \Leftrightarrow B3,$   
 $B1 + B2 + B3 \ge 1.$ 







$$\begin{split} T_k + D_k &\leq T_n \vee T_n + D_n \leq T_k \vee R_k \neq R_n \\ T_m + D_m &\leq T_n \vee T_m + D_m \leq T_n \vee R_m \neq R_n \end{split}$$



### **Global constraints**

Non-overlapping rectangles



- All knowledge in one "place" makes it possible to define good consistency methods (OR, mathematics, geometry, etc.)
- Specific algorithms for consistency more efficient



### **Scheduling Example Constraints**

$$\begin{split} &T1+2 \leq T6,\, T2+2 \leq T6,\\ &T3+2 \leq T7,\, T4+2 \leq T8,\\ &T5+1 \leq T9,\, T6+2 \leq T10,\\ &T7+2 \leq T11,\, T10+1 \leq T11,\\ &\text{diff2}([\ [T1,R1,2,1],\ [T2,R2,2,1],\ [T3,R3,2,1],\\ &\quad [T4,R4,2,1],\ [T6,R6,2,1],\ [T7,R7,2,1],\\ &\quad [T5,R5,1,1],\ [T8,R8,1,1],\ [T9,R9,1,1],\\ &\quad [T10,R10,1,1],\ [T11,R11,1,1]\ ]). \end{split}$$





### Other Synthesis Problems Defined with Constraints

- High-level synthesis:
  - Chaining,
  - Conditional execution,
  - Pipelined components,
  - Algorithmic pipelining,
  - Switching activity reduction (power consumption)
  - ₩ ...
- System design
  - different aspects of design space exploration
  - scheduling
  - component assignment
  - memory allocation/data assignment
  - power/energy consumption
  - · ..





# Additional Constraints element

- Element constraints
  - element(N, [X<sub>1</sub>, X<sub>2</sub>, ..., X<sub>n</sub>], Value)
    - propagation from N to Value

 $N=i \rightarrow Value = X_i$ 

■ propagation from Value to N

Value =  $x \rightarrow N=i$  and Xi = x ...

- Examples- element(N, [2, 3, 4, 4], V)
  - N:: 1..2, V:: {2, 3}
  - \* V = 4, N :: 3..4
- Used to define discrete cost functions and different relations









### **Edge Finding Algorithm**

- Martin-Shmoys algorithm with O(n²) complexity.
- Up phase
  - for each unique lct we create a set
     S = {t | LCT(t) <= lct} and make checking whether a task can be the first or before</li>
- Down phase
  - similar but using est and checking whether a task can be the last one or after all tasks.

### **System Synthesis Example**



- original MILP formulation- 47 timing variables, 225 binary (bus 153) and1081 constraints (bus 416)
- commercial linear programming package used to solve the problem (XLP, developed by XMP Software, Inc.)

|           |          |              | Execution time   |                      |                          |                                                                                                                                                                                                  |                                                                                                                                                                                                                                 |                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                              |
|-----------|----------|--------------|------------------|----------------------|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Processor | Cost     | S1           | S2               | S3                   | S4                       | S5                                                                                                                                                                                               | S6                                                                                                                                                                                                                              | S7                                                                                                                                                                                                                                                             | S8                                                                                                                                                                                                                                                                                            | S9                                                                                                                                                                                                                                                                                                                           |
| P1        | 4        | 2            | 2                | 1                    | 1                        | 1                                                                                                                                                                                                | 1                                                                                                                                                                                                                               | 3                                                                                                                                                                                                                                                              | -                                                                                                                                                                                                                                                                                             | 1                                                                                                                                                                                                                                                                                                                            |
| P2        | 5        | 3            | 1                | 1                    | 3                        | 1                                                                                                                                                                                                | 2                                                                                                                                                                                                                               | 1                                                                                                                                                                                                                                                              | 2                                                                                                                                                                                                                                                                                             | 1                                                                                                                                                                                                                                                                                                                            |
| P3        | 2        | 1            | 1                | 2                    | -                        | 3                                                                                                                                                                                                | 1                                                                                                                                                                                                                               | 4                                                                                                                                                                                                                                                              | 1                                                                                                                                                                                                                                                                                             | 4                                                                                                                                                                                                                                                                                                                            |
|           | P1<br>P2 | P1 4<br>P2 5 | P1 4 2<br>P2 5 3 | P1 4 2 2<br>P2 5 3 1 | P1 4 2 2 1<br>P2 5 3 1 1 | Processor         Cost         S1         S2         S3         S4           P1         4         2         2         1         1           P2         5         3         1         1         3 | Processor         Cost         S1         S2         S3         S4         S5           P1         4         2         2         1         1         1           P2         5         3         1         1         3         1 | Processor         Cost         S1         S2         S3         S4         S5         S6           P1         4         2         2         1         1         1         1           P2         5         3         1         1         3         1         2 | Processor         Cost         S1         S2         S3         S4         S5         S6         S7           P1         4         2         2         1         1         1         1         3           P2         5         3         1         1         3         1         2         1 | Processor         Cost         S1         S2         S3         S4         S5         S6         S7         S8           P1         4         2         2         1         1         1         1         3         -           P2         5         3         1         1         3         1         2         1         2 |



# Modeling of cost and execution time

- Execution timeelement(P1, [2, 3, 1], D1)...element(P9, [1, 1, 4], D4)
- Cost (P1=1 ∨ P2=1 ∨ ... ∨ P9=1) ⇔ C1, ...

 $(P1=6 \lor P2=6 \lor ... \lor P9=6) \Leftrightarrow C6,$ Cost = 4\*C1 + 4\*C2 + 5\*C3 + 5\*C4 + 2\*C5 + 2\*C5

### **System synthesis results**

|                |      | Performance | Perfo    | rmance opti | Cost optimization |         |           |
|----------------|------|-------------|----------|-------------|-------------------|---------|-----------|
| Design         | Cost | (time       | MILP (s) | CLP (s)     | B&B               | CLP (s) | B&B Nodes |
| •              |      | ùnits)      | . ,      | ` ,         | Nodes             | ` '     |           |
|                | 10   | 6           | 6438.00  | 0.43        | 84                | 0.55    | 92        |
| Bus            | 6    | 7           | 5371.80  | 0.53        | 114               | 0.68    | 144       |
|                | 5    | 15          | 3691.20  | 0.43        | 68                | 0.70    | 103       |
|                | 15   | 5           | 3732.00  | 0.43        | 20                | 1.67    | 125       |
| point-to-point | 12   | 6           | 26710.20 | 1.42        | 98                | 2.18    | 169       |
| links          | 8    | 7           | 32320.20 | 1.00        | 58                | 2.59    | 198       |
|                | 7    | 8           | 4510.80  | 1.64        | 75                | 2.02    | 112       |
|                | 5    | 15          | 38501.20 | 1.50        | 32                | 1.48    | 77        |



# System synthesis results with local memory

|                |      | Performance  | Performa  | ince optin | nization | Cost   | ptimization |
|----------------|------|--------------|-----------|------------|----------|--------|-------------|
| Design         | Cost | (time units) | MILP (s)  | CLP        | B&B      | CLP    | B&B Nodes   |
| -              |      |              |           | (s)        | Nodes    | (s)    |             |
|                | 28   | 6            | 6592.20   | 0.71       | 76       | 2.58   | 252         |
|                | 23   | 7            | 5371.80   | 1.07       | 193      | 1.94   | 266         |
| Bus            | 22   | 8            | 123252.60 | 0.95       | 124      | 14.85  | 856         |
|                | 21   | 10           | 316860.60 | 114.92     | 4 534    | 119.55 | 8 799       |
|                | 18   | 11           | 236724.00 | 88.23      | 7 015    | 2.37   | 477         |
|                | 17   | 12           | 138004.20 | 0.93       | 268      | 10.39  | 3 076       |
|                | 14   | 15           | 3581.40   | 0.54       | 22       | 9.89   | 3 076       |
|                | 38   | 5            | -         | 0.56       | 24       | 2.08   | 107         |
|                | 30   | 6            | -         | 0.99       | 59       | 3.75   | 155         |
| point-to-point | 25   | 7            | -         | 1.60       | 79       | 5.58   | 314         |
| links          | 23   | 8            | -         | 1.82       | 57       | 3.21   | 184         |
|                | 22   | 10           | -         | 4.50       | 84       | 59.25  | 855         |
|                | 19   | 11           | -         | 27.34      | 794      | 101.03 | 2 851       |
|                | 18   | 12           | -         | 97.72      | 2 686    | 8.66   | 1 047       |
|                | 14   | 15           | -         | 1.18       | 14       | 4.95   | 328         |





### **Task Mappings to Processors**

| Task | Uni-<br>versal | BMA<br>array | PAR1 | DCT<br>array | FIR<br>array | BMA<br>pipe | FIR<br>seq | FIR<br>pipe | DCT<br>seq | DCT<br>pipe |
|------|----------------|--------------|------|--------------|--------------|-------------|------------|-------------|------------|-------------|
| IN   | -              | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| FB1  | -              | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| BMA  | 7234           | 484          | -    | -            | -            | 3617        | -          | -           | -          | -           |
| FIR  | 7234           | -            | -    | -            | 510          | -           | 3461       | 1170        | -          | -           |
| PRAE | 1280           | -            | 128  | -            | -            | -           | -          | -           | -          | -           |
| DCT  | 12312          | -            | -    | 132          | -            | -           | -          | -           | 6156       | 474         |
| Q    | -              | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| IQ   | -              | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| IDCT | 12312          | -            | -    | 132          | -            | -           | -          | -           | 6156       | 474         |
| REK  | 1536           | -            | 256  | -            | -            | -           | -          | -           | -          | -           |
| С    | 132            | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| FB2  | -              | -            | -    | -            | -            | -           | -          | -           | -          | -           |







#### **Experimental results** H.261 example AVERAGE EXECUTION both greedy -16% memory

### Scheduling of Mars Path Finder under Power Consumption Constraints

The mars rover operates on very limited power supply. The power is given by solar panels. The power obtained from solar panels was measured at different temperatures and the results were the following: 14.9W at -40°C, 12.0W at -60°C and 9.0W at -80°C. There is a battery power source too, which gives maximal 10.0W and it is not replenishable energy so the battery power should be used as little as possible. The mars rover has 6 driving and 4 steering motors, which need to be warmed up before respective driving and steering can be performed.



# Scheduling of Mars Path Finder under Power Consumption Constraints

| Operation                    | Duration |                                                  |  |
|------------------------------|----------|--------------------------------------------------|--|
| Heating steering motors      | 5s       | At least 5s and at most 50s before               |  |
| (HSM1&2, HSM3&4)             |          | steering starts                                  |  |
| Heating wheel motors         | 5s       | At least 5s and at most 50s before drivin        |  |
| (HWM1&2, HWM3&4, HWM5&6)     |          | starts                                           |  |
| Hazard detection (HD1 & HD2) | 10s      | At least 10s before steering starts              |  |
| Steering (Steer1, Steer2)    | 5s       | At least 5s before driving                       |  |
| Driving (Drive1, Drive2)     | 10s      | At least 10s before next hazard detection starts |  |



# **Scheduling of Mars Path Finder under Power Consumption Constraints**

| Task                | Duration | Power -40°C | Power -60°C | Power -80°C |
|---------------------|----------|-------------|-------------|-------------|
| Heat two<br>motors  | 5s       | 7.6W        | 9.5W        | 11.3W       |
| Drive               | 10s      | 7.5W        | 10.9W       | 13.8W       |
| Steer               | 5s       | 4.3W        | 6.2W        | 8.1W        |
| Hazard<br>Detection | 10s      | 5.1W        | 6.1W        | 7.3W        |
| CPU                 | Constant | 2.5W        | 3.1W        | 3.7W        |



### **Modeling**

Precedence constraints:

$$t_hd1 + d_hd1 \le t_steer1$$
,

...

t\_steer1 + d\_steer1 ≤ t\_drive1,

 $t_hwm12 \le t_steer1 + 50$ ,

Power consumption constraints:

cumulative([t\_hd1, ..., th\_sm12],

[p\_hd1, ..., p\_sm12],

[d\_hd1, ..., d\_sm12], Power)

■ Optimize "Power"



### Cycle (circuit) constraint



cycle(2, [ [2,6], [3,4], [1], [2,3], [2,6], [2,5]] )



[[2],[4],[1],[3],[6],[5]]



### Search

- Standard search uses depth-first-search with backtracking.
- Optimization uses branch-and-bound or similar methods.







### Search (cont'd)

[City1::2..4, City2::{1,3..4}, City3::{1..2,4}, City4::1..3]

- How to select order of variable assignment?
  - dynamic vs. static
- How to select values to be assigned from variable's domain?
  - a single value
  - sub-domain
  - **...**



### **Variable Selection**

- Static and dynamic
  - input order (static)
  - first-fail principle (smallest size of the domain)
  - smallest value in the domain
  - largest value in the domain
  - largest difference between the smallest and second smallest value in its domain
  - smallest max value in the domain
  - .



### **Value Selection**

- Single value
  - minimum in the domain and then upwards
  - maximum in the domain and then downwards
  - middle and then towards smallest and largest
  - random
  - **III**
- Domain split
  - split into two sub-domains
  - split into N
  - **...**



### **Search improvements**

- Partial enumeration algorithms (instead of labeling)
  - Credit Search,
  - Limited Discrepancy Search (LDS).
- Assignment of subintervals instead of values to domain variables — possibly examines a bigger part of a solution space.
- Problem-dependent specific heuristics.
- Neighbourhood search...









### **Summary and conclusions**

#### Advantages:

- focus on a specification of the problem, not on a solution method.
- unified framework for different algorithms to be used to solve a problem (by encapsulating them as constraints).
- easy definition of problems with many heterogeneous constraints.
- easy extension of a problem by adding new constraints.
- collaboration of solvers and distributed compute

### **Summary and conclusions**

- **Limitations:** 
  - NP-hard problems.
  - often non-predictable behavior of a solver.
  - difficult to define and add new constraints:
    - \* into existing systems interface problems.
    - \* new propagation algorithms need to be developed.
  - difficult to match constraints with actual problems.



### **CP finite domain systems**

- SICStus Prolog
- CHIP from COSYTEC
- IF/Prolog
- **ILOG**
- Mozart/Oz
- Gnu Prolog
- JaCoP Java based our own solver



#### **Selected CP Web resources**

- Constraints archive http://www.cs.unh.edu/ccc/archive
- Guide to constraints programming http://kti.ms.mff.cuni.cz/~bartak/constraints
- Sicstus manual http://www.sics.se/isl/sicstus/sicstus\_toc.html
- Gnu Prolog http://www.gnu.org/software/prolog/prolog.html
- Mozart/Oz http://www.mozart-oz.org/

### **Other resources**

- Book
  - K. Mariott and P. J. Stuckey *Programming with Constraints: An Introduction*, The MIT Press, 1998.
- Conferences
  - Principles and Practice of Constraint Programming (CP)
  - The Practical Application of Constraint Technologies and Logic Programming (PACLP)
- Journal
  - Constraints (Kluwer Academic Publishers)



### **Selected Papers**

- Kuchcinski, K., Embedded System Synthesis by Timing Constraints Solving, Proc. 10th International Symposium on System Synthesis, Antwerp, Belgium, September 17-19, 1997.
- Gruian, F. and Kuchcinski, K., Operation Binding and Scheduling for Low Power Using Constraint Logic
   Programming, Proc. 24th Euromicro Conference, Workshop on Digital System Design, Västerås, Sweden, August 25-27, 1998.
- Kuchcinski, K. and Wolinski, Ch., Global Approach to Assignment and Scheduling of Complex Behaviours based on HCDG and Constraint Programming, Journal of Systems Architecture, 2003, Elsevier Science.

### **Selected Papers (cont'd)**

- Kuchcinski, K., Constraints Driven Design Space Exploration for Distributed Embedded Systems, Journal of Systems Architecture, vol. 47, no. 3-4, pp. 241-261, 2001, Elsevier Science.
- Szymanek, R. and Kuchcinski, K., A Constructive Algorithm for Memory-Aware Task Assignment and Scheduling, Proc. 9th International Symposium on Hardware/Software Codesign, Copenhagen, Denmark, Apr. 2001.
- Szymanek, R. and Kuchcinski, K., Partial Assignment Technique for Task Graph Scheduling, 40th DAC, Anaheim, USA, June 2003.
- Kuchcinski, K. Constraint-driven scheduling and resource assignment, ACM Trans. on Design Automation of Electron Systems, vol. 8, no. 3, pp. 355-383, 2003.

# Sponsored by:













# ESSES 2003 European Summer School on Embedded Systems

# Lecture Notes Part XI

Embedded Sysems: Introduction and Overview





Editors: Ylva Boivie, Hans Hansson, Jane Kim, Sang Lyul Min



Strängnäs, August 20-22, 2003 ISSN 1404-3041 ISRN MDH-MRTC-106/2003-1-SE

# **Real-Time Communication**

Kang G. Shin

The University of Michigan

## **Real-Time Communication**

### Kang G. Shin

Real-Time Computing Laboratory
Department of Electrical Engineering and
Computer Science
The University of Michigan
Ann Arbor, MI 48109, USA
kgshin@eecs.umich.edu

### Why and What Real-Time Communication?

- Between sensors & control panel and processors Between processors
- Non RT protocols: throughput RT protocols: (absolute/prob.) delay guarantees
- Message delay = formatting and/or packetization + queueing + xmission + depacketization
- How to derive message deadlines?
- Message drop preference in case of congestion.
- RT traffic sources:
  - Constant rate
  - Variable rate
- Traffic characteristics may change inside the network

### **Network Topologies**

- Diameter, node degree, fault-tolerance
- Types:

**Shared:** 

**Point-to-point:** 

Switching

Packet SW: routing, flow control, & scheduling

**Circuit SW:** 

Cut-thru SW: Wormhole, virtual cut-thru

Interconnections: system and node levels,
 I/O connectivity

### Virtual-time CSMA

• 2 synchronized clocks, "real" and "virtual," for each station.

RC: to record message arrival times

**VC:** to count up to a virtual time to start xmission, VSX(M). *freezes* when channel is busy and *runs faster* than RC when channel is idle.



- When VC>VSX(M), packet M becomes eligible for xmission.
   If collision, modify VSX value.
- Question: How to initialize when channel becomes free, and how to modify VSX(M) when ∃ a collision involving M?

#### More on VTCSMA

au: Signal propagation time  $A_M$ : Arrival time of message M  $T_M$ : Time required to xmit M  $D_M$ : Deadline of M  $L_M$ : Latest time M to be sent to meet  $D_M$   $\Lambda_M(t)$ :  $D_M - T_M - \tau - t$ 

$$VSX(M) = \left\{ egin{array}{ll} A_M & ext{for VTCSMA-A} \ T_M & ext{for VTCSMA-T} \ L_M & ext{for VTCSMA-L} \ D_M & ext{for VTCSMA-D} \end{array} 
ight.$$

In case of collision, M is rexmitted immediately with prob p or not; and VSX(M) is modified to be a random # in

$$I = \left\{ \begin{array}{ll} (\mathsf{current} \ \mathsf{VC}, L_M) & \mathsf{for} \ \mathsf{VTCSMA-A} \\ (0, T_M) & \mathsf{for} \ \mathsf{VTCSMA-T} \\ (\mathsf{current} \ \mathsf{RC}, L_M) & \mathsf{for} \ \mathsf{VTCSMA-L} \\ (\mathsf{current} \ \mathsf{RC}, D_M) & \mathsf{for} \ \mathsf{VTCSMA-D} \end{array} \right.$$

### More on VTCSMA

• When channel changes from busy to idle

$$\label{eq:VC} \text{VC} = \left\{ \begin{array}{ll} \text{no change} & \text{for VTCSMA-A} \\ 0 & \text{for VTCSMA-T} \\ \text{RC} & \text{for VTCSMA-L and -D} \end{array} \right.$$

Effects of clock skew

|   | <b>.</b> | Actual RC at M's arrival | RC at node  | $D_M$ | $L_M$ |
|---|----------|--------------------------|-------------|-------|-------|
| 1 | 1        | 8                        | actual RC-1 | 32    | 16    |
| 2 | 2        | 9                        | actual RC+1 | 36    | 20    |

VC runs twice faster than RC.

 $N_1$  and  $N_2$  xmit  $M_1$  and  $M_2$  at their VC values 16 and 20, corresponding to their RC values 8 and 10 which are identical.

• Lower loss rate than CSMA but wastes net bandwidth by idling the net.

#### **Window Protocol**

- Use events on the bus to synchronize node actions
- Each node maintains a time "window"
- When  $L_M$  of a packet falls in this window and the channel is idle, the packet is *eligible* for xmission.
- Each node maintains a stack of window history, (upper bound of window, ID of collided message or 0).
- Shrink or enlarge window based on the events on the bus.
- What does a collision even when w = 1 mean and what to do?
- What's a slot?

### **Ethernet-based RT Communication**

- Advantages
  - low price & simple management
  - mature technology
- Diffi cult to provide RT guarantees CSMA/CD
  - packet collision
  - multiple collisions due to bursty non-RT packets

## **Ethernet MAC protocol**

- CSMA/CD protocol
  - No exclusive medium control
  - Listen before transmit
  - Stop transmission upon sensing collision
- Binary Exponential Backoff (BEB): backoff range increases with collisions.

# **A New Solution**



# Fixed-Rate Traffi c Smoothing



## Disadvantages of Fixed-Rate Smoothing

- Poor scalability due to network-wide input limit distributed to stations
- Low network utilization by non-RT traffi c
  - constant station input limit
- Solution adaptive-rate traffi c smoothing

# Fixed-Rate vs. Adaptive-Rate Smoothing



## Harmonic-Increase and Multiplicative-Decrease Adaptation

- Parameter: minimum packet inter-arrival time or refreshing period (RP)
- Uses NIC-collected collision statistics
- Adaptation
  - No collision: RP decreased by ∆ periodically
  - Upon collision: RP doubled

# Experimental Evaluation of Adaptive-Rate Smoothing

- 11 PCs
- 10BASE-T Ethernet LAN
- Collision domain diameter = 10 m

# Topology



## Traffi c Generation Statistics

- RT traffi c
  - a 100 byte long message every 0.3 sec from PC-n
  - total RT traffi c generation rate 53.3
     kbps
  - roundtrip deadline 129.6 msec
- non-RT traffi c
  - non-greedy mode
    - a burst (1 MB) every 25 sec from an activated node
    - \* 320 kbps from a single node
  - greedy mode
    - \* zero burst inter-arrival time
    - \* variable throughput

# RT Message-Loss Ratio (non-greedy mode)



# Avg Xmit Time of non-RT Bursts (non-greedy mode)



# RT Traffi c Roundtrip Delays





No smoothing

adaptive smoothing

# RT Message-Loss Ratio (greedy mode)



# Throughput of non-RT Traffi c



## Summary on Real-Time Ethernet

- Ethernet augmented with middleware for soft RT guarantees
  - compatible with TCP/IP, UDP/IP and Ethernet standard
  - implemented on Linux and Windows NT
- Prioritization and traffi c smoothing
- Fixed-rate traffi c smoothing
- Adaptive-rate traffi c smoothing

### **Token-Passing Protocols**



Overheads: med. propagation time, token xmission time and capture delay, and NI latency.

#### Timed-Token Protocol (TTP): 2 types of traffic

- Synchronous/RT: xmit up to h units of time (UOTs) every T UOTs.
- Asynchronous/non-RT: uses any bandwidth left unused by synchronous traffic.

# Target Token Rotation Time (TTRT)

- Token cycle time ≤ 2 × TTRT How do you determine TTRT?
- Time available to xmit packets per TTRT  $t_p = \text{TTRT} \Theta$
- Node *i* gets SBA,  $B \times \frac{h_i}{t_p}$ .
- If cycle time > (<) TTRT token is *late* (*early*). Xmit only synch. traffi c up to h
   (AND a certain amount of asynch. traffi c).
- Questions
  - How to determine  $h_i$ ?
  - Why token early or late?
  - Is avg token cycle time still ≤ TTRT?

### **Supporting Periodic Communication**

- Node i needs to xmit  $c_i$  bits of RT traffi c every  $P_i$  UOTs
  - TTRT  $\leq P_i/2$
  - Synch. bandwidth is greater than  $t_p B c_i / \lfloor \frac{P_i}{\mathsf{TTRT}} 1 \rfloor$
- How to deal with token loss?
   Claim-token (contains TTRT it requests)
   Beacon packet if every node doesn't recenditemizeve its own claim-token or a normal packet within TTRT(i) after xmitting claim-token.

Station immediately downstream from a break is the only node that will keep xmitting.

### **IEEE 802.5 Token Ring Protocol**



SD: Starting Delimter SA: Source Address AC: Access Control FS: Frame Status ED: Ending Delimiter 00 dest unavailable DA: Destination Address

10 frame uncopiable 11 frame copied

- FS field checked by the sender
- Sender removes data frame it sent
- Priority arbitration via AC field

   3 bits for current and reserved priorities
  - When data frame or a token goes by, a node checks the reserved priority beginitemizets: do nothing if higher else write its priority into AC
  - Upon completion of current xmission, the sender issues a token with priority in AC
  - Node that increased priority is responsible for restoring it to prior priority value
- Schedulability Analysis of Token Ring  $T_1, T_2, \dots, T_n$  are schedulable iff  $\forall i \exists t \leq d_i$  such that  $\sum_{j+1}^{i} e_j \lceil \frac{t}{P_i} \rceil$  + system overhead +  $b_i \leq t$ .

#### Polled Bus Protocol

- Synchronous contention resolution
- Stations contend for the channel during a contention period
- The station with the highest priority message gets access to the medium
- Example: Controller Area Network (CAN)

#### **Bus Acquisition Algorithm**

- Calculate a unique ID of m beginitemizets.
- If bus not busy then
  - Write its ID to the bus, one beginitemizet at a time, starting with MSB
  - Wait for a fi nite time and sample the bus
  - If the value read by the processor is different from the value it wrote into the bus, it drops out
  - After m rounds, the processor with the highest ID has sole control of the bus.
- The bus is assumed to wired-OR all impinging signals and all stations are synchronized.
- Key: how to design station IDs?

## Stop-and-Go or Framing Protocol

- Frame = a time interval
   Each frame type represents a diff time interval and is associated with a traffic class.
- Upon arrival of a type- $f_i$  pkt at an intermediate node n, it's held by n at least till the beginning of next instance of  $f_i$ .



- All nodes eligible for xmission are served in nonpreemptive priority order, with shorter-frame pkts having priority over longer-frame pkts
- If net load  $\leq$  a certain limit, a type-m pkt will be xmitted within  $f_m$  time units of benditemizeng eligible for xmission, i.e., bounded delay at each hop

### Hierarchical Round-Robeginitemizen Protocol

- Guarantees each traffi c class i to xmit  $m_i$  pkts every  $T_i$  UOTs.
- Traffi c is classifi ed into n classes, where each class i is associated with  $(n_i, b_i, \Phi_i)$ 
  - $n_i$ : max # of class-i pkts xmitable during any given frame of which source j is allocated a certain max #  $\alpha_i$ .
  - If  $\alpha_j$  pkts are xmitted or no class-i pkts left for xmission, class-(i+1) pkts if any are xmitted, etc., for a max of  $b_i$  pkts during that class-i frame
  - $\Phi_i$  = frame associated with class i  $\Phi_1 < \Phi_2 < \ldots < \Phi_n$ .

#### Point-to-Point Networks

- Attractive because of fault-tolerance capability
- Allow multiple conversations to go on simultaneously on different links
- Access to the links can be controlled easily
- Drawback: Higher latency
- Desirable: effi cient broadcast algorithms

## **Guaranteed Delivery**

- Message generation characteristics
  - Source, Destination
  - Maximum message length:  $S_{max}$  (bytes)
  - Minimum inter-arrival time:  $R_{max}$  (msgs/sec)
  - Maximum burst size:  $B_{max}$  (msgs)
  - Desired bound on message latency: D
- In any time interval of length t, the number of messages generated may not exceed  $B_{max} + t \cdot R_{max}$ .
- A pair of uni-directional real-time channels should be established between source and destination before messages can be transmitted on them.

## **Delivery Time Guarantee**

• The *logical generation time*,  $\ell(m)$ , for a message m is defined as

$$\ell(m_0) = t_0$$
  
$$\ell(m_i) = \max\{\ell(m_{i-1}) + I_{\min}, t_i\}$$

where  $t_i$  denotes the actual generation time of message  $m_i$ , and  $I_{min}$  is the reciprocal of  $R_{max}$ .

• If D is the end-to-end delay for the channel, the system guarantees that any message  $m_i$  will be delivered to the destination node by time  $\ell(m_i) + D$ .

# **Illustration**



#### **Channel Establishment**

- Select a source to destination route for the channel
- Compute a feasible worst-case delay at each link (if possible) based on the characteristics of all channels in the system
- Check whether the total delay is acceptable and redistribute the delays
- Compute the buffer requirement at each link
   <u>Note</u>: This computation is dependent upon the link
   message scheduling algorithm used during
   transmission.

## **Delay Computation**

- should maintain the feasibility of existing channels
- should obtain the minimum feasible delay
- should be linked to the run-time message scheduling policy
- distinguish between the feasibility testing and the run-time message scheduling policy
- use an optimality result proved by Dertouzos ('74)

## **Assignment Procedure**

- Arrange the channels in ascending order of their associated delay d<sub>i</sub>.
- Assign the highest priority to the new channel  $M_{k+1}$ . Assign priorities to the other channels based on their delay.
- Compute the new (worst-case) response times  $r'_i$  for the existing channels based on this priority assignment.
- Find the smallest position q such that  $r'_i \leq d_i$  for all channels with priority less than q.
- Assign priority q+1 to the new channel and compute the response time  $r'_{k+1}$ .

### Response Time

Consider a set of channels  $\{M_i=(C_i,d_i,p_i), i=1,\ldots,m\}$  which share a common link  $\ell$ .

$$S_i = \{d_i\} \bigcup \{kp_j \mid j = 1, ..., i-1; k = 1, ..., \lfloor (d_i/p_j) \rfloor \}$$

$$W_i(t) = \sum_{j=1}^{i-1} C_j \cdot \lceil t/p_j \rceil + C_i$$

The worst-case response time for messages belonging to  $M_i$  is the smallest value of t such that  $W_i(t) = t$ .

## Run-time Scheduling

- Problem with fi xed-priority scheduling: arrivals are not strictly periodic
  - arrival time at a node depends on the actual delay at the previous node
  - model allows burst arrivals
- High priority arrivals can disrupt the scheduling of lower priority messages

# **Illustration**



#### Run-time Scheduling

- Deadline Scheduling can be used to overcome this problem
- Based on the logical arrival time of the message at a node

The logical arrival time for  $m_i$  at node b,  $\ell_{c,b}(m_i)$ , is defined as

$$\ell_{c,b}(m_i) = \ell_{c,a}(m_i) + d_{c,a}$$

where  $d_{c,a}$  is the worst-case delay for messages on channel  $M_c$  at node a.

Is the feasibility testing still valid?

#### Run-time Scheduling

Uses a multi-class Earliest Due Date (EDD) algorithm

- **Queue 1** Packets belonging to real—time channels with  $\ell_c(m_i) \leq \text{current\_time}$ , arranged in the order of increasing deadlines.
- Queue 2 Other packets arranged in the order of increasing deadlines.
- Queue 3 Packets belonging to real-time channels with  $\ell_c(m_i) > \text{current\_time}$ , arranged in the order of increasing logical arrival time.

#### **Buffer Management**

- Buffer space is reserved for channels at the source, destination, and at intermediate nodes.
  - depends on  $B_{max}$ ,  $R_{max}$ , and the link delays
- Flow-control enforced: time-based
  - packets in Queue 3 are considered for transmission only when their logical arrival time < current time + horizon</li>

### **Buffer Requirement**

- min space =  $S_{max} \cdot \lceil (d_{prev} + d_{node}) / I_{min} \rceil$
- buf space =  $S_{max} \cdot \lceil (d_{prev} + d_{node} + H)/I_{min} \rceil$
- max space

$$= S_{max} \cdot \lceil d_{cumul}/I_{min} \rceil + S_{max} \cdot B_{max}$$

#### Fault-Tolerant Real-Time Channel

- Static routing makes real-time channels unable to tolerate component failures
- Dynamic routing would make it difficult to guarantee delivery-delay bounds
- Possible solutions:
  - Partially-dynamic routing or local detours
  - Multiplexed backkup channels
  - Reactive approach, i.e., do nothing until breaks.

## Partially-Dynamic Routing

- Set up a primary real-time channel
- Enhance the channel with some extra links and nodes
- Use the primary under normal circumstances, and use the extra links/nodes when the primary breaks down.

# Single Failure Immune RTC



# Isolated Failure Immune RTC



#### **Backup Channels**

#### Approach:

- A dependable connection = a primary channel + backup channels
- Reservation of spare resources in advance
- Advance recovery-route selection (off-line end-to-end rerouting)

#### Issues:

- Per-connection dependability-QoS control
- Spare resource allocation
- Channel failure detection
- Time-bounded failure recovery
- Resource reconfi guration

## Overview of Self-Healing Recovery



#### Summary

- E2E real-time communication is achieved via
  - connection establishment
  - run-time scheduling
  - buffer management
- Extensions for fault-tolerance
  - Local detours: SFI and IFI
  - Backup channels and their multiplexing
  - Reactive approach

References can be found from http://kabru.eecs.umich.edu

# Wayne Wolf

Princeton University

# **Challenges in Embedded Computing**

Wayne Wolf
Dept. of Electrical Engr.
Princeton University

# The push from Moore's Law

#Moore's Law continues---1.4 billion transistor chips @ 10 GHz in 2012.

microprocessor transistors/chip

1999 2001 2003 2006 2009 2012

21M 40M 76M 200M 520M 1.40B

#What do we do with all those transistors?

## Filling up chips

- # Effective use of VLSI manufacturing requires:
  - △ large volumes;
  - □standard services, but with product differentiation;
  - ☐ fast design time.
- **#Fast turnaround means:** 
  - □lots of memory;
  - △embedded processors.

# The application push

- #Internet is a platform for distributed computing.
  - □But must come up with interesting applications.
- \*\*Non-PC appliances will become increasingly important.

# **Potential applications**

- **₩Wireless** 
  - □already here
  - △3G pushes more complex applications
- **#Multimedia** 
  - △digital cameras with compression popular
  - more sophisticated applications are possible

# One way to use a billion transistors

- #Multimedia-rich Internet information appliances:
  - △ Multimedia allows people to use the Internet for person-person communication.
  - □ Complex functionality requires large software content.

  - △ High-volume market requires single chip.

# Characteristics of embedded systems

- # Multiple task, heterogeneous.
- ₩ Real-time.
- #Often low power.
- - ☐I reboot my piano every 4 months, my PC every day.

# System design challenges

- **#Single-chip multiprocessor architectures.**
- #Embedded real-time software.
- #IP-based design.
- #Design verification.

# Design productivity gap #From ITRS 99: The productivity gap #From ITRS 99: ### The productivity gap ### The productiv



#### What do we do?

- #Move some design tasks to software.
  - Need abstract models of hardware for performance, power.
- **#Create platform architectures.** 

  - Design reliablity, low power, etc. into the platform for use by software.

#### **Architectures and tools**

#### **%Tools:**

- Some horizontal abstractions (RTOS, compiler, etc.)

#### **\*\*Architectures:**

- ☐ There will be more than one platform.
- □ Domain knowledge is embodied in the architecture.

## Why multiple platforms?

- People still care about cost.
- #People care about power consumption.
- #Sufficiently general solutions don't fit on one chip.

# Multiprocessor systemson-chips

- #Heterogeneous multiprocessors.
  - □ Optimized for area, power.
- #Processing elements may be CPUs or hardwired (or analog).
- **#Custom memory architecture.**
- **#Moving toward networks-on-chips.**
- **#Need software development platforms.**









#### **Printers**

- ₩High-speed vs. low-cost:
  - □uniprocessor vs. multiprocessor hardware?







# TI Open Multimedia Applications Platform



http://www.ti.com/sc/docs/apps/wireless/omap/overview.htm



# **ADI ADSL engine**

#### 



http://www.analog.com/industry/signal chains/auto/communications/comms 1.html

# **Agere StarPro platform**



http://www.lucent.com/micro/starpro/arch.html





#### **Division of labor**

- **#Platform design:** 
  - □ choose, characterize hardware units;
  - □ create the system architecture;
  - △optimize for performance, power.
- #Platform-based product design:

  - □ optimize programs.

# Semiconductor vs. systems house

- #Semiconductor house designs the platform.
- #Systems house customizes the platform for its system:
  - □ customization may be done in-house or by contractor.

## Platform design challenges

- #Does it satisfy the application's basic requirements?
- #Is it sufficiently customizable? And in the right ways?
- **#Is it cost-effective?**

# Platform design methodology

- #Size the problem.
  - ☐ How much horsepower? How much power?
- #Develop an initial architecture.
- #Evaluate for performance, power, etc.
- #Improve platform after each use.

# Platform use challenges

- #How do I understand the platform's design?
- **#How do I modify it to suit my needs?**
- How do I optimize for performance, power, etc.?

# Platform use methodology

- #Start with reference design, evaluate differences required for your features.
- **#Evaluate hardware changes.**
- #Implement hardware and software changes in parallel.

# **Summary**

- #Semiconductor houses are taking over system design functions.
- **\*Embedded multiprocessors solve a lot of problems.** 
  - riangle But there is no one universal platform.

# Hardware/Software Co-Design 1: Fundamentals and Beyond

Wayne Wolf
Dept. of Electrical Engineering
Princeton University

Sweden Summer School

© 2003 W. Wolf

#### Outline

- Models
- Scheduling theory
- Tasks in co-synthesis
- Classic co-synthesis: hardware/software partitioning.
- Improvement to classic co-synthesis.
- Hot swapping.

Sweden Summer School

© 2003 W. Wolf

# What is co-synthesis?

- Use tools to simultaneously generate hardware architecture and the software which runs on it.
  - Architecture may contain programmable CPUs, ASICs.
- Different styles of co-synthesis use different levels of inputs, synthesis steps, types of results.

Sweden Summer School

© 2003 W. Wolf

3

# Uses for co-synthesis

- Design space exploration:
  - architecture planning;
  - planning of multi-generation products;
  - marketing planning.
- Prototype creation.
- Implementation generation.

Sweden Summer School

© 2003 W. Wolf

#### **Terms**

- thread, process: a single execution; thread usually implies no firewalls between threads
- PE: processing element, may be CPU, ASIC
- custom ASIC: designed on the fly
- catalog ASIC: characteristics known in advance

Sweden Summer School

© 2003 W. Wolf







• Can model late arrivals, early departues by adding dummy processes.

Sweden Summer School

© 2003 W. Wolf



## Kahn process network

• Process has unbounded FIFO at each input:



- Each channel carries a possibly infinite sequence or stream.
- A process maps one or more input sequences to one or more output sequences.

Sweden Summer School

© 2003 W. Wolf

9

# Properties of processes

- Processes are usually required to be continuous: least upper boundedness can be moved across function boundary.
- Monotonicity:

$$-X \text{ in } X' \Longrightarrow F(X) \text{ in } F(X')$$

Sweden Summer School

© 2003 W. Wolf

#### Networks of processes

- A network of processes relates the streams of several processes.
- If I = input sequences, X = internal sequences + outputs, then network behavior fixed point is

$$-X = F(X,I)$$

Sweden Summer School

© 2003 W. Wolf

11

#### Network properties

- A network of monotonic processes is a monotonic process.
  - Even in the presence of feedback loops.
- Can add nondeterminism in several ways:
  - allow process to test for emptiness;
  - allow process to be internally nondeterminate;
  - allow more than one process to consume data from a channel;

Sweden Summer School

© 2003 W. Wolf

#### Processes as software objects

- A process is a unique execution of a program:
  - has its own state.
- A process may have a finite or infinite lifetime:
  - executed once or many times.
- A process may be executed periodically or aperiodically.

Sweden Summer School

© 2003 W. Wolf

13

#### Process timing characteristics

- Period: interval between successive initiations.
- Deadline: time at which an initiated process must complete.



## Process timing characteristics, cont'd

- Initiation time: time at which process goes into ready state. May come from:
  - external indeterminacy (jitter, etc.);
  - hidden data dependencies.



• Execution time: time required for process to complete assuming no interruptions.

Sweden Summer School

© 2003 W. Wolf

15

## Priority-driven scheduling

Each process is assigned a priority. Priority may be static or dynamic.

Highest-priority ready process becomes active.

A process runs until completion or preemption.

Sweden Summer School

© 2003 W. Wolf

#### Example

- P1: priority=1, execution time=10
- P2: priority=2, execution time =5
- P3: priority=3, execution time =20



### Rate-monotonic scheduling

- Liu and Layland: real-time scheduling with provable properties.
- Must guarantee that all processes meet their deadlines, independent of order of initiation.

#### RMS model

- All processes run on one CPU.
- Static priority for each process.
- No data dependencies between processes.
- Must meet all deadlines for any valid combination of initiation times.
- Ignore context switch overhead.

Sweden Summer School

© 2003 W. Wolf

19

#### RMS initiation times



Sweden Summer School

© 2003 W. Wolf



#### Rate monotonic analysis

- To meet deadlines, prioritize according to T<sub>i</sub>'s, with shortest-period process given highest priority.
  - Fixed-priority scheme.
  - Independent of C<sub>i</sub>'s.
- This priority assignment is optimal—no fixed priority scheme does better.

Sweden Summer School

#### Critical instant

• The critical instant for a process occurs when the process and all higher-priority processes are initiated simultaneously.



23

#### CPU utilization under RMS

- Given task set of size m, utilization U:
  - $U = \sum_{1 < i < m} C_i / T_i$
- Least upper bound for utilization:
  - $-U = m(2^{1/m} 1)$
- Utilization asymptotically approaches 69%.
- Some CPU cycles cannot be utilized because the CPU must have enough cycles available to respond to the tightest critical instant for the highest-priority process.
  Sweden Summer School © 2003

#### Earliest-deadline first

Priority is dynamically calculated: process with deadline occuring soonest has highest priority.

Liu and Layland showed EDF can achieve 100% utilization until overload occurs. Cannot guarantee meeting deadlines for arbitrary data arrival times.

Sweden Summer School

© 2003 W. Wolf



#### Schedule unrolling

To prove that a complex scheduling algorithm works, must unroll the schedule to the least-common multiple of all the period times.

MPEG-1 example: 44.1 kHz x 30 Hz requires 132,300 time intervals.

After unrolling, must consider arrival time combinations.

Sweden Summer School

© 2003 W. Wolf

27

#### Priority inversion

Problem arises when processes can request outside resources: bus, shared variable, etc.

Prioritization means that low-priority process can keep high-priority process from getting the resource, causing deadlock.

Sweden Summer School

© 2003 W. Wolf

#### Priority inversion example

#### Deadlock scenario:

- low-priority process L gets resource;
- preempted by higher-priority process H;
- H requests same resource;
- H can't get resource until L finishes, but H won't let L continue.

Can be solved by priority inheritance: L gets higher priority while it has the resource.

Sweden Summer School

© 2003 W. Wolf

29

#### Bus communication example



- Communication requires two resources: PE and channel.
- Long messages can hog communication resources.
- Bus communication must be prioritized.

Sweden Summer School

© 2003 W. Wolf

#### Synthesis tasks

- Scheduling: make sure that data is available when it is needed.
- Allocation: make sure that processes don't compete for the
- Partitioning: break operations into separate processes to increase parallelism, put serial operations in one process to reduce communication.
- Mapping: take PE, communication link characteristics into account.

Sweden Summer School

© 2003 W. Wolf

31

#### Scheduling and allocation

- Must schedule/allocate
  - computation
  - communication
- Performance may vary greatly with allocation choice.



Sweden Summer School

© 2003 W. Wolf

## Problems in scheduling/allocation

- Can multiple processes execute concurrently?
- Is the performance granularity of available components fine enough to allow efficient search of the solution space?
- Do computation and communication requirements conflict?
- How accurately can we estimate performance?
  - software

Sweden Summer School ASICs

© 2003 W. Wolf

33

#### TigerSwitch: allocation of DTMF

• Where do we put TouchTone (DTMF) detection?



### Partitioning example



## Problems in partitioning

- At what level of granularity must partitioning be performed?
- How well can you partition the system without an allocation?
- How does communication overhead figure into partitioning?

Sweden Summer School

© 2003 W. Wolf

## Tigerswitch partitoning problem: before

• Calls are parallel processes:



Sweden Summer School

© 2003 W. Wolf

37

## Tigerswitch partitioning problem: after

• Implement calls as array elements:



Sweden Summer School

© 2003 W. Wolf

#### Problems in mapping

- Mapping and allocation are strongly connected when the components vary widely in performance.
- Software performance depends on bus configuration as well as CPU type.
- Mappings of PEs and communication links are closely related.

Sweden Summer School

© 2003 W. Wolf

39

#### Hardware-software partitioning

Architectural template: CPU + 1 or more ASICs on a bus:



Sweden Summer School

© 2003 W. Wolf

# Properties of classic partitioning algorithms

- Single-rate.
- Single-threaded: CPU waits for ASIC.
- Type of CPU is known; ASIC is synthesized. Closely coupled to high-level synthesis.

Sweden Summer School

© 2003 W. Wolf

41

## Hardware/software partitioning styles

- Two major styles:
- Vulcan starts with all-ASIC solution and moves functions to software to reduce cost (primal method).
- COSYMA starts with all-software solution and moves functions to ASIC to meet performance goal (dual method).

Sweden Summer School

© 2003 W. Wolf

#### Vulcan

- Gupta and De Micheli: Target architecture:
   CPU + ASICs on bus
- Break behavior into threads at nondeterministic delay points; delay of thread is bounded
- Software threads run under RTOS; threads communicate via queues

Sweden Summer School

© 2003 W. Wolf

43

#### Specification and modeling

- Specified in Hardware C. Spec divided into threads at non-deterministic delay points.
- Hardware properties: size, # clock cycles.
- CPU/software thread properties:
  - thread latency
  - thread reaction rate
  - processor utilization
  - bus utilization

Sweden Summer School

© 2003 W. Wolf

## Vulcan thread modeling

## Hardware C allows conjunctive, disjunctive execution:

conjunctive: x=a; y=b; disjunctive: if (c>d) x=e; else y=f;



#### HW/SW allocation

- Start with unbounded-delay threads in CPU, rest of threads in ASIC.
- Optimization:
  - test one thread for move
  - if move to SW does not violate performance requirement, move the thread
  - feasibility depends on SW, HW run times, bus utilization
  - if thread is moved, immediately try moving its successor threads

Sweden Summer School

© 2003 W. Wolf

46

#### **COSYMA**

- Ernst et al.: moves operations from software to hardware.
- Operations are moved to hardware in units of basic blocks.
- Estimates communication overhead based on bus operations and register allocation.
- Hardware and software communicate by shared memory.

Sweden Summer School

© 2003 W. Wolf

47

## COSYMA design flow



Sweden Summer School

© 2003 W. Wolf

#### Cost estimation

- Speedup estimate for basic block *b*:
  - $\Box \Delta c(b) = w(t_{HW}(b) t_{SW}(b) + t_{com}(Z) t_{com}(Z+b)) * It(b)$
  - w = weight, It(b) = # iterations taken on b
- Sources of estimates:
  - Software execution time  $(t_{SW})$  is estimated from source code.
  - Hardware execution time  $(t_{HW})$  is estimated by list scheduling.
  - Communication time (t<sub>com</sub>) is estimated by data flow analysis of adjacent basic blocks.

Sweden Summer School

© 2003 W. Wolf

49

#### **COSYMA** optimization

- Goal: satisfy execution time. User specifies maximum number of function units in coprocessor.
- Start with all basic blocks in software.
- Estimate potential speedup in moving a basic block to software using execution profiling.
- Search using simulated annealing. Impose high cost penalty for solutions that don't meet execution time.

Sweden Summer School

© 2003 W. Wolf

#### Two-phase optimization

- Inner loop uses estimates to search through design space quickly.
- Outer loop uses detailed measurements to check validity of inner loop assumptions:
  - code is compiled and measured
  - ASIC is synthesized
- Results of detailed estimate are used to apply correction to current solution for next run of inner loop.

Sweden Summer School

© 2003 W. Wolf

51

#### SOS ILP formulation

- Prakash and Parker:
  - variables represent schedule;
  - constraints represent process dataflow;
  - -0/1 variables represent allocation.
- Minimize system cost, satisfy performance requirement.

Sweden Summer School

© 2003 W. Wolf

# Co-synthesis and ASIC performance optimization

- Need a more accurate model of ASIC performance.
  - Use high-level synthesis in the loop to estimate.

Sweden Summer School

© 2003 W. Wolf

53

## ASIC performance analysis

❖ Monet: a behavior-level architectural exploration system by Mentor Graphics, take behavioral description in VHDL or Verilog as input,

output RTL description



## ASIC performance analysis

\* ASIC implementation design space



Find out 2 extremes and search intermediate solutions between these 2 extremes.

Sweden Summer School

© 2003 W. Wolf





## Algorithm outline

- 1. Pre-process and find an initial solution
- 2. Iteratively reduce ASIC number and CPU cost

  ASIC\_to\_CPU procedure

  CPU cost reduction

  Allocation and scheduling
- 3. ASIC cost reduction global\_slack local slack

Sweden Summer School

© 2003 W. Wolf

## Iterative improvement

- 1. <u>ASIC\_to\_CPU</u>: 2 heuristics to reque processes from ASIC to CPU
  - b. Non-Critical path ASIC
- 2. <u>CPU cost reduction</u>: Reduce the existing CPU architecture cost.
  - a. reduce the number of CPU
  - b. replace the expensive CPU with a cheaper one
  - c. reduce cache cost
- 3. Allocation and scheduling
  - a. static urgency
  - b. dynamic urgency

Sweden Summer School

© 2003 W. Wolf



Sweden Summer School

© 2003 W. Wolf

-completion time ASIC j

### 1. Sort ASICs by decreasing Surfifecost (reductions (2s) nallest)

2. For each ASIC i in sorted list

replace the fastest ASIC with the smallest ASIC

if (meet deadline) use the smallest ASIC

else keep the fastest

3. Reduce the fastest ASIC cost:

for each ASIC\_j{

- a Set ASIC\_j\_ET=(fastest\_ET+global\_slack+local\_slack\_j)
- b foreach ASIC i in other ASICs

{ calculate local slack i and

set ASIC\_i\_ET=fastest\_ET+local\_slack\_i}

- c call ASIC analysis tool to get the total ASIC cost COST(i)}
- 4. Select the minimal COST(i) in step 3 and set corresponding ASIC ET.

Sweden Summer School

© 2003 W. Wolf



#### Characterization 1

- Lee/Henkel/Wolf: A certain task may have more than one implementation
- Different implementations of a task typically have different power consumption
- Swapping from one implementation of a certain task to another implementation of the same task implies an additional computation time and power consumption

Sweden Summer School

© 2003 W. Wolf

63

#### Characterization 2

- The basic functionality of the various implementations of a certain task is the same. At least one implementation is a superset of the other implementations
- Swapping can be triggered by two requirements: **constraints** and **cost minimization**







Swap to Task 1\_2

Sweden Summer School

Stay at Task 1\_1

#### Goal

- It is unlikely that all the most costly implementations of all tasks are scheduled to run at the same time since the requirements for swapping are different for different tasks
- Achieve an advantage using hot swapping as opposed to a system that has only one fixed implementation of a task which is typically the most costly one (in order to cover the worst case)

Sweden Summer School

© 2003 W. Wolf

65

#### Cost example

- RMS with Hot Swap
- Result: energy saving



Sweden Summer School

© 2003 W. Wolf

### Scheduling equations

$$t_{1b} = \sum_{\substack{\forall \tau_j \subset \{hp(i) \cup \tau_i\} \\ \bullet}} C \cdot \left\lceil \frac{t_{request} + 1}{T_i} \right\rceil$$

$$\bullet$$

$$\bullet$$

$$\bullet$$

$$\bullet$$

$$\bullet$$

$$t_{nb} = \sum_{\substack{\forall \tau_j \subset \{hp(i) \cup \tau_i\} \\ T_i}} C \cdot \left\lceil \frac{t_{nb-1} + 1}{T_i} \right\rceil$$

Sweden Summer School

© 2003 W. Wolf

67

### Feasibility test

- Case 1
  - Task τi has not yet started execution. Swapping is potentially possible if there is enough time left for conducting the swap (will be discussed later).
- Case 2
  - Task τi has already started execution. This case tells us that the feasibility test from above is not sufficient. We have to apply a second feasibility analysis

Sweden Summer School

© 2003 W. Wolf

#### Feasibility test example

• Swapping task 2 from implementation 1 to 2



Sweden Summer School

© 2003 W. Wolf

60

### Scheduling Strategy 1

- When is it safe to swap
- When to actually perform the swap (assumed that there has already been initiated a request to swap a.s.a.p.)
- *How* to swap (e.g. using intermediate swaps etc.)

Sweden Summer School

© 2003 W. Wolf

#### Scheduling Strategy 2

- At Design Time
  - Create database
  - Generate parameters
- At Run time
  - Perform feasibility test
  - Decide to swap or not
  - Decide which implementation to swap

Sweden Summer School

© 2003 W. Wolf

71

#### Summary

- Co-design is one branch of embedded system design.
- Models include functionality, HW and SW targets, performance.
- Scheduling real-time embedded systems is hard.
- Classic co-synthesis partitions functionality onto a CPU+ASIC template.

Sweden Summer School

© 2003 W. Wolf

# Hardware/Software Co-Design 2: Techniques

Wayne Wolf
Dept. of Electrical Engineering
Princeton University

ACM Workshop May 2001

© 2001 W. Wolf

1

#### Outline

- Distributed system co-synthesis.
- Communication.
- Reactive system co-synthesis.
- Low-power design.
- Memory systems.

ACM Workshop May 2001

© 2001 W. Wolf

### Why distributed systems?

- CPU cost is a non-linear function of performance.
  - Several small CPUs may be cheaper than one big one.
- Scheduling overhead must be paid for at the non-linear rate.

ACM Workshop May 2001

© 2001 W. Wolf

3

#### CPU cost vs. performance



#### Distributed system co-synthesis

- Can't take advantage of architectural template:
  - structure;
  - component characteristics.
- Generally multi-rate.

ACM Workshop May 2001

© 2001 W. Wolf

5

#### GCLP algorithm

Kalavade and Lee: global criticality/local phase; iterative algorithm.

Global criticality: critical path used to identify nodes to move onto ASIC.

Local phase: identify nodes which can be much more cheaply implemented in one medium than the other to reduce cost.

ACM Workshop May 2001

© 2001 W. Wolf

## Successive-refinement cosynthesis

- Wolf: scheduling, allocation, and mapping are intertwined:
  - process execution time depends on CPU type selection
  - scheduling depends on process execution times
  - process allocation depends on scheduling
  - CPU type selection depends on feasibility of scheduling
- Solution: allocate and map conservatively to meet deadlines, then re-synthesize to reduce implementation cost.

ACM Workshop May 2001

© 2001 W. Wolf

7

#### A heuristic algorithm

- 1. Allocate processes to CPUs and select CPU types to meet all deadlines.
- 2. Schedule processes based on current CPU type selection; analyze utilization.
- 3. Reallocate processes to CPUs to reduce cost.
- 4. Reallocate again to minimize inter-CPU communication.
- 5. Allocate communication channels to minimize cost.
- 6. Allocate devices, to internal CPU devices if possible.

ACM Workshop May 2001

© 2001 W. Wolf

#### Example



#### PE cost reduction step

- Step 3 contributes most to minimizing implementation cost. Want to eliminate unnecessary PEs.
- Iterative cost reduction:
- reallocate all processes in one PE;
- pairwise merge PEs;
- balance load in system.
- Repeat until system cost is not reduced.

ACM Workshop May 2001

© 2001 W. Wolf

# Technology mapping: interrupt allocation

- Rhodes and Wolf: assign limited interrupts to processes:
  - Allocate each process to a resource (processor);
  - Determine the **priority** of each process;
  - Decide which arcs should be interrupt driven and which polled, subject to a limit per resource.

ACM Workshop May 2001

© 2001 W. Wolf

11

# Analysis model used in Design

- POLLED data arrives "silently"
- INTERRUPT driven arc requires interrupt handling
  - Int. handlers assumed to have higher priority than any process
  - Interrupts are turned off while in an interrupt handler
    - Turned back on at conclusion
- No context switch overhead ACM Workshop May 2001 Switch © 2001 W. Wolf

# Communication and reactive systems

- Communication is specified as systems of communicating finite-state machines.
- Finite-state machine model is important: determines what designer must write and what can be synthesized.

ACM Workshop May 2001

© 2001 W. Wolf

13

#### SOLAR/PARTIF

- Ismail et al: partitioning and refinement for communication-rich systems.
- Specification in terms of Design Units, written in StateChart style. Design Unit can be composed of Design Units and Communication Units. State Table can be used to specify a Design Unit.
- Three types of architectural components: hardware, software, communication. Units can be connected in various topologies.

ACM Workshop May 2001

© 2001 W. Wolf

# SOLAR design process

- System is specified as a set of Design Units.
- DUs are assigned into hardware and software, but not allocated to particular units.
- Communication channels are allocated and bound.
- Design units are bound to particular architectural components.

ACM Workshop May 2001

© 2001 W. Wolf

15

#### **PARTIF**

- Interactive partitioning tool. Provides design statistics to guide partitioning moves.
- Basic transformations:
  - move—transfer one process to another point in the hierarchy
  - merge—fuse two processes into one
  - split—turn a sequential machine into parallel machines
  - cut—cut a set of parallel states into interconnected Design Units
  - map—transform communicating systems into interconnected subsystems

ACM Workshop May 2001

© 2001 W. Wolf

# Statecharts

- Harel: Early control-oriented specification model for reactive systems.
- Commercialized in the Statemate system.
- Provided hierarchical model for states.

ACM Workshop May 2001

© 2001 W. Wolf

17

# Statechart OR state



#### Esterel

- Berry: language for reactive real-time systems.
- Synchronous specification:
- Assumption: zero-time communication.
- Implement by forming product machine.

ACM Workshop May 2001

© 2001 W. Wolf

19

# **TOSCA**

Antoniazzi et al: targets control-dominated systems.

System modeled as communicating machines.

Use operators to explore design space:

- unfold entry/exit;
- unfold FSM emabling condition;
- flatten hierarchy;
- replace timers with counters;
- collapse several processes into one.

ACM Workshop May 2001

© 2001 W. Wolf

#### **POLIS**

Sangiovanni-Vincentelli et al: targets multirate reactive systems.

System modeled as network of Co-design FSMs (CFSMs).

Uses zero-delay hypothesis: communication happens instantaneously.



Polis, cont'd

Communication can be analyzed by forming product of communicating machines.

Partitioning assigns functions to hardware and software.

Library components can be described as CFSMs.

ACM Workshop May 2001

© 2001 W. Wolf

22

# Peformance analysis for reactive systems

Balarin and S-V: validate schedule for reactive system.

Specification is DAG of tasks with priorities and execution times.

Events initiate computation. Fpr every critical event from i to j, min time between 2 executions of i must be >= max time between execution of i and i.

Compute partial loads to find answer.

ACM Workshop May 2001

© 2001 W. Wolf

23

# Chinook

Borriello et al: targets control-dominated systems.

Major steps:

- HW/SW partitioning;
- device driver synthesis and low-level scheduling;
- I/O port allocation and interface synthesis;
- system-level scheduling;
- code generation.

ACM Workshop May 2001

© 2001 W. Wolf

# Power analysis and optimization

- Must estimate power/energy requirements:
  - multi-threaded
  - caches
- Are there tradeoffs between power and performance?

ACM Workshop May 2001

© 2001 W. Wolf

25

# Single-process power analysis

Tiwari and Malik: measure average current for instructions.

First-order model gives energy per instruction.

Second-order model takes into account inter-instruction interactions.



ACM Workshop May 2001

© 2001 W. Wolf

# TOSCA power estimation metrics

#### Types of systems and metrics:

- area-constrained: power\*area
- performance-constrained
  - fixed-throughput: energy per operation
  - maximum throughput: energy per op/max throughput = Power/T<sup>2</sup>
  - burst throughput:  $(E_{max} + E_{idle})/T_{max}$

ACM Workshop May 2001

© 2001 W. Wolf

27

# TOSCA power estimation

- Hardware:
  - IO section
  - data path
  - memory
  - control: primary inputs, state, combinational, outputs
- Software:
  - based on average current per instruction

ACM Workshop May 2001

© 2001 W. Wolf

# Low-power synthesis

Kirovski and Potkonjak: synthesis concentrates on processor allocation and task assignment.

Architectural template: multiple PEs on common bus with common memory.

Uses mixed strategy: power consumption per process + voltage scaling.

ACM Workshop May 2001

© 2001 W. Wolf

29

# Low power synthesis, cont'd

First step: allocate processors in template for minimum power for the task set:

- eliminate inferior PEs: more expensive, slower, or uses more energy
- iteratively improve to final configuration

Second step: assign tasks to PEs in the instantiated architecture.

ACM Workshop May 2001

© 2001 W. Wolf

# Policy optimization

Paleologo et al: describe power management using stochastic FSMs.

Power management policy determines how power is managed.

Finding optimal policy to maximize average performance can be solved in polynomial time.

ACM Workshop May 2001

© 2001 W. Wolf

31

# Stochastic model

Each component is a Markov chain:



ACM Workshop May 2001

© 2001 W. Wolf

# Stochastic model components

Sample Markov chain for service requestor:

0 1

 $PSR = 0 \mid 0.9 \ 0.1 \mid$ 

1 | 0.2 0.8 Probability of two requests in a row

Policy is the sequence of states taken by the power manager.

Optimal policy is stationary: depends only on current state of the system.

ACM Workshop May 2001

© 2001 W. Wolf

33

# Memory is important

- Relative energy per operation (Catthoor et al):
  - memory transfer: 33
  - external I/O: 10
  - SRAM write: 9
  - SRAM read: 4.4
  - multiply: 3.6
  - add: 1

ACM Workshop May 2001

© 2001 W. Wolf

# The cache's sweet spot

- Energy consumption has a sweet spot as cache size changes:
  - cache too small: program thrashes, burning energy on external memory accesses;
  - cache too large: cache itself burns too much power.

ACM Workshop May 2001

© 2001 W. Wolf

35

# Energy dissipation for embedded systems

Li and Henkel: concentrates on power consumption in memory hierarchy.

Memory models for cache and main memory:

- array
- peripheral logic

ACM Workshop May 2001

© 2001 W. Wolf

# Software energy and performance model

Based on behavioral simulation.

Total energy components:

- energy per instruction \* instructions
- data cache write miss penalty
- data read miss
- instruction fetch miss

ACM Workshop May 2001

© 2001 W. Wolf

37

# Design space exploration strategy

#### Software transformations:

- procedure in-lining
- loop unrolling

Cache may be made larger to increase performance at cost of higher energy utilization.

ACM Workshop May 2001

© 2001 W. Wolf



# Memory system methodology

- Catthoor et al:
  - optimize storage and transfer between components;
  - then merge components and apply optimizations to global model;
  - use loop transformations, reordering of computations, data type refinement, etc.

ACM Workshop May 2001

© 2001 W. Wolf

# Co-synthesis with memory hierarchies

Li and Wolf: use hierarchical memory model in co-synthesis system.

Target architecture:



# System architecture

Given processes and characteristics, PE technology information

#### Find:

- schedule for processes and allocation to PEs;
- cache sizes and allocation of processes to cache.

Two phases: parameter extraction for processes; synthesis.

ACM Workshop May 2001

© 2001 W. Wolf

42

# Cache assumptions

Only one level of cache; each process is well-behaved in that cache.

Caches are direct mapped; cache sizes are powers of two.

A task is mapped into a contiguous region of instruction cache.

ACM Workshop May 2001

© 2001 W. Wolf

43

# Mapping of tasks onto cache



#### Process characterization

- Characterize each process independently:
  - trace-based analysis;
  - identify cache footprint.
- Process parameters:
  - worst-case execution time;
  - best-case execution time;
  - average-case execution time.

ACM Workshop May 2001

© 2001 W. Wolf

45

# Task allocation and scheduling

#### Based on hierarchical scheduling algorithm:

- schedule each task independently;
- schedule system with task as atomic unit;
- instantiate tasks and optimize system schedule.



ACM Workshop May 2001

© 2001 W. Wolf

# Results

#### MPEG-2 encoder:

- without cache: 6 PEs, 121 CPU seconds
- fixed caches: 5 PEs, 157 CPU seconds
- synthesized caches: 4 PEs, 203 CPU seconds

Most CPU time goes into parameter evaluation, but this is outside synthesis loop and may be amortized across multiple designs.

ACM Workshop May 2001

© 2001 W. Wolf

47

# Code placement

- Potkonjak et al:
  - place code to minimize dynamic cache interference through placement and scheduling.



p1 p2

cache

code image
ACM Workshop May 2001

© 2001 W. Wolf

# Sponsored by:











# Cycle-Accurate Joulemeter for CMOS VLSI Circuits

Soo-lk Chae

Seoul National University, Korea





# Cycle-Accurate Joulemeter for CMOS VLSI Circuits

Soo-lk Chae Seoul National University chae@sdgroup.snu.ac.kr 2003. 8. 11

Inter-university Semiconductor Research Center

1



Seoul National University

#### **Contents**

|   |   | 4     |        |   |   | 4 . |   |   |
|---|---|-------|--------|---|---|-----|---|---|
|   | ı | 1 T P | $\sim$ | ~ |   | ~ti | ^ | n |
| _ |   |       | u      | ч | u |     | u |   |
| _ |   |       | _      | • | • |     | _ |   |

☐ CMOS Inverter Model

☐ Energy Estimation in Measurement System

- Energy model 1
- Energy model 2
- Energy model 3
- Energy model 4
- · Capacitance ratio constraint

#### □ Second Order Effects

- · Leakage current
- Overlapping current
- · Nonlinear capacitance
- □ Experimental Results
- □ Conclusion





#### **Motivation for Low-Power Design**

|  | Scaling | of | <b>CMOS</b> | techn | ology |
|--|---------|----|-------------|-------|-------|
|--|---------|----|-------------|-------|-------|

- · Higher functionality with smaller chips
- Higher performance at lower cost
- ☐ Cool chips and cool systems
  - Low-power design may ease the heat dissipation problem
- ☐ Recent portable compute-intensive applications
  - Multimedia
  - · Video display and capture
  - · Notebook computer
  - Personal communication systems
  - · Implantable medical electronics

Inter-university Semiconductor Research Center

3



Seoul National University

### **Power/Energy Estimation Techniques**

| ∟ Lov | v leve | l simul | lators |
|-------|--------|---------|--------|
|-------|--------|---------|--------|

- Precise estimation
- Large amount of simulation time due to the increase of circuit complexity
- ☐ High level energy simulators
  - · Fast simulation time
  - Lower accuracy

#### ■ Measurement-based energy estimators

- Digital multimeters, wattmeters: only can measure average or rms
  value
- Oscilloscope: little practical aspects in terms of time complexity and accuracy
- Need for cycle-accurate power/energy estimator





### **Need for Cycle-Accurate Power/Energy Estimator**

- Most digital systems consume very different amounts of energy for each cycle
- ☐ Need for cycle-accurate energy consumption profiles
- ☐ Software level energy optimization can be done
- ☐ Changing the energy sensitive factors while preserving the semantics of the original design

Inter-university Semiconductor Research Center





Seoul National University

### **Energy Flow during a Transition of a CMOS Inverter**

#### Inv Model A

#### Inv Model B



Inter-university Semiconductor Research Center



Seoul National University

### Inv Model A: when the output is rising





□ Before transition,

$$Q_1 = C_S V$$

$$Q_1 = C_S V_1$$
  $E_1 = \frac{1}{2} C_S V_1^2$ 

□ After transition,

$$Q_2 = (C_S + C_P + C_N)V_2$$

$$Q_2 = (C_S + C_P + C_N)V_2$$
  $E_2 = \frac{1}{2}(C_S + C_P + C_N)V_2^2$ 



### Inv Model A: when the output is rising





- $Q = (C_P + C_N)V_2$ ☐ Charge flow from C<sub>S</sub>:
- $oldsymbol{\square}$  Charge discharged locally:  $Q_{cancelled} = 0$
- $C_S V_1 = (C_S + C_P + C_N)V_2$ ☐ Charge conservation law:

Inter-university Semiconductor Research Center



Seoul National University

### Inv Model A: when the output is rising





Inter-university Semiconductor Research Center



Seoul National University

#### Inv Model A: when the output is falling





□ Before transition,

$$Q_1 = (C_S + C_P + C_N)V_S$$

$$Q_{1} = (C_{S} + C_{P} + C_{N})V_{2}$$

$$E_{1} = \frac{1}{2}(C_{S} + C_{P} + C_{N})V_{2}^{2}$$

□ After transition,

$$Q_2 = C_S V_2$$

$$Q_2 = C_S V_2 E_2 = \frac{1}{2} C_S V_2^2$$

Inter-university Semiconductor Research Center



Seoul National University

# Inv Model A: when the output falling





- ☐ Charge flow from C<sub>S</sub>: Q = 0
- $\Delta E = E_1 E_2 = \frac{1}{2} (C_P + C_N) V_2^2$ ☐ Energy consumed:
- $\ \ \, \Box \ \ \, \text{Charge discharged locally:} \quad \, Q_{cancelled} = \left(C_{P} + C_{N}\right) \!\! V_{2}$
- $C_S V_2 = C_S V_2$ ☐ Charge conservation law:





### Inv Model B: when the output rising





□ Before transition,

$$Q_1 = (C_S + C_P)V_1$$

$$Q_1 = (C_S + C_P)V_1$$
  $E_1 = \frac{1}{2}(C_S + C_P)V_1^2$ 

□ After transition,

$$Q_2 = (C_S + C_N)V_2$$

$$Q_2 = (C_S + C_N)V_2$$
  $E_2 = \frac{1}{2}(C_S + C_N)V_2^2$ 

Inter-university Semiconductor Research Center



Seoul National University

### Inv Model B: when the output is rising





- □ Charge flow from C<sub>s</sub>:
- $Q = C_N V_2$
- $\Box$  Charge dischargedd locally:  $Q_{cancelled} = C_P V_1$
- ☐ Charge conservation law:
- $C_S V_1 = (C_S + C_N) V_2$
- ☐ Energy consumed:
- $\Delta E = E_1 E_2 = \frac{1}{2} C_P V_1^2 + \frac{1}{2} C_N V_1 V_2$





### Inv Model B: when the output is falling





□ Before transition,

$$Q_1 = (C_S + C_N)V_2$$

$$Q_1 = (C_S + C_N)V_2$$
  $E_1 = \frac{1}{2}(C_S + C_N)V_2^2$ 

□ After transition,

$$Q_2 = (C_S + C_P)V_3$$

$$Q_2 = (C_S + C_P)V_3$$
  $E_2 = \frac{1}{2}(C_S + C_P)V_3^2$ 

Inter-university Semiconductor Research Center



Seoul National University

## Inv Model B: when the output is falling





- ☐ Charge flow from C<sub>s</sub>:  $Q = C_P V_3$
- $Q_{cancelled} = C_N V_2$ ☐ Charge cancelled locally:
- $C_S V_2 = (C_S + C_P) V_3$ ☐ Charge conservation law:
- $\Delta E = E_1 E_2 = \frac{1}{2}C_N V_2^2 + \frac{1}{2}C_P V_2 V_3$ ☐ Energy consumed:



#### **CMOS Inverter Model A vs Model B**

|        |                            | Simple Model A                                                    | Model B                                                                                         |
|--------|----------------------------|-------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| Output | Charge from C <sub>S</sub> | (C <sub>P</sub> +C <sub>N</sub> )V <sub>2</sub>                   | C <sub>N</sub> V <sub>2</sub>                                                                   |
|        | Energy consumed            | 1/2(C <sub>P</sub> +C <sub>N</sub> )V <sub>1</sub> V <sub>2</sub> | 1/2C <sub>P</sub> V <sub>1</sub> <sup>2</sup> + 1/2C <sub>N</sub> V <sub>1</sub> V <sub>2</sub> |
| Output | Charge from C <sub>S</sub> | 0                                                                 | C <sub>P</sub> V <sub>3</sub>                                                                   |
|        | Energy consumed            | 1/2(C <sub>P</sub> +C <sub>N</sub> )V <sub>2</sub> <sup>2</sup>   | 1/2C <sub>N</sub> V <sub>2</sub> <sup>2</sup> + 1/2C <sub>P</sub> V <sub>2</sub> V <sub>3</sub> |

Inter-university Semiconductor Research Center 17



Seoul National University

#### **CMOS Inverter Model A vs Model B**

Assuming  $C_s = \infty$ 

|        |                            | Simple Model A                                     | Model B                                            |
|--------|----------------------------|----------------------------------------------------|----------------------------------------------------|
| Output | Charge from C <sub>S</sub> | (C <sub>P</sub> +C <sub>N</sub> )V                 | C <sub>N</sub> V                                   |
|        | Energy consumed            | 1/2(C <sub>P</sub> +C <sub>N</sub> )V <sup>2</sup> | 1/2(C <sub>P</sub> +C <sub>N</sub> )V <sup>2</sup> |
| Output | Charge from C <sub>S</sub> | 0                                                  | C <sub>P</sub> V                                   |
|        | Energy consumed            | 1/2(C <sub>P</sub> +C <sub>N</sub> )V <sup>2</sup> | 1/2(C <sub>P</sub> +C <sub>N</sub> )V <sup>2</sup> |



# **Cycle-Accurate Energy Measurement System**



Inter-university Semiconductor Research Center

10



Seoul National University

## **Waveforms of Energy Measurement System**



☐ Three sampling points for each clock transition

Inter-university Semiconductor Research Center

#### **Settling Time Constraint**



- ☐ All data points must be sampled after V<sub>M</sub> is settled.
- $\Box$  T<sub>settling</sub> < T<sub>M</sub>/8
- $\Box$  T<sub>M</sub> = 1/f<sub>M</sub>, (f<sub>M</sub>: operating frequency during measurement)
- $\Box$  Typically  $f_M < f_O$  to meet the settling constraint. ( $f_O$ : actual operating frequency)

Inter-university Semiconductor Research Center

. .



Seoul National University

# Energy Model 1: Transferred energy under measurement





Difference of energy stored in  $C_{\rm M}$ 

$$E = \frac{1}{2}C_{M}V_{1}^{2} - \frac{1}{2}C_{M}V_{2}^{2}$$

- ☐ Energy transferred from capacitor C<sub>M</sub>
- ☐ Is this energy equal to the energy consumed in the chip?



**Seoul National University** 

# **Energy Model 1: C<sub>B</sub> Effect**



- $f \Box$   $f C_B$ : on-chip/off-chip bypass capacitor, parasitic capacitor on PCB or package
- ☐ Charge conservation law when S₂ on

$$C_M V_1(n) + C_B(n) V_3(n-1) = (C_M + C_B(n)) V_2(n)$$

... (1)

Inter-university Semiconductor Research Center

23



Seoul National University

# **Energy Model 1: Calculate C<sub>B</sub> & E(n)**





$$C_B(n) = \frac{V_1(n) - V_2(n)}{V_2(n) - V_3(n-1)} \cdot C_M$$

$$E(n) = \frac{1}{2} \left( C_M + C_B(n) \right) \left( V_2^2(n) - V_3^2(n) \right)$$

Inter-university Semiconductor Research Center





#### **Energy Model 1: Problem**





Difference of energy stored in  $C_{\mathrm{M}}$ 

$$E = \frac{1}{2} C_M V_1^2 - \frac{1}{2} C_M V_2^2$$

- ☐ Must consider the supply voltage drop during the energy flow
- ☐ Measured energy < real energy consumed with ideal supply

Inter-university Semiconductor Research Center

25



Seoul National University

## **Energy Model 2: Consumed energy**





$$E_{consumed} = C_L V_{DD}^{2}$$

- ☐ Energy consumption based on effective switching capacitance C₁
- $\hfill \square$  Assume that capacitance is time invariant and voltage independent
- ☐ No error due to the supply voltage variation



**Seoul National University** 

# **Energy Model 2: Calculate C**<sub>B</sub>





☐ Charge conservation law when S₂ on

$$C_M V_1(n) + C_B(n) V_3(n-1) = (C_M + C_B(n)) V_2(n)$$

... ①

☐ Same with energy model 1

$$C_B(n) = \frac{V_1(n) - V_2(n)}{V_2(n) - V_3(n-1)} \cdot C_M$$

Inter-university Semiconductor Research Center

27



Seoul National University

# **Energy Model 2: Calculate C<sub>L</sub>**





☐ Charge conservation law at the n-th edge of clock

$$(C_M + C_B(n))V_2(n) = (C_M + C_B(n) + C_L(n))V_3(n)$$

... ②

☐ Load capacitance for the n-th edge

$$C_L(n) = \frac{V_2(n) - V_3(n)}{V_3(n)} \cdot (C_M + C_B(n))$$





### **Energy Model 2: Calculate E(n)**



☐ Energy consumption for the n-th edge

$$E(n) = C_L(n) \cdot V_{DD}^{2}$$

Inter-university Semiconductor Research Center

20



Seoul National University

# Comparison between Models 1 and 2

☐ Model 1:

$$E(n) = \frac{1}{2} (C_M + C_B(n)) (V_2^2(n) - V_3^2(n))$$

☐ Model 2:

$$\begin{split} E(n) &= C_L(n) \cdot V_{DD}^{2} \\ &= C_L(n) \cdot V_2^{2}(n) \qquad (assume \ V_{DD} = V_2(n)) \\ &= \left(C_M + C_B(n)\right) \frac{V_2(n) - V_3(n)}{V_3(n)} \cdot V_2^{2}(n) \end{split}$$

 $\triangle$   $\triangle$  = E(n) of Model 1 - E(n) of Model 2

$$\begin{split} & \text{E(n) of Model 1 - E(n) of Model 2} \\ & \Delta E(n) = \frac{1}{2} \Big( C_M + C_B(n) \Big) \Big( V_2^{\ 2}(n) - V_3^{\ 2}(n) \Big) - \Big( C_M + C_B(n) \Big) \frac{V_2(n) - V_3(n)}{V_3(n)} \cdot V_2^{\ 2}(n) \\ & = \frac{1}{2} \Big( C_M + C_B(n) \Big) \cdot \Big( \Delta V \Bigg) \Big( -3\Delta V - \frac{2(\Delta V)^2}{V_3(n)} \Big) \\ & \cong -\frac{3}{2} \Big( C_M + C_B(n) \Big) \cdot \Big( \Delta V \Big)^2 \end{split}$$

Model 1 underestimates the consumed energy.

Inter-university Semiconductor Research Center



## Simplification In Energy Models 1 and 2



- ☐ Each load capacitor has only the capacitance to the ground line, not to the power line.
- ☐ CMOS inverter model A is used.

Inter-university Semiconductor Research Center

31



Seoul National University

# **Load Capacitors in CMOS VLSI Circuits**



☐ N: total number of load capacitors



### More realistic Models 3 and 4



- ☐ All the output nodes considered.
  - ☐ Group 1: stay low
  - ☐ Group 2: stay high
  - □ Group 3: switch from low to high□ Group 4: switch from high to low
- ☐ The load capacitors in the group 2 acts like C<sub>B</sub>
- ☐ CMOS inverter model B is used

Inter-university Semiconductor Research Center

33



Seoul National University

### **Classification of Load Capacitors**

| Group | Symbol                                       |                                         | Input States            |  |
|-------|----------------------------------------------|-----------------------------------------|-------------------------|--|
| 1     | C <sub>1N</sub> (n) / C <sub>1P</sub> (n)    | C <sub>2</sub> (n) / C <sub>1</sub> (n) | Stay in the low state   |  |
| 2     | C <sub>2N</sub> (n) / C <sub>2P</sub> (n)    | C <sub>4</sub> (n) / C <sub>3</sub> (n) | Stay in the high state  |  |
| 3     | C <sub>3N</sub> (n) /<br>C <sub>3P</sub> (n) | C <sub>6</sub> (n) / C <sub>5</sub> (n) | Switch from low to high |  |
| 4     | C <sub>4N</sub> (n) / C <sub>4P</sub> (n)    | C <sub>8</sub> (n) / C <sub>7</sub> (n) | Switch from high to low |  |



### **Energy Model 3: Transferred energy**



- $\Box$  Group 1 (stay low) :  $C_1(n)$ ,  $C_2(n)$
- $\square$  Group 2 (stay high) :  $C_3(n)$ ,  $C_4(n)$
- $\square$  Group 3 (low to high):  $C_5(n)$ ,  $C_6(n)$
- $\Box$  Group 4 (high to low):  $C_7(n)$ ,  $C_8(n)$

Inter-university Semiconductor Research Center

35



Seoul National University

### **Energy Model 3: Simplified**



$$E = (C_6 + C_7) \cdot V_{DD}^2$$

$$\Box$$
  $C_H(n) = C_B + C_1(n) + C_4(n)$ 



### **Energy Model 3: Calculate C<sub>H</sub>(n)**



☐ Charge conservation law before and after the clock edge

$$(C_M + C_H(n))V_2(n) = (C_M + C_{H67}(n))V_3(n)$$
 ... ①
$$C_H(n) = \frac{(C_M + C_{H67}(n))V_3(n)}{V_2(n)} - C_M = \left[\frac{V_3(n)}{V_2(n)} \left(\frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)}\right) - 1\right] \cdot C_M$$

 $\Box$   $C_{H67}(n) = C_H(n) + C_6(n) + C_7(n)$ . It will be calculated in next phase.

Inter-university Semiconductor Research Center

75 600

**Seoul National University** 

# **Energy Model 3: Calculate C**<sub>H67</sub>(n)



- $\square$  Open S<sub>2</sub> and close S<sub>1</sub> to charge C<sub>M</sub>
- $\Box$  Charge conservation law when  $S_2$  is closed ( $S_1$  open) after the edge n

$$C_{M}V_{1}(n+1) + C_{H67}(n)V_{3}(n) = (C_{M} + C_{H67}(n))V_{2}(n+1) \qquad \cdots \ \ \bigcirc$$

$$C_{H67}(n) = \frac{V_{1}(n+1) - V_{2}(n+1)}{V_{2}(n+1) - V_{3}(n)} \cdot C_{M}$$



### Energy Model 3: Calculate C<sub>L</sub>(n) & E(n)



$$E = (C_6 + C_7) \cdot V_{DD}^2$$

□ From 
$$C_{67}(n) = C_{H67}(n) - C_{H}(n)$$

$$C_{67}(n) = \left(1 - \frac{V_3(n)}{V_2(n)}\right) \left(\frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)}\right) \cdot C_M$$

Inter-university Semiconductor Research Center

39



Seoul National University

### **Energy Model 4: Consumed energy**



- $\square$  Group 1 (stay low) :  $C_1(n)$ ,  $C_2(n)$
- $\square$  Group 2 (stay high) :  $C_3(n)$ ,  $C_4(n)$
- $\square$  Group 3 (low to high):  $C_5(n)$ ,  $C_6(n)$
- $\Box$  Group 4 (high to low):  $C_7(n)$ ,  $C_8(n)$



### **Energy Model 4: Simplified**





$$E = \frac{1}{2} (C_5 + C_6 + C_7 + C_8) \cdot V_{DD}^2$$

- ☐ Heat dissipation
- $\Box C_{H}(n) = C_{B} + C_{1}(n) + C_{4}(n)$

Inter-university Semiconductor Research Center

44



Seoul National University

# Phase 1 of Model 4: Calculate C<sub>H58</sub>(n)



- $\Box$   $C_{H58}(n) = C_H(n) + C_5(n) + C_8(n)$
- ☐ Charge conservation law when S₂ is closed before the edge n

$$C_{M}V_{1}(n) + C_{H58}(n)V_{3}(n-1) = (C_{M} + C_{H58}(n))V_{2}(n) \qquad \cdots \bigcirc$$

$$C_{H58}(n) = \frac{V_{1}(n) - V_{2}(n)}{V_{2}(n) - V_{3}(n-1)} \cdot C_{M}$$





# Phase 2 of Model 4: Calculate C<sub>H</sub>(n)



☐ Charge conservation law during the clock edge

$$(C_M + C_H(n))V_2(n) = (C_M + C_{H67}(n))V_3(n)$$
 ... 
$$(C_M + C_H(n)) = \frac{(C_M + C_{H67}(n))V_3(n)}{V_2(n)} - C_M = \left[ \frac{V_3(n)}{V_2(n)} \left( \frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)} \right) - 1 \right] \cdot C_M$$

 $\Box$   $C_{H67}(n) = C_H(n) + C_6(n) + C_7(n)$ . It will be calculated in the next phase.

Inter-university Semiconductor Research Center



Seoul National University

# **Energy Model 4: Calculate C**<sub>H67</sub>(n)



☐ Charge conservation law when S₂ is closed after the edge n

$$C_{M}V_{1}(n+1) + C_{H67}(n)V_{3}(n) = (C_{M} + C_{H67}(n))V_{2}(n+1) \qquad \cdots \qquad \Im$$

$$C_{H67}(n) = \frac{V_{1}(n+1) - V_{2}(n+1)}{V_{2}(n+1) - V_{3}(n)} \cdot C_{M}$$



### Energy Model 4: Calculate C<sub>1</sub> (n) (n-th edge)

$$C_{H58}(n) = \frac{V_1(n) - V_2(n)}{V_2(n) - V_3(n-1)} \cdot C_M = \left[ \frac{V_1(n) - V_3(n-1)}{V_2(n) - V_3(n-1)} - 1 \right] C_M$$

$$C_{H67}(n) = \frac{V_1(n+1) - V_2(n+1)}{V_2(n+1) - V_3(n)} \cdot C_M = \left[ \frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)} - 1 \right] C_M$$

$$\left[ E = \frac{1}{2} C_L V_{DD}^2 \right] \quad C_H(n) = \left[ \frac{V_3(n)}{V_2(n)} \left( \frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)} \right) - 1 \right] C_M$$

$$C_{58}(n) = C_{H58}(n) - C_H = \left[ \frac{V_1(n) - V_3(n-1)}{V_2(n) - V_3(n-1)} - \frac{V_3(n)}{V_2(n)} \left( \frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)} \right) \right] C_M$$

$$C_{67}(n) = C_{H67}(n) - C_H = \left( 1 - \frac{V_3(n)}{V_2(n)} \right) \left( \frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)} \right) C_M$$

$$C_L(n) = C_{5678}(n) = \left[ \frac{V_1(n) - V_3(n-1)}{V_2(n) - V_3(n-1)} + \left( 1 - \frac{2V_3(n)}{V_2(n)} \right) \left( \frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)} \right) \right] C_M$$



**Seoul National University** 

### **Energy Model 4: Capacitance Ratio Constraint (1)**

☐ In the energy model 4, we obtained the following equation

$$C_{58}(n) = \left[ \frac{V_1(n) - V_3(n-1)}{V_2(n) - V_3(n-1)} - \frac{V_3(n)}{V_2(n)} \left( \frac{V_1(n+1) - V_3(n)}{V_2(n+1) - V_3(n)} \right) \right] C_M$$
$$= \left[ \Delta(n) - \frac{V_3(n)}{V_2(n)} \Delta(n+1) \right] C_M$$

- $\square$  But if the target chip is consists of large bypass capacitor  $C_B$ , relatively, the ratio of  $(C_I/C_B)$  becomes very small.
- □ And it results in very small signal level, that is,  $V_2(n)-V_3(n) \cong 0$ .
- If V₂(n)-V₃(n)≅0, then we may obtain the negative capacitance value of C₅8(n) for the small noise level.



### **Energy Model 4: Capacitance Ratio Constraint (2)**



- ☐ This plot is for the fixed noise level of 0.2mV.
- $\square$  Consequently, in case that (C<sub>L</sub>/C<sub>B</sub>) ratio is small, energy model 4 requires very precise measurement system.

Inter-university Semiconductor Research Center

47



Seoul National University

### **Energy Model Summary**

|         | Equation                                                    | Description                                                       |
|---------|-------------------------------------------------------------|-------------------------------------------------------------------|
| Model 1 | $E = \frac{1}{2}C_{M}V_{1}^{2} - \frac{1}{2}C_{M}V_{2}^{2}$ | Energy transferred from capacitor C <sub>M</sub> .                |
| Model 2 | $E = C_L V_{DD}^2$                                          | Energy consumption based on the load capacitance C <sub>L</sub> . |
| Model 3 | $E = \left(C_6 + C_7\right) \cdot V_{DD}^{2}$               | Energy consumed from supply.                                      |
| Model 4 | $E = \frac{1}{2} (C_5 + C_6 + C_7 + C_8) \cdot V_{DD}^{2}$  | Energy consumption in the chip. Heat dissipation.                 |

☐ Model 1, 2, 3 are related to the energy from supply, but model 4 is the energy consumed in the chip.



- □ Definition
  - $\Sigma_0$ : total energy transferred from supply
  - $\Sigma_1$ : total energy consumed in the chip
- $\ \square$  Using our energy models, we can write  $\Sigma_0$  and  $\Sigma_1$  as follows

$$\Sigma_0 = \sum_{n=1}^{N} E_3(n) = V_{DD}^2 \cdot \sum_{n=1}^{N} C_{67}(n)$$

$$\Sigma_1 = \sum_{n=1}^{N} E_4(n) = \frac{1}{2} V_{DD}^2 \cdot \sum_{n=1}^{N} C_{5678}(n)$$

Inter-university Semiconductor Research Center

49



Seoul National University

## Appendix (1): $\Sigma_0$ vs $\Sigma_1$

☐ For simplicity, we define as follows

$$\alpha_n = \frac{V_3(n)}{V_2(n)}$$

$$\Delta_n = \frac{V_1(n) - V_3(n-1)}{V_2(n) - V_3(n-1)}$$

 $\Box$  Then the C<sub>67</sub>(n) and C<sub>5678</sub>(n) in the previous section can be rewritten,

$$C_{67}(n) = (1 - \alpha_n) \Delta_{n+1} C_M$$

$$C_{5678}(n) = [\Delta_n + (1 - 2\alpha_n) \Delta_{n+1}] C_M$$



$$\begin{split} & \Sigma_{1} = \frac{1}{2}V_{DD}^{2} \cdot \sum_{n=1}^{2N} C_{5678}(n) \\ & = \frac{1}{2}V_{DD}^{2} \cdot \sum_{k=1}^{N} \left[ C_{5678}(2k-1) + C_{5678}(2k) \right] \\ & = \frac{1}{2}V_{DD}^{2} \cdot \sum_{k=1}^{N} \left[ \left( \Delta_{2k-1} + \left( 1 - 2\alpha_{2k-1} \right) \Delta_{2k} \right) + \left( \Delta_{2k} + \left( 1 - 2\alpha_{2k} \right) \Delta_{2k+1} \right) \right] C_{M} \\ & = \frac{1}{2}V_{DD}^{2} \cdot \sum_{k=1}^{N} \left[ \left( \Delta_{2k-1} - \Delta_{2k} + 2\left( 1 - \alpha_{2k-1} \right) \Delta_{2k} \right) + \left( \Delta_{2k} - \Delta_{2k+1} + 2\left( 1 - \alpha_{2k} \right) \Delta_{2k+1} \right) \right] C_{M} \\ & = \frac{1}{2}V_{DD}^{2} \cdot \sum_{k=1}^{N} \left[ \left( \Delta_{2k-1} - \Delta_{2k+1} \right) C_{M} + 2C_{67}(2k-1) + 2C_{67}(2k) \right] \\ & = V_{DD}^{2} \cdot \sum_{k=1}^{N} \left[ \left( C_{67}(2k-1) + C_{67}(2k) \right) \right] + \frac{1}{2}V_{DD}^{2} \cdot \sum_{k=1}^{N} \left[ \left( \Delta_{2k-1} - \Delta_{2k+1} \right) C_{M} \right] \\ & = \Sigma_{0} + \frac{1}{2}C_{M}V_{DD}^{2} \cdot \left( \Delta_{1} - \Delta_{2N+1} \right) \end{split}$$

Thus we can say that  $\Sigma_0 \cong \Sigma_1$  for large N, that is, (total transferred energy)=(total consumed energy).

Inter-university Semiconductor Research Center

51



Seoul National University

# Appendix (2): $\Sigma_1$ vs $\Sigma_2$

- □ Definition
  - $\Sigma_1$ : total energy consumed in the chip
  - $\Sigma_2$ : total energy consumed in the chip when  $V_2(n+1)$  is affected by noise
- $\square$   $\Sigma_2$   $\Sigma_1$  can be written as follows because the C<sub>5678</sub>(n) and C<sub>5678</sub>(n+1) are related with V<sub>2</sub>(n+1) and others are not.

$$\Sigma_{2} - \Sigma_{1} = \sum_{n=1}^{N} E_{4}^{*}(n) - \sum_{n=1}^{N} E_{4}(n)$$

$$= \frac{1}{2} V_{DD}^{2} \cdot \sum_{n=1}^{N} C_{5678}^{*}(n) - \frac{1}{2} V_{DD}^{2} \cdot \sum_{n=1}^{N} C_{5678}(n)$$

$$= \frac{1}{2} V_{DD}^{2} \left[ C_{5678}^{*}(n) + C_{5678}^{*}(n+1) - C_{5678}(n) - C_{5678}(n+1) \right]$$



☐ For simplicity, we define as follows

$$p_{n} = \frac{C_{58}(n)}{C_{M}} = \Delta_{n} - \alpha_{n} \Delta_{n+1}$$

$$q_{n} = \frac{C_{67}(n)}{C_{M}} = (1 - \alpha_{n}) \Delta_{n+1} = p_{n} + (\Delta_{n+1} - \Delta_{n})$$

$$\Delta V_{2} = V_{2}^{*}(n+1) - V_{2}(n+1) \quad (\Delta V_{2} : noise)$$

Inter-university Semiconductor Research Center

53



Seoul National University

### Appendix (2): $\Sigma_1$ vs $\Sigma_2$

 $\Box$  Then  $(\Sigma_2 - \Sigma_1)$  can be rewritten,

$$\begin{split} & \Sigma_{2} - \Sigma_{1} = \frac{1}{2} C_{M} V_{DD}^{2} \Big[ \Big( p_{n}^{\phantom{n}*} + q_{n+1}^{\phantom{n}*} + q_{n+1}^{\phantom{n}*} + q_{n+1}^{\phantom{n}*} \Big) - \Big( p_{n} + q_{n} + p_{n+1} + q_{n+1} \Big) \Big] \\ & = \frac{1}{2} C_{M} V_{DD}^{\phantom{DD}2} \Big[ \Big( \Delta_{n+1}^{\phantom{n}*} - \Delta_{n}^{\phantom{n}*} + 2 p_{n}^{\phantom{n}*} + \Delta_{n+2}^{\phantom{n}*} - \Delta_{n+1}^{\phantom{n}*} + 2 p_{n+1}^{\phantom{n}*} \Big) - \Big( \Delta_{n+1} - \Delta_{n} + 2 p_{n} + \Delta_{n+2} - \Delta_{n+1} + 2 p_{n+1} \Big) \Big] \\ & = \frac{1}{2} C_{M} V_{DD}^{\phantom{DD}2} \Big[ \Big( \Delta_{n+1}^{\phantom{n}*} - \Delta_{n} + 2 p_{n}^{\phantom{n}*} + \Delta_{n+2} - \Delta_{n+1}^{\phantom{n}*} + 2 p_{n+1}^{\phantom{n}*} \Big) - \Big( \Delta_{n+1} - \Delta_{n} + 2 p_{n} + \Delta_{n+2} - \Delta_{n+1} + 2 p_{n+1} \Big) \Big] \\ & = C_{M} V_{DD}^{\phantom{DD}2} \Big[ \Big( p_{n}^{\phantom{n}*} + p_{n+1}^{\phantom{n}*} \Big) - \Big( p_{n} + p_{n+1} \Big) \Big] & \cdots & \bigcirc \\ \end{split}$$

□ Note that  $\Delta_n^* = \Delta_n$  and  $\Delta_{n+2}^* = \Delta_{n+2}$  because they have no dependency on  $V_2(n+1)$ .



 $\Box$   $p_{n+1}^*$  can be written

$$\begin{split} p_{n+1}^{**} &= \Delta_{n+1}^{**} - \alpha_{n+1}^{**} \Delta_{n+2}^{**} \\ &= \frac{\Delta_{n}^{*} - p_{n}^{*}}{\alpha_{n}^{*}} - \frac{V_{3}^{*}(n+1)}{V_{2}^{*}(n+1)} \Delta_{n+2}^{**} \\ &= \frac{(p_{n} + \alpha_{n} \Delta_{n+1}) - p_{n}^{**}}{\alpha_{n}} - \frac{V_{3}(n+1)}{V_{2}(n+1) + \Delta V_{2}} \Delta_{n+2} \quad \left(\because p_{n} = \Delta_{n} - \alpha_{n} \Delta_{n+1}\right) \\ &= \Delta_{n+1} + \frac{p_{n} - p_{n}^{*}}{\alpha_{n}} - \frac{V_{3}(n+1)}{V_{2}(n+1)} \cdot \frac{1}{1 + \frac{\Delta V_{2}}{V_{2}(n+1)}} \Delta_{n+2} \\ &\cong \Delta_{n+1} + \frac{p_{n} - p_{n}^{*}}{\alpha_{n}} - \frac{V_{3}(n+1) \Delta_{n+2}}{V_{2}(n+1)} \cdot \left(1 - \frac{\Delta V_{2}}{V_{2}(n+1)}\right) \quad \left(\because \Delta V_{2} << V_{2}(n+1)\right) \\ &= \Delta_{n+1} - \alpha_{n+1} \Delta_{n+1} + \frac{p_{n} - p_{n}^{*}}{\alpha_{n}} + \alpha_{n+1} \Delta_{n+2} \cdot \frac{\Delta V_{2}}{V_{2}(n+1)} \\ &= p_{n+1} + p_{n} - p_{n}^{*} + \left(1 - \frac{1}{\alpha_{n}}\right) \left(p_{n}^{*} - p_{n}\right) + \alpha_{n+1} \Delta_{n+2} \cdot \frac{\Delta V_{2}}{V_{2}(n+1)} \quad \cdots \quad \textcircled{2} \end{split}$$

Inter-university Semiconductor Research Center





Seoul National University

### Appendix (2): $\Sigma_1$ vs $\Sigma_2$

 $\square$  In order to calculate  $\Delta V_2$  in Eq.  $\square$ , we solve the following equation for  $V_2(n+1)$ 

$$p_{n} = \Delta_{n} - \alpha_{n} \Delta_{n+1} = \Delta_{n} - \alpha_{n} \frac{V_{1}(n+1) - V_{3}(n)}{V_{2}(n+1) - V_{3}(n)}$$

☐ Then

$$V_2(n+1) = V_3(n) + (V_1(n+1) - V_3(n)) \cdot \frac{\alpha_n}{\Delta_n - p_n}$$

☐ In the same way

$$V_{2}^{*}(n+1) = V_{3}^{*}(n) + \left(V_{1}^{*}(n+1) - V_{3}^{*}(n)\right) \cdot \frac{\alpha_{n}^{*}}{\Delta_{n}^{*} - p_{n}^{*}}$$

$$= V_{3}(n) + \left(V_{1}(n+1) - V_{3}(n)\right) \cdot \frac{\alpha_{n}}{\Delta_{n} - p_{n}^{*}}$$



 $\square$   $\Delta V_2$  can be written

$$\begin{split} & \Delta V_2 = V_2^*(n+1) - V_2(n+1) \\ & = \left[ V_3(n) + \left( V_1(n+1) - V_3(n) \right) \cdot \frac{\alpha_n}{\Delta_n - p_n^*} \right] - \left[ V_3(n) + \left( V_1(n+1) - V_3(n) \right) \cdot \frac{\alpha_n}{\Delta_n - p_n} \right] \\ & = \alpha_n \left( V_1(n+1) - V_3(n) \right) \left( \frac{1}{\Delta_n - p_n^*} - \frac{1}{\Delta_n - p_n} \right) \\ & = \alpha_n \left( V_1(n+1) - V_3(n) \right) \frac{1}{\Delta_n} \left( \frac{1}{1 - \frac{p_n^*}{\Delta_n}} - \frac{1}{1 - \frac{p_n}{\Delta_n}} \right) \\ & \cong \alpha_n \left( V_1(n+1) - V_3(n) \right) \frac{1}{\Delta_n} \left( \frac{p_n^*}{\Delta_n} - \frac{p_n}{\Delta_n} \right) \quad (\because V_2(n) \cong V_3(n) \Rightarrow p_n << \Delta_n \right) \\ & = \alpha_n \left( V_1(n+1) - V_3(n) \right) \frac{1}{\Delta_n^2} \left( p_n^* - p_n \right) \end{split}$$

Inter-university Semiconductor Research Center

31



Seoul National University

### Appendix (2): $\Sigma_1$ vs $\Sigma_2$

☐ In Eq. ②,

$$\begin{split} \alpha_{n+1} \Delta_{n+2} \frac{1}{V_2(n+1)} \Delta V_2 &= \alpha_{n+1} \Delta_{n+2} \frac{1}{V_2(n+1)} \cdot \alpha_n \left( V_1(n+1) - V_3(n) \right) \frac{1}{\Delta_n^2} \left( p_n^* - p_n \right) \\ &= \left( \Delta_{n+1} - p_{n+1} \right) \frac{1}{V_2(n+1)} \cdot \alpha_n \left( V_1(n+1) - V_3(n) \right) \frac{1}{\Delta_n^2} \left( p_n^* - p_n \right) \\ &= \left( \frac{\Delta_n - p_n}{\alpha_n} - p_{n+1} \right) \frac{1}{V_2(n+1)} \cdot \alpha_n \left( V_1(n+1) - V_3(n) \right) \frac{1}{\Delta_n^2} \left( p_n^* - p_n \right) \\ &= \left( 1 - \frac{p_n + \alpha_n p_{n+1}}{\Delta_n} \right) \frac{V_1(n+1) - V_3(n)}{V_2(n+1)} \cdot \frac{1}{\Delta_n} \left( p_n^* - p_n \right) \end{split}$$



 $\square$  From Eq. 1 and 2, we write  $(\Sigma_2 - \Sigma_1)$ ,

$$\begin{split} \Sigma_{2} - \Sigma_{1} &= C_{M} V_{DD}^{2} \left[ \left( p_{n}^{*} + p_{n+1}^{*} \right) - \left( p_{n} + p_{n+1} \right) \right] \\ &= C_{M} V_{DD}^{2} \left[ \left( 1 - \frac{1}{\alpha_{n}} \right) + \left( 1 - \frac{p_{n} + \alpha_{n} p_{n+1}}{\Delta_{n}} \right) \frac{V_{1}(n+1) - V_{3}(n)}{V_{2}(n+1)} \cdot \frac{1}{\Delta_{n}} \right] \left( p_{n}^{*} - p_{n} \right) \end{split}$$

- $\hfill \hfill \hfill$
- $\square$  And you can see that  $(\Sigma_2 \Sigma_1) \cong 0$ .
- □ Concluding remark
  - If  $V_2(n) \cong V_3(n)$ ,  $C_{5678}(n)$  is very sensitive to noise( $\Delta V_2$ ) and it sometimes goes to negative capacitance value.
  - But the total consumed energy is almost not changed by the noise( $\Delta V_2$ ).
  - From this, we can fix the error of C<sub>5678</sub>(n) when it goes to the negative value.

Inter-university Semiconductor Research Center

59



Seoul National University

### **Energy Model 2: Leakage Current Effect**



□ I<sub>leak</sub>: static current independent of clock. Leakage of static CMOS, analog sub-block current, and so on.



# **Energy Model 2: Calculate C<sub>B</sub> with I<sub>leak</sub>**





☐ Charge conservation law when S₂ on

$$C_M V_1(n) + C_B(n) V_3(n-1) = (C_M + C_B(n)) V_2(n) + I_{leak} \frac{3T_M}{8}$$
 ... ①

☐ Same with energy model 1

$$C_B(n) = \frac{C_M(V_1(n) - V_2(n)) - \frac{3I_{leak}T_M}{8}}{V_2(n) - V_3(n-1)}$$

Inter-university Semiconductor Research Center

61



Seoul National University

# Energy Model 2: Calculate $C_L$ with $I_{leak}$





 $\hfill \Box$  Charge conservation law at the n-th edge of clock

$$(C_M + C_B(n))V_2(n) = (C_M + C_B(n) + C_L(n))V_3(n) + I_{leak} \frac{T_M}{8} \qquad \dots \ \ \bigcirc$$

☐ Load capacitance for the n-th edge

$$C_L(n) = \frac{\left(C_M + C_B(n)\right)\left(V_2(n) - V_3(n)\right) - \frac{I_{leak}T_M}{8}}{V_3(n)}$$



# **Energy Model 2: I<sub>leak</sub> Measurement**



- ☐ Assume constant I<sub>leak</sub> for simplicity
- □ No clocking
- $\hfill \Box$  Two measurements with two different capacitor  $\mathbf{C}_{\mathrm{T1}},\,\mathbf{C}_{\mathrm{T2}}$  to determine  $\mathbf{C}_{\mathrm{B}}$
- $\hfill \square$  Voltage  $\hfill V_{TEST}$  slightly above  $\hfill V_{DD}$

Inter-university Semiconductor Research Center

63



Seoul National University

# **Energy Model 2: Calculate I**<sub>leak</sub>



$$(C_{T1} + C_B)\Delta V_{T1} = I_{leak}\Delta t$$

$$(C_{T2} + C_B)\Delta V_{T2} = I_{leak}\Delta t$$

$$C_{B} = \frac{C_{T2} \Delta V_{T2} - C_{T1} \Delta V_{T1}}{\Delta V_{T1} - \Delta V_{T2}}$$

$$I_{leak} = \frac{C_{T2} - C_{T1}}{\Delta V_{T1} - \Delta V_{T2}} \cdot \frac{\Delta V_{T1} \Delta V_{T2}}{\Delta t}$$



### **Energy Model 2: Frequency Compensation**

☐ Energy consumption for the n-th clock edge

$$E(n) = C_L(n) \cdot V_{DD}^2 + \frac{1}{2} I_{leak} T_O V_{DD}$$

- $\Box$  T<sub>o</sub> = 1/f<sub>o</sub>, f<sub>o</sub>: actual operating frequency
- $\Box$  C<sub>L</sub>(n)V<sub>DD</sub><sup>2</sup>: effective switching energy
- ☐ (1/2)I<sub>L</sub>T<sub>O</sub>V<sub>DD</sub>: energy due to static current during a half period

Inter-university Semiconductor Research Center

65



Seoul National University

# Energy Model 4: Calculate C<sub>H58</sub>(n) with I<sub>leak</sub>



☐ Charge conservation law when S₁ on

$$C_M V_1(n) + C_{H58}(n) V_3(n-1) = (C_M + C_{H58}(n)) V_2(n) + \frac{3I_{leak} T_M}{8}$$
 ... ①

$$C_{H58}(n) = \frac{C_M (V_1(n) - V_2(n)) - \frac{3I_{leak}T_M}{8}}{V_2(n) - V_3(n-1)}$$





# **Energy Model 4: Calculate C<sub>H</sub>(n) with I<sub>leak</sub>**



☐ Charge conservation law during clock edge

$$(C_M + C_H(n))V_2(n) = (C_M + C_{H67}(n))V_3(n) + \frac{I_{leak}T_M}{8} \qquad \cdots \ \ \bigcirc$$

$$C_{H}(n) = \frac{\left(C_{M} + C_{H67}(n)\right)V_{3}(n) + \frac{I_{leak}T_{M}}{8}}{V_{2}(n)} - C_{M}$$

Inter-university Semiconductor Research Center

67



Seoul National University

# **Energy Model 4: Calculate C<sub>H67</sub>(n) with I<sub>leak</sub>**



☐ Charge conservation law when S₁ on

$$C_M V_1(n+1) + C_{H67}(n) V_3(n) = (C_M + C_{H67}(n)) V_2(n+1) + \frac{3I_{leak} T_M}{8}$$
 ... 3

$$C_{H67}(n) = \frac{C_M (V_1(n+1) - V_2(n+1)) - \frac{3I_{leak}T_M}{8}}{V_2(n+1) - V_3(n)}$$



### **CMOS Short-Circuit Current**





- ☐ During transition from 0 to 1 or from 1 to 0.
- ☐ Both NMOS and PMOS are on for a short period of time.
- ☐ This results in a short current pulse from VDD to GND.
- $\hfill \square$   $\hfill I_{SC}$  may be misunderstood as load capacitance  $\hfill C_L$  in the measurement system.

Inter-university Semiconductor Research Center



Seoul National University

# $E_{real}(n)$ with $I_{SC}$





 $\hfill \square$  If we assume that the total charge loss by  $I_{SC}$  is  $Q_{SC}(n),$  the consumed energy E(n) in real operation

$$E_{real}(n) = C_{67,real}(n) \cdot V_{DD}^{2} + Q_{SC}(n) \cdot V_{DD}$$



### $E_{meas}(n)$ with $I_{SC}$

☐ E<sub>meas</sub>(n) can be written

$$E_{meas}(n) = C_{67,eff}(n) \cdot V_{DD}^{2} \qquad \cdots \quad ②$$

☐ In the CCL for the n-th edge

$$(C_M + C_H(n))V_2(n) = (C_M + C_H(n) + C_{67,eff}(n))V_3(n)$$
  
=  $(C_M + C_H(n) + C_{67,real}(n))V_3(n) + Q_{SC}(n)$  ... 3

☐ Assume that

$$V_3(n) = V_{DD} - \Delta V_3$$

Inter-university Semiconductor Research Center

71



Seoul National University

# $E_{meas}(n)$ with $I_{SC}$

☐ From Eq.② and Eq.③

$$\begin{split} E_{meas}(n) &= C_{67,eff}(n) \cdot V_{DD}^{2} \\ &= C_{67}(n) \cdot V_{DD}^{2} + \frac{Q_{SC}(n)}{V_{3}(n)} \cdot V_{DD}^{2} \\ E_{real}(n) &= C_{67}(n) \cdot V_{DD}^{2} + Q_{SC}(n) V_{DD} \\ \Delta E(n) &= E_{effl}(n) - E_{real}(n) \\ &= \left(\frac{V_{DD}}{V_{3}(n)} - 1\right) Q_{SC}(n) V_{DD} \\ &= \left(\frac{1}{1 - \frac{\Delta V_{3}}{V_{DD}}} - 1\right) Q_{SC}(n) V_{DD} \\ &= Q_{SC}(n) \Delta V_{3} \end{split}$$





### $E_{real}(n)$ vs $E_{meas}(n)$ Example

- Assumptions:
  - $Q_{SC}(n)\cdot V_{DD}$  in Eq. $\oplus$  is 10% of  $E_{real}(n)$
  - $V_{DD} = 3.0V$
  - $V_3(n) = 2.8V$
- ☐ Then, the % error

% 
$$error = \frac{E_{meas}(n) - E_{real}(n)}{E_{real}(n)} \cong 0.7\%$$

 $\hfill\Box$  Consequently, the error from calculating  $I_{SC}$  as  $C_L$  in our measurement system is not large.

Inter-university Semiconductor Research Center

73



Seoul National University

### **Voltage dependent Load Capacitance**





# **Differential Capacitance vs Static Capacitance**

$$C_1 = \frac{dQ}{dV}$$
 : dynamic

$$C_2 = \frac{Q}{V}$$
 : static

#### ☐ Q(v) and C(v) curves





Inter-university Semiconductor Research Center

Seoul National University

### **Differential Capacitance vs Static Capacitance**

$$\boxed{C_1 = \frac{dQ}{dV} \quad : dynamic} \qquad \boxed{C_2 = \frac{Q}{V} \quad : static}$$

$$C_2 = \frac{Q}{V}$$
 : static

☐ Relationship between C<sub>1</sub> & C<sub>2</sub>

$$Q(V) = \int \frac{dQ}{dV} dV = \int_{V}^{V} C_1(V) dV + Q(0)$$

$$C_2(V) = \frac{Q(V)}{V} = \frac{\int_0^V C_1(V)dV + Q(0)}{V}$$





### **Energy stored in the pMOS capacitance**



$$\begin{split} E_{p,gateox}^{stored} &= \int \!\! V_c dQ \\ &= \int \!\!\! \int^{V_{DD}} \!\! C_{1p}(V) \cdot V_C dV_c \\ &= \int \!\!\! \int^{V_{Tp}} \!\!\! \left| C_{Lp} V dV + \int^{V_{DD}}_{V_{Tp}} \!\!\! C_{Hp} V dV \right| & |\mathbf{V}_{\mathsf{Tp}}| & \mathbf{V}_{\mathsf{DD}} \\ &= \frac{1}{2} C_{Lp} \big| V_{Tp} \big|^2 + \frac{1}{2} C_{Hp} \Big( V_{DD}^2 - \big| V_{Tp} \big|^2 \Big) & Q_{p,gateox}^{transferred} &= C_{Hp} \Big( V_{DD} - \big| V_{Tp} \big| \Big) + C_{Lp} \big| V_{Tp} \big| \end{split}$$





Seoul National University

### **Energy stored in the nMOS capacitance**



$$\begin{split} E_{n,gateox}^{stored} &= \int V_o dQ \\ &= \int_0^{V_{DD}} C_{1n}(V) \cdot V_o dV_o \\ &= \int_0^{V_{Tn}} C_{Ln} V dV + \int_{V_{Tn}}^{V_{DD}} C_{Hn} V dV \\ &= \frac{1}{2} C_{Ln} V_{Tn}^2 + \frac{1}{2} C_{Hn} \left( V_{DD}^2 - V_{Tn}^2 \right) \end{split}$$



 $Q_{n,gateox}^{transferred} = C_{Ln}V_{Tn} + C_{Hn}(V_{DD} - V_{Tn})$ 



### **Energy Flow for Gate Capacitance Load**



$$\Box$$
 When Vout is rising,  $E_{gateox}^{loss}(\uparrow) = E_{p,gateox}^{stored} + E_{n,gateox}^{loss}$ 

$$\Box$$
 When Vout is falling,  $E_{gateox}^{loss}(\downarrow) = E_{n,gateox}^{stored} + E_{p,gateox}^{loss}$ 

Inter-university Semiconductor Research Center



Seoul National University

### **Energy Flow when the output is rising**



#### Before transition

$$E_{p,gateox}^{stored} = \frac{1}{2} \bigg[ C_{Lp} \Big| V_{Tp} \Big|^2 + C_{Hp} \bigg( V_{DD}^{\phantom{DD}2} - \Big| V_{Tp} \Big|^2 \bigg) \bigg]$$

### After transition

$$\begin{split} E_{gateox}^{transferred}\left(\uparrow\right) &= Q_{n,gateox}^{transferred}V_{DD} = \left\{C_{Ln}V_{Tn} + C_{Hn}\left(V_{DD} - V_{Tn}\right)\right\}V_{DD} \\ E_{n,gateox}^{stored} &= \frac{1}{2}\left[C_{Ln}{V_{Tn}}^2 + C_{Hn}\left({V_{DD}}^2 - {V_{Tn}}^2\right)\right] \\ E_{gateox}^{loss}\left(\uparrow\right) &= E_{gateox}^{transferred}\left(\uparrow\right) - E_{n,gateox}^{stored} + E_{p,gateox}^{stored} \end{split}$$



### **Energy Flow when the output is falling**

#### Before transition



$$E_{n,gateox}^{stored} = \frac{1}{2} \left[ C_{Ln} V_{Tn}^{2} + C_{Hn} \left( V_{DD}^{2} - V_{Tn}^{2} \right) \right]$$

#### After transition

$$\begin{split} E_{gateox}^{transferred}\left(\downarrow\right) &= Q_{p,gateox}^{transferred}V_{DD} = \left\{\!\!\!\! \left\{\!\!\!\! C_{Lp} \middle| V_{Tp} \middle| + C_{Hp} \middle(\!\!\! V_{DD} - \middle| V_{Tp} \middle| \right) \!\!\!\! \right\}\!\!\!\! V_{DD} \\ E_{p,gateox}^{stored} &= \frac{1}{2} \Big[\!\!\!\! \left[ C_{Lp} V_{Tp}^{\phantom{Tp}}^2 + C_{Hp} \middle(\!\!\! V_{DD}^{\phantom{Tp}}^2 - V_{Tp}^{\phantom{Tp}}^2 \right) \!\!\!\! \right] \\ E_{gateox}^{loss} \left(\downarrow\right) &= E_{gateox}^{transferred} \left(\downarrow\right) - E_{p,gateox}^{stored} + E_{n,gateox}^{stored} \end{split}$$

Inter-university Semiconductor Research Center

21



Seoul National University

### **Approximation of Energy Related to CMOS Gate**

- □ For simplicity
  - · The load consists of the gate oxide capacitance only
  - $V_T = V_{Tp} = V_{Tn}$
  - $C_H = C_{Hp}^{+} = C_{Hn}^{+}, C_L = C_{Lp}^{-} = C_{Ln}^{-}$

 $E_{loss} = E_{transferred}$ 

- $(V_T/V_{DD})=k$ ,  $(C_L/C_H)=\alpha$
- ☐ Then, for each transition

$$\begin{split} E_{\textit{transferred}} &= C_H V_{DD}^{\ \ 2} \big[ \big( 1 - k \big) + \alpha k \big] \\ E_{\textit{stored}} &= \frac{1}{2} \left. C_H V_{DD}^{\ \ 2} \big[ \alpha k^2 + \big( 1 - k^2 \big) \right] \end{split}$$



# $\textbf{E}_{\text{gate}}$ vs k, $\alpha$

|                          | k=0, α=0 | k=0.25, α=0.5 |  |
|--------------------------|----------|---------------|--|
| E <sub>transferred</sub> | 100%     | 87.5%         |  |
| E <sub>stored</sub>      | 50%      | 48.4%         |  |
| E <sub>loss</sub>        | 100%     | 87.5%         |  |

 $\square$  With the typical MOS parameters today, we have k=0.25,  $\alpha$ =0.5

Inter-university Semiconductor Research Center

83



Seoul National University

# **C**<sub>gate</sub> vs Supply Voltage Scaling



- ☐ With the technology scaling, the supply voltage has been lowered.
- $\Box$  then, k(=V<sub>T</sub>/V<sub>DD</sub>) becomes larger.
- $\hfill \Box$  Voltage dependency of  $\mathbf{C}_{\mathrm{gate}}$  is not negligible any longer.



### E<sub>transferred</sub> vs k

$$E_{\textit{transferred},0} = C_{\textit{H}} V_{\textit{DD}}^{\phantom{\textit{DD}}2}$$

☐ Then, the ratio

$$\begin{split} \frac{E_{\textit{transferred}}}{E_{\textit{transferred},0}} &= \left(1 - k\right) + \alpha k \\ &= 1 - \left(1 - \alpha\right) \cdot k \end{split}$$



Inter-university Semiconductor Research Center

25



Seoul National University

# $\mathbf{E}_{\text{stored}}$ vs k

$$E_{stored,0} = \frac{1}{2} C_H V_{DD}^2$$

☐ Then, the ratio

$$\begin{split} \frac{E_{stored}}{E_{stored,0}} &= \alpha k^2 + \left(1 - k^2\right) \\ &= 1 - \left(1 - \alpha\right) \cdot k^2 \end{split}$$





### **Diffusion Capacitors**

- ☐ C<sub>diff</sub> = Area\* C<sub>bottom</sub> + perimeter\* C<sub>sidewall</sub>
- □ Two types of junctions
  - ☐ Abrupt junction: bottom (m=1/2)
  - ☐ Graded junction: sidewall (m=1/3)



$$C_{depl}(V) = \frac{dQ}{dV} = \frac{\varepsilon_{Si}}{t}$$

$$t_{abrupt} = k_{abrupt} \sqrt{V}, \quad k_{abrupt} = \sqrt{\frac{2\varepsilon_{Si}}{e} \frac{N_D + N_A}{N_D N_A}}$$

$$t_{graded} = k_{graded} \sqrt[3]{V}, \quad k_{graded} = \sqrt[3]{\frac{12\varepsilon_{Si}}{eG}}$$
 where  $N_D - N_A = Gx$ 

Inter-university Semiconductor Research Center



Seoul National University

### **Capacitance for Abrupt Junctions**



$$C_{diff}(V) = \frac{dQ}{dV} = \frac{k}{2\sqrt{V}}, \quad k = \sqrt{2e\varepsilon_{Si} \frac{N_D N_A}{N_D + N_A}}$$

$$Q = k\sqrt{V}, \quad Q_{V_{DD}} = k\sqrt{\phi + V_{DD}}, \quad Q_0 = k\sqrt{\phi}$$

$$Q_{transferred} = Q_{V_{DD}} - Q_0 = k\left(\sqrt{\phi + V_{DD}} - \sqrt{\phi}\right)$$

$$E_{stored}(V) = \int V dQ = \int_{V_{DD}}^{\phi + V_{DD}} C_{diff}(V) \cdot V dV$$

$$\begin{split} E_{transferred} &= Q_{transferred} V_{DD} \\ &= k V_{DD} \Big( \sqrt{\phi + V_{DD}} - \sqrt{\phi} \Big) \\ E_{stored} &= \int_{\mathcal{Q}_{o}}^{\mathcal{Q}_{VDD}} V dQ \\ &= \int_{\mathcal{Q}_{o}}^{\mathcal{Q}_{VDD}} \frac{Q^2}{k^2} dQ = \frac{1}{3} \Big[ (\phi + V_{DD}) Q_{V_{DD}} - \phi Q_0 \Big] \end{split}$$

$$\begin{split} E_{stored}(V) &= \int \!\! V dQ = \int_{\phi}^{\phi + V_{DD}} C_{diff}(V) \cdot V dV \\ &= k V_{DD} \left( \sqrt{\phi + V_{DD}} - \sqrt{\phi} \right) \\ &= \frac{k}{2} \int_{\phi}^{\phi + V_{DD}} \frac{V}{\sqrt{V}} \, dV = \frac{k}{2} \int_{\phi}^{\phi + V_{DD}} \sqrt{V} \, dV \\ &= \int_{2o}^{Q_{V_{DD}}} V \, dQ \\ &= \frac{k}{3} \left[ V \sqrt{V} \right]_{\phi 0}^{\phi + V_{DD}} = \frac{k}{3} \left[ (\phi + V_{DD}) \sqrt{\phi + V_{DD}} - \phi \sqrt{\phi} \right] \\ &= \int_{2o}^{Q_{V_{DD}}} \frac{Q^2}{k^2} dQ = \frac{1}{3} \left[ (\phi + V_{DD}) Q_{V_{DD}} - \phi Q_0 \right] \\ &= \frac{1}{3} \left[ (\phi + V_{DD}) Q_{V_{DD}} - \phi Q_0 \right] \end{split}$$





### **Capacitance for Graded Junctions**



$$\begin{split} E_{transferred} &= Q_{transferred} V_{DD} \\ &= k V_{DD} \bigg[ (\phi + V_{DD})^{\frac{2}{3}} - (\phi)^{\frac{2}{3}} \bigg] \\ E_{stored} &= \int_{Q_{vD}}^{Q_{vDD}} V dQ \\ &= \int_{Q_{v}}^{Q_{vDD}} V dQ \\ &= \int_{Q_{v}}^{Q_{vDD}} \left( \frac{Q}{k} \right)^{\frac{3}{2}} dQ = \frac{2}{5} \big[ (\phi + V_{DD}) Q_{V_{DD}} - \phi Q_{0} \big] \\ \end{split}$$

$$C_{diff}(V) = \frac{dQ}{dV} = \frac{2}{3} \frac{k}{\sqrt[3]{V}}, \quad k = \frac{3}{2} \sqrt[3]{\frac{e\varepsilon_{Si}^2 G}{12}}$$

$$Q = k(V)^{\frac{2}{3}} \quad Q_{V_{DD}} = k(\phi + V)^{\frac{2}{3}}, \quad Q_0 = k(\phi)^{\frac{2}{3}}$$

$$Q_{transferred} = Q_{V_{DD}} - Q_0 = k\left[(\phi + V)^{\frac{2}{3}} - (\phi)^{\frac{2}{3}}\right]$$

$$\begin{split} E_{stored}(V) &= \int V dQ = \int_{\phi}^{\phi + V_{DD}} C_{diff}(V) \cdot V dV \\ &= \frac{2}{3} k \int_{\phi}^{\phi + V_{DD}} \frac{V}{\sqrt[3]{V}} dV = \frac{2}{3} k \int_{\phi}^{\phi + V_{DD}} (V)^{\frac{2}{3}} dV \\ &= \frac{2k}{5} \left[ V^{\frac{5}{3}} \right]_{\phi}^{\phi + V_{DD}} = \frac{2k}{5} \left[ (\phi + V_{DD})^{\frac{5}{3}} - \phi^{\frac{5}{3}} \right] \\ Q_{0} \right] \\ &= \frac{2}{5} \left[ (\phi + V_{DD}) Q_{V_{DD}} - \phi Q_{0} \right] \end{split}$$



Seoul National University

### **Energy Flow for Junction Capacitance Load**



- ☐ When Vout is rising,
- When Vout is falling,

$$E_{diff}^{loss}(\uparrow) = E_{p,diff}^{stored} + E_{n,diff}^{loss}$$

$$E_{diff}^{loss}(\downarrow) = E_{n,diff}^{stored} + E_{p,diff}^{loss}$$



### **Energy Flow when the output is rising**



$$\begin{split} Q_{p,m}^{V_{DD}} &= k_{p,m} \left( \phi_p + V_{DD} \right)^{1-m}, \quad Q_{p,m}^0 = k_{p,m} \phi_p^{1-m} \\ Q_{n,m}^{V_{DD}} &= k_{n,m} \left( \phi_n + V_{DD} \right)^{1-m}, \quad Q_{n,m}^0 = k_{n,m} \phi_n^{1-m} \end{split}$$

#### Before transition

$$E_{p,diff,m}^{stored} = \frac{1 - m}{2 - m} \left[ Q_{p,m}^{V_{DD}}(\phi_p + V_{DD}) - Q_{p,m}^{0} \phi_p \right]$$

#### After transition

$$\begin{split} E_{\textit{diff},m}^{\textit{transferred}}\left(\uparrow\right) &= Q_{n,\textit{diff},m}^{\textit{transferred}} V_{DD} = \left(Q_{n,\textit{diff},m}^{\textit{V}_{DD}} - Q_{n,\textit{diff},m}^{\text{o}}\right) V_{DD} \\ E_{\textit{n},\textit{diff},m}^{\textit{stored}} &= \frac{1-m}{2-m} \left[Q_{n,\textit{diff},m}^{\textit{V}_{DD}}\left(\phi_n + V_{DD}\right) - Q_{n,\textit{diff},m}^{\text{o}}\phi_n\right] \\ E_{\textit{diff},m}^{\textit{loss}}\left(\uparrow\right) &= E_{\textit{diff},m}^{\textit{transferred}}\left(\uparrow\right) - E_{\textit{n},\textit{diff},m}^{\textit{stored}} + E_{\textit{p},\textit{diff},m}^{\textit{stored}} \end{split}$$

Inter-university Semiconductor Research Center

91



Seoul National University

### **Energy Flow when the output is falling**



$$\begin{split} Q_{p,m}^{V_{DD}} &= k_{p,m} \Big( \! \phi_p + V_{DD} \Big)^{\! 1-m}, \quad Q_{p,m}^0 = k_{p,m} \! \phi_p^{1-m} \\ Q_{n,m}^{V_{DD}} &= k_{n,m} \Big( \! \phi_n + V_{DD} \Big)^{\! 1-m}, \quad Q_{n,m}^0 = k_{n,m} \! \phi_n^{1-m} \end{split}$$

#### Before transition

$$E_{n,diff,m}^{stored} = \frac{1-m}{2-m} \left[ Q_{n,diff,m}^{V_{DD}} (\phi_n + V_{DD}) - Q_{n,diff,m}^{0} \phi_n \right]$$

#### After transition

$$\begin{split} E_{\textit{diff},m}^{\textit{transferred}}\left(\downarrow\right) &= Q_{p,\textit{diff},m}^{\textit{transferred}}V_{DD} = \left(\!Q_{p,\textit{diff},m}^{v_{DD}}\!-\!Q_{p,\textit{diff},m}^{\scriptscriptstyle 0}\right)\!\!V_{DD} \\ E_{p,\textit{diff},m}^{\textit{stored}} &= \frac{1-m}{2-m}\!\left[\!Q_{p,\textit{diff},m}^{v_{DD}}\!\left(\!\phi_{pn}\!+\!V_{DD}\right)\!-\!Q_{p,\textit{diff},m}^{\scriptscriptstyle 0}\!\phi_{n}\right] \\ E_{\textit{diff},m}^{\textit{loss}}\!\left(\downarrow\right) &= E_{\textit{diff},m}^{\textit{transferred}}\!\left(\downarrow\right)\!-\!E_{p,\textit{diff},m}^{\textit{stored}}\!+\!E_{n,\textit{diff},m}^{\textit{stored}} \end{split}$$



### **Load Capacitance**



$$\begin{split} E_{\textit{transferred}} &= E_{\textit{transferred}}\left(\uparrow\right) + E_{\textit{transferred}}\left(\downarrow\right) \\ &= E_{\textit{transferred}}\left(\uparrow\right) + E_{\textit{gate}}^{\textit{transferred}}\left(\downarrow\right) + E_{\textit{wire}}^{\textit{transferred}}\left(\uparrow\right) + E_{\textit{wire}}^{\textit{transferred}}\left(\downarrow\right) \\ &+ E_{\textit{diff}}^{\textit{transferred}}\left(\uparrow\right) + E_{\textit{diff}}^{\textit{transferred}}\left(\downarrow\right) + E_{\textit{diff}}^{\textit{transferred}}\left(\uparrow\right) + E_{\textit{diff}}^{\textit{transferred}}\left(\downarrow\right) \end{split}$$

We need to find 16 constants to find the voltage dependency of the energy consumption. ?!

Inter-university Semiconductor Research Center

93



**Seoul National University** 

### **Typical CMOS Load Capacitance (1)**

|                       |      | E <sub>transferred</sub> | E <sub>stored</sub> | E <sub>consumed</sub> |
|-----------------------|------|--------------------------|---------------------|-----------------------|
| •                     | Rise | 1.81                     | 1.00                | 2.80                  |
| C <sub>gate</sub>     | Fall | 3.58                     | 1.99                | 2.59                  |
|                       | Rise | 2.06                     | 1.00                | 4.22                  |
| C <sub>junction</sub> | Fall | 6.38                     | 3.16                | 4.22                  |
|                       | Rise | 2.00                     | 1.00                | 2.00                  |
| C <sub>metal</sub>    | Fall | 2.00                     | 1.00                | 2.00                  |

#### □ Assumptions

- V<sub>DD</sub>=3.0V
- W<sub>N</sub>:W<sub>P</sub>=1:2
- V<sub>Tn</sub>=0.5V, |V<sub>Tp</sub>|=0.55V





### **Typical CMOS Load Capacitance (2)**

- □ Assumptions
  - C<sub>L</sub>=K<sub>gate</sub>·C<sub>gate</sub>+ K<sub>junction</sub>·C<sub>junction</sub>+ K<sub>metal</sub>·C<sub>metal</sub>
  - K<sub>gate</sub>: K<sub>junction</sub>: K<sub>metal</sub> = 2:1:3
  - (# of 0→1 transition nodes) : (# of 0→1 transition nodes) = 1:1 for every edge.
- ☐ Then,
  - $E_{transferred}$ :  $E_{stored}$ :  $E_{consumed}$  = 1.93 : 1 : 1.93 for each edge.
- ☐ From this we can rewrite the consumed energy E(n) in model 4, approximately,

$$E \cong \frac{1}{1.93} \left( C_5 + C_6 + C_7 + C_8 \right) \cdot V_{DD}^2$$

Inter-university Semiconductor Research Center

95



Seoul National University

## **Cycle-Accurate Energy Measurement System**







# **Cycle-Accurate Energy Measurement Board**



Inter-university Semiconductor Research Center

37



Seoul National University

### **Data Processing Software**

[Monitoring and Data Acquisition Program]

[Energy Calculation Excel Program]





# **Simulation Setup**



 $\hfill \square$  In order to verify energy models, we simulated for an inverter with  $\mathbf{C}_{\mathsf{L},\mathsf{GND}}$  and  $\mathbf{C}_{\mathsf{L},\mathsf{VDD}}.$ 

Inter-university Semiconductor Research Center



Seoul National University

### **Simulated Waveforms**





# Simulation Results (1): % Error vs V<sub>S</sub>

| * 7                                        | % error    |        |                  |        |                  |        |
|--------------------------------------------|------------|--------|------------------|--------|------------------|--------|
| $\begin{bmatrix} V_S \\ [V] \end{bmatrix}$ |            |        | Model 2          |        | Model 3, 4       |        |
| LVJ                                        | $C_{ m L}$ | Energy | $C_{\mathrm{L}}$ | Energy | $C_{\mathrm{L}}$ | Energy |
| 2.3                                        | -          | -20.98 | 6.01             | 5.63   | 0.01             | -0.35  |
| 2.4                                        | -          | -13.94 | 6.03             | 5.65   | 0.03             | -0.33  |
| 2.5                                        | -          | -6.70  | 5.93             | 5.55   | -0.06            | -0.42  |
| 2.6                                        | -          | 0.98   | 6.01             | 5.63   | 0.01             | -0.35  |
| 2.7                                        |            | 8.92   | 6.03             | 5.65   | 0.03             | -0.33  |
| 2.8                                        | -          | 17.15  | 6.05             | 5.67   | 0.05             | -0.31  |

- $\hfill \square$  Model 1 is very sensitive to  $\hfill V_S$  variation.
- ☐ Model 3 and 4 provide small errors.

Inter-university Semiconductor Research Center 101



Seoul National University

### Simulation Results (2): % Error vs C<sub>M</sub>

| <i>C</i>            |         | % error |         |        |            |        |
|---------------------|---------|---------|---------|--------|------------|--------|
| C <sub>M</sub> [nF] | Model 1 |         | Model 2 |        | Model 3, 4 |        |
| firi 1              | $C_L$   | Energy  | $C_{L}$ | Energy | $C_{L}$    | Energy |
| 2                   | -       | -8.42   | 9.78    | 9.38   | -0.19      | -0.55  |
| 3                   | -       | -2.03   | 7.47    | 7.08   | -0.03      | -0.39  |
| 4                   | -       | 0.98    | 6.01    | 5.63   | 0.01       | -0.35  |
| 5                   | -       | 2.72    | 5.08    | 4.70   | 0.07       | -0.29  |
| 6                   |         | 3.82    | 4.41    | 4.04   | 0.12       | -0.24  |
| 7                   | -       | 4.52    | 3.86    | 3.49   | 0.11       | -0.25  |

- $\square$  Model 1 and 2 are very sensitive to  $C_M$  variation.
- ☐ Model 3 and 4 provide small errors.



#### Measurement Results: % Error vs C<sub>M</sub>

| _       |         | % error |         |        |            |        |
|---------|---------|---------|---------|--------|------------|--------|
| $C_{M}$ | Model 1 |         | Model 2 |        | Model 3, 4 |        |
| [m.]    | $C_{L}$ | Energy  | $C_{L}$ | Energy | $C_{ m L}$ | Energy |
| 3.9     | -       | -10.0   | 8.2     | 8.2    | 0.8        | 0.8    |
| 5.3     | -       | -6.1    | 7.7     | 7.7    | 0.2        | 0.2    |
| 6.8     | -       | -5.3    | 5.4     | 5.4    | 0.5        | 0.5    |
| 8.2     | -       | -4.5    | 4.9     | 4.9    | 0.3        | 0.3    |
| 10      | -       | -3.8    | 3.5     | 3.5    | 0.5        | 0.5    |
| 22      | -       | -2.6    | 1.4     | 1.4    | 0.5        | 0.5    |

- $\hfill \square$  Using our measurement system, we measured for an inverter with  $\mathbf{C}_{\mathbf{L},\mathbf{GND}}$ and  $C_{L,VDD}$ .
- ☐ As the simulation results, model 3 and 4 provide small errors.

Inter-university Semiconductor Research Center 103



Seoul National University

#### **Comparison with Ammeter (1)**



- ☐ In order to compare our measurement system with the ammeter, we measured an inverter with  $C_{L,GND}$  and  $C_{L,VDD}$ .
- ☐ Theoretical E<sub>consumed</sub>=11.441nJ



#### **Comparison with Ammeter (2)**

|                | Theres |         | Our measurement system |         |            |  |
|----------------|--------|---------|------------------------|---------|------------|--|
|                | Theory | Ammeter | Model 1                | Model 2 | Model 3, 4 |  |
| Energy<br>[nJ] | 11.441 | 11.514  | 11.010                 | 11.840  | 11.497     |  |
| % error        | -      | 0.63    | -3.77                  | 3.49    | 0.49       |  |

☐ Note that although ammeter shows small error but it cannot measure the cycle-accurate energy.

Inter-university Semiconductor Research Center 105





#### **FPGA: Energy Consumed by Logic Operation**

☐ If the circuits in FPGA are composed of rising edge triggered logic, we can write as follows

$$\begin{split} E_{\textit{rising}} &\cong E_{\textit{logic}} - E_{\textit{clock}} \\ E_{\textit{falling}} &\cong E_{\textit{clock}} \end{split}$$

- □ Where
  - E<sub>logic</sub>: energy consumed by logic operation
  - E<sub>clock</sub>: energy consumed by clock tree circuit
- ☐ Then we can write

$$E_{logic} \cong E_{rising} - E_{falling}$$

Inter-university Semiconductor Research Center 107



Seoul National University

#### **FPGA: 4-Bit Counter Test Results (1)**





#### FPGA: 4-Bit Counter Test Results (2)

| Case         | Logic Cell<br>Number | Energy Type   | Energy by Hamming Distance (nJ) |      |      |       |
|--------------|----------------------|---------------|---------------------------------|------|------|-------|
|              |                      |               | 1                               | 2    | 3    | 4     |
| Counter 1ea  | 4                    | Energy.rise   | 3,38                            | 3.49 | 3.67 | 3.81  |
|              |                      | Energy.fall   | 3, 25                           | 3.25 | 3.21 | 3.15  |
|              |                      | Energy.logic  | 0.13                            | 0.24 | 0.46 | 0.65  |
| Counter 5ea  | 16                   | Energy.rise   | 3.71                            | 4.33 | 4.75 | 5.22  |
|              |                      | Energy.fall   | 3.22                            | 3.19 | 3.14 | 3.09  |
|              |                      | Energy, logic | 0.48                            | 1.15 | 1.61 | 2,14  |
|              | 34                   | Energy.rise   | 4.02                            | 5.10 | 5.96 | 6.71  |
| Counter 10ea |                      | Energy.fall   | 3,19                            | 3.13 | 3.08 | 3.00  |
|              |                      | Energy.logic  | 0.83                            | 1.97 | 2,88 | 3,71  |
|              | 46                   | Energy.rise   | 4.15                            | 5.62 | 6.77 | 7.79  |
| Counter 15ea |                      | Energy.fall   | 3.15                            | 3.08 | 3.01 | 2.93  |
|              |                      | Energy.logic  | 1.00                            | 2.54 | 3,76 | 4.85  |
|              | 61                   | Energy.rise   | 4.27                            | 6.35 | 8.41 | 10.14 |
| Counter 20ea |                      | Energy.fall   | 3.10                            | 3.02 | 2.95 | 2.87  |
|              |                      | Energy.logic  | 1,18                            | 3,33 | 5,46 | 7.27  |
| Counter 25ea | 76                   | Energy.rise   | 4.43                            | 6.85 | 9.23 | 11.38 |
|              |                      | Energy.fall   | 3.06                            | 2.95 | 2.87 | 2.79  |
|              |                      | Energy.logic  | 1.37                            | 3,89 | 6,36 | 8,59  |

Inter-university Semiconductor Research Center 109



Seoul National University

#### FPGA: 4-Bit Counter Test Results (3)



☐ Elogic vs Hamming distance plot

Inter-university Semiconductor Research Center 110



#### How to break software

#### James Whittaker

The Center for Information Assurance at Florida Tech

## How to Break Software

Brought to you by:
The Center for Information Assurance (CIA) at Florida Tech

Your host today:

James A. Whittaker, Ph.D.

Professor of Computer Science

#### Users Use, Testers Test

- Any fool can stumble across bugs
- Testing requires:
  - efficiency: finding bugs faster than normal use
  - effectiveness: finding bugs that users care about and that developers will fix
  - thoroughness: leaving no stone unturned

#### Where are the Bugs?

- To find them you could:
  - Seek out the weak developers!
  - Seek out the bad managers!
  - Seek out the doomed projects!
  - Create them yourself!
- Understanding where bugs are requires that we understand *how and why software fails*

#### Software Fault Models

- Fault models can be based on:
  - Process maturity
  - Programming language constructs
  - Software behavior
- How do we understand problematic behavior?
  - Read bug reports
  - Recognize patterns of failure

#### Where Bugs Come From

- Environment
  - The software misinterprets or cannot handle its environment
  - What are all the environmental considerations that we must face?
- Capabilities
  - The software incorrectly performs one or more of its capabilities
  - What are software's general capabilities?







#### Software Capabilities

- Although there are only four basic capabilities:
  - accepting input
  - producing output
  - storing data
  - performing computation
- Software can combine these capabilities to perform very complicated tasks

## Testing Software Capabilities

- Confronting such complexity in a massive frontal testing assault is often unproductive
- Instead, exploratory testers wage small (winnable) battles until the enemy submits

#### Testing is War

- The enemy: bugs in the software
- Bugs prevent capabilities
- Drive capabilities and you find bugs
- Method:
  - » Determine your enemy's strengths and remove them
  - Wage small wars on input capabilities, output capabilities, data storage and computation

## Testing Input

Software should only accept input that it can handle...

... But ensuring that this is the case is problematic

#### Testing Input

- How does software filter erroneous input?
  - GUIs
- by preventing input data of incorrect typeby preventing input data that is too small or too large
  - by forcing the user down specific control flow paths

## Testing Input

- How does software filter erroneous input?
  - Error checking code
  - "if" statements
- by ensuring that inputs received can actually be processed
  - but error checking code can also have errors!
    - writing error code might introduce errors
    - writing error code also means diverting your attention from the main-line code

#### Testing Input

- How does software filter erroneous input?
  - Exception handlers
- failing gracefully is extremely difficult error routines must reset state and cleanup side-effects...a very difficult endeavor

### Testing Output

- Software should generate only those outputs that are acceptable to its users
  - Displayed data must fit in its display area
  - Software cannot pass incorrect data types
  - Data must be correctly computed, software must *never* pass incorrect values to its users
  - Testing output requires domain expertise
- We must understand wrong answers and ensure that our software does not produce them

#### Testing Data

- Inputs and computation results are often stored internally
- Software will fail if it stores illegal data
- Stored data values must be acceptable individually and in combination with other data

## Testing Data

- The major difference between data and inputs is that data is *persistent*
- Persistence needs to be tested
  - data retrieval, data modification, data access
  - we must test that data structures can be operated on without failure
    - accessed, retrieved, modified, overflowed, underflowed, ...

#### Testing Computation

- Software can correctly filter inputs, validate outputs and store data...
  - ... and still fail
- x=x+1 will fail if it is executed enough times to overflow the value x
- Correct computation depends on operators, operands and result

## Testing Computation

- Another aspect of computation is feature interaction
  - Features can interact in ways that affect computation
  - One feature can *get in the way* of another feature's computation

#### An Overview of the Methodology

- Software possesses 4 basic capabilities
- Attack each capability by staging situations that commonly cause failure
  - input attacks
  - output attacks
  - data attacks
  - computation attacks
- Determine which attacks apply to your app and apply them, one-at-a-time

#### Exploring the Input Domain

- Banging on the keyboard is largely a waste of time, a strategy for rookies
- Each test should have a specific purpose
- A tester with clear goals is more likely to find a problem than a tester who is simply hacking away
- Testers must learn to target problematic input scenarios

- Explore the application under test
  - Apply the inputs that a user would apply to get real work done
- Observe the inputs
- Watch for attack opportunities

## Preview of Input Attacks

- 1. Force all error messages to occur
- 2. Force the software to establish default values
- 3. Explore allowable character sets and data types
- 4. Overflow input buffers
- 5. Find inputs that interact with other inputs
- 6. Repeatedly apply the same input/input sequence

#### Documenting an Attack

- When to apply the attack
- What faults make the attack successful
- How to determine if the attack exposes failures
- How to conduct the attack
- Example and analysis

#### First Attack:

Force Error Messages to Occur

- When to apply this attack:
  - Whenever an application must respond to erroneous input
  - Testers should consider the type of response:
    - » Input filter
    - » Input checking
    - » Exception handling

#### Force Error Messages to Occur

- What faults make this attack successful:
  - Programming error cases is difficult
    - Additional code must be written where there is code, there is often bad code
    - Writing error code takes the programmers attention away from writing functional code
  - Failing gracefully is difficult
    - » What data needs to be saved?
    - » What state is the app in?

Attack 1

#### Force Error Messages to Occur

- How to determine if the software fails:
  - This attacks finds
    - Missing error cases
    - Nonsense error messages
    - Uninformative error messages
  - Thus, manifestation of the defect ranges from crash to fully functional code

#### Force Error Messages to Occur

- How to conduct this attack:
  - Look thru the (ahem) specification for message definition
  - Consider *properties* of inputs
    - [type] entering invalid types often causes error messages
    - [length] a few too many characters in an input string will often elicit an error message
    - [boundary values] often represent special cases

Attack 1

#### Force Error Messages to Occur

- Target: Word® 2000
- Feature: Insert Index
- Reproducibility: Works on all PC versions of Word® and all versions of Windows® 9x, NT, 2000
- Synopsis: Probably a duplicated block of code, very sloppy programming

Attack 1: Example

### Force Error Messages to Occur

Questions before we move to the next attack?

Attack 1: Example



- When to apply this attack:
  - All software uses variables
  - The lifecycle of a variable is
    - Declaration
    - Initialization
    - Use
  - Variables must be initialized before they are used

#### Force Default Values

- What faults make this attack successful:
  - Compilers are often good at catching these mistakes
    - But the compilers must be used properly
  - Implicit declaration can often get a programmer into trouble
  - When users skip input fields or leave them blank, defaults must be established in case those variables are used in computation

Attack 2

#### Force Default Values

- How to determine if the software fails:
  - Best case (for ease of verification) is that use of an un-initialized variable causes a memory violation
  - Harder to detect is that a random value gets assigned to the variable
    - » Look for garbage characters/"strange things"
    - » Incorrect results
    - » Too many values/too few values displayed
    - Incorrectly typed data being displayed

#### Force Default Values

- How to conduct this attack:
  - Determine the data that has defaults
    - Look for "options" or "configuration" buttons
    - Consult the source code's declaration section
  - Force the app to use its defaults
    - Accept any displayed data as defaults
    - » Enter a null value, if a value is displayed then delete it
    - » Change settings from their default values to a valid value and then back again

Attack 2

#### Force Default Values

- Target: Word® 2000
- Feature: Insert Table of Contents
- Reproducibility: Works on all PC versions of Word® and all versions of Windows® 9x, NT, 2000
- Synopsis: Frankly, I would have difficulty coding this behavior if I wanted to!

Attack 2: Example

### Force Default Values

Questions before we move to the next attack?

Attack 2: Example

#### Third Attack:

Explore Character Sets/Data Types

- When to apply this attack:
  - Target: variable input
  - Special cases based on:
    - » Operating system reserved words
    - Programming language reserved words
    - » Character set boundaries (ASCII, UNICODE, etc.)
    - » spaces, quotation marks, delimiters, etc.

#### Explore Character Sets/Data Types

- What faults make this attack successful:
  - Special cases require special handling
  - Either:
    - Developers fail to recognize a special case
    - Developers put too much trust in interface controls
    - Developers fail to handle errors properly (we've already discussed that error code is hard to get right)

Attack 3

#### Explore Character Sets/Data Types

- How to determine if the software fails:
  - Unhandled exceptions cause a system to crash
  - Generalized error handlers often present nonsensical or uninformative messages
    - Watch out for loss-of-state often caused by exception handlers
  - Since we are dealing with character data, watch for rendering problems
  - Watch for "Easter eggs"

#### Explore Character Sets/Data Types

- How to conduct this attack:
  - Spend some time researching:
  - Operating system, programming language and character set keywords and ranges
  - Study documentation of each of the above
  - Beware outdated keywords

Attack 3

#### Explore Character Sets/Data Types

- Target: Microsoft® Internet Explorer® v5.5
- Feature: File Open
- Reproducibility: Works only on early or recent versions of IE
- Synopsis: AUX is an old DOS device, device names were not considered when the file open program was created

Attack 3: Example

#### Explore Character Sets/Data Types

Questions before we move to the next attack?

Attack 3: Example

## Fourth Attack: Overflow Input Buffers

- When to apply this attack:
  - The idea is simple: enter long strings into input fields
  - Don't neglect APIs/exposed internal objects
  - This is the hacker's choice because many buffer overflows create exploitable failure scenarios

# Fourth Attack: Overflow Input Buffers

- When to apply this attack:
  - This is an important bug because:
     copy/paste into inputs fields is a fairly common practice
     Buffer overrups result in crashes, risking data loss and costly
  - Buffer overruns result in crashes, risking data loss and costly rework
     be sure to check with developers to find
  - be sure to check with developers to find out tolerance for fixing long string bugs...thousands characters can get ridiculous

Attack 4

#### Overflow Input Buffers

- What faults make this attack successful:
  - Developers simply fail to constrain the amount of text the software will accept in an input sting
  - When the text is read input memory, fixedsized buffers are overflowed

#### Overflow Input Buffers

- How to determine if the software fails:
  - This bug almost always causes the software to crash
  - Other possibilities are extreme application instability (since memory is often overwritten by the long string)

Attack 4

#### Overflow Input Buffers

- How to conduct this attack:
  - Identify where strings are read as input
  - Start small and then grow the string to its maximum length
  - This attack is very repetitive
  - It's helpful to count 12345678901234567890

#### Overflow Input Buffers

Target: Word® 2000

Feature: Find/Replace

- Reproducibility: Works on all PC versions of Word® and all versions of Windows® 9x, NT, 2000
- Synopsis: Find field is string-length constrained but the Replace field is not

Attack 4: Example

## Overflow Input Buffers

Questions before we move to the next attack?

Attack 4: Example

## Fifth Attack: Find Interacting Inputs

- Up to now, we have dealt with inputs oneat-a-time
  - Find a spot where input is accepted and poke it until something breaks
- This next attack deals with *combinations* of inputs
  - Multiple inputs on a single input dialog
  - An API call with more than one input
- Constraining a single input is hard enough for developers...

Attack 5

#### Fifth Attack: Find Interacting Inputs

- When to apply this attack:
  - Some inputs affect other inputs
    - They might represent different properties of the same piece of data
    - They might be used in a single internal computation
  - Thus, individually correct inputs might be problematic when combined
  - These relationships should be watched for while executing the previous attacks

#### Find Interacting Inputs

- What faults make this attack successful:
  - Obviously, input relationships are hard to determine for both testers and developers
  - The logic involved in handling a single erroneous input is hard enough...
  - ...multiple error cases often require complex nested IF statements
  - Code changes make this situation worse

Attack 5

#### Find Interacting Inputs

- How to determine if the software fails:
  - Since inputs are often stored internally, slipping a bad input in means corruption of internal data
  - Once you suspect you tricked the app into accepting bad input, force that input to be used as much as possible
  - Carefully verify it every time it is displayed or used

#### Find Interacting Inputs

- How to conduct this attack:
  - Explore the app, identify possible relationships between inputs

Are they properties of a single data structure?

Are they used together in a single computation?

 Select boundary and extreme combinations:
 Large/small, small/large, large/large, small/small

Attack 5

#### Find Interacting Inputs

Target: Word® 2000

Feature: Insert a Table

Reproducibility: Works on all PC versions of Word® and all versions of Windows® 9x, NT, 2000

Synopsis: These two parameters can be small-large, large-small, small-small but not large-large

Attack 5: Example

#### Find Interacting Inputs

Questions before we move to the next attack?

Attack 5: Example



Repeat Inputs Numerous Times

- When to apply this attack:
  - This attack is applicable whenever the app accepts input inside a loop:

Accept an input

Process it

Repeat

#### Repeat Inputs Numerous Times

- What faults make this attack successful:
  - Repetition has the effect of gobbling up resources
  - Many applications are unaware of available resources and assume unlimited memory and storage
  - Low resources can cause undesirable sideeffects

Attack 6

## Repeat Inputs Numerous Times

- How to determine if the software fails:
  - It is difficult to predict how memory stress will manifest
  - Watch for:
    - » Screen refresh problems
    - » Slow performance
  - Consider:
    - Using a memory leak detector

#### Repeat Inputs Numerous Times

- How to conduct this attack:
  - Pick an input or a sequence of inputs
  - Apply them over and over to test for undesirable side-effects
- Or:
  - Pick an object and apply the same input to it over and over
  - Pick multiple objects and apply the same series of inputs to each object

Attack 6

### Repeat Inputs Numerous Times

- Target: Word®
- Feature: Equation Editor
- Reproducibility: Works on all PC versions of Word® and all versions of Windows® NT, 2000
- Synopsis: The number of parentheses is constrained at 10, but that number should, perhaps, be lowered

Attack 6: Example

# Repeat Inputs Numerous Times

Questions before we move to the next attack?

Attack 6: Example

# Review of Input Attacks

- 1. Force all error messages to occur
- 2. Force the software to establish default values
- 3. Explore allowable character sets and data types
- 4. Overflow input buffers
- 5. Find inputs that interact with other inputs
- 6. Repeatedly apply the same input/input sequence

- Some bugs are too difficult to find by concentrating on inputs alone
- Which inputs will generate incorrect results?
- Why not concentrate on what incorrect results *could* occur and then find the inputs to force them?

# Preview of Output Attacks

- 7. Force different outputs to be generated for each input
- 8. Force invalid outputs to be generated
- 9. Force output properties to change
- 10. Force the screen to be refreshed

# Seventh Attack: Vary Outputs for Each Input

- When to apply this attack:
  - Inputs are not independent of each other
  - Applying some inputs before (or after) other inputs can affect behavior
  - A good way to think about the most important cases is to ensure that you see every output that an input can cause

Attack 7

# Vary Outputs for Each Input

- What faults make this attack successful:
  - Input → Output is simple to code
  - Inputs that cause different outputs require complex logic to be coded
  - Complexity leads to bugs

# Vary Outputs for Each Input

- How to determine if the software fails:
  - The main bug is that there are behaviors that the developer miscoded or forgot to code
  - These are usually severity 1 problems that get found and fixed before release

Attack 7

# Vary Outputs for Each Input

- How to conduct this attack:
  - Testers must identify which inputs can cause multiple behaviors and understand the *context* in which these inputs can be applied

# Vary Outputs for Each Input

- To illustrate, consider a telephone switch
- How many outputs can the input "pick up the receiver" cause?
  - The switch could generate a dial tone ...
    - ... when the phone is idle -
  - The switch could connect two callers...situations
    - ... when the phone is ringing -

Attack 7: Example

# Vary Outputs for Each Input

Questions before we move to the next attack?

# Eighth Attack: Force Invalid Outputs

- When to apply this attack:
  - Domain expertise is essential to effectively carry out this next attack
    - It's hard to test a flight simulator without knowing how to fly an airplane
  - Testers must know the product well enough to enumerate wrong answers...
  - ...and then figure out how to drive the application to produce them

Attack 8

# Force Invalid Outputs

- What faults make this attack successful:
  - Just as testers can misunderstand the problem domain...so can developers
    - When this happens, they write bugs
  - In most cases, the problem is over-looked special cases

#### Force Invalid Outputs

- How to determine if the software fails:
  - This is a hard one...the software rarely fails in a spectacular (or even noticeable) manner
  - Testers should ignore type and format and concentrate instead on the value being displayed

Attack 8

# Force Invalid Outputs

- How to conduct this attack:
  - Testers need to focus on *known bad results*Enumerate wrong answers and then create the situation that forces them
  - Learn as much about the problem domain as you can

#### Force Invalid Outputs

- Target: Windows NT System Clock
- Feature: Calendar
- Reproducibility: Works on Windows® NT, Service Pack 4 and prior
- Synopsis: The calendar works fine as long as the cursor isn't fixed on day 29 of February

Attack 8: Example

# Force Invalid Outputs



Windows NT4 Clock

The bug has since been fixed!

- Choose Month as February
- Choose Year as 2000
- Click on the 29th
- Change Year to 2001 (using spin control)

VOILA! We have a whole new calendar!

Attack 8: Example

# Force Invalid Outputs

Questions before we move to the next attack?

Attack 8: Example

# Ninth Attack: Change Output Properties

- When to apply this attack:
  - This attack requires *persistent* outputs be present in the application (which often precludes its use on APIs)
  - An output is generated and then we change some property of the output

# Change Output Properties

- What faults make this attack successful:
  - Generating the output once tests that the software works with initial data settings
    - These are settings that the developer established
    - These are settings that the developer has anticipated
  - Changing the output ensures that the software will work user-defined setting
    - These settings might not be anticipated

Attack 9

# Change Output Properties

- How to determine if the software fails:
  - Since outputs are necessarily visuallyintensive, this often requires manual verification
  - But since testers have had to determine output properties in advance, it is easy to determine what to look for

# Change Output Properties

- How to conduct this attack:
  - First determine the property of interest Size, value, type, color, shape, direction, ...
  - Cause it to be displayed, then change it
    Size (bigger to smaller, smaller to bigger,...)
    Value(s) (high to low, positive to negative,...)
    Type (char to int, int to float,...)

...

Attack 9

# Change Output Properties

- Target: PowerPoint® 2000
- Features: Insert WordArt
- Reproducibility: Works on all PC versions of PowerPoint® and Word® and all versions of Windows®
- Synopsis: First time through, two things happen; next time through, something is forgotten

Attack 9: Example

# Change Output Properties

Questions before we move to the next attack?

Attack 9: Example



- When to apply this attack:
  - This attack is applicable only to GUI software with editable display areas
  - The attack is most useful at the boundaries of screen objects
    - » Objects created by the user
    - Partitions of the display area (frames, etc.) set by the software

# Force the Screen to Refresh

- What faults make this attack successful:
  - Refreshing the contents of a window, after those contents have changed, is problematic
  - Refresh too often:
    - Performance degrades
    - The screen flickers to annoy the user
  - Refresh too seldom
    - The screen becomes messy
  - Hitting the sweet spot is challenging

Attack 10

# Force the Screen to Refresh

- How to determine if the software fails:
  - Sigh...you guessed it...labor-intensive visual verification is the only resort

### Force the Screen to Refresh

- How to conduct this attack:
  - Add things to the screen
  - Move them around (varying the distance you move them)
  - Delete them
  - Edit them
- Do all these things with a mix of features and at or around object boundaries

Attack 10

# Force the Screen to Refresh

- Target: Word® 2000
- Features: Entering text at the page boundary
- Reproducibility: Works on Word® 2000 on any OS. The problem is much worse on prior versions of Word
- Synopsis: Removal of a similar bug in Word 97 caused this one

Attack 10: Example

# Force the Screen to Refresh

Questions before we move to the next attack?

Attack 10: Example

# Review of Output Attacks

- 7. Force different outputs to be generated for each input
- 8. Force invalid outputs to be generated
- 9. Force output properties to change
- 10. Force the screen to be refreshed

- While executing the input and output attacks, keep notes on persistent (stored) data
  - Data you see over and over is stored
  - Data you see over multiple executions is persistent
- Think about where this data comes from and how it can get corrupted

# Exploring Stored Data

- If you have the source code, data is easy to find
- If you don't have the source, you must be able to look through the interface and identify data
  - Put yourself in the shoes of the developer
  - How would you program it?

#### Preview of Stored Data Attacks

- 11. Apply inputs using a variety of initial conditions
- 12. Force a data structure to store too many/too few values
- 13. Investigate alternate ways to modify internal data constraints

# Eleventh Attack: Vary Initial Conditions

- When to apply this attack:
  - Inputs are often applicable in a variety of circumstances (states of the software)
  - This attack investigates the application of inputs from a variety of initial states
  - Try to pick states that might cause errant behavior
    - Pick states that are as different from each other as possible
    - Pick states that differ by only one data element

### Vary Initial Conditions

- What faults make this attack successful:
  - Checking that inputs are valid is tricky
  - Checking that inputs combined with internal state – are valid is even harder
  - The result is that some inputs only work on specific initial conditions

Attack 11

# Vary Initial Conditions

- How to determine if the software fails:
  - Since some cases of this bug mean missing code (failure to consider a specific case)
    - Watch for crashes
    - Watch for nonsense error messages

### Vary Initial Conditions

- How to conduct this attack:
  - Enumerating states is a time consuming endeavor:
  - Does an input cause different behavior in different circumstances?
  - Is an input applicable sometimes but not other times?

For more information read about *model-based testing* visit: www.model-based-testing.org

Attack 11

# Vary Initial Conditions

- Target: Word® 2000
- Features: Draw Group/Ungroup
- Reproducibility: Works on all PC versions of Word® and all versions of Windows® 9x, NT, 2000
- Synopsis: Certain configurations of grouped objects do not ungroup cleanly

Attack 11: Example

# Vary Initial Conditions

Questions before we move to the next attack?

Attack 11: Example

# Twelfth Attack: Over/Underflow Data Structures

- When to apply this attack:
  - Fixed size structures can be overflowed by forcing the software to store too many values
  - Testers must be able to detect size constraints and then find ways to violate them

#### Over/Underflow Data Structures

- What faults make this attack successful:
  - Developers are often good at checking content and failing to consider size
  - Every program that adds or removes elements from a structure must have code to check that size constraints are not broken as a result of the add/remove operation
  - It only takes one program to forget...

Attack 12

#### Over/Underflow Data Structures

- How to determine if the software fails:
  - Reading or writing beyond the bounds of a data structures almost always causes the software to crash
  - But be alert for the usual signs of data corruption (performance, anomalous output/computation, etc.)

#### Over/Underflow Data Structures

- How to conduct this attack:
  - A structure can be underflowed by removing more values than the structure contains
  - Strategy:
    - $\sim$  add x number of values to a structure
    - delete x+1 values from the structure
  - Try this on internal structures and also list boxes that store persistent data

Attack 12

### Over/Underflow Data Structures

- Farget: Word® 2000
- Features: Tables
- Reproducibility: Works on all PC versions of Word® and all versions of Windows® 9x
- Synopsis: failure to constrain the upper bound on the number of rows

Attack 12: Example

### Over/Underflow Data Structures

Questions before we move to the next attack?

Attack 12: Example

# Thirteenth Attack: Find Back Doors to Stored Data

- When to apply this attack:
  - Whenever data is supposed to be constrained in any way:
    - A month must fall between 1 and 12
    - A year must fall between 1980 and 2095
    - The limit on undo operations is 20

. . .

#### Find Back Doors to Stored Data

- What faults make this attack successful:
  - Plain and simple: a lack of good software engineering practice
  - Information hiding (object-orientation) cures this problem:
    - » All data is private to a set of access routines
    - » These access routines enforce data constraints
    - A program must use the access routines to access or modify data

Attack 13

#### Find Back Doors to Stored Data

- How to determine if the software fails:
  - Broken data constraints are serious, often resulting in system instability or crash
  - Also, look for:
    - » System sluggishness
    - » Incorrect error messages
    - » Incorrectly computed results

#### Find Back Doors to Stored Data

- How to conduct this attack:
  - Identify data structures
  - Explore various ways to modify the contents or properties of the data structure

Attack 13

#### Find Back Doors to Stored Data

- Target: PowerPoint® 2000
- Features: Insert Table
- Reproducibility: Works on all PC versions of PowerPoint® and all versions of Windows® 9x, NT, 2000
- Synopsis: Table size is constrained at creation but can be expanded on edit

Attack 13: Example

#### Find Back Doors to Stored Data

Questions before we move to the next attack?

Attack 13: Example

# Review of Stored Data Attacks

- 11. Apply inputs using a variety of initial conditions
- 12. Force a data structure to store too many/too few values
- 13. Investigate alternate ways to modify internal data constraints

- While executing the input and output attacks:
  - make a feature list
  - keep notes on what computation is happening inside the application
- Think about ways to "get in the way" of the computation, using inputs, outputs, data, or even other features

# Preview of Computation Attacks

- 14. Experiment with invalid operand and operator combinations
- 15. Exploit recursion
- 16. Force computation results to be too large or too small
- 17. Find features that share data or interact poorly

# Fourteenth Attack: Experiment w/ Operator/Operands

- When to apply this attack:
  - Some operands cannot be used with certain operators
    - Divide by zero
    - Adding a character to a real number
  - Some operators conflict with each other
    - » Operators with an inverse relationship
    - $\rightarrow$  Ex: [SQRT(x)]<sup>2</sup>

Attack 14

#### Experiment w/ Operator/Operands

- What faults make this attack successful:
  - Developers have to write error code to weed out invalid operator/operand combinations
  - Consider divide-by-zero

```
if (x != 0)
  y = z / x;
else
  cout("cannot divide by zero");
```

#### Experiment w/ Operator/Operands

- How to determine if the software fails:
  - Become a problem domain expert to learn about the domain and range of the functions computed by your software
  - Apply invalid combinations and watch for:
    - Crashes (for unhandled exceptions)
    - » Incorrect results
    - » Null or nonsensical values appearing as output

Attack 14

#### Experiment w/ Operator/Operands

- How to conduct this attack:
  - Must be able to identify computation (operators) and the data objects (operands) on which the computation operates
  - Computation isn't just assignments
    - Graphical rendering
    - Screen layout, WYSIWYG

### Experiment w/ Operator/Operands

- Target: Windows® Calculator
- Features: Square Root/Exponent
- Reproducibility: Works on all versions of Windows® 9x, NT, 2000
- Synopsis: Round-off error

Attack 14: Example

### Experiment w/ Operator/Operands

Questions before we move to the next attack?

Attack 14: Example

# Fifteenth Attack: Exploit Recursion

- When to apply this attack:
  - Recursion occurs when a function calls itself

```
long int factorial(long int n)
{
  if (n <= 1)
    return(1);
  else
    return(n * factorial(n - 1));
}</pre>
```

• If this occurs too many times, stack overflow will occur

Attack 15



# **Exploit Recursion**

- What faults make this attack successful:
  - Developers fail to write code that ensures loop or recursive call termination
  - Since recursion only requires one line of code, constraining it often seems unnecessary to developers

### **Exploit Recursion**

- How to determine if the software fails:
  - Infinite recursion causes a heap overflow in the computer's memory
  - This will almost always result in the application crashing or hanging

Attack 15

# **Exploit Recursion**

- How to conduct this attack:
  - Recursion can be thought about in terms of black box objects too:
    - Can a document reference itself?
    - Can a hyperlink point to itself?
    - Can an email message contain itself?
    - Can a program that spawns processes call itself?
- Determine these objects and create the self-referencing situation

# **Exploit Recursion**

- This demo will shows how such a program behaves on Windows 2000...
- ...However, we will wait until the end of the session because the computer will not be usable after this program executes

Attack 15: Example

# **Exploit Recursion**

Questions before we move to the next attack?

Attack 15: Example

# Sixteenth Attack: Force Results to be too Large/Small

- When to apply this attack:
  - Review:
    - Inputs need to be constrained
      - an app should never permit invalid input to enter the system
    - Data needs to be constrained
      - an app should never store invalid data
  - But even if all inputs and data are correct, computation can still fail

Attack 16

#### Force Results to be too Large/Small

What faults make this attack successful:

x=x+1;

- This simple line of code will fail if it is executed enough times
- Developers often focus on operators and operands and ignore the result

#### Force Results to be too Large/Small

- How to determine if the software fails:
  - Since the result of this attack is underflow or overflow, the software most often crashes

Attack 16

### Force Results to be too Large/Small

- How to conduct this attack:
  - Usually, you have to force a computation to occur over and over and over
  - If you have control over the data used in the computation, rig it to begin at or around boundary values

# Force Results to be too Large/Small

Given the program:

```
#define count 2
main() {
  int sum, value[count];
  sum = 0;
  for (i = 0; i < count; ++i) {
     sum = sum + value[i];
  }
}</pre>
```

Consider the values:

```
value[0] = 32700, value[1] = 70count = 33000, value[0..32999] = 1
```

Attack 16: Example

Force Results to be too Large/Small

Questions before we move to the next attack?

# Seventeenth Attack: Feature Interaction

- When to apply this attack:
  - This is the attack that distinguishes novice testers from the pros
  - Given two or more features that work fine individually, can you make them break when they work together
    - on the same data
    - » at the same time
    - » getting in the way of each other

Attack 17

# Feature Interaction

- What faults make this attack successful:
  - Features often work well in isolation because they are developed in isolation
  - But two features may share data but require that different constraints on that data be enforced
  - Failure to enforce such constraints causes this class of failure

Attack 17

# Feature Interaction

- How to determine if the software fails:
  - Here is yet another example of painstaking manual verification
  - Keep a sharp eye out for:
    - » Improperly formatted output
    - Incorrect computation
    - Corrupted data
  - Failures often manifest long after the fault has been tripped over

Attack 17

# Feature Interaction

- How to conduct this attack:
  - Pick two or more features
    - That might interact with the same user input
    - That might contribute to computing a single output
    - That might share the same data
  - Try to make them "get in each other's way"

# Feature Interaction

- Target: Word® 2000
- Features: Footnotes/Dual Columns
- Reproducibility: Works on all PC versions of Word® and all versions of Windows® 9x, NT, 2000
- Synopsis: The footnote adopts the properties of the reference point, which can be incompatible with the page layout

Attack 17: Example

# Review of Computation Attacks

- 14. Experiment with invalid operand and operator combinations
- 15. Exploit recursion
- 16. Force computation results to be too large or too small
- 17. Find features that share data or interact poorly



# Summary

- Each interface presents number of challenging problems
- Testing any of the interfaces is formidable
- Testing them thoroughly is exquisitely difficult
- Some interfaces are well-understood, some we are just beginning to study

# Conclusions Testing each interface: The user interface study the attacks execute the attacks meet often, share best practices

# Testing each interface: • The kernel interface • acquire the tools • acquire the knowledge • study field bug reports and make sure you understand real stressed environments

# Conclusions

- Testing each interface:
  - The file system interface
    - » understand your app's file formats
- understand how storage media can fail and determine your customer's tolerance for losing their data

# Conclusions

- Testing each interface:
  - The software interface
    - understand the other apps that your software depends upon
    - » understand the data that your app gets from these resources
    - » determine how this communication can fail
    - » determine if failure of the resource will cause failure of your app

# Remember

- Know your app's environment
  - Understand what might go wrong and test that it doesn't
- Know your app's capabilities
  - Understand how it can screw things up, test that it doesn't
- Practice the attacks...always have a goal
  - Brain on, eyes open, Test!



# Sponsored by:













# **Design of Embedded Systems**

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris





# Constraint Programming Approach

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris

### **Quotations**

"Constraint programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the problem, the computer solves it."



Eugene C. Freuder CONSTRAINTS, April 1997



# **Introduction and Motivation**

Synthesis of the following code (inner loop of differential equation integrator)

### while c do

### begin

```
x1 := x + dx;

u1 := u - (3*x*u*dx);

y1 := y + u*dx;

c := x < a;

x := x1; y := y1; u := u1;

end;
```





# Register Allocation as Graph Coloring



### **Constraints:**

$$\begin{split} &[r_1,r_2,r_3,r_4,r_5,r_6] :: 0..2, \\ &r_1 \neq r_2, \, r_1 \neq r_3, \, r_2 \neq r_3, \\ &r_2 \neq r_4, \, r_3 \neq r_4, \, r_4 \neq r_6, \\ &r_5 \neq r_6. \end{split}$$



# Register Allocation as Clique Finding

- for all r<sub>i</sub>, r<sub>j</sub> which are not connected by an edge: r<sub>i</sub> ≠ 1 ∨ r\_j ≠ 1
- The maximal clique can found by maximizing the following cost function:

$$cost = \sum_{i} r_{i}$$





# **Constraint Consistency**

- All constraints are stored in the constraint store
- Consistency methods are applied to find inconsistent values and prune variables' domains
- Different types of consistency methods:
  - Node consistency
  - Arc consistency
  - Path consistency

· ...





# **Consistency Properties**

- Node consistency
  - A network is node consistent if in each node domain each value is consistent with unary constraint (e.g., X > 7)
- Arc consistency
  - A network is arc consistent if for each arc connecting variables V<sub>i</sub> and V<sub>j</sub> for each value in the domain of V<sub>i</sub> there exist a value in the domain of V<sub>j</sub> consistent with binary constraint (e.g., X > Y)

# **Node and Arc Consistency**

Example



1..6 
$$0..5$$

Not node consistent Not arc consistent node consistent arc consistent

## **Need for search**

- Node, arc and path consistency are in general not complete (complete for some problems with particular structures)
- Complete algorithm: N-consistency for N variable problems → exponential complexity
- Example:





### Search

Solver is not complete and search for a solution is needed







# **Constraint properties**

- may specify partial information need not uniquely specify the values of its variables,
- non-directional typically one can infer a constraint on each present variable,
- declarative specify relationship, not a procedure to enforce this relationship,
- additive order of imposing constraints does not matter,
- rarely independent typically they share variables

# More realistic example Scheduling

### Scheduling of the data-flow graph



### **Constraints:**

- for all op<sub>i</sub> and op<sub>j</sub> such that op<sub>i</sub>
   before op<sub>j</sub>
   T<sub>i</sub> + D<sub>i</sub> ≤ T<sub>j</sub>
- for all op<sub>i</sub> and op<sub>j</sub> that can use the same resource
   T<sub>i</sub> + D<sub>i</sub> ≤ T<sub>j</sub> ∨ T<sub>j</sub> + D<sub>j</sub> ≤ T<sub>i</sub> ∨ R<sub>i</sub> ≠ R<sub>j</sub>

# **Problems**

Constraint propagation for

$$T_i + D_i \le T_i \lor T_i + D_i \le T_i \lor R_i \ne R_i$$
 is weak

- Not all solvers support disjunctive constraints.
  - Other solution (reified constraints):

$$\begin{split} T_i + D_i &\leq T_j \Leftrightarrow B1, \\ T_j + D_j &\leq T_i \Leftrightarrow B2, \\ R_i &\neq R_j \Leftrightarrow B3, \end{split}$$

B1 + B2 + B3 ≥ 1.







$$\begin{aligned} T_k + D_k &\leq T_n \vee T_n + D_n \leq T_k \vee R_k \neq R_n \\ T_m + D_m &\leq T_n \vee T_m + D_m \leq T_n \vee R_m \neq R_n \end{aligned}$$



### **Global constraints**

Non-overlapping rectangles



 $\begin{aligned} \text{diff2}([~[X_i,Y_i,DX_i,DY_i],\\ [X_j,Y_j,DX_j,DY_j]~]~) \end{aligned}$ 

- All knowledge in one "place" makes it possible to define good consistency methods (OR, mathematics, geometry, etc.)
- Specific algorithms for consistency more efficient



# **Scheduling Example Constraints**

$$\begin{split} &T1+2 \leq T6,\, T2+2 \leq T6,\\ &T3+2 \leq T7,\, T4+2 \leq T8,\\ &T5+1 \leq T9,\, T6+2 \leq T10,\\ &T7+2 \leq T11,\, T10+1 \leq T11,\\ &\text{diff2}([\ [T1,R1,2,1],\, [T2,R2,2,1],\, [T3,R3,2,1],\\ &\quad [T4,R4,2,1],\, [T6,R6,2,1],\, [T7,R7,2,1],\\ &\quad [T5,R5,1,1],\, [T8,R8,1,1],\, [T9,R9,1,1],\\ &\quad [T10,R10,1,1],\, [T11,R11,1,1]\ ]). \end{split}$$





# Other Synthesis Problems Defined with Constraints

- High-level synthesis:
  - Chaining,
  - Conditional execution,
  - Pipelined components,
  - Algorithmic pipelining,
  - Switching activity reduction (power consumption)
  - .
- System design
  - different aspects of design space exploration
  - scheduling
  - component assignment
  - memory allocation/data assignment
  - power/energy consumption
  - · ..





# Additional Constraints element

- Element constraints
  - element(N, [X<sub>1</sub>, X<sub>2</sub>, ..., X<sub>n</sub>], Value)
    - propagation from N to Value

propagation from Value to N

Value =  $x \rightarrow N=i$  and Xi = x ...

- Examples- element(N, [2, 3, 4, 4], V)
  - N :: 1..2, V :: {2, 3}
  - V = 4, N :: 3..4
- Used to define discrete cost functions and different relations









# **Edge Finding Algorithm**

- Martin-Shmoys algorithm with O(n²) complexity.
- Up phase
  - for each unique lct we create a set
    S = {t | LCT(t) <= lct} and make checking whether a task can be the first or before</p>
- Down phase
  - similar but using est and checking whether a task can be the last one or after all tasks.

# **System Synthesis Example**



- original MILP formulation- 47 timing variables, 225 binary (bus 153) and1081 constraints (bus 416)
- commercial linear programming package used to solve the problem (XLP, developed by XMP Software, Inc.)

|           |      |    | Execution time |    |    |    |    |    |    |    |
|-----------|------|----|----------------|----|----|----|----|----|----|----|
| Processor | Cost | S1 | S2             | S3 | S4 | S5 | S6 | S7 | S8 | S9 |
| P1        | 4    | 2  | 2              | 1  | 1  | 1  | 1  | 3  | -  | 1  |
| P2        | 5    | 3  | 1              | 1  | 3  | 1  | 2  | 1  | 2  | 1  |
| P3        | 2    | 1  | 1              | 2  | -  | 3  | 1  | 4  | 1  | 4  |



# Modeling of cost and execution time

- Execution time element(P1, [2, 3, 1], D1)... element(P9, [1, 1, 4], D4)
- Cost
   (P1=1 ∨ P2=1 ∨ ... ∨ P9=1) ⇔ C1,
   ...
   (P1=6 ∨ P2=6 ∨ ... ∨ P9=6) ⇔ C6,

# **System synthesis results**

|                | Perform |        |          | rmance opti | mization | Cost optimization |           |  |
|----------------|---------|--------|----------|-------------|----------|-------------------|-----------|--|
| Design         | Cost    | (time  | MILP (s) | CLP (s)     | B&B      | CLP (s)           | B&B Nodes |  |
|                |         | units) |          |             | Nodes    |                   |           |  |
|                | 10      | 6      | 6438.00  | 0.43        | 84       | 0.55              | 92        |  |
| Bus            | 6       | 7      | 5371.80  | 0.53        | 114      | 0.68              | 144       |  |
|                | 5       | 15     | 3691.20  | 0.43        | 68       | 0.70              | 103       |  |
|                | 15      | 5      | 3732.00  | 0.43        | 20       | 1.67              | 125       |  |
| point-to-point | 12      | 6      | 26710.20 | 1.42        | 98       | 2.18              | 169       |  |
| links          | 8       | 7      | 32320.20 | 1.00        | 58       | 2.59              | 198       |  |
|                | 7       | 8      | 4510.80  | 1.64        | 75       | 2.02              | 112       |  |
|                | 5       | 15     | 38501.20 | 1.50        | 32       | 1.48              | 77        |  |

Cost = 4\*C1 + 4\*C2 + 5\*C3 + 5\*C4 + 2\*C5 + 2\*C6



# System synthesis results with local memory

|                |      | Performance  | Performa  | ince optin | Cost optimization |        |           |
|----------------|------|--------------|-----------|------------|-------------------|--------|-----------|
| Design         | Cost | (time units) | MILP (s)  | CLP        | B&B               | CLP    | B&B Nodes |
|                |      |              | * *       | (s)        | Nodes             | (s)    |           |
|                | 28   | 6            | 6592.20   | 0.71       | 76                | 2.58   | 252       |
|                | 23   | 7            | 5371.80   | 1.07       | 193               | 1.94   | 266       |
| Bus            | 22   | 8            | 123252.60 | 0.95       | 124               | 14.85  | 856       |
|                | 21   | 10           | 316860.60 | 114.92     | 4 534             | 119.55 | 8 799     |
|                | 18   | 11           | 236724.00 | 88.23      | 7 015             | 2.37   | 477       |
|                | 17   | 12           | 138004.20 | 0.93       | 268               | 10.39  | 3 076     |
|                | 14   | 15           | 3581.40   | 0.54       | 22                | 9.89   | 3 076     |
|                | 38   | 5            | -         | 0.56       | 24                | 2.08   | 107       |
|                | 30   | 6            | -         | 0.99       | 59                | 3.75   | 155       |
| point-to-point | 25   | 7            | -         | 1.60       | 79                | 5.58   | 314       |
| links          | 23   | 8            | -         | 1.82       | 57                | 3.21   | 184       |
|                | 22   | 10           | -         | 4.50       | 84                | 59.25  | 855       |
|                | 19   | 11           | -         | 27.34      | 794               | 101.03 | 2 851     |
|                | 18   | 12           | -         | 97.72      | 2 686             | 8.66   | 1 047     |
|                | 14   | 15           | -         | 1.18       | 14                | 4.95   | 328       |





|         |                |              | IVIC | app          | ıng          | s to        | Pr         | OCE         | 255        | ors         |
|---------|----------------|--------------|------|--------------|--------------|-------------|------------|-------------|------------|-------------|
| Task    | Uni-<br>versal | BMA<br>array | PAR1 | DCT<br>array | FIR<br>array | BMA<br>pipe | FIR<br>seq | FIR<br>pipe | DCT<br>seq | DCT<br>pipe |
| IN      | -              | -            | -    | -            | -            | -           | -          |             | -          | -           |
| FB1     |                | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| BMA     | 7234           | 484          | -    | -            | -            | 3617        | -          | -           | -          | -           |
| FIR     | 7234           | -            | -    | -            | 510          | -           | 3461       | 1170        | -          | -           |
| PRAE    | 1280           | -            | 128  | -            | -            | -           | -          | -           | -          | 474         |
| DCT     | 12312          | -            | -    | 132          | -            | -           | -          | -           | 6156       | 474         |
| Q<br>IQ | -              | -            | -    | -            | -            | -           | -          | -           | -          | -           |
| IDCT    | 12312          | -            | -    | 132          | -            | -           | -          | -           | 6156       | -<br>474    |
| REK     | 1536           | - [          | 256  | 132          | - [          | -           |            |             | 0130       | 4/4         |
| C       | 132            | -            | 230  |              | -            | -           | -          | -           | _          |             |
| FB2     | -              | _            | _    | _            | _            | _           | _          | _           | _          | _           |
| 1 02    | -              |              |      |              |              |             |            |             | (0)        | TAKE        |





### **Experimental results** H.261 example AVERAGE EXECUTION PIPELINE DEGREE DEADLINE both greedy 1% -16% memory

# Scheduling of Mars Path Finder under Power Consumption Constraints

The mars rover operates on very limited power supply. The power is given by solar panels. The power obtained from solar panels was measured at different temperatures and the results were the following: 14.9W at -40°C, 12.0W at -60°C and 9.0W at -80°C. There is a battery power source too, which gives maximal 10.0W and it is not replenishable energy so the battery power should be used as little as possible. The mars rover has 6 driving and 4 steering motors, which need to be warmed up before respective driving and steering can be performed.



# Scheduling of Mars Path Finder under Power Consumption Constraints

| Operation                    | Duration |                                                  |  |
|------------------------------|----------|--------------------------------------------------|--|
| Heating steering motors      | 5s       | At least 5s and at most 50s before               |  |
| (HSM1&2, HSM3&4)             |          | steering starts                                  |  |
| Heating wheel motors         | 5s       | At least 5s and at most 50s before driving       |  |
| (HWM1&2, HWM3&4, HWM5&6)     |          | starts                                           |  |
| Hazard detection (HD1 & HD2) | 10s      | At least 10s before steering starts              |  |
| Steering (Steer1, Steer2)    | 5s       | At least 5s before driving                       |  |
| Driving (Drive1, Drive2)     | 10s      | At least 10s before next hazard detection starts |  |



# **Scheduling of Mars Path Finder under Power Consumption Constraints**

| Task                | Duration | Power -40°C | Power -60°C | Power -80°C |
|---------------------|----------|-------------|-------------|-------------|
| Heat two motors     | 5s       | 7.6W        | 9.5W        | 11.3W       |
| Drive               | 10s      | 7.5W        | 10.9W       | 13.8W       |
| Steer               | 5s       | 4.3W        | 6.2W        | 8.1W        |
| Hazard<br>Detection | 10s      | 5.1W        | 6.1W        | 7.3W        |
| CPU                 | Constant | 2.5W        | 3.1W        | 3.7W        |



# **Modeling**

Precedence constraints:

$$t_hd1 + d_hd1 \le t_steer1$$
,

. . .

 $t_steer1 + d_steer1 \le t_drive1$ ,

 $t_hwm12 \le t_steer1 + 50$ ,

Power consumption constraints:

cumulative([t\_hd1, ..., th\_sm12],

[p\_hd1, ..., p\_sm12],

[d\_hd1, ..., d\_sm12], Power)

Optimize "Power"







cycle(2, [ [2,6], [3,4], [1], [2,3], [2,6], [2,5]] )



[[2],[4],[1],[3],[6],[5]]



# Search

- Standard search uses depth-first-search with backtracking.
- Optimization uses branch-and-bound or similar methods.







# Search (cont'd)

[City1::2..4, City2::{1,3..4}, City3::{1..2,4}, City4::1..3]

- How to select order of variable assignment?
  - dynamic vs. static
  - criteria
- How to select values to be assigned from variable's domain?
  - a single value
  - sub-domain
  - ...



### **Variable Selection**

- Static and dynamic
  - input order (static)
  - first-fail principle (smallest size of the domain)
  - smallest value in the domain
  - largest value in the domain
  - largest difference between the smallest and second smallest value in its domain
  - smallest max value in the domain



### **Value Selection**

- Single value
  - minimum in the domain and then upwards
  - maximum in the domain and then downwards
  - middle and then towards smallest and largest
  - random
  - · ...
- Domain split
  - split into two sub-domains
  - split into N
  - · ...



# **Search improvements**

- Partial enumeration algorithms (instead of labeling)
  - Credit Search,
  - Limited Discrepancy Search (LDS).
- Assignment of subintervals instead of values to domain variables — possibly examines a bigger part of a solution space.
- Problem-dependent specific heuristics.
- Neighbourhood search...









# **Summary and conclusions**

### Advantages:

- focus on a specification of the problem, not on a solution method.
- unified framework for different algorithms to be used to solve a problem (by encapsulating them as constraints).
- easy definition of problems with many heterogeneous constraints.
- easy extension of a problem by adding new constraints.
- collaboration of solvers and distributed computing

### **Summary and conclusions**

- Limitations:
  - NP-hard problems.
  - often non-predictable behavior of a solver.
  - difficult to define and add new constraints:
    - into existing systems interface problems.
    - new propagation algorithms need to be developed.
  - difficult to match constraints with actual problems.



### **CP finite domain systems**

- SICStus Prolog
- CHIP from COSYTEC
- IF/Prolog
- ILOG
- Mozart/Oz
- Gnu Prolog
- JaCoP Java based our own solver



#### **Selected CP Web resources**

- Constraints archive http://www.cs.unh.edu/ccc/archive
- Guide to constraints programming http://kti.ms.mff.cuni.cz/~bartak/constraints
- Sicstus manual http://www.sics.se/isl/sicstus/sicstus\_toc.html
- Gnu Prolog http://www.gnu.org/software/prolog/prolog.html
- Mozart/Oz http://www.mozart-oz.org/

#### Other resources

- Book
  - K. Mariott and P. J. Stuckey Programming with Constraints: An Introduction, The MIT Press, 1998.
- Conferences
  - Principles and Practice of Constraint Programming (CP)
  - The Practical Application of Constraint Technologies and Logic Programming (PACLP)
- Journal
  - Constraints (Kluwer Academic Publishers)



#### **Selected Papers**

- Kuchcinski, K., Embedded System Synthesis by Timing Constraints Solving, Proc. 10th International Symposium on System Synthesis, Antwerp, Belgium, September 17-19, 1997.
- Gruian, F. and Kuchcinski, K., Operation Binding and Scheduling for Low Power Using Constraint Logic
   Programming, Proc. 24th Euromicro Conference, Workshop on Digital System Design, Västerås, Sweden, August 25-27, 1998.
- Kuchcinski, K. and Wolinski, Ch., Global Approach to Assignment and Scheduling of Complex Behaviours based on HCDG and Constraint Programming, Journal of Systems Architecture, 2003, Elsevier Science.

#### **Selected Papers (cont'd)**

- Kuchcinski, K., Constraints Driven Design Space Exploration for Distributed Embedded Systems, Journal of Systems Architecture, vol. 47, no. 3-4, pp. 241-261, 2001, Elsevier Science.
- Szymanek, R. and Kuchcinski, K., A Constructive Algorithm for Memory-Aware Task Assignment and Scheduling, Proc. 9th International Symposium on Hardware/Software Codesign, Copenhagen, Denmark, Apr. 2001.
- Szymanek, R. and Kuchcinski, K., Partial Assignment Technique for Task Graph Scheduling, 40th DAC, Anaheim, USA, June 2003.
- Kuchcinski, K. Constraint-driven scheduling and resource assignment, ACM Trans. on Design Automation of Electronic Systems, vol. 8, no. 3, pp. 355-383, 2003.



### Methodologies for Embedded System design

Kris Kuchcinski
Dept. of Computer Science
Lund University
Sweden

http://www.cs.lth/~kris



### **Input to System Design**

- Executable specification (functional requirements):
  - usually provided as interacting processes/tasks,
  - very often multi-language specifications,
  - can be simulated and verified,
  - can be used to perform analysis, e.g, estimation.
- Specification languages: C, C++, VHDL, Verilog, SystemC, Esterel, SDL, etc.
- Set of (non-functional) design requirements (cost, speed, I/O rate, power consumption, etc.).

### **Output from System Design**

- A set of system modules assigned to system components (CPU's, DSP's, ASIC's, etc.).
- Communication modules.
- Each module can be further synthesized to hardware using *high-level synthesis* or *compiled to software*.





### **Basic Characteristics of the Methodology**

- Behavioral specification is given for the complete heterogeneous system, regardless of how different parts will be later implemented.
- Analysis techniques are provided; specially different estimation techniques.
- Synthesis tools are used to automatically explore a design space.
  - high-level synthesis, RTL synthesis,
  - compilers, cross-compilers,
  - interface generators,
  - etc.



### **Estimation of Design Parameters**

- Estimation of parameters such as size, cost, power consumption.
- Does not need to be very precise but has to be "consistent" — follows real design parameters.
- Usually 15%-20% inaccurate.
- Trade-off between accuracy and estimation time.



### Improvements of the Design Process

- High-level specification is made before architecture selection and implementation decisions can be made more accurate (better exploration of architectures).
- A uniform description of HW and SW makes it possible to move parts of the systems between HW and SW.
- HW and SW development is moved closer and the integration cost is reduced.
- An early evaluation of system characteristics is possible.





### **Representation Example**



Process communication graph

### **Allocation of System Components**

- Decides about the kind and number of components for implementation of the system
  - processing elements: μprocesosrs, microcontrollers, DSP's, ASIP's, ASIC's, FPGA's, etc.
  - storage elements: memories, register files, registers, etc.
  - communication devices: buses, point-to-point links, networks, etc.
  - specialized I/O devices: A/D, D/A, frame grabbers etc.

### **Partitioning**

- Functional partitioning vs. structural partitioning.
- Abstraction level.
- Partitioning granularity (fine or course):
  - modules,
  - processes and procedures,
  - instructions.
- Partitioning objective:
  - performance,
  - minimal communication,
  - low power,
  - combination of several criteria.





### **Communication Synthesis**

- Creation of abstract communication channels by communication clustering.
- Communication refinement
  - selection of communication lines width,
  - protocol selection,
  - etc.
- Interface generation:
  - device drivers,
  - communication hardware,
  - etc.





### **Design Decisions**

- Different types of design decisions
  - selection of components, partitioning, assignments, scheduling, etc.
  - decisions regarding runtime system done off-line or are postponed to runtime (e.g., static vs. runtime scheduling)
- Design decisions are mutually dependent
- Huge design space



### **Design Automation**

- Uses internal representations which are usually based on graphs.
- Graph algorithms (shortest path, Hamiltonian circuit, topological sort, depth-first-search, breadth-firstsearch, SAT, etc.).
- Optimization methods (M)ILP, CLP, heuristics, etc.
- Tractable and intractable problems.
- Decidable and undecidable problems.
- Decision problems and combinatorial optimization problems.

### **Design Automation Consequences**

- Most of the problems which need to be solved in design automation are NP-complete or NP-hard.
- Usually only small problems can be solved exactly.
- Need for algorithms which do not guarantee optimal solutions but "good enough" solutions
  - approximation algorithms guarantee a solution with a cost that is within some margin of the optimum,
  - heuristics algorithms that are constructed based on "rules-of-thumb"; nothing can be said in advance about the quality of the result.

# Examples of Embedded Systems (cont'd)

Anti-lock brakes
Auto-focus cameras
Automatic teller machines
Automatic teller machines
Automatic transmission
Avionic systems
Battery chargers
Camcorders
Cell-phone base stations
Cordless phones
Cruise control
Curbside check-in systems
Digital cameras
Disk drives
Electronic card readers
Electronic instruments

Electronic toys/games Factory control

Fingerprint identifiers

Home security systems Life-support systems Medical testing systems

Fax machines

MPEG decoders
Network cards
Network switches/routers
On-board navigation
Pagers
Photocopiers
Portable video games
Printers
Satellite phones
Scanners
Samart ovens/dishwashers
Speech recognizers
Stereo systems
Teleconferencing systems
Televisions
Temperature controllers
Theft tracking systems
TV set-top boxes
VCR's, DVD players

Video game consoles Video phones

Washers and dryers



Source: Embedded Systems Design: A Unified Hardware/Software Introduction, (c) 2000 Vahid/Givargis

# "A device that includes a programmable computer but is not itself a general-purpose computer."



### **Embedded Systems (cont'd)**

- Computing systems embedded within electronic devices
- Hard to define. Nearly any computing system other than a desktop computer
- Billions of units produced yearly, versus millions of desktop units
- Perhaps 50 per household and per automobile



Introduction, (c) 2000 Vahid/Givargis

#### **Embedded Systems (cont'd)**

- Non User-Programmable.
- Based on programmable components (e.g., Microcontrollers, DSP's...) but often contain application specific hardware (IC's, ASIC's).
- Reactive Real-Time Systems:
  - React to external environment,
  - Maintain permanent interaction,
  - Ideally never terminate,
  - Are subject to external timing constraints (realtime).

### Characteristics of Embedded Systems

- Sophisticated functionality.
- Real-time operation.
- Low manufacturing cost.
- Low power.
- Designed to tight deadlines by small teams.
- "Resource conscious" vs. "Unlimited resources" programming

#### **SoC Embedded System**



- Assembly of "prefabricated components" often purchased from external vendors ("IP")
  - "black box" hierarchy
- Design & Verification at the System level
  - rather than the logic level
  - Interface and communication
- Great Importance of Software RV

Source: Alberto Sangiovanni-Vincentelli, 35th DAC









# Typical Hardware Components of DSP System

| Component class                 | Implements                                                  | Compiler                                              | Specification        |
|---------------------------------|-------------------------------------------------------------|-------------------------------------------------------|----------------------|
| DSP processor                   | Low data-rate DSP<br>Slow control loops<br>Appl. Spec. alg. | (Retargetable)<br>code generator<br>High level synth. | Assembly<br>C<br>DFL |
| Microcontroller                 | User interface<br>Slow control loops                        | C compiler                                            | С                    |
| Hardware accelerator            | High data-rate DSP                                          | High level synth.<br>RT level synth.                  | C, DFL<br>VHDL       |
| Communication blocks and memory | Internal & external communication Storage & buffering       | Memory mgmt. (A)synchronous interface synth.          | Data-sheets<br>STG   |
| Others                          | Usually FSMD's - clock generators                           | RT level synth. Asynchronous                          | VHDL *RV             |

Source: H. de Man, et. al. "Co-design of DSP Hardware/Software Co-design, Kluwer 1995.

### Importance of Embedded System Design Methodologies

- Hardware complexity.
- Heterogeneous systems containing hardware (both digital and analog) and software.
- Heterogeneous components (CPU's, DSP's, ASIC's, buses, point-to-point links, etc.).
- Heterogeneous requirements performance, cost, power consumption, etc.
- System-on-chip.
- Shorter design cycles required by time-to-market constraints.

. . . .



# Software vs. Hardware Design short summary

- Software
  - flexibility,
  - reconfigurability, easy update, etc.,
  - complex functionality,
  - cost,
  - · ...
- Hardware
  - speed,
  - power consumption,
  - cost in large volumes,
  - ...



#### **Design of Embedded Systems**

- Need to be done using high-level specification, programming and hardware description languages not assembly languages and gate/transistor level design.
- Requires efficient design space exploration and synthesis/compilation tools.
- Different design requirements has to be taken into account, e.g., cost, performance, testability, quality of service, power consumption.
- Multi-language design framework.

# Importance of High-Level Design Methods

#### System Verification Processing Speeds

| System Implementation              | Processing time (s/frame) |  |
|------------------------------------|---------------------------|--|
| Behavioral model                   | 1 200 (20 min/frame)      |  |
| RTL model                          | 144 000 (1.6 days/frame)  |  |
| Gate model                         | 228 000 (2.6 days/frame)  |  |
| Gate model on hardware accelerator | 1 200                     |  |
| Rapid Prototype                    | 0.5                       |  |
| Target Hardware                    | 0.05                      |  |

Source: Paul Clemente, Ron Crevier, Peter Runstadle Synthesis A Case Study", VHDL Times, vol. 5, no. 1.



### **Specification and Programming**

- Specification languages, such as UML, SDL.
- Programming languages, such as C, C++, Java, Esterel, assembly languages.
- Hardware description languages, such as VHDL, Verilog, SystemC.
- **Example**: combining SystemC and C++ gives unified simulation environment for hardware and software.

#### **Hardware Description Languages**

- Cover several levels of design abstraction as well as behavioral and structural description domain.
- Contain typical features of programming languages, such as data types and program statements.
- Special features:
  - time concept,
  - structure description,
  - parallelism.
- VHDL (IEEE standard), Verilog, SystemC.



### **Design Representations** (Computational Model)

- Used to represent/model digital systems under design.
- Generated by a compiler from system specification or coded directly in the model.
- Represent the semantics, structure and timing of the system.
- Usually based on some kind of annotated graph representation.
- Used internally by design automation systems of the modeler/designer.

### **Design** — **Synthesis**

- Software translation into target code for a processor (real-time operating system might be used).
- Hardware synthesis translation of a behavioral representation of a design into a structural one.
- Communication synthesis generates hardware and software which interconnects system components.











### Summary

- Embedded systems are important class of electronic systems which can be found everywhere,
- Combine hardware and software solutions,
- Cover several engineering and research areas:
  - microelectronics,
  - real-time systems,
  - software development,
  - etc.
- Need careful design which optimizes different design parameters.