zamiacad Wiki

Brought to you by: a0c, guenter, heikos, maksimj, valtih1978

Rejuvenation Application

Attachments

nand_interleaving.png (11727 bytes)

ugp3.zip (1019203 bytes)

zamiacad.zip (13852576 bytes)

NBTI is a transistor aging effect which means that when transistor is under stress voltage for years, some atoms are leaking from the gate and its switching voltage (Vth) is increased, which causes increased switching time, low to high or hight to low. Actually, only logical 0 at the transistor input is considered stressful, which means that we can reduce the degradation if maximize the logical 1 in the critical path. Material explained here is a basis for Rejuvenation of nanoscale logic at NBTI-critical paths using evolutionary TPG article.

As a first step, we represented gates in terms of transistors and mapped the transistor delays to gate delays, what is transistor's switching time at given input, from the input to output propagation delay. The delay depends on the Pz, the probability or fraction of time the input was under stress logic 0 for 10 years, depends on the number of input since delays and aging effect depends on it and it also depends on the load -- the number of gates attached to the gate output.

We assume that logic 0 at the input causes the stress and transistor degradation and only rising edge adds NBTI delay to the signal propagation and falling edge has the standard nominal, unaged delay

prob = nzeri[gi] if edge == 'r' else 0
delay = gateDelay(gate, prob, formal)

Next, we assume that all our combinatorial circuit is made of NAND, NOR and NOT gates only. Then, as you can see on the diagram,

raising edge 01 on the input of the gate that is on the critical path causes falling edge 10 at its output. You will have NBTI-added delay on every second gate in the critical path. This allows to compute NBTI-aware static timing analysis, find the longest path after, say, 10 years of operation and try to fix it.

In our approach we were looking if it is possible to add short "excercises", some overhead computation that de-stresses the critical path. We used uGP, a genetic inputs generator for our combinatorial circuit whereas Zamia environment is used to compute the delays and report them as fitness for genetic generator. It involves workload simulation, which is necessary to predict the aging (Pz at every gate input) over time and static netlist analysis.

Runnable sources of our work can be found at assembla repository. You may also need uGP generator, just extract executable in parallel to assembla git archive. You will also need to download zamiacad sources to get access to zamiacad.bat and python library. You may instead download this part of the sources and add it to the PATH env variable such that zamiacad.bat is runnable.

Go to evaluator folder and run 'gen_universal.bat' to see what you can do with it. You basically need something like

  gen_universal.bat C17 -best 1

This will produce best_0.tst and best_0-rejuvenated.txt in C17_tests folder. The first file, which looks like 6 vectors

is the overhead that should be periodically applied to C17 inputs, as often as possible, because it is the least stressful workload. The second file has a content like

universal rejuvenation
delay when all inputs zero prob = 1.0 is 14.79
delay when all inputs zero prob = 0.0 is 11.33

delay,2-vec_rnd0
0%,14.79
1e-10%,14.79
1e-05%,14.79
0.001%,14.77
0.01%,14.58
0.1%,13.83
1%,12.97
10%,12.37
100%,11.63

which says that after 10 years we end up with delay of 14.79 units if add no rejuvenation overhead (the previous file) to the user workload. Here, user workload specified in the file 2-vec_rnd0, which is just a double vectors that that are applied all the time, one after another. If however, we mix best_0.tst overhead into user workload such that overhead is executed 1 percent of time, total delay is reduced to 12.97. It would be 11.63 if rejuvenating 01011 was applied to the circuit all the time. It would be 11.33 if no aging were observed. 14.79 is also a delay under impossible case when all nets are under stress all the time.

At this point, you are in position to add your typical user load for circuits C17, ALU4 and ALU32 from Plasma processor to see how they normally age and rejuvenate. Instead of generating best rejuvenations, one can simulate his workload with best rejuvenation if one has best.tst:

  gen_universal.bat C17 -sim c17_tests/best_0.tst

which generates best_0.tst-rejuvenated.txt. To add your own design instead of C17 and ALU, the deeper technical details are needed.

Technical details

Implement this section as flow chart?

universal_overhead.py

Gen_universal.bat delegates the high level functionality to universal_overhead.py. This script starts by setting right uGP's constraint.xml to generate input vectors of right width, depending on design selected. It also selects right nand_alu.vhd for analysis and simulation. It then elaborates the design and calls val-eval.py to make preliminary analisys and create nzeri, a set of nets that are inputs to all nand/nor gates in the analyzed circuit. As explained above, we need to capture the fraction of time, Pz, that the net is at logic 0 during desired number of years and nzeri are the nets to track.

Then, all .tst workloads are simulated in C17_tests folder, assuming C17 design was specified when starting gen_universal.bat. Nzeri values are collected after every simulation.

for test in os.listdir(iniTestDir):
    if test.endswith(testExtension):
        simulate(iniTestDir + test)
            tests[test] = nzeri.copy() # back up the nominal workload probabilities

If -sim <best.tst> command is specified then the best is simulated and used for rejuvenation and no uGP generation loop is invoked. Normally, however, you need to generate the best rejuvenation overhead.

In run_uGP, we start uGP with the number of rejuvenations to generate (1 is good enough). For every rejuvenation we loop with the body

        if not os.path.exists(responseFile): # uGP has consumed our evaluation. It means that it has supplied a new individual already
            with open (requestFile, 'r') as f: path = f.readline().strip()
            delay = simulate(ugpPath + path)
            writeResponse("%s" % bestTracker.update(delay)) ; logResult(path)
            ensureRemove(requestFile) # signal that we have finished writing output

As you can see, as soon as stimuli vectors are available from uGP for evaluation in request file, they are used in simulator to produce the critical path delay that is reported back to the uGP generator as fitness in writeResponse. In the end, request file is removed to signal the counterpart that response is ready. Such signaling is necessary since receiver should not start reading request file sooner than sender finishes writing it.

The loop is over as soon as uGP returns a best individual instead of simple individual to evaluate. The best individual is simulated again to obtain nzeri probabilities of Pz at every net and proceeds to rejuvenation as if it was our '-sim' command.

The rejuvenate procedure expects the best rejuvenation stimuli just simulated and nzeri containing net => Pz map, probablility of zero at every net if the best rejuvenation workload is applied to your circuit all the time. The procedure then spits out a user_test x overhead table. It takes user Pz values obtained during initial simulation for user workloads and current best nzeri and weight-mixes them u_zer * (1-overhead) + best_zer * overhead. Resulting Pz is used for one more NBTI-aware static delay analysis and resulting delay is captured into the table.

val-eval.py

Actual simulation is started where Pz values are counted and critical path delay is computed in val-eval.py. It starts by running through all the gates in the NAND/NOR netlist of interest

for smt in module.getStructure().getStatements():
    if isinstance(smt, org.zamia.instgraph.IGInstantiation):
        gates.add(Gate(smt))

which not only creates the list of gates but also, in the Gate constructor, adds all gate inputs into the nzeri map, telling which signals need to be traced in the simulation. Similarly, drivers map registers which gate is used to drive which line. Moreover, initially, all inputs of all gates are registered as primary inputs PI and all outputs as primary outputs PO. This is a trick. Subtracting drivers.keys from the PI and nzeri.keys from PO we are left with true primary inputs and outputs. Indeed, if we subtract all gate inputs in the netlist from all gate outputs, we are left with unused output, which are considered PO.

Since we do not know how many gates load the primary outputs, we set all PO loads to 1. For other gate outputs, load is the number of other gate inputs the output drives

# compute fan-outs
for g in gates:
    for i,_ in g.inputs():
        if not i in pi: drivers[i].load += 1

Next, we define a delay function that computes gate delay given gate type, gate load, input num, and Pz at that input. After that, we register the monitor to wake up at every clk in simulator and count the number of times 0 is observed at the nets of our NAND/NOR netlist. That count divided by number of clocks is Pz of that net. Please note that such method of computing Pz assumes that signal value transitions instantaneous and staying there for the whole period of the clock.

Then, gates are levelized

while len(gatesToProcess) > 0:
    newLevel = Set(); postponed = Set()
    for gate in gatesToProcess:
        canProcess = all(gi in processedNets for gi,_ in gate.inputs())
        #printf("%s can process %s at level %s", gate.label(), canProcess, len(gateLevels))
        (newLevel if canProcess else postponed).add(gate)
    gatesToProcess = postponed
    for g in newLevel: processedNets.add(g.out); gateLevels.append(newLevel)

Levelization speeds up the aging aware static analysis about 4 times. AgingAwareStaticAnalysis traverses the gates level by level, tracing the longest delay reaching from primary inputs to that gate. Completed, the delays to primary outputs are computed. It does so two times, one time for rising gate output and falling edge, recording both delays. After that, worst path is found.

self-test.py

self-test.py contains useful examples of using the simulation and aging aware static analysis. You can run the whole self-test or any parts of it in ZamiaCAD python editor after creating ZamiaCAD project. Just, when creating new Zamia Project, instead of "default location", specify the evaluator's folder that contains the vhdl files.

Examples include

u268 = drivers['N265'] # a gate existing in ALU4
delay = gateDelay(u268, 1, 'A')

for computing gate delay from input A to output or

design = 'C17_TB' ; CurDir = str(project.fBasePath) + "/"
#rebuild() # Project must be built in Zamia beforehand.
execfile(CurDir + "val-eval.py")

#compute delays if all pz are one and the same constant value
def sa(pz): printf("%s after %s years at Pz %s => delay " % (design, yearsOfAging, pz)

     + (" %.2f" % agingAwareStatisAnalysis(dict.fromkeys(nzeri, pz)).delay()))

sa(pz = 0) ; sa(pz = 1) #for i in [0.0, 0.0001, 0.001, 0.01, 0.1, 0.999, 0.5, 1]: sa(i)

for pz in [0, .5, 1]:
    path = agingAwareStatisAnalysis(dict.fromkeys(nzeri, pz)) ;
    printf("pz %s: longest = %s" % (pz, path))

to get (unrealistic but informative) delays when all nets have the same probability of logic 0, which prints

C17_TB after 10 years at Pz 0 => delay  11.33
C17_TB after 10 years at Pz 1 => delay  14.79
pz 0: longest = fI3 + rN8=4.66666666667 + fN10=8.0 + rY10=11.3333333333
pz 0.5: longest = rI2 + fN8=5.05822451192 + rN10=8.39155784526 + fY10=12.0045753538
pz 1: longest = rI3 + fN8=6.70975666667 + rN10=10.04309 + fY10=14.7916733333

which is pretty in agreement with our generator.
Finally, there is fold function on the path for user convenience. Once we obtained the longest path

design = 'C17_TB' ; execfile(CurDir + "val-eval.py")
iniTestDir =  re.sub('_TB', '', re.sub('_VECTORIZED', '', CurDir + design + "_tests")) + "/"
#for test in filter(lambda file: file.endswith(".tst"), os.listdir(iniTestDir)):
simulate(iniTestDir + "best_0.tst") ;
longestPath = agingAwareStatisAnalysis(nzeri); printf("longest = %s" % longestPath)

we can use fold

printf("folded nets = %s", longestPath.parent().fold((), lambda acc, path: acc + (path.computedNet.net,)))
lpz = longestPath.parent().fold((), lambda acc, path: acc + (nzeri[path.computedNet.net],))
def static(pzets): return sum(pz > doubleAgingProb for pz in pzets)
printf("fraction of static nets %s = %s/%s" % (lpz, static(lpz), len(lpz)))

where the first fold simply collects nets on the path whereas second computes the number of nets stuck at 0. Thus, the output is

longest = rI2 + fN8=4.95916568168 + rN10=8.29249901502 + fY10=11.6258323483
folded nets = (u'I2', u'N8', u'N10')
fraction of static nets (0.16666666666666666, 0.5, 0.0) = 0/3

That is ok that last line shows no stuck-at-0 because we look at the critical path after simulating the best (from the aging point of view) workload and all critical nets are periodically rejuvenated.

Userload specific overhead.py

This python script mainly mimics the universal_overhead.py. If, however, universal_overhead script generates rejuvenation stimuli that are good for the circuit regardless of the workload, userload-specific_overhead.py generates stimuli for given user workload and overhead percent. Figuratively speaking, universal overhead are the physical exercises that make your life longer if you do them all the time. Userload-specific overhead, in contrast, takes into account your daily activity, which may result in different set of exercises for you.

An example command to run the script is given in its first line

set "ZAMIA_DATA_DIR=work" && del ..\ugp3.lok && zamiacad -q -f userload-specific_overhead.py -s "design='ALU4'"

Making nand/nor netlists

Not directly related to rejuvenation, we used Synopsys Design Vision to synthesize RTL descriptions into nand/nor netlists.

Create some folder with your design files (alu.vhd and mlite.vhd) and run the commands in the Design Vision

cd temp/alu
set link_library "* class.db"
analyze -library WORK -format vhdl {alu.vhd  mlite.vhd}
elaborate ALU -architecture LOGIC -library DEFAULT
set_dont_use {class/AN2 class/AN2I class/AN2P class/AN3 class/AN3P class/AN4 class/AN4P class/AO1 class/AO1P class/AO2 class/AO2P class/AO3 class/AO3P class/AO4 class/AO4P class/AO5 class/AO5P class/AO6 class/AO6P class/AO7 class/AO7P class/B2I class/B2IP class/B3I class/B3IP class/B4I class/B4IP class/B5I class/B5IP class/BIDI class/BTS5 class/EN class/EN3 class/EN3P class/ENI class/ENP class/EO class/EO1 class/EO1P class/EO3 class/EO3P class/EOI class/EON1 class/EON1P class/EOP class/FD1 class/FD1S class/FD2 class/FD2S class/FD4 class/FD4S class/IBUF1 class/IBUF2 class/IBUF3 class/IBUF4 class/IBUF5 class/IV class/IVA class/IVAP class/IVDA class/IVDAP class/IVI class/IVP class/LD1 class/MUX21H class/MUX21HP class/MUX21L class/MUX21LA class/MUX21LAP class/MUX21LP class/MUX31L class/MUX31LP class/ND2 class/ND2I class/ND2P class/ND3 class/ND3P class/ND4 class/ND4P class/ND5 class/ND5P class/ND6 class/ND6P class/ND8 class/ND8P class/NR2 class/NR2I class/NR2P class/NR3 class/NR3P class/NR4 class/NR4P class/NR5 class/NR5P class/NR6 class/NR6P class/NR8 class/NR8P class/NR16 class/NR16P class/OBUF1 class/OBUF2 class/OR2 class/OR2I class/OR2P class/OR3 class/OR3P class/OR4 class/OR4P}
#remove_attribute { class/IV class/IVI class/ND2 class/ND2I class/NR2 class/NR2I} dont_use
remove_attribute { class/IV class/ND2 class/NR2} dont_use
set target_library "class.db"
compile

If no errors produced in synthesis, you should be able to select the synthesized design and see its schematics. You can finally export resulting netlist as vhdl.

Wiki: Documentation