`
`41
`To optimize an iformat design for a particular application
`program, !he iforrnat system selects custom templates from
`operation issue statistics obtained from scheduling the pro(cid:173)
`gram. The iformat system then generates an iformat based
`on a combination of the custom templates and an abstract
`ISA specification.
`Tbe system uses a re-targetablc compiler to generate the
`operation issues statistics for a particular processor design.
`A:;; shown in FIG. 12, a module called the MOES extractor
`560 generates a machine description in a format called JO
`MOES.
`Tb is machine description re targets the compiler 564 to the
`processor design based on its abstract ISA specification 510
`and datapath specification 514. The compiler 564 then
`schedules a given application program 566 and generates
`operation issue statistics 568 regarding the usage of the
`operation groups in the instruction format templates. '01e
`system thea uses tbe frequency of use of the operations in
`each template by the application program lo compute cus(cid:173)
`tomized templates as shown in step 569. The customization
`process is automated in that it selects custom templates by
`minimizing a cost function that quantifies the static or
`dynamic code size and the decode cost (e.g., measured in
`chip area).
`The process of selecting instruction templates in the
`iformat based on scheduling statistics may be conducted as
`a stand-alone proce&s, or may be conducted in conjunction
`with the automated iformat design process. In the latter case,
`it may be used to provide an initial input specification of the
`desired ILP constraints to the automated iformat design JO
`process. Additionally, it·may be used lo optimize an existing
`iformat design.
`The system may also perform additional oplimizatioa by
`using variable-length field encodings to further reduce the
`instruction size. These optimized designs can lead lo dra- 35
`matic reductions in code size, as showa in the detailed
`description below.
`7.2 !mplementation of the Input Specification
`The principal input of the iformal design process is an
`Abstract Instruction Set Architecture (ISA) specification 40
`510. In the current implementation, the user or another
`program module may provide this specification as an Arch(cid:173)
`Spec in texrual form.
`An ArchSpec reader module converts the textual form of
`the ArchSpec to an abstract ISA spec data structure, which 45
`contains a machine-readable set of tabular parameters and
`constraints, including register file entries, operation groups,
`and exclusion/concurrency relationships.
`7.3 Instruction Syntax
`VLIW processors issue instructions having multiple so
`instruction fields. An instruction field is a set of bit positions
`intended lo be interpreted as an atomic unit within some
`instruction context. Familiar examples are opcode fields,
`source and destination register specifier fields, and literal
`fields. Bits from each of these fields flow from the instruc- 55
`lion register lo control ports in the data path. For example,
`opcode bits flow to functional uoits, and source register bits
`flow lo register file read address ports. Another common
`type of instruction field is a select field. Select fields encode
`a choice between disjoint alternatives and communicate this 60
`context to the decoder. For example, a select bit may indicate
`whether an operand field is to be interpreted as a register
`specifier or as a short literal value.
`An operation is the smallest unit of execution; ii com(cid:173)
`prises an opcode, source operaods, and destination operands. 65
`Each operand may support one or more operand types. A set
`of possible operand types is called an io-set. A list of io-sets,
`
`42
`one per operand, form an operation's io-formal. For
`example, suppose an add operation permits its left source
`operand to be either an integer register or a short literal
`value, and suppose iL' right source and destination operands
`source and sink from integer registers. 'foe corresponding
`io-sets are (gpr, s}, {gpr}, {gpr}. The io-fonnal is simply
`this list of io-sets, which are abbreviated in shorthand
`notation as follows:
`gpr s, gpr . gpr
`Closely related operations such as add and subtract often
`have the same io-formal. One reason for this is that related
`operations may be implemented by a single, multi-function
`unit (macro-cell). As discussed above, to simplify tbe
`instruction formal design process, related operations are
`15 grouped into operation groups.
`111e instruction format assigns sets of op groups (called
`super groups) to slots of an instruction. The processor issues
`operations within an instruction from these slots concur(cid:173)
`rently. To fully specify an operation, the instruction format
`20 specifies both an op-group and an opcode (specific lo that
`opgroup ). In effect, tbis organization factors a fiat opcode
`name space into a multi-tier encoding. In rare cases, this
`factorization may increase the encoding length by one bit
`per level. However, it should be noted that this approach
`25 does not preclude a fiat encoding space: placing each
`operation in its own op-group eliminates the factorization.
`More importantly, hierarchical encoding often gives the
`same benefits as variable-length field encoding, but is sim-
`pler lo implement.
`7 ,4 The Instruction Format Tree
`In a flat, horizontal instruction formal, all instruction
`fields are encoded in disjomt position.s within a single, wide
`instruction. A hierarchical instruction format allows exclu(cid:173)
`sive instruction fields (those that are not used simultaneously
`in any instruction) lo be encoded in overlapping bit
`positions, thereby reducing the overall instruction width, In
`the instruction formal design system shown in FIG. 12, the
`hierarchical relationship between instruction fields is repre(cid:173)
`sented by an instruction formal tree (if-tree). The leaves of
`an if-tree are instruction fields; where each leaf points to a
`control port in the data path, such as a register file address
`port, or an opcode input of a FU.
`FIG. 13 illustrates the structure of an if-tree used in the
`current implementation. 1be overall structure of the tree
`defines how each instruction is built. Each part of the tree
`represents a node, with the lowest nodes (the cut-off-box(cid:173)
`shaped nodes) forming the tree's leaves. The oval-shaped
`nodes are "OR" nodes, while the boxed-shaped nodes are
`"AND" nodes. The OR nodes denote a selection between the
`children of the node such that only one choice (one branch)
`extends lo the next level. Conversely, an AND node allows
`all of the components of the node to form new branches.
`Stated another way, each level of the tree is either a
`conjunction (AND) or disjunction (OR) of the subtrees al the
`lower level.
`The root node 632 of tbe tree is the overall machine
`instruction. This is an OR node representing a choice of
`instruction templates. A template select field (template ID) is
`used lo identify the particular template. This select field is
`illustrated as the leaf node labeled "steer" connected lo the
`instruction node 632.
`[ndividual instructions are based on instruction templates,
`which are the AND-type child nodes of the root node (See,
`e.g., template.s 634 and 636). 1be templates each encode the
`sets of operations that issue concurrently. Since the 1mmber
`of combinatioos of operations that may issue concurrently is
`astronomical, it is necessary to impose some structure oo the
`
`0181
`
`Volkswagen 1008 - Part 2 of 3
`
`
`
`US 6,385,757 Bl
`
`43
`encoding within each lemplale. Hence, each template is
`parlitiorled into one or more operation issue sloL>. Every
`combination of operations assigned lo th"''e slots may be
`issued concmrently.
`In addition, each template has a consume to end-of-packet s
`bit field (CEP) that indicates whether the next instruction
`directly follows the current instruction or ii starts at lhe next
`packet boundary. This capability is used to align certain
`inslrnclions (e.g. branch targets) lo known address bound(cid:173)
`aries. Each template also specifies the number of spare bits 10
`that may be used to encode the number of no-op cycle to
`follow the current instruction. These spare bits may arise due
`to a need for packet alignment or quantized allocation.
`1he next level of the tree defines each of the concurrent
`issue sloL~. Each slot is an OR node supporting a set of 15
`operation groups, called a super group (i.e., nodes 638, 640,
`642), that are all mutually exclusive and have tne same
`concurreoey pattern. A select field chooses among the vari(cid:173)
`ous operation groups within a super group. Again, this select
`field is illustrated as the leaf node labeled "steer'' connected 20
`to super group 640.
`Below each super group lie operation groups as defined in
`the input specification as described above. Each operation
`group (e.g., operation group 643) is an OR node that has a
`select field ("steer") to choose among tbe various operation
`formats supported by operation group. PIG. 13 shows this
`situation where one operation format allows a literal field on
`the left port, while the other allows it on the right port.
`Each operation format (e.g., IO fonnat descriptors 644,
`646) :is an AND node consisting of the opcode field 654, the 30
`predicate field (if any) 656, and a sequence of source and
`destination field types(shown as IO sets 648, 660, 652). The
`traditional three-address operation encoding is defined at
`this level.
`Each IO set is an OR node consisting of a singleton or a
`set of instruction fields that identify the exact kind and
`location of the operaod. IO sets with multiple choices (e.g.,
`650) have a select field to identify which instruction field is
`intended. For example, one of the IO set nodes 650 repre(cid:173)
`sents a selection between instruction fields 660, 662, which
`is controlled via a multiplexor select field 664. The other 10
`sets each have only one kind of field, and thus, have a single
`child node representing that field (nodes 658, 666). The
`instruction fields point to the datapath control ports 668.
`In implementing an instruction format, one principal 45
`design choice is whether lo use a single, fixed-length instruc(cid:173)
`tion format, or allow variable-length instructions. 'Il1e ifor(cid:173)
`mat design system supports both fixed and variable length
`instructions. The use of variable-length instructions pro(cid:173)
`duces more-compact code but increases decode complexity. 50
`The trade-off between code size and instruction decode
`complexity is a primary design consideration. A single,
`fixed-lengtb instruction format simplifies decode logic and
`of operations to functional units,
`the data path for
`but it often results
`poor code density, since the single 55
`format must accommodate the worst-case (longest) instruc(cid:173)
`tion. For example, if the longest instruction in a fixed-length
`instruction formal is 128 bil5 long, then all of the instruc(cid:173)
`tions in the instruction set must be 128 hits long. In order lo
`maintain a constant instruction length, many instructions 60
`will require the use of wasted bits whose sole purpose is to
`fill in unused space in the instructions. These wasted bits
`lead to increased code size. Conversely, variable-length
`instruclions can accommodate both wide and compact,
`restricted instruction formats without wasting bits, which 65
`results in a reduction in code size. By using variable-length
`instructions, the instruction format can accommodate the
`
`44
`wid"'st instructions where necessary, and make use of
`compact, restricted instruction formals, such as instructions
`that do not encode long literals.
`FIG. 14 shows tbe format of an instruction and its
`building blocks. Al the heart of the instruction is an instruc(cid:173)
`tion template 670. An instruction template encodes sets of
`operations that issue concurrently. Each template includes
`multiple concurrent slots 672, where each slot comprises a
`set of exclusive operation groups 674. Since all of the
`operations in an operation group are exclusive, all of the
`operations in each slot are also exclusive. Each template
`encodes the cross-product of the operations in each of its
`slors.
`11ie length of each template is variable, depending in part
`on the length and number of the slots in the template. For
`example, some templates might have two slots, while other
`templates
`have three or four slots. Furthermore, the
`width of
`slot will depend on the width of the widest
`operation group within that slot, plus overhead, as shown in
`the lower portion of FIG. 14. There :is considerable similarity
`and overlap among the opcodes within an operation group
`by construction, so very little encoding space is wasted
`within the operation group. But the opcode field now must
`be split into an operation group selection field 676 and an
`opcode selection field 678 within the operation group. With
`logarithmic encoding, this requires at most one additional bit
`for encoding the opcode. For example, 15 opcodes may be
`encoded in 4 bits, while splitting them into 3 operation
`groups of 5 opcodes each requires (log,(3))+(1og,(5)]+5
`bits. In addition, every slot has a reserved no-op encoding.
`In cases where an op group has alternative operation
`formats, there is yet another select field to select the opera(cid:173)
`tion format.
`Each instruction also includes a consume to end-of-packet
`35 bit 680, and a template specifier 682. The template specifier
`identifies the template. An instruction format having t tem(cid:173)
`plates will need [log2(t) J bits to encode the template speci(cid:173)
`fier. This template specifier is in a fixed position within every
`instruction, and from its value, the instruction sequencer in
`40 the proces.sor's control path determines the overall instnic(cid:173)
`tion length, and thus the address of the subsequent instmc(cid:173)
`tioo.
`In the current implementation, the length of the instruc(cid:173)
`tion is variable, but each length is a multiple of a pre(cid:173)
`determined number of bits called a quantum. For instance, if
`the quantum is 8 bits, the length of the instruction could pe
`any number equal to or above some minimum value (say 32
`bit.s) that is divisible by 8, such as 64 bits, 72 bits, 80 biL<;,
`etc. One or more dnmmy bits may be placed as appropriate.
`within the instruction to ensure that the length of the
`instruction falls on a quantum boundary.
`The iformat system builds the level-, of the if-tree in an
`incremental fashion. It construct-, the top three levels, con(cid:173)
`sisting of the iostrnction, the templates, and the super groups
`from the abstract !SA specification, and optionally, custom
`templates. It constrncls the middle layers, including the
`operation groups, the operation formats, and the field types
`from the abstract !SA specification. Finally, it coastmcts the
`instruction fields from the contents oft.be various field types
`in tbe abstract ISA specification and the individual control
`ports in the datapath that each field is supposed to control.
`7.5 Instruction Templates
`A primary objective of tbe instruction format desigo
`system is to produce a set of instruction templates that
`support the encoding of all of the sets of operation groups
`th<tl can be issued concurrently. 1b initiate lhe template
`design process, the instruction format design system starts
`
`0182
`
`
`
`US 6,385,757 Bl
`
`46
`cess. The process starts witb the !LP constraints 681, wbicb
`define a set of exclusion rdationsbips 683 between operation
`groups 684. From these exclusion relationships, the iformat
`design system builds a boolean exclusion matrix 686. In the
`exclusion matrix 686, the rows and columns are matched up
`with respective operation groups, e.g., "A" corresponds to
`the operation group A, "B" corresponds to the operation
`gmup B, etc. The 1 's in the matrix indicate an exclusion
`relationship, while a blank indicates that the corresponding
`operation groups may be issued concurrently. (1be blanks
`are actually O's in the real matrix--blanks are used here for
`clarity). The system then builds a concurrency matrix 688
`from the exclusion matrix 686. The concurrency matrix 688
`is the complement of the exclusion matrix 686. The "?''s
`along the diagonal of the concurrency matrix 688 can be
`interpreted as either a 1 or 0.
`The rows in the concurrency matrix determine a set of
`concurrency neigbbors for each operation group.Agraphical
`representation of the relationships defined by the ccncur(cid:173)
`rency matrix 688 is shown in concunency graph 692. Eacb
`node represents an operation group, while each connecting
`"edge" represents a concurrency relation. A clique is a set of
`nodes from a graph where every pair of nodes is connected
`by an edge. For instance, there are 16 cliques in lbe
`concurrency graph 692.
`After the concurrency matrix is generated, the system
`compares the rows in the concurrency matrix to identify
`equivalent operation groups. The super groups are formed
`from the equivalent operation groups. Two operation groups
`are said to be equivalent if they have the same set of
`concurrency neighbors. Note that two mutually exclusive
`nperation groups tbat have the same set of concurrency
`neighbors can replace each other in any template without
`violating any exclusion constraint and therefore can be
`treated equivalently. Similarly, two concurrent operation
`groups that have tbe same set of concurrency neighbors
`(other than themselves) can always be placed together in a
`template without violating any exclusion constraints and
`therefore can be treated equivalently.
`An example of psendocode for performing equivalence
`checking and partitioning into super groups is illustrated
`below.
`
`25
`
`45
`out with the architecture specification, which defines the
`exclusion and concurrency constraints for a particular
`design. ln one implementation, the architecture specification
`directly provides the exclusion relationships between
`to s
`tion groups. However, the iformal design process
`know which opcodes can be issued concurrently, i.e., the
`concurrency relationship, rather tlran which opcodes must be
`exclusive.
`In such an implementation, the concurrency relationship
`is taken to be the complement of the exclusion relationship. rn
`One way of determining the concurrency relation is lo take
`the complement of the exclusion relations among opcodes
`implied by the architecture specification and treat each set of
`concurrent opcodes as a potential candidate for becoming an
`instruction template. While this provides an excellent start- 15
`ing point, it unfortunately does not lead to a practical
`solution, since the number of combinations of operations
`that may issue concurrently quickly becomes intractable.
`For example, a typical VLJW machiue specification may
`include 2 integer ALUs, 1 floating point ALU and 1 memory 20
`u.nit, with 50 opcodes each. In such a machine tbe total
`number of distinct 4-issue instructions is 502 x50x50=6,250,
`000. Specializing instructions to 1, 2, and 3-issue templates
`would add many more. It is therefore necessary lo impose
`some structure on the encoding within each template.
`Our current implementation uses several mechanisms to
`reduce the complexity of the problem. These mechanisms
`represent iformat design decisions and affect the final
`instruction fonnat layout and size. In most cases there may
`al.so be a tradeoff between the simplicity and orthogonality 30
`of the field layout (and hence the decode hardware) and the
`size of the instruction template. These tradeoffs will be
`described as the design process is detailed below.
`AF. a first axiom, all templates must satisfy an exclusion
`constraint between two opcodes, i.e. these opcodes must 35
`never occupy separate slots in any template. 'D1is is because
`tbese opcodes may share hardware resources during
`execution, and therefore, the scheduler should never put
`these opcodes together within the same instruction. Oo the
`other hand, a concurrency constraint between two opcodes 40
`implies that the scheduler is free lo issue these opcodes
`together in a single instruction and therefore there should be
`some template in which these two opcodes are allowed to
`occur together. In particular, that template may contain
`additional slots that can be filled with noops, if necessary.
`Tberefore, it is unnecessary to generate a special template
`for each concurrency constraint, but rather all that is needed
`is a set of templates that can effectively cover all possible
`sets of concurrently scheduled opcodes_
`The problem becomes greatly simplified when the con(cid:173)
`currency of operation groups is considered instead of indi(cid:173)
`vidual opcodes. As introduced above, operation groups are
`defined as sets of opcode instances that are generally similar
`in nature in terms of their latency and connectivity to
`physical register files and are expected to be mutually
`exclusive with respect to operation issue. All opcodes within
`an operation group must be mutually exclusive by definition.
`Furthermore, the instruction format is designed so that all
`opcodes within an operation group sbare the same instruc(cid:173)
`tion fields. Tbus, the operation group is an obvious choice
`for the primary building block for creating templates.
`Another simplification involves classifying mutually(cid:173)
`exclusive operation groups into equivalence classes called
`super groups based on the constraints provided in the
`architecture specification. FIG. 15 illustrates an example that 65
`shows how the opera lion groups (sbown as letters) ancl
`exclusion relations arc used in the template selection pro-
`
`Proceduref'iodSuperGroups (BitM:atri.x con.cur)
`45 1:
`// "concuf' is a (numNodes x nurn.Nodes) boolean matrix
`{/First, initialize supergroup hash table and id counter
`2:
`3:
`Hash.Map;Bi.tVector, inll SGmap
`int Sf;COUOt ... O;
`4:
`for (i = 0 to numNodes-1) do
`5:
`//cxt:acl C8ch node's vector of n.eighbo1s w/ an.d w/o self
`6:
`SO 7:
`Bit Vector AND-group - ooncur.row(i) .aet_bit(i);
`8:
`BilVectcr OR~group - c.oncur.row(i) .reset_bit(i);
`9:
`//Check for existing AND-style st1pergroup for this node
`if (SGruap(AND·group) is already bound) then
`10:
`SGkind(i) = SO-AND;
`11:
`SGid(i) = SGtnap(AND·group);
`12:
`55 13:
`//Check for existing OR-style suporgroup for this node
`else if (SGmap(OR·gruup) is already bound) then
`14:
`15:
`SGkind(i) - SG·OR
`16:
`SGid(i) - SGmap(OR-group);
`//If neither neighbor relation is present, start a new
`17:
`18;
`f/supergroup with the ne:w neighbor re[ations
`GO ~~:
`else SGid(i) = S<Ycount;
`21:
`SGmap(AND-group) SG01ap(OR-group) SGcount;
`SGcount = SGcount + 1;
`22:
`cndif
`23:
`24:
`e1td[or
`
`The equivalence check and the partitioning can be per(cid:173)
`formed quickly by employing the pigeon-hole principle. The
`
`0183
`
`
`
`US 6,385,757 Bl
`
`47
`algorithm hashes each operation group using its set of
`neighbors in the concurrency matnx as the key, 11ie neigh(cid:173)
`bor relations (neighbor keys) for each operation group (each
`row) are converted 1D bilvectors. The algorithm hashes in
`two ways: once by treating each operation group as concur(cid:173)
`rent with itself (AND-style) thereby finding equivalent con(cid:173)
`current operation groups, and the second time by treating
`each operation group as exclusive with itself (OR-s!yle)
`thereby finding equivalent exclusive operation groups, Thi~
`hashing approach results io two bitvectors for each operation
`grou!'-"ne with the"?" entry changed to a 1 (AND-style),
`and one with the "?" entry changed to a 0 (OR-style).
`Bitvectors (operation groups) that hash to the same bucket
`necessarily have the same concurrency neighbors and there(cid:173)
`fore become part of the same super group. For example in
`FIG. 15, operation groups A, B, and C have the same 15
`concurrency neighbors and thus form the super group {A, B,
`C}. The other super groups, {P, Q}, {X, Y}, and {M, N}, are
`similarly formed. The set of all distioct super groups is
`defined by all the distinct neighbor keys. This partitioning
`leads to a reduced-concurrency (super group) graph 694, 20
`comprisiog the super groups and their concurrency relations.
`Instruction templates 696 are obtained from the reduced
`concurrency graph, as described below.
`Each operation group identifies whether it is anAND·lYPe
`or OR-type super grou]J. This information is used in the final 25
`template expansion, where each operation group from an
`AND-type super group is given a separate slot, while all
`operation groups from an OR-type super group are put into
`the same slot.
`In the concurrency matrix 690 shown in FIG. 15, the"?" 30
`entries o( the "A", "B", and "C" operation group bitvectors
`have been changed to O's so that their corresponding bitvec(cid:173)
`tors are identical. 'Thus, "A:', "B", and "C" form an OR-type
`super group {A, B, C}, and each operation group is placed
`in the same slot.
`FIG. 16 shows a case with an AND-type and an OR-type
`super group. In order to obtain identical bitvectors, the "A",
`"B", and "C" operation groups are treated as being concur(cid:173)
`rent with tbernselves. As a result, they form an AND-tYPe
`super group and are placed in separate template slots. In 40
`contrast, the "M", "N", "X", and 11Y" operation groups are
`treated as exclusive with them.selves and form two different
`sets of OR-type super groups {M,N} and {X,Y}, which each
`occupy a single slot
`For a homogenous VLIW-style machine with multiple, 45
`orthogonal functional units thL'i process yields tremendous
`savings by reducing the complexity of the problem to just a
`few iodependent super groups. Tbe resulting instruction
`templates closely match super groups to independent issue
`slots for each functional unit. For a more heterogeneous 50
`machine with shared resources, the resulting number of
`templates may be larger and the decoding is more complex
`but partitioning the operation groups into super groups still
`reduces the complexity of the problem significantly.
`7.6 Concurrency Cliques and Templates
`Once the super groups have been determined, each clique
`in the reduced concurrency graph is a candidate for an
`instruction template since it denotes a set of super groups
`tbat may be issued in parallel by the scheduler. A clique is
`a subgraph in which every node is a neighbor of every other 60
`node. Clearly, enumerating all cliques would lead to a large
`number of templates. On the other hand, unless the concur(cid:173)
`rency among super groups is restricted in some other way,
`it is necessary lo choose a set of templates that cover all
`possible cliques of the super group graph to ensure that !lie 65
`scheduler is not restricted in any way other than that
`specified in the Ard1Spec.
`
`48
`As an example, suppose super groups A, B and C only
`have pairwise concurrency constraints, i.e., {AB}. {AC},
`and {BC}. These pairwise concurrencies can be covered in
`one of two ways. First, the pairwise concurrency constraints
`can be treated as three independent templates AB, AC, and
`BC, each requiring lwo issne slots. A second possibility is lo
`treat the pairwise concurrencies as being simultaneously
`concurrent, thereby requiring only one template (ABC) with
`three issue slois. Strictly speaking, this allows more paral(cid:173)
`lelism than what was intended, lf the compiler never sched(cid:173)
`uled all three operations simultaneously, the second design
`would end up carrying one ooop in every instruction 111ereby
`wasting one-third of the program space, On the other hand,
`the first design requires additional decoding logic lo select
`among the three templates and more complex dispersal of
`the instruction bits to the various functional units,
`In the present scheme, this tradeoff is made tow•rds
`initially choosing a reduced number of possibly longer
`templates, 11iis is partly due to the fact that the ArchSpec
`does not directly specify concurrency in most instances, but
`rather specifies exclusion relations among operation groups
`that are then complemented to obtain concuneocy relations.
`During the initial template design phase, choosiog the maxi·
`mally concurrent templates covers all possible concurrency
`relations with as few templates as possible.
`The maximally concurrent !emplates may be determined
`by finding tbe cliques of the super group graph. An example
`of a simple reduced super group concurrency graph fo shown
`in FIG. 17. The graph comprises super groups 1-7, and their
`interconnecting edges. The maximal cliques for such a
`simple graph ca.n be determined by hand by siniply identi-
`fying sets of nodes that are completely connected-that is
`each node in a clique must connect to the remaining nodes
`in the clique, For instance, { 1, 3, 7} is a clique, while {2, 4,
`35 5, 6} is not (nodes 5 and 6 are not connected). In the
`supergraph of FIG. 6, there are seven maximal cliques, and
`thus seven maximally concurrent templates,
`It is necessary to use computational means to calculate the
`cliques for more complex super group graphs, The instruc(cid:173)
`tion format designer uses the same approach for finding
`cliques as the datapath synthesizer described above.
`7.7 Set-Up of Bit Allocation Problem
`Once the templates are selected, the iforrnat system con(cid:173)
`structs the lower levels of the IF tree, 'The tern plates form the
`upper level of the tree. For each of the operation groups in
`a template, the system extracts the inputs and outputs for
`each operation based on their f/0 formats in the .abstract ISA
`specification and adds this information to the IF tree, Using
`the extracted 1/0 fom1ats, the system enumerates the instruc·
`tion fields for each of the operation groups associated with
`the templates. Next, it builds field conflicts, partitions
`instruction fields into superfields, and extracts bit widtb
`requirements.
`7 .7 .1 Instruction Fields
`As shown in FIG, 13, the instructi.on fields form the leaves
`of the if-tree. Each instruction field corresponds to a data(cid:173)
`path control port such as register file read/write address
`ports, predicate and opcode port> of functional units, and
`selector port5 o[ multiplexors, Each field reserves a certain
`number of instruction bits to control the corresponding
`control port.
`1ne iforrnat designer assigns each field to a control port
`by traversing the if tree lo find the operation group associ(cid:173)
`ated with the field, and then extracting the fonctional unit
`assigned to the operation group in the datapath specification.
`The following sub-sections describe various kinds of
`instruction fields. FIG. 20 is annotated with letters S, A, L,
`
`10
`
`55
`
`0184
`
`
`
`us 6,385,757 Bl
`
`50
`49
`7.7 .2 Computing Field Conflicts
`op and C to illustrate examples of lhe information flowing
`Before performing graph coloring, tbe system computes
`from these fields in the instruction register to the control
`the pairwise conflict relation between instruction fields,
`ports in the data path.
`which are represented as an undirected conflict graph.
`Select fields (S). At each level of the if-tree that is an OR
`two leaf nodes (instruction fields) conflict if
`In the
`node, there is a select field that chooses among the various
`Least-common ancestor is an AND node.
`and only if
`alternatives. The number of alternatives is given by the
`The system computes pairwise conflict relations using a
`number of children, n, of the OR node in the if-tree.
`bottom-up data flow analysis of the if-tree. TI1e procedure in
`Assuming a simple binary encoding, the bit requirement of
`the implementation maintains a field set, F, and a conflict
`the select field is then log2(n) bits.
`Different select fields are used lo control different aspects 10 relation, R. Set F" is lbe set of instruction fields in the subtree
`rooted al node n. Relation R" is the conflict relation for the
`of the datapalh. 'Die root of the if-tree has a template select
`subtree rooted at node n.
`field that is routed directly to the instruction unit control
`The procedure proces.ses nodes in bottom-up order as
`logic in order lo determine the template width. ll also
`follows:
`specifies where the supergroup select fields are positioned. 15
`Leaf node: Al a leaf node, /, the field set is initialized lo
`Therefore, lb.is field must be allocated at a fixed position
`contain !he leaf node, and the conflict relation is empty.
`within tbe instruction. Together with the template select
`or-node: At an OR-node, the field set is the union of field
`fields, the select fields al super group and operatio