THE ART OF
COMPUTER PROGRAMMING

FASCICLE 1

MMIX

DONALD E. KNUTH  Stanford University

ADDISON–WESLEY
PREFACE

fas'ci-ble \fə-sə-pl\ n. . . 1: a small bundle . . . an inflorescence consisting of
a compacted cyme less capsulate than a glomerule
. . . 2: one of the divisions of a book published in parts

This is the first of a series of updates that I plan to make available at
regular intervals as I continue working toward the ultimate editions of The Art
of Computer Programming.

I was inspired to prepare fascicles like this by the example of Charles Dickens,
who issued his novels in serial form; he published a dozen installments of Oliver
Twist before having any idea what would become of Bill Sikes! I was thinking
also of James Murray, who began to publish 350-page portions of the Oxford
English Dictionary in 1884, finishing the letter B in 1888 and the letter C in
1895. (Murray died in 1915 while working on the letter T; my task is, fortunately,
much simpler than his.)

Unlike Dickens and Murray, I have computers to help me edit the material,
so that I can easily make changes before putting everything together in its final
form. Although I'm trying my best to write comprehensive accounts that need
no further revision, I know that every page brings me hundreds of opportunities
to make mistakes and to miss important ideas. My files are bursting with notes
about beautiful algorithms that have been discovered, but computer science has
grown to the point where I cannot hope to be an authority on all the material
I wish to cover. Therefore I need extensive feedback from readers before I can
finalize the official volumes.

In other words, I think these fascicles will contain a lot of Good Stuff, and I'm
excited about the opportunity to present everything I write to whoever wants
to read it, but I also expect that beta-testers like you can help me make it
Way Better. As usual, I will gratefully pay a reward of $2.56 to the first
person who reports anything that is technically, historically, typographically,
or politically incorrect.

Charles Dickens usually published his work once a month, sometimes once
a week; James Murray tended to finish a 350-page installment about once every
18 months. My goal, God willing, is to produce two 128-page fascicles per year.
Most of the fascicles will represent new material destined for Volumes 4 and
higher; but sometimes I will be presenting amendments to one or more of the
earlier volumes. For example, Volume 4 will need to refer to topics that belong
in Volume 3, but weren't invented when Volume 3 first came out. With luck,
the entire work will make sense eventually.
Fascicle Number One is about \texttt{MMIX}, the long-promised replacement for \texttt{MIX}. Thirty years have passed since the \texttt{MIX} computer was designed, and computer architecture has been converging during those years towards a rather different style of machine. Therefore I decided in 1990 to replace \texttt{MIX} with a new computer that would contain even less saturated fat than its predecessor.

Exercise 1.3.1–25 in the first three editions of Volume 1 spoke of an extended \texttt{MIX} called MixMaster, which was upward compatible with the old version. But MixMaster itself has long been hopelessly obsolete. It allowed for several gigabytes of memory, but one couldn't even use it with ASCII code to print lowercase letters. And ouch, its standard subroutine calling convention was irrevocably based on self-modifying instructions! Decimal arithmetic and self-modifying code were popular in 1962, but they sure have disappeared quickly as machines have gotten bigger and faster. Fortunately the new RISC machines have a very appealing structure, so I've had a chance to design a new computer that is not only up to date but also fun.

Many readers are no doubt thinking, "Why does Knuth replace \texttt{MIX} by another machine instead of just sticking to a high-level programming language? Hardly anybody uses assemblers these days." Such people are entitled to their opinions, and they need not bother reading the machine-language parts of my books. But the reasons for machine language that I gave in the preface to Volume 1, written in the early 1960s, remain valid today:

- One of the principal goals of my books is to show how high-level constructions are actually implemented in machines, not simply to show how they are applied. I explain coroutine linkage, tree structures, random number generation, high-precision arithmetic, radix conversion, packing of data, combinatorial searching, recursion, etc., from the ground up.
- The programs needed in my books are generally so short that their main points can be grasped easily.
- People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like. Otherwise the programs they write will be pretty weird.
- Machine language is necessary in any case, as output of some of the software that I describe.
- Expressing basic methods like algorithms for sorting and searching in machine language makes it possible to carry out meaningful studies of the effects of cache and RAM size and other hardware characteristics (memory speed, pipelining, multiple issue, lookaside buffers, the size of cache blocks, etc.) when comparing different schemes.

Moreover, if I did use a high-level language, what language should it be? In the 1960s I would probably have chosen Algol W; in the 1970s, I would then have had to rewrite my books using Pascal; in the 1980s, I would surely have changed everything to C; in the 1990s, I would have had to switch to C++ and then probably to Java. In the 2000s, yet another language will no doubt be de
riqueur. I cannot afford the time to rewrite my books as languages go in and out of fashion; languages aren’t the point of my books, the point is rather what you can do in your favorite language. My books focus on timeless truths.

Therefore I will continue to use English as the high-level language in *The Art of Computer Programming*, and I will continue to use a low-level language to indicate how machines actually compute. Readers who only want to see algorithms that are already packaged in a plug-in way, using a trendy language, should buy other people’s books.

The good news is that programming for MMIX is pleasant and simple. This fascicle presents

1) a programmer’s introduction to the machine (replacing Section 1.3.1 of Volume 1);
2) the MMIX assembly language (replacing Section 1.3.2);
3) new material on subroutines, coroutines, and interpretive routines (replacing Sections 1.4.1, 1.4.2, and 1.4.3).

Of course, MIX appears in many places throughout Volumes 1–3, and dozens of programs need to be rewritten for MMIX. Readers who would like to help with this conversion process are encouraged to join the MMIXmasters, a happy group of volunteers based at mmixmasters.sourceforge.net.

I am extremely grateful to all the people who helped me with the design of MMIX. In particular, John Hennessy and Richard L. Sites deserve special thanks for their active participation and substantial contributions. Thanks also to Vladimir Ivanović for volunteering to be the MMIX grandmaster/webmaster.

Stanford, California
May 1999

D. E. K.

You can, if you want, rewrite forever.

CONTENTS

Chapter 1 — Basic Concepts .......................... 1
1.3’. MMIX ............................................. 2
   1.3.1’. Description of MMIX ..................... 2
   1.3.2’. The MMIX Assembly Language ........... 28
1.4’. Some Fundamental Programming Techniques ...... 52
   1.4.1’. Subroutines ................................ 52
   1.4.2’. Coroutines .................................. 66
   1.4.3’. Interpretive Routines ..................... 73

Answers to Exercises ............................... 94

Index and Glossary ................................. 127
1.3: MMIX

In many places throughout this book we will have occasion to refer to a computer's internal machine language. The machine we use is a mythical computer called "MMIX." MMIX—pronounced EM-micks—is very much like nearly every general-purpose computer designed since 1985, except that it is, perhaps, nicer. The language of MMIX is powerful enough to allow brief programs to be written for most algorithms, yet simple enough so that its operations are easily learned.

The reader is urged to study this section carefully, since MMIX language appears in so many parts of this book. There should be no hesitation about learning a machine language; indeed, the author once found it not uncommon to be writing programs in a half dozen different machine languages during the same week! Everyone with more than a casual interest in computers will probably get to know at least one machine language sooner or later. Machine language helps programmers understand what really goes on inside their computers. And once one machine language has been learned, the characteristics of another are easy to assimilate. Computer science is largely concerned with an understanding of how low-level details make it possible to achieve high-level goals.

Software for running MMIX programs on almost any real computer can be downloaded from the website for this book (see page ii). The complete source code for the author's MMIX routines appears in the book MMIXware [Lecture Notes in Computer Science 1750 (1999)]; that book will be called "the MMIXware document" in the following pages.

1.3.1: Description of MMIX

MMIX is a polysaturated, 100% natural computer. Like most machines, it has an identifying number—the 2009. This number was found by taking 14 actual computers very similar to MMIX and on which MMIX could easily be simulated, then averaging their numbers with equal weight:

\[
\text{Cray1 + IBM 801 + RISCII + Clipper C300 + AMD 29K + Motorola 88K} \\
+ \text{ IBM 601 + Inteli960 + Alpha2164 + POWER2 + MIPS R4000} \\
+ \text{ Hitachi SuperH4 + StrongARM 110 + Sparc 64)}/14
\]

\[= 28126/14 = 2009.\] (1)

The same number may also be obtained in a simpler way by taking Roman numerals.

Bits and bytes. MMIX works with patterns of 0s and 1s, commonly called binary digits or bits, and it usually deals with 64 bits at a time. For example, the 64-bit quantity

\[1001111000110111011001101100101111111010010011110000010110\] (2)

is a typical pattern that the machine might encounter. Long patterns like this can be expressed more conveniently if we group the bits four at a time and use
hexadecimal digits to represent each group. The sixteen hexadecimal digits are

\[
\begin{align*}
0 &= 0000, & 4 &= 0100, & 8 &= 1000, & c &= 1100, \\
1 &= 0001, & 5 &= 0101, & 9 &= 1001, & d &= 1101, \\
2 &= 0010, & 6 &= 0110, & a &= 1010, & e &= 1110, \\
3 &= 0011, & 7 &= 0111, & b &= 1011, & f &= 1111.
\end{align*}
\]

We shall always use a distinctive typeface for hexadecimal digits, as shown here, so that they won’t be confused with the decimal digits 0–9; and we will usually also put the symbol * just before a hexadecimal number, to make the distinction even clearer. For example, (2) becomes

\[
{^*}9e3779b9f4a7c16
\]

in hexadecimalese. Uppercase digits ABCDEF are often used instead of abcdef, because {^*}9E3779B9F4A7C16 looks better than {^*}9e3779b9f4a7c16 in some contexts; there is no difference in meaning.

A sequence of eight bits, or two hexadecimal digits, is commonly called a byte. Most computers now consider bytes to be their basic, individually addressable units of information; we will see that an MMIX program can refer to as many as \(2^{64}\) bytes, each with its own address from \(^*0000000000000000\) to \(^*ffffffffffffffff\). Letters, digits, and punctuation marks of languages like English are often represented with one byte per character, using the American Standard Code for Information Interchange (ASCII). For example, the ASCII equivalent of MMIX is \(^*4d4d4958\). ASCII is actually a 7-bit code with control characters \(^*00–*1f\), printing characters \(^*20–*7e\), and a “delete” character \(*7f\) [see CACM 8 (1965), 207–214; 11 (1968), 849–852; 12 (1969), 166–178]. It was extended during the 1980s to an international standard 8-bit code known as Latin-1 or ISO 8859-1, thereby encoding accented letters: \(\text{pâle}\) is \(^*70e274e9\).

“Of the 256th squadron?”

“Of the fighting 256th Squadron,” Yossarian replied.

... “That’s two to the fighting eighth power.”

— JOSEPH HELLER, Catch-22 (1961)

A 16-bit code that supports nearly every modern language became an international standard during the 1990s. This code, known as Unicode or ISO/IEC 10646 UCS-2, includes not only Greek letters like \(\Sigma\) and \(\sigma\) \(^*03a3\) and \(^*03c3\), Cyrillic letters like \(\text{I}I\) and \(\text{m}\) \(^*0429\) and \(^*0449\), Armenian letters like \(\text{O}\) and \(\text{q}\) \(^*0547\) and \(^*0577\), Hebrew letters like \(\text{w}\) \(^*05e9\), Arabic letters like خ \(^*0634\), and Indian letters like \(\text{क}\) \(^*0936\) or \(\text{k}\) \(^*09b6\) or \(\text{०}\) \(^*0b36\) or \(\text{a}\) \(^*0bb7\), etc., but also tens of thousands of East Asian ideographs such as the Chinese character for mathematics and computing, 算 \(^*7b97\). It even has special codes for Roman numerals: \(\text{MMIX} = {^*}216f216f21602169\). Ordinary ASCII or Latin-1 characters are represented by simply giving them a leading byte of zero: \(\text{pâté}\) is \(^*007000e2007400e9\) à l’Unicode.
We will use the convenient term *wyde* to describe a 16-bit quantity like the wide characters of Unicode, because two-byte quantities are quite important in practice. We also need convenient names for four-byte and eight-byte quantities, which we shall call *tetrabytes* (or “tetras”) and *octabytes* (or “octas”). Thus

- 2 bytes = 1 wyde;
- 2 wydes = 1 tetra;
- 2 tetrabytes = 1 octa.

One octabyte equals four wydes equals eight bytes equals sixty-four bits.

Bytes and multibyte quantities can, of course, represent numbers as well as alphabetic characters. Using the binary number system,

- an unsigned byte can express the numbers 0 .. 255;
- an unsigned wyde can express the numbers 0 .. 65,535;
- an unsigned tetra can express the numbers 0 .. 4,294,967,295;
- an unsigned octa can express the numbers 0 .. 18,446,744,073,709,551,615.

Integers are also commonly represented by using *two’s complement notation*, in which the leftmost bit indicates the sign: If the leading bit is 1, we subtract \( 2^n \) to get the integer corresponding to an \( n \)-bit number in this notation. For example, \(-1\) is the signed byte *ff*; it is also the signed wyde *ffffff*, the signed tetrabyte *fffffff* and the signed octabyte *fffffff*.*fffffff*. In this way

- a signed byte can express the numbers \(-128 \ldots 127\);
- a signed wyde can express the numbers \(-32,768 \ldots 32,767\);
- a signed tetra can express the numbers \(-2,147,483,648 \ldots 2,147,483,647\);
- a signed octa can express the numbers \(-9,223,372,036,854,775,808 \ldots 9,223,372,036,854,775,807\).

**Memory and registers.** From a programmer’s standpoint, an *MMIX* computer has \(2^{64}\) cells of memory and \(2^8\) general-purpose registers, together with \(2^5\) special registers (see Fig. 13). Data is transferred from the memory to the registers, transformed in the registers, and transferred from the registers to the memory. The cells of memory are called \(M[0], M[1], \ldots, M[2^{64} - 1]\); thus if \(x\) is any octabyte, \(M[x]\) is a byte of memory. The general-purpose registers are called \(\$0, \$1, \ldots, \$255\); thus if \(x\) is any byte, \(\&x\) is an octabyte.

The \(2^{64}\) bytes of memory are grouped into \(2^{63}\) wydes, \(M_2[0] = M_2[1] = M[0]M[1], M_2[2] = M_2[3] = M[2]M[3], \ldots\); each wyde consists of two consecutive bytes \(M[2k]M[2k + 1] = M[2k] \times 2^8 + M[2k + 1]\), and is denoted either by \(M_2[2k]\) or by \(M_2[2k + 1]\). Similarly there are \(2^{62}\) tetrabytes

\[ M_4[4k] = M_4[4k + 1] = \cdots = M_4[4k + 3] = M[4k]M[4k + 1] \cdots M[4k + 3], \]

and \(2^{61}\) octabytes

\[ M_8[8k] = M_8[8k + 1] = \cdots = M_8[8k + 7] = M[8k]M[8k + 1] \cdots M[8k + 7]. \]

In general if \(x\) is any octabyte, the notations \(M_2[x], M_4[x], \) and \(M_8[x]\) denote the wyde, the tetra, and the octa that contain byte \(M[x]\); we ignore the least
Fig. 13. The MMIX computer, as seen by a programmer, has 256 general-purpose registers and 32 special-purpose registers, together with $2^{64}$ bytes of virtual memory. Each register holds 64 bits of data.

significant $\lg t$ bits of $x$ when referring to $M_t[x]$. For completeness, we also write $M_t[x] = M[x]$, and we define $M[x] = M[x \text{ mod } 2^{64}]$ when $x < 0$ or $x \geq 2^{64}$.

The 32 special registers of MMIX are called rA, rB, ..., rZ, rBB, rTT, rWW, rXX, rYY, and rZZ. Like their general-purpose cousins, they each hold an octabyte. Their uses will be explained later; for example, we will see that rA controls arithmetic interrupts while rR holds the remainder after division.

Instructions. MMIX’s memory contains instructions as well as data. An instruction or “command” is a tetabyte whose four bytes are conventionally called OP, X, Y, and Z. OP is the operation code (or “opcode,” for short); X, Y, and Z specify the operands. For example, $\#0010203$ is an instruction with OP = $\#20$, $X = \#01$, $Y = \#02$, and $Z = \#03$, and it means “Set $\$1$ to the sum of $\$2$ and $\$3$.” The operand bytes are always regarded as unsigned integers.

Each of the 256 possible opcodes has a symbolic form that is easy to remember. For example, opcode $\#20$ is ADD. We will deal almost exclusively with symbolic opcodes; the numeric equivalents can be found, if needed, in Table 1 below, and also in the endpapers of this book.

The X, Y, and Z bytes also have symbolic representations, consistent with the assembly language that we will discuss in Section 1.3.2’. For example, the instruction $\#0010203$ is conventionally written ‘ADD $\$1, $\$2, $\$3’”, and the addition instruction in general is written ‘ADD $\$X, $\$Y, $\$Z’. Most instructions have three operands, but some of them have only two, and a few have only one. When there are two operands, the first is X and the second is the two-byte quantity YZ; the symbolic notation then has only one comma. For example, the instruction
‘INCL $X,Y$Z’ increases register $X$ by the amount $YZ$. When there is only one operand, it is the unsigned three-byte number $XYZ$, and the symbolic notation has no comma at all. For example, we will see that ‘JMP 0+4*XYZ’ tells MMIX to find its next instruction by skipping ahead $XYZ$ tetrabytes; the instruction ‘JMP 0+1000000’ has the hexadecimal form $0f03d090$, because JMP = $0f0$ and 250000 = $03d090$.

We will describe each MMIX instruction both informally and formally. For example, the informal meaning of ‘ADD $X,Y,Z$’ is “Set $X$ to the sum of $Y$ and $Z$”; the formal definition is ‘$s(X) \leftarrow s(Y) + s(Z)$’. Here $s(x)$ denotes the signed integer corresponding to the bit pattern $x$, according to the conventions of two’s complement notation. An assignment like $s(x) \leftarrow N$ means that $x$ is to be set to the bit pattern for which $s(x) = N$. (Such an assignment causes integer overflow if $N$ is too large or too small to fit in $x$. For example, an ADD will overflow if $s(Y) + s(Z)$ is less than $2^{63}$ or greater than $2^{63} - 1$. When we’re discussing an instruction informally, we will often gloss over the possibility of overflow; the formal definition, however, will make everything precise. In general the assignment $s(x) \leftarrow N$ sets $x$ to the binary representation of $N$ mod $2^n$, where $n$ is the number of bits in $x$, and it signals overflow if $N < -2^{n-1}$ or $N \geq 2^{n-1}$; see exercise 5.)

**Loading and storing.** Although MMIX has 256 different opcodes, we will see that they fall into a few easily learned categories. Let’s start with the instructions that transfer information between the registers and the memory.

Each of the following instructions has a memory address $A$ obtained by adding $Y$ to $Z$. Formally,

$$A = (u(Y) + u(Z)) \mod 2^{64}$$

is the sum of the unsigned integers represented by $Y$ and $Z$, reduced to a 64-bit number by ignoring any carry that occurs at the left when those two integers are added. In this formula the notation $u(x)$ is analogous to $s(x)$, but it considers $x$ to be an unsigned binary number.

- **LDB $X,Y,Z$ (load byte):** $s(X) \leftarrow s(M_1[A])$.
- **LDW $X,Y,Z$ (load wyde):** $s(X) \leftarrow s(M_2[A])$.
- **LDT $X,Y,Z$ (load tetra):** $s(X) \leftarrow s(M_4[A])$.
- **LDO $X,Y,Z$ (load octa):** $s(X) \leftarrow s(M_8[A])$.

These instructions bring data from memory into register $X$, changing the data if necessary from a signed byte, wyde, or tetrabyte to a signed octabyte of the same value. For example, suppose the octabyte $M_8[1002] = M_8[1000]$ is


Then if $Z = 1000$ and $Y = 2$, we have $A = 1002$, and

- **LDB $1,2,3$** sets $1 \leftarrow 0000000000000045$;
- **LDW $1,2,3$** sets $1 \leftarrow 000000000004567$;
- **LDT $1,2,3$** sets $1 \leftarrow 00000001234567$;
- **LDO $1,2,3$** sets $1 \leftarrow 0123456789abcdef.$
1.3.1
decription of mmix

But if $s = 5$, so that $a = 1005$,

$$
\text{LDB } $1, $2, $3 \text{ sets } $1 \leftarrow ^*\text{fffffff} \text{fffffff} \text{ffab};
\text{LDW } $1, $2, $3 \text{ sets } $1 \leftarrow ^*\text{fffffff} \text{fffffff}89\text{ab};
\text{LDT } $1, $2, $3 \text{ sets } $1 \leftarrow ^*\text{fffffff}89\text{ab} \text{ cdef};
\text{LDO } $1, $2, $3 \text{ sets } $1 \leftarrow ^{01234567}89\text{ab} \text{ cdef}.
$$

When a signed byte or wyde or tetra is converted to a signed octa, its sign bit is "extended" into all positions to the left.

- \text{LDBU } $x, $y, $z \text{ (load byte unsigned): } u(x) \leftarrow u(M_A[a]).
- \text{LDWU } $x, $y, $z \text{ (load wyde unsigned): } u(x) \leftarrow u(M_{2}[a]).
- \text{LDTU } $x, $y, $z \text{ (load tetra unsigned): } u(x) \leftarrow u(M_{4}[a]).
- \text{LDOU } $x, $y, $z \text{ (load octa unsigned): } u(x) \leftarrow u(M_{8}[a]).

These instructions are analogous to LDB, LDW, LDT, and LDO, but they treat the memory data as unsigned; bit positions at the left of the register are set to zero when a short quantity is being lengthened. Thus, in the example above, LDBU $1, $2, $3 with $2 + $3 = 1005 would set $1 \leftarrow ^{0000000000000000ab}.

The instructions LDO and LDOU actually have exactly the same behavior, because no sign extension or padding with zeros is necessary when an octabyte is loaded into a register. But a good programmer will use LDO when the sign is relevant and LDOU when it is not; then readers of the program can better understand the significance of what is being loaded.

- \text{LDHT } $x, $y, $z \text{ (load high tetra): } u(x) \leftarrow u(M_{4}[a]) \times 2^{32}.

Here the tetabyte $M_{4}[a]$ is loaded into the left half of $x$, and the right half is set to zero. For example, LDHT $1, $2, $3$ sets $1 \leftarrow ^{89\text{abcd}}00000000$.

- \text{LDA } $x, $y, $z \text{ (load address): } u(x) \leftarrow a.

This instruction, which puts a memory address into a register, is essentially the same as the ADDU instruction described below. Sometimes the words “load address” describe its purpose better than the words “add unsigned.”

- \text{STB } $x, $y, $z \text{ (store byte): } s(M_{1}[a]) \leftarrow s(x).
- \text{STW } $x, $y, $z \text{ (store wyde): } s(M_{2}[a]) \leftarrow s(x).
- \text{STT } $x, $y, $z \text{ (store tetra): } s(M_{4}[a]) \leftarrow s(x).
- \text{STO } $x, $y, $z \text{ (store octa): } s(M_{8}[a]) \leftarrow s(x).

These instructions go the other way, placing register data into the memory. Overflow is possible if the (signed) number in the register lies outside the range of the memory field. For example, suppose register $1$ contains the number $-65536 = ^{fffffff} \text{fffffff}0000$. Then if $2 = 1000$, $3 = 2$, and (6) holds,

$$
\text{STB } $1, $2, $3 \text{ sets } M_{4}[1000] \leftarrow ^{01230067}89\text{abcd} \text{ (with overflow)};
\text{STW } $1, $2, $3 \text{ sets } M_{8}[1000] \leftarrow ^{01230000}89\text{abcd} \text{ (with overflow)};
\text{STT } $1, $2, $3 \text{ sets } M_{16}[1000] \leftarrow ^{fffffff}000089\text{abcd};
\text{STO } $1, $2, $3 \text{ sets } M_{32}[1000] \leftarrow ^{fffffff} \text{fffffff}0000.
$$
• STBU $X$, $Y$, $Z$ (store byte unsigned):
  \[ u(M_1[A]) \leftarrow u(X) \mod 2^8. \]
• STWU $X$, $Y$, $Z$ (store wyde unsigned):
  \[ u(M_2[A]) \leftarrow u(X) \mod 2^{16}. \]
• STU $X$, $Y$, $Z$ (store tetra unsigned):
  \[ u(M_4[A]) \leftarrow u(X) \mod 2^{32}. \]
• STOU $X$, $Y$, $Z$ (store octa unsigned): \[ u(M_8[A]) \leftarrow u(X). \]
These instructions have the same exact effect on memory as their signed counterparts STB, STW, STT, and STO, but overflow never occurs.
• STHT $X$, $Y$, $Z$ (store high tetra): \[ u(M_4[A]) \leftarrow \left\lfloor u(X)/2^{32} \right\rfloor. \]
The left half of register $X$ is stored in memory tetabyte $M_4[A]$.
• STC $X$, $Y$, $Z$ (store constant octabyte): \[ u(M_8[A]) \leftarrow X. \]
A constant between 0 and 255 is stored in memory octabyte $M_8[A]$.

**Arithmetic operators.** Most of MIMIX’s operations take place strictly between registers. We might as well begin our study of the register-to-register operations by considering addition, subtraction, multiplication, and division, because computers are supposed to be able to compute.
• ADD $X$, $Y$, $Z$ (add): \[ s(X) \leftarrow s(Y) + s(Z). \]
• SUB $X$, $Y$, $Z$ (subtract): \[ s(X) \leftarrow s(Y) - s(Z). \]
• MUL $X$, $Y$, $Z$ (multiply): \[ s(X) \leftarrow s(Y) \times s(Z). \]
• DIV $X$, $Y$, $Z$ (divide): \[ s(X) \leftarrow \left\lfloor s(Y)/s(Z) \right\rfloor [Z \neq 0], \] and
  \[ s(R) \leftarrow s(Y) \mod s(Z). \]
Sums, differences, and products need no further discussion. The DIV command forms the quotient and remainder as defined in Section 1.2.4; the remainder goes into the special *remainder register* $rR$, where it can be examined by using the instruction GET $X$, $rR$ described below. If the divisor $Z$ is zero, DIV sets $X \leftarrow 0$ and $R \leftarrow Y$ (see Eq. 1.2.4–(1)); an “integer divide check” also occurs.
• ADDU $X$, $Y$, $Z$ (add unsigned): \[ u(X) \leftarrow (u(Y) + u(Z)) \mod 2^{64}. \]
• SUBU $X$, $Y$, $Z$ (subtract unsigned): \[ u(X) \leftarrow (u(Y) - u(Z)) \mod 2^{64}. \]
• MULU $X$, $Y$, $Z$ (multiply unsigned): \[ u(R) \leftarrow u(X) \times u(Y) \mod u(Z). \]
• DIVU $X$, $Y$, $Z$ (divide unsigned): \[ u(X) \leftarrow \left\lfloor u(R)/u(Y) \right\rfloor [u(Z) \neq 0], u(R) \leftarrow u(R) \mod u(Z), \] if $u(Z) > u(R)$; otherwise $X \leftarrow R$, $R \leftarrow Y$.
Arithmetic on unsigned numbers never causes overflow. A full 16-byte product is formed by the MULU command, and the upper half goes into the special *himult register* $rH$. For example, when the unsigned number \[ *9e3779b9f4a7c16 * \] in (2) and (4) above is multiplied by itself we get
\[ rH \leftarrow *61c886480b53ea, \ X \leftarrow *1bb32095c0d51ae4. \]
In this case the value of $rH$ has turned out to be exactly $2^{64}$ minus the original number $*9e3779b9f4a7c16$; this is not a coincidence! The reason is that (2) actually gives the first 64 bits of the binary representation of the golden ratio $\phi^{-1} = \phi - 1$, if we place a binary radix point at the left. (See Table 2 in Appendix A.) Squaring gives us an approximation to the binary representation of $\phi^{-2} = 1 - \phi^{-1}$, with the radix point now at the left of $rH$.  


Division with DIVU yields the 8-byte quotient and remainder of a 16-byte dividend with respect to an 8-byte divisor. The upper half of the dividend appears in the special dividend register rD, which is zero at the beginning of a program; this register can be set to any desired value with the command PUT rD, $Z$ described below. If rD is greater than or equal to the divisor, DIVU $X$, $Y$, $Z$ simply sets $X ← rD$ and $rR ← Y$. (This case always arises when $Z$ is zero.) But DIVU never causes an integer divide check.

The ADDU instruction computes a memory address $A$, according to definition (5); therefore, as discussed earlier, we sometimes give ADDU the alternative name LDA. The following related commands also help with address calculation.

- **2ADDU $X$, $Y$, $Z$ (times 2 and add unsigned):**
  \[ u(X) ← (u(Y) × 2 + u(Z)) \mod 2^{64}. \]
- **4ADDU $X$, $Y$, $Z$ (times 4 and add unsigned):**
  \[ u(X) ← (u(Y) × 4 + u(Z)) \mod 2^{64}. \]
- **8ADDU $X$, $Y$, $Z$ (times 8 and add unsigned):**
  \[ u(X) ← (u(Y) × 8 + u(Z)) \mod 2^{64}. \]
- **16ADDU $X$, $Y$, $Z$ (times 16 and add unsigned):**
  \[ u(X) ← (u(Y) × 16 + u(Z)) \mod 2^{64}. \]

It is faster to execute the command 2ADDU $X$, $Y$, $Y$ than to multiply by 3, if overflow is not an issue.

- **NEG $X$, $Y$, $Z$ (negate):** $s(X) ← Y - s(Z)$.
- **NEG $X$, $Y$, $Z$ (negate unsigned):** $u(X) ← (Y - u(Z)) \mod 2^{64}$.

In these commands $Y$ is simply an unsigned constant, not a register number (just as $X$ was an unsigned constant in the STCQ instruction). Usually $Y$ is zero, in which case we can write simply NEG $X$, $Z$ or NEGU $X$, $Z$.

- **SL $X$, $Y$, $Z$ (shift left):** $s(X) ← s(Y) \times 2^{u(Z)}$.
- **SLU $X$, $Y$, $Z$ (shift left unsigned):** $u(X) ← (u(Y) \times 2^{u(Z)}) \mod 2^{64}$.
- **SR $X$, $Y$, $Z$ (shift right):** $s(X) ← [s(Y) / 2^{u(Z)}]$.
- **SRU $X$, $Y$, $Z$ (shift right unsigned):** $u(X) ← [u(Y) / 2^{u(Z)}]$.

SL and SLU both produce the same result in $X$, but SL might overflow while SLU never does. SR extends the sign when shifting right, but SRU shifts zeros in from the left. Therefore SR and SRU produce the same result in $X$ if and only if $Y$ is nonnegative or $Z$ is zero. The SL and SR instructions are much faster than MUL and DIV by powers of 2. An SLU instruction is much faster than MULU by a power of 2, although it does not affect rH as MULU does. An SRU instruction is much faster than DIVU by a power of 2, although it is not affected by rD. The notation $y < z$ is often used to denote the result of shifting a binary value $y$ to the left by $z$ bits; similarly, $y \gg z$ denotes shifting to the right.

- **CMP $X$, $Y$, $Z$ (compare):**
  \[ s(X) ← [s(Y) > s(Z)] - [s(Y) < s(Z)]. \]
- **CMPU $X$, $Y$, $Z$ (compare unsigned):**
  \[ s(X) ← [u(Y) > u(Z)] - [u(Y) < u(Z)]. \]

These instructions each set $X$ to either −1, 0, or 1, depending on whether register $Y$ is less than, equal to, or greater than register $Z$. 
Conditional instructions. Several instructions base their actions on whether a register is positive, or negative, or zero, etc.

- CSN $X$, $Y$, $Z$ (conditional set if negative): if $s(Y) < 0$, set $X \leftarrow Z$.
- CSZ $X$, $Y$, $Z$ (conditional set if zero): if $Y = 0$, set $X \leftarrow Z$.
- CSP $X$, $Y$, $Z$ (conditional set if positive): if $s(Y) > 0$, set $X \leftarrow Z$.
- CSOD $X$, $Y$, $Z$ (conditional set if odd): if $s(Y) \mod 2 = 1$, set $X \leftarrow Z$.
- CSNN $X$, $Y$, $Z$ (conditional set if nonnegative): if $s(Y) \geq 0$, set $X \leftarrow Z$.
- CSNZ $X$, $Y$, $Z$ (conditional set if nonzero): if $Y \neq 0$, set $X \leftarrow Z$.
- CSNP $X$, $Y$, $Z$ (conditional set if nonpositive): if $s(Y) \leq 0$, set $X \leftarrow Z$.
- CSEV $X$, $Y$, $Z$ (conditional set if even): if $s(Y) \mod 2 = 0$, set $X \leftarrow Z$.

If register $Y$ satisfies the stated condition, register $Z$ is copied to register $X$; otherwise nothing happens. A register is negative if and only if its leading (leftmost) bit is 1. A register is odd if and only if its trailing (rightmost) bit is 1.

- ZSN $X$, $Y$, $Z$ (zero or set if negative): $X \leftarrow Z [s(Y) < 0]$.
- ZSZ $X$, $Y$, $Z$ (zero or set if zero): $X \leftarrow Z [Y = 0]$.
- ZSP $X$, $Y$, $Z$ (zero or set if positive): $X \leftarrow Z [s(Y) > 0]$.
- ZZOD $X$, $Y$, $Z$ (zero or set if odd): $X \leftarrow Z [s(Y) \mod 2 = 1]$.
- ZSN $X$, $Y$, $Z$ (zero or set if nonnegative): $X \leftarrow Z [s(Y) \geq 0]$.
- ZSNZ $X$, $Y$, $Z$ (zero or set if nonzero): $X \leftarrow Z [Y \neq 0]$.
- ZSNP $X$, $Y$, $Z$ (zero or set if nonpositive): $X \leftarrow Z [s(Y) \leq 0]$.
- ZSEP $X$, $Y$, $Z$ (zero or set if even): $X \leftarrow Z [s(Y) \mod 2 = 0]$.

If register $Y$ satisfies the stated condition, register $Z$ is copied to register $X$; otherwise register $X$ is set to zero.

Bitwise operations. We often find it useful to think of an octabyte $x$ as a vector $v(x)$ of 64 individual bits, and to perform operations simultaneously on each component of two such vectors.

- AND $X$, $Y$, $Z$ (bitwise and): $v(X) \leftarrow v(Y) \land v(Z)$.
- OR $X$, $Y$, $Z$ (bitwise or): $v(X) \leftarrow v(Y) \lor v(Z)$.
- XOR $X$, $Y$, $Z$ (bitwise exclusive-or): $v(X) \leftarrow v(Y) \oplus v(Z)$.
- ANDN $X$, $Y$, $Z$ (bitwise and-not): $v(X) \leftarrow v(Y) \land \neg v(Z)$.
- ORN $X$, $Y$, $Z$ (bitwise or-not): $v(X) \leftarrow v(Y) \lor \neg v(Z)$.
- NAND $X$, $Y$, $Z$ (bitwise not-and): $v(X) \leftarrow v(Y) \land \neg v(Z)$.
- NOR $X$, $Y$, $Z$ (bitwise not-or): $v(X) \leftarrow v(Y) \lor \neg v(Z)$.
- NXOR $X$, $Y$, $Z$ (bitwise not-exclusive-or): $v(X) \leftarrow v(Y) \oplus \neg v(Z)$.

Here $\neg$ denotes the complement of vector $v$, obtained by changing 0 to 1 and 1 to 0. The binary operations $\land$, $\lor$, and $\oplus$, defined by the rules

\[
\begin{align*}
0 \land 0 &= 0, & 0 \lor 0 &= 0, & 0 \oplus 0 &= 0, \\
0 \land 1 &= 0, & 0 \lor 1 &= 1, & 0 \oplus 1 &= 1, \\
1 \land 0 &= 0, & 1 \lor 0 &= 1, & 1 \oplus 0 &= 1, \\
1 \land 1 &= 1, & 1 \lor 1 &= 1, & 1 \oplus 1 &= 0,
\end{align*}
\]

are applied independently to each bit. Anding is the same as multiplying or taking the minimum; oring is the same as taking the maximum. Exclusive-or is the same as adding mod 2.
• **MUX $X, Y, Z** (bitwise multiplex): $v(X) \leftarrow (v(Y) \land v(rM)) \lor (v(Z) \lor \lnot v(rM))$.

The **MUX** operation combines two bit vectors by looking at the special *multiplex mask register* $rM$, choosing bits of $Y$ where $rM$ is 1 and bits of $Z$ where $rM$ is 0.

• **SADD $X, Y, Z$** (sideways add): $s(X) \leftarrow s(\sum(v(Y) \land \lnot v(Z)))$.

The **SADD** operation counts the number of bit positions in which register $Y$ has a 1 while register $Z$ has a 0.

**Bytewise operations.** Similarly, we can regard an octabyte $x$ as a vector $b(x)$ of eight individual bytes, each of which is an integer between 0 and 255; or we can think of it as a vector $w(x)$ of four individual wydes, or a vector $t(x)$ of two unsigned tetrads. The following operations deal with all components at once.

• **BDIF $X, Y, Z$** (byte difference): $b(X) \leftarrow b(Y) - b(Z)$.

• **WDIF $X, Y, Z$** (wyde difference): $w(X) \leftarrow w(Y) - w(Z)$.

• **TDIF $X, Y, Z$** (tetra difference): $t(X) \leftarrow t(Y) - t(Z)$.

• **ODIF $X, Y, Z$** (octa difference): $o(X) \leftarrow o(Y) - o(Z)$.

Here $-$ denotes the operation of *saturating subtraction*,

$$y - z = \max(0, y - z).$$

These operations have important applications to text processing, as well as to computer graphics (when the bytes or wydes represent pixel values). Exercises 27–30 discuss some of their basic properties.

We can also regard an octabyte as an $8 \times 8$ Boolean matrix, that is, as an $8 \times 8$ array of 0s and 1s. Let $m(x)$ be the matrix whose rows from top to bottom are the bytes of $x$ from left to right; and let $m^T(x)$ be the transposed matrix, whose columns are the bytes of $x$. For example, if $x = 9e3779b97f4a7c16$ is the octabyte (2), we have

$$m(x) = \begin{pmatrix}
1 & 0 & 0 & 1 & 1 & 1 & 0 \\
0 & 0 & 1 & 1 & 0 & 1 & 1 \\
0 & 1 & 1 & 1 & 1 & 0 & 0 \\
1 & 0 & 1 & 1 & 0 & 0 & 0 \\
0 & 1 & 1 & 1 & 1 & 1 & 1 \\
0 & 1 & 0 & 0 & 1 & 0 & 1 \\
0 & 1 & 1 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 1 & 1
\end{pmatrix}, \quad m^T(x) = \begin{pmatrix}
1 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 1 & 0 & 1 & 1 \\
0 & 1 & 1 & 1 & 1 & 0 & 0 \\
1 & 0 & 1 & 1 & 1 & 1 & 1 \\
1 & 0 & 1 & 1 & 1 & 1 & 0 \\
1 & 1 & 0 & 0 & 1 & 0 & 0 \\
1 & 1 & 0 & 0 & 1 & 1 & 0 \\
0 & 1 & 1 & 1 & 1 & 0 & 0
\end{pmatrix}. \quad (10)$$

This interpretation of octabytes suggests two operations that are quite familiar to mathematicians, but we will pause a moment to define them from scratch.

If $A$ is an $m \times n$ matrix and $B$ is an $n \times s$ matrix, and if $\circ$ and $\bullet$ are binary operations, the *generalized matrix product* $A \bullet B$ is the $m \times s$ matrix $C$ defined by

$$C_{ij} = (A_{i1} \bullet B_{1j}) \circ (A_{i2} \bullet B_{2j}) \circ \cdots \circ (A_{in} \bullet B_{nj}) \quad (11)$$

for $1 \leq i \leq m$ and $1 \leq j \leq s$. [See K. E. Iverson, *A Programming Language* (Wiley, 1962), 23–24: we assume that $\circ$ is associative.] An ordinary matrix product is obtained when $\circ$ is $+$ and $\bullet$ is $\times$, but we obtain important operations
on Boolean matrices if we let $\odot$ be $\lor$ or $\oplus$:

$$(A \lor B)_{ij} = A_{i1} B_{1j} \lor A_{i2} B_{2j} \lor \cdots \lor A_{in} B_{nj};$$  \hspace{1cm} (12)

$$(A \oplus B)_{ij} = A_{i1} B_{1j} \oplus A_{i2} B_{2j} \oplus \cdots \oplus A_{in} B_{nj}.$$  \hspace{1cm} (13)

Notice that if the rows of $A$ each contain at most one 1, at most one term in (12) or (13) is nonzero. The same is true if the columns of $B$ each contain at most one 1. Therefore $A \lor B$ and $A \oplus B$ both turn out to be the same as the ordinary matrix product $A \times B = AB$ in such cases.

- $\text{MOR} \; \$X, \$Y, \$Z$ (multiple or): $\text{m}^T(\$X) \leftarrow \text{m}^T(\$Y) \lor \text{m}^T(\$Z)$; equivalently, $\text{m}(\$X) \leftarrow \text{m}(\$Z) \lor \text{m}(\$Y)$. (See exercise 32.)
- $\text{MXOR} \; \$X, \$Y, \$Z$ (multiple exclusive-or): $\text{m}^T(\$X) \leftarrow \text{m}^T(\$Y) \oplus \text{m}^T(\$Z)$; equivalently, $\text{m}(\$X) \leftarrow \text{m}(\$Z) \oplus \text{m}(\$Y)$.

These operations essentially set each byte of $\$X$ by looking at the corresponding byte of $\$Z$ and using its bits to select bytes of $\$Y$; the selected bytes are then $\lor$ed or $\oplus$ed together. If, for example, we have

$$\$Z = \text{01} 02 04 08 10 20 40 80,$$

then both $\text{MOR}$ and $\text{MXOR}$ will set register $\$X$ to the byte reversal of register $\$Y$: The $k$th byte from the left of $\$X$ will be set to the $k$th byte from the right of $\$Y$, for $1 \leq k \leq 8$. On the other hand if $\$Z = \text{0000000000000000}$, $\text{MOR}$ and $\text{MXOR}$ will set all bytes of $\$X$ to zero except for the rightmost byte, which will become either the $\text{OR}$ or the $\text{XOR}$ of all eight bytes of $\$Y$.

Exercises 33–37 illustrate some of the many practical applications of these versatile commands.

**Floating point operators.** MMIX includes a full implementation of the famous IEEE/ANSI Standard 754 for floating point arithmetic. Complete details of the floating point operations appear in Section 4.2 and in the MMIXware document; a rough summary will suffice for our purposes here.

Every octabyte $x$ represents a floating binary number $f(x)$ determined as follows: The leftmost bit of $x$ is the sign ($0 = \text{‘+’}, 1 = \text{‘−’}$); the next 11 bits are the exponent $E$; the remaining 52 bits are the fraction $F$. The value represented is then

$$\pm 0.0, \text{ if } E = F = 0 \text{ (zero); }$$
$$\pm 2^{-1023}F, \text{ if } E = 0 \text{ and } F \neq 0 \text{ (denormal); }$$
$$\pm 2^{E-1023}(1 + F/2^{52}), \text{ if } 0 \leq E \leq 2047 \text{ (normal); }$$
$$\pm \infty, \text{ if } E = 2047 \text{ and } F = 0 \text{ (infinite); }$$
$$\pm \text{NaN}(F/2^{52}), \text{ if } E = 2047 \text{ and } F \neq 0 \text{ (Not-a-Number).}$$

The “short” floating point number $f(t)$ represented by a tetrabyte $t$ is similar, but its exponent part has only 8 bits and its fraction has only 23; the normal case $0 < E < 255$ of a short float represents $\pm 2^{E-127}(1 + F/2^{23})$.

- $\text{FADD} \; \$X, \$Y, \$Z$ (floating add): $f(\$X) \leftarrow f(\$Y) + f(\$Z)$.
- $\text{FSUB} \; \$X, \$Y, \$Z$ (floating subtract): $f(\$X) \leftarrow f(\$Y) - f(\$Z)$.
- $\text{FMUL} \; \$X, \$Y, \$Z$ (floating multiply): $f(\$X) \leftarrow f(\$Y) \times f(\$Z)$.
- $\text{FDIV} \; \$X, \$Y, \$Z$ (floating divide): $f(\$X) \leftarrow f(\$Y)/f(\$Z)$.
1.3.1

**DESCRIPTION OF MMIX**

- **FREM $X$, $Y$, $Z** (floating remainder): $f(X) \leftarrow f(Y) \text{ rem } f(Z)$.
- **FSQRT $X$, $Z$ or FSQRT $X$, $Y$, $Z$** (floating square root): $f(X) \leftarrow f(Z)^{1/2}$.
- **FIN T $X$, $Z$ or FIN T $X$, $Y$, $Z$** (floating integer): $f(X) \leftarrow \text{int}(f(Z))$.
- **FCMP $X$, $Y$, $Z$** (floating compare): $s(X) \leftarrow [f(Y) > f(Z)] - [f(Y) < f(Z)]$.
- **FEQL $X$, $Y$, $Z$** (floating equal to): $s(X) \leftarrow [f(Y) = f(Z)]$.
- **FUN $X$, $Y$, $Z$** (floating unordered): $s(X) \leftarrow [f(Y) \neq f(Z)]$.
- **FCMP $X$, $Y$, $Z$** (floating compare with respect to epsilon): $s(X) \leftarrow [f(Y) > f(Z)(f(rE))] - [f(Y) < f(Z)(f(rE))]$, see 4.2.2-(21).
- **FEQLE $X$, $Y$, $Z$** (floating equivalent with respect to epsilon): $s(X) \leftarrow [f(Y) \approx f(Z)(f(rE))]$, see 4.2.2-(24).
- **FUNE $X$, $Y$, $Z$** (floating unordered with respect to epsilon): $s(X) \leftarrow [f(Y) \neq f(Z)(f(rE))]$.
- **FIX $X$, $Z$ or FIX $X$, $Y$, $Z$** (convert floating to fixed): $s(X) \leftarrow \text{int}(f(Z))$.
- **FIXU $X$, $Z$ or FIXU $X$, $Y$, $Z$** (convert floating to fixed unsigned): $u(X) \leftarrow \text{int}(f(Z)) \mod 2^{64}$.
- **FLDT $X$, $Z$ or FLDT $X$, $Y$, $Z$** (convert fixed to floating): $f(X) \leftarrow s(Z)$.
- **FLTU $X$, $Z$ or FLTU $X$, $Y$, $Z$** (convert fixed to floating unsigned): $f(X) \leftarrow u(Z)$.
- **SFLOT $X$, $Z$ or SFLOT $X$, $Y$, $Z$** (convert fixed to short float): $f(X) \leftarrow f(T) \leftarrow s(Z)$.
- **SFLOTU $X$, $Z$ or SFLOTU $X$, $Y$, $Z$** (convert fixed to short float unsigned): $f(X) \leftarrow f(T) \leftarrow u(Z)$.
- **LD S $X$, $Y$, $Z$ or LDS $X$, $A$** (load short float): $f(X) \leftarrow f(M_4[A])$.
- **ST S $X$, $Y$, $Z$ or STS $X$, $A$** (store short float): $f(M_4[A]) \leftarrow f(X)$.

Assignment to a floating point quantity uses the current rounding mode to determine the appropriate value when an exact value cannot be assigned. Four rounding modes are supported: 1 (ROUND OFF), 2 (ROUND UP), 3 (ROUND DOWN), and 4 (ROUND NEAR). The Y field of FSQRT, FIN T, FIX, FIXU, FLDT, FLTU, SFLOT, and SFLOTU can be used to specify a rounding mode other than the current one, if desired. For example, FIX $X$, ROUND_UP, $Z$ sets $s(X) \leftarrow [f(Z)]$. Operations SFLOT and SFLOTU first round as if storing into an anonymous tetrabyte T, then they convert that number to octabyte form.

The 'int' operation rounds to an integer. The operation $y \oplus z$ is defined to be $y - n z$, where $n$ is the nearest integer to $y / z$, or the nearest even integer in case of a tie. Special rules apply when the operands are infinite or NaN, and special conventions govern the sign of a zero result. The values $+0.0$ and $-0.0$ have different floating point representations, but FEQL calls them equal. All such technicalities are explained in the MMIXware document, and Section 4.2 explains why the technicalities are important.

**Immediate constants.** Programs often need to deal with small constant numbers. For example, we might want to add or subtract 1 from a register, or we might want to shift by 32, etc. In such cases it's a nuisance to load the small constant from memory into another register. So MMIX provides a general mechanism by which such constants can be obtained "immediately" from an
Every instruction we have discussed so far has a variant in which $SZ$ is replaced by the number $Z$, unless the instruction treats $SZ$ as a floating point number.

For example, ‘ADD $X, Y, Z$’ has a counterpart ‘ADD $X, Y, S$’, meaning $s(X) ← s(Y) + Z$; ‘SRU $X, Y, Z$’ has a counterpart ‘SRU $X, Y, S$, $Z$’, meaning $u(X) ← u(Y) / 2^Z$; ‘FLUT $X, Y$’ has a counterpart ‘FLUT $X, Z$’, meaning $f(X) ← Z$. But ‘FADD $X, Y, S$’ has no immediate counterpart.

The opcode for ‘ADD $X, Y, Z$’ is $^*20$ and the opcode for ‘ADD $X, Y, Z$’ is $^*21$; we use the same symbol ADD in both cases for simplicity. In general the opcode for the immediate variant of an operation is one greater than the opcode for the register variant.

Several instructions also feature wyde immediate constants, which range from $^*0000 = 0$ to $^*ffff = 65535$. These constants, which appear in the YZ bytes, can be shifted into the high, medium high, medium low, or low wyde positions of an octabyte.

- SETH $X, YZ$ (set high wyde): $u(X) ← YZ \times 2^{48}$.
- SETMH $X, YZ$ (set medium high wyde): $u(X) ← YZ \times 2^{32}$.
- SETML $X, YZ$ (set medium low wyde): $u(X) ← YZ \times 2^1$.
- SETL $X, YZ$ (set low wyde): $u(X) ← YZ$.
- INCH $X, YZ$ (increase by high wyde): $u(X) ← (u(X) + YZ \times 2^{48}) \mod 2^{64}$.
- INCMH $X, YZ$ (increase by medium high wyde): $u(X) ← (u(X) + YZ \times 2^{32}) \mod 2^{64}$.
- INCML $X, YZ$ (increase by medium low wyde): $u(X) ← (u(X) + YZ \times 2^1) \mod 2^{64}$.
- INC $X, YZ$ (increase by low wyde): $u(X) ← (u(X) + YZ) \mod 2^{64}$.
- ORH $X, YZ$ (bitwise or with high wyde): $v(X) ← v(X) \lor v(YZ < 48)$.
- ORMH $X, YZ$ (bitwise or with medium high wyde): $v(X) ← v(X) \lor v(YZ < 32)$.
- ORML $X, YZ$ (bitwise or with medium low wyde): $v(X) ← v(X) \lor v(YZ < 16)$.
- ORL $X, YZ$ (bitwise or with low wyde): $v(X) ← v(X) \lor v(YZ)$.
- ANDH $X, YZ$ (bitwise and-not high wyde): $v(X) ← v(X) \land \lnot v(YZ < 48)$.
- ANDMH $X, YZ$ (bitwise and-not medium high wyde): $v(X) ← v(X) \land \lnot v(YZ < 32)$.
- ANDML $X, YZ$ (bitwise and-not medium low wyde): $v(X) ← v(X) \land \lnot v(YZ < 16)$.
- ANDL $X, YZ$ (bitwise and-not low wyde): $v(X) ← v(X) \land \lnot v(YZ)$.

Using at most four of these instructions, we can get any desired octabyte into a register without loading anything from the memory. For example, the commands

```
SETH $0,#0123; INCMH $0,#4567; INCML $0,#89ab; INCL $0,#cdef
```

put $0123456789ab{cdef}$ into register $0$.

The MMX assembly language allows us to write SET as an abbreviation for SETL, and SET $X, Y$ as an abbreviation for the common operation OR $X, Y, 0$. 
Jumps and branches. Instructions are normally executed in their natural sequence. In other words, the command that is performed after MMIX has obeyed the tetrabyte in memory location @ is normally the tetrabyte found in memory location @ + 4. (The symbol @ denotes the place where we’re “at.”) But jump and branch instructions allow this sequence to be interrupted.

- **JMP RA** (jump): @ ← RA.
  Here RA denotes a three-byte *relative address*, which could be written more explicitly as @+4*XYZ, namely XYZ tetrabytes following the current location @. For example, ‘JMP @+4*2’ is a symbolic form for the tetrabyte *fffffe* OpsCode. OpsCode *f0* tells the computer to “jump forward” and opsCode *f1* tells it to “jump backward,” but we write both as JMP. In fact, we usually write simply ‘JMP Addr’ when we want to jump to location Addr, and the MMIX assembly program figures out the appropriate opcode and the appropriate value of XYZ. Such a jump will be possible unless we try to stray more than about 67 million bytes from our present location.

- **GO $x, $y, $z** (go): $u($x) ← @ + 4, then @ ← A.
  The GO instruction allows us to jump to an *absolute address*, anywhere in memory; this address A is calculated by formula (5), exactly as in the load and store commands. Before going to the specified address, the location of the instruction that would ordinarily have come next is placed into register $x$. Therefore we could return to that location later by saying, for example, ‘GO $x$, $x$, 0’, with Z = 0 as an immediate constant.

- **BN $x$, RA** (branch if negative): if s($x$) < 0, set @ ← RA.
- **BZ $x$, RA** (branch if zero): if $x$ = 0, set @ ← RA.
- **BP $x$, RA** (branch if positive): if s($x$) > 0, set @ ← RA.
- **BOD $x$, RA** (branch if odd): if s($x$) mod 2 = 1, set @ ← RA.
- **BNN $x$, RA** (branch if nonnegative): if s($x$) ≥ 0, set @ ← RA.
- **BNZ $x$, RA** (branch if nonzero): if $x$ ≠ 0, set @ ← RA.
- **BEP $x$, RA** (branch if nonpositive): if s($x$) ≤ 0, set @ ← RA.
- **BEV $x$, RA** (branch if even): if s($x$) mod 2 = 0, set @ ← RA.

A *branch* instruction is a conditional jump that depends on the contents of register $x$. The range of destination addresses RA is more limited than it was with JMP, because only two bytes are available to express the relative offset; but still we can branch to any tetrabyte between @ − 218 and @ + 218 − 4.

- **PBN $x$, RA** (probable branch if negative): if s($x$) < 0, set @ ← RA.
- **PBZ $x$, RA** (probable branch if zero): if $x$ = 0, set @ ← RA.
- **PBP $x$, RA** (probable branch if positive): if s($x$) > 0, set @ ← RA.
- **PBOD $x$, RA** (probable branch if odd): if s($x$) mod 2 = 1, set @ ← RA.
- **PBNN $x$, RA** (probable branch if nonnegative): if s($x$) ≥ 0, set @ ← RA.
• \textsf{PBNZ} $X$, RA (probable branch if nonzero): if $X \neq 0$, set @ ← RA.
• \textsf{PBNP} $X$, RA (probable branch if nonpositive): if $s(X) \leq 0$, set @ ← RA.
• \textsf{PBEV} $X$, RA (probable branch if even): if $s(X) \mod 2 = 0$, set @ ← RA.

High-speed computers usually work fastest if they can anticipate when a branch will be taken, because foreknowledge helps them look ahead and get ready for future instructions. Therefore \textsc{MMIX} encourages programmers to give hints about whether branching is likely or not. Whenever a branch is expected to be taken more than half of the time, a wise programmer will say \textsf{PB} instead of \textsf{B}.

*Subroutine calls. \textsc{MMIX} also has several instructions that facilitate efficient communication between subprograms, via a register stack. The details are somewhat technical and we will defer them until Section 1.4; an informal description will suffice here. Short programs do not need to use these features.
• \textsf{PUSHJ} $X$, RA (push registers and jump): push($X$) and set rJ ← @ + 4, then set @ ← RA.
• \textsf{PUSHG} $X$, $Y$, $Z$ (push registers and go): push($X$) and set rJ ← @ + 4, then set @ ← $A$.

The special \textit{return-jump register} rJ is set to the address of the tetrabyte following the \textsf{PUSH} command. The action “push($X$)” means, roughly speaking, that local registers $0$ through $X$ are saved and made temporarily inaccessible. What used to be $(X+1)$ is now $0$, what used to be $(X+2)$ is now $1$, etc. But all registers $k$ for $k > rG$ remain unchanged; $rG$ is the special \textit{global threshold register}, whose value always lies between 32 and 255, inclusive.

Register $k$ is called \textit{global} if $k \geq rG$. It is called \textit{local} if $k < rL$; here $rL$ is the special \textit{local threshold register}, which tells how many local registers are currently active. Otherwise, namely if $rL \leq k < rG$, register $k$ is called \textit{marginal}, and $k$ is equal to zero whenever it is used as a source operand in a command. If a marginal register $k$ is used as a destination operand in a command, $rL$ is automatically increased to $k + 1$ before the command is performed, thereby making $k$ local.

• \textsf{POP} $X$, $Y$, $Z$ (pop registers and return): pop($X$), then @ ← rJ + 4 * $Y$, $Z$.

Here “pop($X$)” means, roughly speaking, that all but $X$ of the current local registers become marginal, and then the local registers hidden by the most recent “push” that has not yet been “popped” are restored to their former values. Full details appear in Section 1.4, together with numerous examples.

• \textsf{SAVE} $X$, 0 (save process state): u($X$) ← context.
• \textsf{UNSAVE} $Z$ (restore process state): context ← u($Z$).

The \textsf{SAVE} instruction stores all current registers in memory at the top of the register stack, and puts the address of the topmost stored octabyte into u($X$). Register $X$ must be global; that is, $X$ must be $\geq rG$. All of the currently local and global registers are saved, together with special registers like rA, rD, rE, rG, rH, rJ, rM, rR, and several others that we have not yet discussed. The \textsf{UNSAVE} instruction takes the address of such a topmost octabyte and restores the associated context, essentially undoing a previous \textsf{SAVE}. The value of $rL$ is set to zero by \textsf{SAVE}, but restored by \textsf{UNSAVE}. \textsc{MMIX} has special registers called...
the register stack offset (rO) and register stack pointer (rS), which control the PUSH, POP, SAVE, and UNSAVE operations. (Again, full details can be found in Section 1.4.)

*System considerations.* Several opcodes, intended primarily for ultrafast and/or parallel versions of the MMIX architecture, are of interest only to advanced users, but we should at least mention them here. Some of the associated operations are similar to the “probable branch” commands, in the sense that they give hints to the machine about how to plan ahead for maximum efficiency. Most programmers do not need to use these instructions, except perhaps SYNCID.

- **LDUNC $X,$Y, $Z** (load octa uncached): \( s(\$X) \leftarrow s(M_8[A]) \).
- **STUNC $X,$Y, $Z** (store octa uncached): \( s(M_8[A]) \leftarrow s(\$X). \)

These commands perform the same operations as LD0 and ST0, but they also inform the machine that the loaded or stored octabyte and its near neighbors will probably not be read or written in the near future.

- **PRLD $X,$Y, $Z** (preload data).

Says that many of the bytes \( M[A] \) through \( M[A + X] \) will probably be loaded or stored in the near future.

- **PREST $X,$Y, $Z** (prestore data).

Says that all of the bytes \( M[A] \) through \( M[A + X] \) will definitely be written (stored) before they are next read (loaded).

- **PREGO $X,$Y, $Z** (prefetch to go).

Says that many of the bytes \( M[A] \) through \( M[A + X] \) will probably be used as instructions in the near future.

- **SYNCID $X,$Y, $Z** (synchronize instructions and data).

Says that all of the bytes \( M[A] \) through \( M[A + X] \) must be fetched again before being interpreted as instructions. **MMIX is allowed to assume that a program’s instructions do not change after the program has begun,** unless the instructions have been prepared by SYNCID. (See exercise 57.)

- **SYNC $X,$Y, $Z** (synchronize data).

Says that all of bytes \( M[A] \) through \( M[A + X] \) must be brought up to date in the physical memory, so that other computers and input/output devices can read them.

- **SYNC XYZ** (synchronize).

Restricts parallel activities so that different processors can cooperate reliably; see MMIXware for details. XYZ must be 0, 1, 2, or 3.

- **CSWAP $X,$Y, $Z** (compare and swap octabytes).

If \( u(M_8[A]) = u(rP) \), where \( rP \) is the special prediction register, set \( u(M_8[A]) \leftarrow u(\$X) \) and \( u(\$X) \leftarrow 1 \). Otherwise set \( u(rP) \leftarrow u(M_8[A]) \) and \( u(\$X) \leftarrow 0 \). This is an atomic (indivisible) operation, useful when independent computers share a common memory.

- **LDVTS $X,$Y, $Z** (load virtual translation status).

This instruction, described in MMIXware, is for the operating system only.
*Interrupts.* The normal flow of instructions from one tetrabyte to the next can be changed not only by jumps and branches but also by less predictable events like overflow or external signals. Real-world machines must also cope with such things as security violations and hardware failures. **MMIX** distinguishes two kinds of program interruptions: "trips" and "traps." A trip sends control to a *trip handler*, which is part of the user's program; a trap sends control to a *trap handler*, which is part of the operating system.

Eight kinds of exceptional conditions can arise when **MMIX** is doing arithmetic, namely integer divide check (D), integer overflow (V), float-to-fix overflow (W), invalid floating operation (I), floating overflow (O), floating underflow (U), floating division by zero (Z), and floating inexact (X). The special *arithmetic status register* rA holds current information about all these exceptions. The eight bits of its rightmost byte are called its *event bits*, and they are named D_BIT (ʻ80), V_BIT (ʻ40), ..., X_BIT (ʻ01), in order DVWIOUZX.

The eight bits just to the left of the event bits in rA are called the *enable bits*; they appear in the same order DVWIOUZX. When an exceptional condition occurs during some arithmetic operation, **MMIX** looks at the corresponding enable bit before proceeding to the next instruction. If the enable bit is 0, the corresponding event bit is set to 1; otherwise the machine invokes a trip handler by "tripping" to location ʻ10 for exception D, ʻ20 for exception V, ..., ʻ80 for exception X. Thus the event bits of rA record the exceptions that have not caused trips. (If more than one enabled exception occurs, the leftmost one takes precedence. For example, simultaneous O and X is handled by O.)

The two bits of rA just to the left of the enable bits hold the current rounding mode, mod 4. The other 46 bits of rA should be zero. A program can change the setting of rA at any time, using the PUT command discussed below.

- **TRIP X, Y, Z or TRIP X, YZ or TRIP XYZ (trip).**

  This command forces a trip to the handler at location ʻ00.

  Whenever a trip occurs, **MMIX** uses five special registers to record the current state: the *bootstrap register* rB, the *where-interrupted register* rW, the *execution register* rX, the *Y operand register* rY, and the *Z operand register* rZ. First rB is set to $255$, then $255$ is set to rJ, and rW is set to $@ + 4$. The left half of rX is set to $80000000$, and the right half is set to the instruction that tripped. If the interrupted instruction was not a store command, rY is set to $@Y$ and rZ is set to $@Z$ (or to Z in case of an immediate constant); otherwise rY is set to $A$ (the memory address of the store command) and rZ is set to $@X$ (the quantity to be stored). Finally control passes to the handler by setting $@$ to the handler address (ʻ00 or ʻ10 or . . . or ʻ80).

- **TRAP X, Y, Z or TRAP X, YZ or TRAP XYZ (trap).**

  This command is analogous to TRIP, but it forces a trap to the operating system. Special registers rBB, rWW, rXX, rYY, and rZZ take the place of rB, rW, rX, rY, and rZ; the special *trap address register* rT supplies the address of the trap handler, which is placed in $@$. Section 1.3.2 describes several TRAP commands that provide simple input/output operations. The normal way to conclude a
program is to say ‘\texttt{TRAP 0}’; this instruction is the tetrabyte \#00000000, so you might run into it by mistake.

The \texttt{MMIXware} document gives further details about external interrupts, which are governed by the special \textit{interrupt mask register} \texttt{rK} and \textit{interrupt request register} \texttt{rQ}. Dynamic traps, which arise when \texttt{rK} \& \texttt{rQ} ≠ 0, are handled at address \texttt{rIT} instead of \texttt{rT}.

- \texttt{RESUME 0} (resume after interrupt).
If \texttt{s(rX)} is negative, \texttt{MMIX} simply sets \texttt{r} \texttt{← \texttt{rW}} and takes its next instruction from there. Otherwise, if the leading byte of \texttt{rX} is zero, \texttt{MMIX} sets \texttt{r} \texttt{← \texttt{rW} – 4} and executes the instruction in the lower half of \texttt{rX} as if it had appeared in that location. (This feature can be used even if no interrupt has occurred. The inserted instruction must not itself be \texttt{RESUME}.) Otherwise \texttt{MMIX} performs special actions described in the \texttt{MMIXware} document and of interest primarily to the operating system; see exercise 1.4.3–14.

\textbf{The complete instruction set.} Table 1 shows the symbolic names of all 256 opcodes, arranged by their numeric values in hexadecimal notation. For example, \texttt{ADD} appears in the upper half of the row labeled \#20 and in the column labeled \#0 at the top, so \texttt{ADD} is opcode \#20: \texttt{ORL} appears in the lower half of the row labeled \#Ex and in the column labeled \#B at the bottom, so \texttt{ORL} is opcode \#EB.

Table 1 actually says ‘\texttt{ADD \{I\}}’, not ‘\texttt{ADD}’, because the symbol \texttt{ADD} really stands for two opcodes. Opcode \#20 arises from \texttt{ADD \$X, \$Y, \$Z} using register \$Z, while opcode \#21 arises from \texttt{ADD \$X, \$Y, \#Z} using the immediate constant \texttt{Z}. When a distinction is necessary, we say that opcode \#20 is \texttt{ADD} and opcode \#21 is \texttt{ADDI} (“add immediate”); similarly, \#F0 is \texttt{JMP} and \#F1 is \texttt{JMPB} (“jump backward”). This gives every opcode a unique name. However, the extra \texttt{I} and \texttt{B} are generally dropped for convenience when we write \texttt{MMIX} programs.

We have discussed nearly all of \texttt{MMIX}'s opcodes. Two of the stragglers are

- \texttt{GET \$X, \#Z} (get from special register): \texttt{u(\$X) ← u(g[Z])}, where \(0 ≤ Z < 32\).
- \texttt{PUT \$X, \#Z} (put into special register): \texttt{u(g[X]) ← u(\$Z)}, where \(0 ≤ X < 32\).

Each special register has a code number between 0 and 31. We speak of registers \texttt{RA, rB, \ldots}, as aids to human understanding; but register \texttt{RA} is really \texttt{g[21]} from the machine's point of view, and register \texttt{rB} is really \texttt{g[0]}, etc. The code numbers appear in Table 2 on page 21.

\texttt{GET} commands are unrestricted, but certain things cannot be \texttt{PUT}: No value can be put into \texttt{rG} that is greater than 255, less than 32, or less than the current setting of \texttt{rL}. No value can be put into \texttt{rA} that is greater than \#\texttt{3ffff}. If a program tries to increase \texttt{rL} with the \texttt{PUT} command, \texttt{rL} will stay unchanged. Moreover, a program cannot \texttt{PUT} anything into \texttt{rC, rN, rO, rS, rL, rT, rIT, rK, rQ, rU, or rV}; these “extraspecial” registers have code numbers in the range 8–18.

Most of the special registers have already been mentioned in connection with specific instructions, but \texttt{MMIX} also has a “clock register” or \textit{cycle counter}, \texttt{rC}, which keeps advancing; an \textit{interval counter}, \texttt{rI}, which keeps decreasing, and which requests an interrupt when it reaches zero; a \textit{serial number register}, \texttt{rN}, which gives each \texttt{MMIX} machine a unique number; a \textit{usage counter}, \texttt{rU}, which
Table 1
THE OPCODES OF MMIX

<table>
<thead>
<tr>
<th>Opcodes</th>
<th>*0</th>
<th>*1</th>
<th>*2</th>
<th>*3</th>
<th>*4</th>
<th>*5</th>
<th>*6</th>
<th>*7</th>
</tr>
</thead>
<tbody>
<tr>
<td>TRAP</td>
<td>FMUL 4c</td>
<td>FMUL 4c</td>
<td>FADD 4c</td>
<td>FIX 4c</td>
<td>FSUB 4c</td>
<td>FIXU 4c</td>
<td>#0x</td>
<td></td>
</tr>
<tr>
<td>*1x</td>
<td>FCPM a</td>
<td>FPSE 4c</td>
<td>FPREL 4c</td>
<td>FDIV 4c</td>
<td>FSQRT 4c</td>
<td>FPREM 4c</td>
<td>FINT 4c</td>
<td>#1x</td>
</tr>
<tr>
<td>*2x</td>
<td>ADDI  v</td>
<td>ADDU  v</td>
<td>SUBI  v</td>
<td>SUBU  v</td>
<td>1ADDI  v</td>
<td>1ADDU  v</td>
<td>#2x</td>
<td></td>
</tr>
<tr>
<td>*3x</td>
<td>CMP  v</td>
<td>CMPU  v</td>
<td>NSZI  v</td>
<td>NSZU  v</td>
<td>#3x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*4x</td>
<td>BR  v</td>
<td>BR  v</td>
<td>BR  v</td>
<td>BR  v</td>
<td>#4x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*5x</td>
<td>PRED 3-1t</td>
<td>PRED 3-1t</td>
<td>PRED 3-1t</td>
<td>PRED 3-1t</td>
<td>#5x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*6x</td>
<td>CSB  v</td>
<td>CSB  v</td>
<td>CSB  v</td>
<td>CSB  v</td>
<td>#6x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*7x</td>
<td>ZSB  v</td>
<td>ZSB  v</td>
<td>ZSB  v</td>
<td>ZSB  v</td>
<td>#7x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*8x</td>
<td>LT  3-3t</td>
<td>LT  3-3t</td>
<td>LT  3-3t</td>
<td>LT  3-3t</td>
<td>#8x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*9x</td>
<td>LSAP 3-3t</td>
<td>LSAP 3-3t</td>
<td>LSAP 3-3t</td>
<td>LSAP 3-3t</td>
<td>#9x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*Ax</td>
<td>STB  3-3t</td>
<td>STB  3-3t</td>
<td>STB  3-3t</td>
<td>STB  3-3t</td>
<td>#Ax</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*Bx</td>
<td>STIP 3-3t</td>
<td>STIP 3-3t</td>
<td>STIP 3-3t</td>
<td>STIP 3-3t</td>
<td>#Bx</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*Cx</td>
<td>SYN 3-3t</td>
<td>SYN 3-3t</td>
<td>SYN 3-3t</td>
<td>SYN 3-3t</td>
<td>#Cx</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*Dx</td>
<td>EDIF  v</td>
<td>EDIP  v</td>
<td>EDIF  v</td>
<td>EDIP  v</td>
<td>#Dx</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*Ex</td>
<td>SETH  v</td>
<td>SETMH  v</td>
<td>SETHL  v</td>
<td>SETL  v</td>
<td>#Ex</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>*Fx</td>
<td>JP  3-2t</td>
<td>JP  3-2t</td>
<td>JP  3-2t</td>
<td>JP  3-2t</td>
<td>#Fx</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

π = 2v if the branch is taken, π = 0 if the branch is not taken

increases by 1 whenever specified opcodes are executed; and a virtual translation register, rV, which defines a mapping from the “virtual” 64-bit addresses used in programs to the “actual” physical locations of installed memory. These special registers help make MMIX a complete, viable machine that could actually be built and run successfully; but they are not of importance to us in this book. The MMIXware document explains them fully.

- GETA $X, RA (get address): u($X) ← RA.

This instruction loads a relative address into register $X, using the same conventions as the relative addresses in branch commands. For example, GETA $0, 0 will set $0 to the address of the instruction itself.
Table 2
SPECIAL REGISTERS OF MMIX

<table>
<thead>
<tr>
<th>Register</th>
<th>Description</th>
<th>Code</th>
<th>Saved?</th>
<th>Put?</th>
</tr>
</thead>
<tbody>
<tr>
<td>rA</td>
<td>arithmetic status register</td>
<td>21</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rB</td>
<td>bootstrap register (trip)</td>
<td>0</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rC</td>
<td>cycle counter</td>
<td>8</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rD</td>
<td>dividend register</td>
<td>1</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rE</td>
<td>epsilon register</td>
<td>2</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rF</td>
<td>failure location register</td>
<td>22</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rG</td>
<td>global threshold register</td>
<td>19</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rH</td>
<td>himult register</td>
<td>3</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rI</td>
<td>interval counter</td>
<td>12</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rJ</td>
<td>return-jump register</td>
<td>4</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rK</td>
<td>interrupt mask register</td>
<td>15</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rL</td>
<td>local threshold register</td>
<td>20</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rM</td>
<td>multiplex mask register</td>
<td>5</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rN</td>
<td>serial number</td>
<td>9</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rO</td>
<td>register stack offset</td>
<td>10</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rP</td>
<td>prediction register</td>
<td>23</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rQ</td>
<td>interrupt request register</td>
<td>16</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rR</td>
<td>remainder register</td>
<td>6</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rS</td>
<td>register stack pointer</td>
<td>11</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rT</td>
<td>trap address register</td>
<td>13</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rU</td>
<td>usage counter</td>
<td>17</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rV</td>
<td>virtual translation register</td>
<td>18</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rW</td>
<td>where-interrupted register (trip)</td>
<td>24</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rX</td>
<td>execution register (trip)</td>
<td>25</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rY</td>
<td>Y operand (trip)</td>
<td>26</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rZ</td>
<td>Z operand (trip)</td>
<td>27</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>rBB</td>
<td>bootstrap register (trap)</td>
<td>7</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rTT</td>
<td>dynamic trap address register</td>
<td>14</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rWW</td>
<td>where-interrupted register (trap)</td>
<td>28</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rXX</td>
<td>execution register (trap)</td>
<td>29</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rYY</td>
<td>Y operand (trap)</td>
<td>30</td>
<td></td>
<td></td>
</tr>
<tr>
<td>rZZ</td>
<td>Z operand (trap)</td>
<td>31</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- **SWYM X, Y, Z or SWYM X, YZ or SWYM XYZ** (sympathize with your machinery).

The last of MMIX’s 256 opcodes is, fortunately, the simplest of all. In fact, it is often called a no-op, because it performs no operation. It does, however, keep the machine running smoothly, just as real-world swimming helps to keep programmers healthy. Bytes X, Y, and Z are ignored.

**Timing.** In later parts of this book we will often want to compare different MMIX programs to see which is faster. Such comparisons aren’t easy to make, in general, because the MMIX architecture can be implemented in many different ways. Although MMIX is a mythical machine, its mythical hardware exists in cheap, slow versions as well as in costly high-performance models. The running time of a program depends not only on the clock rate but also on the number of
functional units that can be active simultaneously and the degree to which they are pipelined; it depends on the techniques used to prefetch instructions before they are executed; it depends on the size of the random-access memory that is used to give the illusion of $2^{64}$ virtual bytes; and it depends on the sizes and allocation strategies of caches and other buffers, etc., etc.

For practical purposes, the running time of an MMIX program can often be estimated satisfactorily by assigning a fixed cost to each operation, based on the approximate running time that would be obtained on a high-performance machine with lots of main memory; so that's what we will do. Each operation will be assumed to take an integer number of $\upsilon$, where $\upsilon$ (pronounced “oops”)* is a unit that represents the clock cycle time in a pipelined implementation. Although the value of $\upsilon$ decreases as technology improves, we always keep up with the latest advances because we measure time in units of $\upsilon$, not in nanoseconds. The running time in our estimates will also be assumed to depend on the number of memory references or mems that a program uses; this is the number of load and store instructions. For example, we will assume that each LDQ (load octet) instruction costs $\mu + v$, where $\mu$ is the average cost of a memory reference. The total running time of a program might be reported as, say, $35\mu + 1000\upsilon$, meaning “35 mems plus 1000 oops.” The ratio $\mu/\upsilon$ has been increasing steadily for many years; nobody knows for sure whether this trend will continue, but experience has shown that $\mu$ and $\upsilon$ deserve to be considered independently.

Table 1, which is repeated also in the endpapers of this book, displays the assumed running time together with each opcode. Notice that most instructions take just $1\upsilon$, while loads and stores take $\mu + v$. A branch or probable branch takes $1\upsilon$ if predicted correctly, $3\upsilon$ if predicted incorrectly. Floating point operations usually take $4\upsilon$ each, although FDIV and FSQRT cost $40\upsilon$. Integer multiplication takes $10\upsilon$; integer division weighs in at $60\upsilon$.

Even though we will often use the assumptions of Table 1 for seat-of-the-pants estimates of running time, we must remember that the actual running time might be quite sensitive to the ordering of instructions. For example, integer division might cost only one cycle if we can find 60 other things to do between the time we issue the command and the time we need the result. Several LDQ (load byte) instructions might need to reference memory only once, if they refer to the same octabyte. Yet the result of a load command is usually not ready for use in the immediately following instruction. Experience has shown that some algorithms work well with cache memory, and others do not; therefore $\mu$ is not really constant. Even the location of instructions in memory can have a significant effect on performance, because some instructions can be fetched together with others. Therefore the MMIXware package includes not only a simple simulator, which calculates running times by the rules of Table 1, but also a comprehensive meta-simulator, which runs MMIX programs under a wide range of different technological assumptions. Users of the meta-simulator can specify the

* The Greek letter upsilon ($\upsilon$) is wider than an italic letter vee ($v$), but the author admits that this distinction is rather subtle. Readers who prefer to say vee instead of oops are free to do as they wish. The symbol is, however, an upsilon.
characteristics of the memory bus and the parameters of such things as caches for instructions and data, virtual address translation, pipelining and simultaneous instruction issue, branch prediction, etc. Given a configuration file and a program file, the meta-simulator determines precisely how long the specified hardware would need to run the program. Only the meta-simulator can be trusted to give reliable information about a program’s actual behavior in practice; but such results can be difficult to interpret, because infinitely many configurations are possible. That’s why we often resort to the much simpler estimates of Table 1.

No benchmark result should ever be taken at face value.
— BRIAN KERNIGHAN and CHRISTOPHER VAN WYK (1998)

**MMIX versus reality.** A person who understands the rudiments of MMIX programming has a pretty good idea of what today’s general-purpose computers can do easily; MMIX is very much like all of them. But MMIX has been idealized in several ways, partly because the author has tried to design a machine that is somewhat “ahead of its time” so that it won’t become obsolete too quickly. Therefore a brief comparison between MMIX and the computers actually being built at the turn of the millennium is appropriate. The main differences between MMIX and those machines are:

- Commercial machines do not ignore the low-order bits of memory addresses, as MMIX does when accessing $M[A]$; they usually insist that $A$ be a multiple of 8. (We will find many uses for those precious low-order bits.)
- Commercial machines are usually deficient in their support of integer arithmetic. For example, they never produce the true quotient $\lfloor x/y \rfloor$ and true remainder $x \mod y$ when $x$ is negative or $y$ is negative; they often throw away the upper half of a product. They don’t treat left and right shifts as strict equivalents of multiplication and division by powers of 2. Sometimes they do not implement division in hardware at all; and when they do handle division, they usually assume that the upper half of the 128-bit dividend is zero. Such restrictions make high-precision calculations more difficult.
- Commercial machines do not perform FINT and FREX efficiently.
- Commercial machines do not (yet?) have the powerful XOR and XOR operations. They usually have a half dozen or so ad hoc instructions that handle only the most common special cases of XOR.
- Commercial machines rarely have more than 64 general-purpose registers. The 256 registers of MMIX significantly decrease program length, because many variables and constants of a program can live entirely in those registers instead of in memory. Furthermore, MMIX’s register stack is more flexible than the comparable mechanisms in existing computers.

All of these plusses for MMIX have associated minuses, because computer design always involves tradeoffs. The primary design goal for MMIX was to keep the machine as simple and clean and consistent and forward-looking as possible, without sacrificing speed and realism too greatly.
Summary. **MIN** is a programmer-friendly computer that operates on 64-bit quantities called octabytes. It has the general characteristics of a so-called RISC ("reduced instruction set computer"); that is, its instructions have only a few different formats (OP X, Y, Z or OP X, YZ or OP XYZ), and each instruction either transfers data between memory and a register or involves only registers. Table 1 summarizes the 256 opcodes and their default running times; Table 2 summarizes the special registers that are sometimes important.

The following exercises give a quick review of the material in this section. Most of them are quite simple, and the reader should try to do nearly all of them.

EXERCISES

1. [00] The binary form of 2009 is (11110101001)₂; what is 2009 in hexadecimal?
2. [05] Which of the letters {A, B, C, D, E, F, a, b, c, d, e, f} are odd when considered as (a) hexadecimal digits? (b) ASCII characters?
3. [10] Four-bit quantities — half-bytes, or hexadecimal digits — are often called **nibbles**. Suggest a good name for two-bit quantities, so that we have a complete binary nomenclature ranging from bits to octabytes.
4. [15] A kilobyte (kB or KB) is 1000 bytes, and a megabyte (MB) is 1000 kB. What are the official names and abbreviations for larger numbers of bytes?
5. [M13] If α is any string of 0s and 1s, let s(α) and u(α) be the integers that it represents when regarded as a signed or unsigned binary number. Prove that, if x is any integer, we have
   \[ x = s(\alpha) \quad \text{if and only if} \quad x \equiv u(\alpha) \pmod{2^n} \text{ and } -2^{n-1} \leq x < 2^{n-1}, \]
   where n is the length of α.
6. [M20] Prove or disprove the following rule for negating an n-bit number in two's complement notation: “Complement all the bits, then add 1.” (For example, **00...01** becomes **1f...fe**, then **f...ff**; also **f...ff** becomes **0...00**, then **0...01**.)
7. [M15] Could the formal definitions of **LOAD** and **STORE** have been stated as
   \[ s(\alpha) \leftarrow s(M_4[\alpha]) \times 2^{32} \quad \text{and} \quad s(M_4[\alpha]) \leftarrow \lfloor s(\alpha) / 2^{32} \rfloor, \]
   thus treating the numbers as signed rather than unsigned?
8. [10] If registers $Y$ and $Z$ represent numbers between 0 and 1 in which the binary radix point is assumed to be at the left of each register, (7) illustrates the fact that **MUL** forms a product in which the assumed radix point appears at the left of register RH. Suppose, on the other hand, that $Z$ is an integer, with the radix point assumed at its right, while $Y$ is a fraction between 0 and 1 as before. Where does the radix point lie after **MUL** in such a case?
9. [M10] Does the equation $s(\alpha) = s(\alpha) \cdot s(\beta) + s(\alpha) \cdot s(\alpha)$ always hold after the instruction **DIV** $\alpha, \beta, \alpha, \beta$ has been performed?
10. \[M16\] Give an example of DIV in which overflow occurs.

11. \[M16\] True or false: (a) Both MUL $X, Y, Z$ and MULU $X, Y, Z$ produce the same result in $X$. (b) If register $rD$ is zero, both DIV $X, Y, Z$ and DIVU $X, Y, Z$ produce the same result in $X$.

\[M20\] Although ADDU $X, Y, Z$ never signals overflow, we might want to know if a carry occurs at the left when adding $Y$ to $Z$. Show that the carry can be computed with two further instructions.

13. \[M21\] Suppose MMIX had no ADD command, only its unsigned counterpart ADDU. How could a programmer tell whether overflow occurred when computing $s(Y) + s(Z)$?

14. \[M21\] Suppose MMIX had no SUB command, only its unsigned counterpart SUBU. How could a programmer tell whether overflow occurred when computing $s(Y) - s(Z)$?

15. \[M25\] The product of two signed octabytes always lies between $-2^{126}$ and $2^{126}$, so it can always be expressed as a signed 16-byte quantity. Explain how to calculate the upper half of such a signed product.

16. \[M23\] Suppose MMIX had no MUL command, only its unsigned counterpart MULU. How could a programmer tell whether overflow occurred when computing $s(Y) \times s(Z)$?

17. \[M22\] Prove that unsigned integer division by 3 can always be done by multiplication: If register $Y$ contains any unsigned integer $y$, and if register $S1$ contains the constant $^*_{aaaaaaaabbb}$, then the sequence

\begin{verbatim}
MULU $0, Y, S1; GET $0, rH; SRU $X, $0, 1
\end{verbatim}

puts $y/3$ into register $X$.

18. \[M23\] Continuing the previous exercise, prove or disprove that the instructions

\begin{verbatim}
MULU $0, Y, S1; GET $0, rH; SRU $X, $0, 2
\end{verbatim}

gives identical results if $S1$ is an appropriate constant.

19. \[M26\] Continuing exercises 17 and 18, prove or disprove the following statement: Unsigned integer division by a constant can always be done using “high multiplication” followed by a right shift. More precisely, if $2^w \leq z < 2^w + 1$, we can compute $[y/z]$ by computing $[ay/y/z]/2^w + 1$, where $a = [2^w/y/z]$, for $0 \leq y < 2^w$.

20. \[I6\] Show that two cleverly chosen MMIX instructions will multiply by 25 faster than the single instruction MUL $X, Y$, 25, if we assume that overflow will not occur.

21. \[I5\] Describe the effects of SL, SLL, SR, and SRU when the unsigned value in register $S$ is 64 or more.

22. \[I5\] Mr. B. C. Dull wrote a program in which he wanted to branch to location Case i if the signed number in register $S1$ was less than the signed number in register $S2$. His solution was to write 'SUB $0, S1, S2; BR $0, Casei'.

What terrible mistake did he make? What should he have written instead?

23. \[I0\] Continuing the previous exercise, what should Dull have written if his problem had been to branch if $S1$ was less than or equal to $S2$?

24. \[M10\] If we represent a subset $S$ of \(\{0, 1, \ldots, 63\}\) by the bit vector

\[(10 \in S), [1 \in S], \ldots, [63 \in S]\),

the bitwise operations $\land$ and $\lor$ correspond respectively to set intersection ($S \cap T$) and set union ($S \cup T$). Which bitwise operation corresponds to set difference ($S \setminus T$)?
25. \[10\] The **Hamming distance** between two bit vectors is the number of positions in which they differ. Show that two MMIX instructions suffice to set register \$X equal to the Hamming distance between \$Y and \$Z.

26. \[10\] What's a good way to compute 64 bit differences, \(v(\$X) \leftarrow v(\$Y) - v(\$Z)\)?

27. \[20\] Show how to use EDIF to compute the maximum and minimum of eight bytes at a time: \(b(\$X) \leftarrow \max(b(\$Y), b(\$Z))\), \(b(\$W) \leftarrow \min(b(\$Y), b(\$Z))\).

28. \[16\] How would you calculate eight **absolute pixel differences** \(|b(\$Y) - b(\$Z)|\) simultaneously?

29. \[21\] The operation of **saturation addition** on \(n\)-bit pixels is defined by the formula

\[
y + z = \min(2^n - 1, y + z).$$

Show that a sequence of three MMIX instructions will set \(b(\$X) \leftarrow b(\$Y) + b(\$Z)\).

30. \[25\] Suppose register \$0 contains eight ASCII characters. Find a sequence of three MMIX instructions that counts the number of **blank spaces** among those characters. (You may assume that auxiliary constants have been preloaded into other registers. A blank space is ASCII code \$20.)

31. \[22\] Continuing the previous exercise, show how to count the number of characters in \$0 that have **odd parity** (an odd number of 1 bits).

32. \[M20\] True or false: If \(C = A \cdot B\) then \(C^T = B^T \cdot A^T\). (See (11).)

33. \[20\] What is the shortest sequence of MMIX instructions that will **cyclically shift** a register eight bits to the right? For example, \$9e3779b97f4a7c16 would become \$169e3779b97f4a7c.

34. \[21\] Given eight bytes of ASCII characters in \$Y, explain how to convert them to the corresponding eight wyde characters of Unicode, using only two MMIX instructions to place the results in \$X and \$Y. How would you go the other way (back to ASCII)?

35. \[22\] Show that two cleverly chosen MOV instructions will reverse the left-to-right order of all 64 bits in a given register \$Y.

36. \[20\] Using only two instructions, create a mask that has \$ff in all byte positions where \$Y differs from \$Z, \$00 in all byte positions where \$Y equals \$Z.

37. \[HM30\] (Finite fields.) Explain how to use MMX for arithmetic in a field of 256 elements; each element of the field should be represented by a suitable octabyte.

38. \[20\] What does the following little program do?

```
setl $1,0; sr $2,$0,56; add $1,$1,$2; slu $0,$0,8; pbnz $0,0-4*3.
```

39. \[20\] Which of the following equivalent sequences of code is faster, based on the timing information of Table 1?

- a) \(BN \; 0,0+4*2; \; ADDU \; \$1,\$2,\$3\) versus \(ADDU \; \$4,\$2,\$3; \; CSNN \; \$1,\$0,\$4.\)
- b) \(BN \; 0,0+4*3; \; SET \; \$1,\$2; \; JMP \; 0+4*2; \; SET \; \$1,\$3\) versus \(CSNN \; \$1,\$0,\$2; \; CSN \; \$1,\$0,\$3.\)
- c) \(BN \; 0,0+4*3; \; ADDU \; \$1,\$2,\$3; \; JMP \; 0+4*2; \; ADDU \; \$1,\$4,\$5\) versus \(ADDU \; \$1,\$2,\$3; \; ADDU \; \$6,\$4,\$5; \; CSN \; \$1,\$0,\$6.\)
- d, e) Same as (a), (b), and (c), but with PBNZ in place of BN.

40. \[10\] What happens if you G0 to an address that is not a multiple of 4?
41. [20] True or false:
   a) The instructions CSOD $X, $Y, 0 and ZSEV $X, $Y, $X have exactly the same effect.
   b) The instructions CMPU $X, $Y, 0 and ZSNZ $X, $Y, 1 have exactly the same effect.
   c) The instructions MDR $X, $Y, 1 and AND $X, $Y, #ff have exactly the same effect.
   d) The instructions MXOR $X, $Y, #80 and SR $X, $Y, 56 have exactly the same effect.

42. [20] What is the best way to set register $1 to the absolute value of the number in register $0, if $0 holds (a) a signed integer? (b) a floating point number?

43. [28] Given a nonzero octabyte in $Z, what is the fastest way to count how many leading and trailing zero bits it has? (For example, $13fd8124f32434a2 has three leading zeros and one trailing zero.)

44. [M25] Suppose you want to emulate 32-bit arithmetic with MMIX. Show that it is easy to add, subtract, multiply, and divide signed tetrabytes, with overflow occurring whenever the result does not lie in the interval $[-2^{31}, 2^{31})$.

45. [10] Think of a way to remember the sequence DVWIOUZX.

46. [05] The all-zeros tetrabyte $00000000 halts a program when it occurs as an MMIX instruction. What does the all-ones tetrabyte $ffffffffff do?

47. [05] What are the symbolic names of opcodes $DF and $55?

48. [11] The text points out that opcodes LD0 and LD0U perform exactly the same operation, with the same efficiency, regardless of the operand bytes X, Y, and Z. What other pairs of opcodes are equivalent in this sense?

49. [22] After the following “number one” program has been executed, what changes to registers and memory have taken place? (For example, what is the final setting of $1? of rA? of rB?)

   NEG $1,1
   STCO 1,$1,1
   CMPU $1,$1,1
   STB $1,$1,$1
   LD0U $1,$1,$1
   INCH $1,1
   16ADDU $1,$1,$1
   MULU $1,$1,$1
   PUT xA,1
   SW $1,$1,1
   SADD $1,$1,1
   FL0T $1,$1
   PUT xB,$1
   XOR $1,$1,1
   PB0D $1,0-4*1
   NOR $1,$1,$1
   SR $1,$1,1
   SRU $1,$1,1

50. [14] What is the execution time of the program in the preceding exercise?

51. [14] Convert the “number one” program of exercise 49 to a sequence of tetrabytes in hexadecimal notation.

52. [22] For each MMIX opcode, consider whether there is a way to set the X, Y, and Z bytes so that the result of the instruction is precisely equivalent to SWYM (except that...
the execution time may be longer]. Assume that nothing is known about the contents of any registers or any memory locations. Whenever it is possible to produce a no-op, state how it can be done. Examples: INCL is a no-op if \( X = 255 \) and \( Y = Z = 0 \). EZJ is a no-op if \( Y = 0 \) and \( Z = 1 \). MULJ can never be a no-op, since it affects rH.

53. [15] List all MMIX opcodes that can possibly change the value of rH.

54. [20] List all MMIX opcodes that can possibly change the value of rA.

55. [21] List all MMIX opcodes that can possibly change the value of rL.

56. [28] Location \(^*200000000000000\) contains a signed integer number, \( x \). Write two programs that compute \( x^{18} \) in register \( $0 \). One program should use the minimum number of MMIX memory locations; the other should use the minimum possible execution time. Assume that \( x^{18} \) fits into a single octabyte, and that all necessary constants have been preloaded into global registers.

57. [20] When a program changes one or more of its own instructions in memory, it is said to have self-modifying code. MMIX insists that a SYNCID command be issued before such modified commands are executed. Explain why self-modifying code is usually undesirable in a modern computer.

58. [50] Write a book about operating systems, which includes the complete design of an MMIX kernel for the MMIX architecture.

"Them fellers is a-mommkin' everything."
— V. RANDOLPH and G. P. WILSON, Down in the Holler (1953)

1.3.2'. The MMIX Assembly Language

A symbolic language is used to make MMIX programs considerably easier to read and to write, and to save the programmer from worrying about tedious clerical details that often lead to unnecessary errors. This language, MMIXAL ("MMIX Assembly Language"), is an extension of the notation used for instructions in the previous section. Its main features are the optional use of alphabetic names to stand for numbers, and a label field to associate names with memory locations and register numbers.

MMIXAL can readily be comprehended if we consider first a simple example.

The following code is part of a larger program; it is a subroutine to find the maximum of \( n \) elements \( X[1], \ldots, X[n] \), according to Algorithm 1.2.10.M.

Program M (Find the maximum). Initially \( n \) is in register \( $0 \), and the address of \( X[0] \) is in register \( x0 \), a global register defined elsewhere.

<table>
<thead>
<tr>
<th>Assembled code</th>
<th>Line no.</th>
<th>LABEL</th>
<th>GP</th>
<th>EXPR</th>
<th>Times</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>01</td>
<td>j</td>
<td>IS</td>
<td>$0</td>
<td></td>
<td>j</td>
</tr>
<tr>
<td></td>
<td>02</td>
<td>m</td>
<td>IS</td>
<td>$1</td>
<td></td>
<td>m</td>
</tr>
<tr>
<td></td>
<td>03</td>
<td>kk</td>
<td>IS</td>
<td>$2</td>
<td></td>
<td>8k</td>
</tr>
<tr>
<td></td>
<td>04</td>
<td>xk</td>
<td>IS</td>
<td>$3</td>
<td></td>
<td>X[k]</td>
</tr>
<tr>
<td></td>
<td>05</td>
<td>t</td>
<td>IS</td>
<td>$255</td>
<td></td>
<td>Temp storage</td>
</tr>
<tr>
<td>#100</td>
<td>07</td>
<td>Maximum</td>
<td>SL</td>
<td>kk, $0, 3</td>
<td>1</td>
<td>M1. Initialize: ( k \leftarrow n, j \leftarrow n ).</td>
</tr>
<tr>
<td>#104</td>
<td>08</td>
<td>LDI</td>
<td>m, x0, kk</td>
<td>1</td>
<td>( m \leftarrow X[n] ).</td>
<td></td>
</tr>
<tr>
<td>#106</td>
<td>09</td>
<td>JMP</td>
<td>DecRk</td>
<td>1</td>
<td>To M2 with ( k \leftarrow n - 1 ).</td>
<td></td>
</tr>
</tbody>
</table>
This program is an example of several things simultaneously:

a) The columns headed “LABEL”, “OP”, and “EXPR” are of principal interest; they contain a program in the MIXAL symbolic machine language, and we shall explain the details of this program below.

b) The column headed “Assembled code” shows the actual numeric machine language that corresponds to the MIXAL program. MIXAL has been designed so that any MIXAL program can easily be translated into numeric machine language; the translation is usually carried out by another computer program called an assemble program or assembler. Thus, programmers can do all of their machine language programming in MIXAL, never bothering to determine the equivalent numeric codes by hand. Virtually all MIX programs in this book are written in MIXAL.

c) The column headed “Line no.” is not an essential part of the MIXAL program; it is merely included with MIXAL examples in this book so that we can readily refer to parts of the program.

d) The column headed “Remarks” gives explanatory information about the program, and it is cross-referenced to the steps of Algorithm 1.2.10M. The reader should compare that algorithm (page 96) with the program above. Notice that a little “programmer’s license” was used during the transcription into MIX code; for example, step M2 has been put last.

e) The column headed “Times” will be instructive in many of the MIX programs we will be studying in this book: it represents the profile, the number of times the instruction on that line will be executed during the course of the program. Thus, line 10 will be performed \( n-1 \) times, etc. From this information we can determine the length of time required to perform the subroutine; it is \( nq + (5n + 4A + 5)v \), where \( A \) is the quantity that was analyzed carefully in Section 1.2.10. (The PENP instruction costs \( (n - 1 + 2A)v. \))

Now let’s discuss the MIXAL part of Program M. Line 01, ‘j IS $0’, says that symbol \( j \) stands for register \$0; lines 02–05 are similar. The effect of lines 01 and 03 can be seen on line 14, where the numeric equivalent of the instruction ‘SR j,kk,3’ appears as ‘3dio0203’, that is, ‘SR $0,$2,3’.

Line 06 says that the locations for succeeding lines should be chosen sequentially, beginning with \*100. Therefore the symbol Maximum that appears in the label field of line 07 becomes equivalent to the number \*100; the symbol Loop in line 10 is three tetrabytes further along, so it is equivalent to \*10c.

On lines 07 through 17 the OP field contains the symbolic names of MIX instructions: SL, LDO, etc. But the symbolic names IS and LOC, found in
the OP column of lines 01–06, are somewhat different; IS and LOC are called pseudo-operations, because they are operators of MMIXAL but not operators of MMIX. Pseudo-operations provide special information about a symbolic program, without being instructions of the program itself. Thus the line ‘j IS $0’ only talks about Program M; it does not signify that any variable is to be set equal to the contents of register $0 when the program is run. Notice that no instructions are assembled for lines 01–06.

Line 07 is a “shift left” instruction that sets $k \leftarrow n$ by setting $kk \leftarrow 8n$. This program works with the value of $8k$, not $k$, because $8k$ is needed for octabyte addresses in lines 08 and 10.

Line 09 jumps the control to line 15. The assembler, knowing that this JMP instruction is in location #108 and that DecrK is equivalent to $^8120$, computes the relative offset ($^8120 - ^8108)/4 = 6$. Similar relative addresses are computed for the branch commands in lines 12 and 16.

The rest of the symbolic code is self-explanatory. As mentioned earlier, Program M is intended to be part of a larger program; elsewhere the sequence

\begin{verbatim}
SET $2,100
PUSHJ $1, Maximum
STO $1, Max
\end{verbatim}

would, for example, jump to Program M with $n$ set to 100. Program M would then find the largest of the elements $X[1], \ldots, X[100]$ and would return to the instruction ‘STO $1, Max’ with the maximum value in $1$ and with its position, $j$, in $2$. (See exercise 3.)

Let’s look now at a program that is complete, not merely a subroutine. If the following program is named Hello, it will print out the famous message ‘Hello, world’ and stop.

**Program H (Hail the world).**

<table>
<thead>
<tr>
<th>Assembled code</th>
<th>Line</th>
<th>LABEL</th>
<th>OP</th>
<th>EXPR</th>
<th>Remarks</th>
</tr>
</thead>
<tbody>
<tr>
<td>#100: 8fff0100</td>
<td>01</td>
<td>argv</td>
<td>IS</td>
<td>$1$</td>
<td>The argument vector</td>
</tr>
<tr>
<td>#104: 00000701</td>
<td>02</td>
<td></td>
<td>LOC</td>
<td>#100</td>
<td></td>
</tr>
<tr>
<td>#108: 74ff0000</td>
<td>03</td>
<td>Main</td>
<td>LDUU</td>
<td>$255$, argv, 0</td>
<td>$255 \leftarrow$ address of program name.</td>
</tr>
<tr>
<td>#10c: 00000701</td>
<td>04</td>
<td>TRAP</td>
<td>0, Fputs, StdOut</td>
<td>Print that name.</td>
<td></td>
</tr>
<tr>
<td>#110: 00000000</td>
<td>05</td>
<td>GETA</td>
<td>$255$, String</td>
<td>$255 \leftarrow$ address of “. world”.</td>
<td></td>
</tr>
<tr>
<td>#114: 2c20776f</td>
<td>06</td>
<td>STRING</td>
<td>BYTE</td>
<td>“. world”, #a, 0</td>
<td>String of characters</td>
</tr>
<tr>
<td>#118: 726c640a</td>
<td>07</td>
<td>0</td>
<td></td>
<td></td>
<td>with newline</td>
</tr>
<tr>
<td>#11c: 00</td>
<td>08</td>
<td>10</td>
<td></td>
<td></td>
<td>and terminator</td>
</tr>
</tbody>
</table>

Readers who have access to an MMIX assembler and simulator should take a moment to prepare a short computer file containing the LABEL OP EXPR portions of Program H before reading further. Name the file ‘Hello.mms’ and assemble it by saying, for example, ‘mmixal Hello.mms’. (The assembler will produce a file called ‘Hello.mmo’; the suffix .mms means “MMIX symbolic” and .mmo means “MMIX object.”) Now invoke the simulator by saying ‘mmix Hello’.
The MMIX simulator implements some of the simplest features of a hypothetical operating system called MMIX. If an object file called, say, foo.mmo is present, MMIX will launch it when a command line such as

\[ \text{foo bar xzyzy} \]  

is given. You can obtain the corresponding behavior by invoking the simulator with the command line `mmix (options) foo bar xzyzy`, where (options) is a sequence of zero or more special requests. For example, option `-P` will print a profile of the program after it has halted.

An MMIX program always begins at symbolic location `Main`. At that time register $0$ contains the number of command line arguments, namely the number of words on the command line. Register $1$ contains the memory address of the first such argument, which is always the name of the program. The operating system has placed all of the arguments into consecutive octabytes, starting at the address in $1$ and ending with an octabyte of all zeros. Each argument is represented as a `string`, meaning that it is the address in memory of a sequence of zero or more nonzero bytes followed by a byte that is zero; the nonzero bytes are the `characters` of the string.

For example, the command line (1) would cause $0$ to be initially 3, and we might have

\[
\begin{align*}
\text{`$1 = $4000000000000008`} & \quad \text{Pointer to the first string} \\
\text{M}[\text{`$4000000000000010`}] & \equiv \text{`$4000000000000028`} \quad \text{First argument, the string "foo"} \\
\text{M}[\text{`$4000000000000018`}] & \equiv \text{`$4000000000000038`} \quad \text{Second argument, the string "bar"} \\
\text{M}[\text{`$4000000000000020`}] & \equiv \text{`$0000000000000000`} \quad \text{Null pointer after the last argument} \\
\text{M}[\text{`$4000000000000028`}] & \equiv \text{`$6666666666666666`} \quad \text{"f", "o", "x", 0, 0, 0, 0, 0} \\
\text{M}[\text{`$4000000000000030`}] & \equiv \text{`$6666666666666666`} \quad \text{"b", "a", "x", 0, 0, 0, 0, 0} \\
\text{M}[\text{`$4000000000000038`}] & \equiv \text{`$7879797979797979`} \quad \text{"z", "y", "z", "y", 0, 0, 0, 0}
\end{align*}
\]

MMIX sets up each argument string so that its characters begin at an octabyte boundary; strings in general can, however, start anywhere within an octabyte.

The first instruction of Program H, in line 03, puts the string pointer M[81] into register $255$; this string is the program name ‘Hello’. Line 04 is a special TRAP instruction, which asks the operating system to put string $255$ into the standard output file. Similarly, lines 05 and 06 ask MMIX to contribute ‘,’ world and a newline character to the standard output. The symbol Fputs is predefined to equal 7, and the symbol StdOut is predefined to equal 1. Line 07, ‘TRAP 0, Halt, 0’, is the normal way to terminate a program. We will discuss all such special TRAP commands at the end of this section.

The characters of the string output by lines 05 and 06 are generated by the BYTE command in line 08. BYTE is a pseudo-operation of MMIXAL, not an operation of MMIX; but BYTE is different from pseudo-ops like IS and LOC, because it does assemble data into memory. In general, BYTE assembles a sequence of expressions into one-byte constants. The construction ", world" in line 08 is MMIXAL’s shorthand for the list

\[
\text{', ', 'w', 'o', 'r', 'l', 'd'}
\]
of seven one-character constants. The constant \#a on line 08 is the ASCII newline character, which causes a new line to begin when it appears in a file being printed. The final ‘,’ on line 08 terminates the string. Thus line 08 is a list of nine expressions, and it leads to the nine bytes shown at the left of lines 08–10.

Our third example introduces a few more features of the assembly language. The object is to compute and print a table of the first 500 prime numbers, with 10 columns of 50 numbers each. The table should appear as follows, when the standard output of our program is listed as a text file:

First Five Hundred Primes
0002 0233 0547 0877 1229 1597 1993 2371 2749 3187
0003 0239 0557 0881 1231 1601 1997 2377 2753 3191
0005 0241 0563 0883 1237 1607 1999 2381 2767 3203
0229 0541 0863 1223 1583 1987 2357 2741 3181 3571

We will use the following method.

Algorithm \textbf{P} (Print table of 500 primes). This algorithm has two distinct parts: Steps P1–P8 prepare an internal table of 500 primes, and steps P9–P11 print the answer in the form shown above.

\textbf{P1.} [Start table.] Set PRIME[1] $\leftarrow 2$, $n \leftarrow 3$, $j \leftarrow 1$. (In this program, $n$ runs through the odd numbers that are candidates for primes; $j$ keeps track of how many primes have been found so far.)

\textbf{P2.} [$n$ is prime.] Set $j \leftarrow j + 1$, PRIME[$j$] $\leftarrow n$.

\textbf{P3.} [500 found?] If $j = 500$, go to step P9.

\textbf{P4.} [Advance \textit{n}.] Set $n \leftarrow n + 2$.

\textbf{P5.} [$k \leftarrow 2$] Set $k \leftarrow 2$. (PRIME[$k$] will run through \textit{n}'s possible prime divisors.)

\textbf{P6.} [PRIME[$k$] \textbackslash n?] Divide \textit{n} by PRIME[$k$]; let \textit{q} be the quotient and \textit{r} the remainder. If \textit{r} $= 0$ (hence \textit{n} is not prime), go to P4.

\textbf{P7.} [PRIME[$k$] large?] If \textit{q} \leq \textit{PRIME[$k$]}, go to P2. (In such a case, \textit{n} must be prime; the proof of this fact is interesting and a little unusual—see exercise 11.)

\textbf{P8.} [Advance \textit{k}.] Increase \textit{k} by 1, and go to P6.

\textbf{P9.} [Print title.] Now we are ready to print the table. Output the title line and set $m \leftarrow 1$.

\textbf{P10.} [Print line.] Output a line that contains PRIME[$m$], PRIME[$50 + m$], . . . , PRIME[$450 + m$] in the proper format.

\textbf{P11.} [500 printed?] Increase \textit{m} by 1. If \textit{m} $\leq 50$, return to P10; otherwise the algorithm terminates.

Program \textbf{P} (Print table of 500 primes). This program has deliberately been written in a slightly clumsy fashion in order to illustrate most of the features of MIMIXAL in a single program.
01  % Example program ... Table of primes
02  L   IS  500          The number of primes to find
03  t   IS  $255         Temporary storage
04  n   GREG 0          Prime candidate
05  q   GREG 0          Quotient
06  r   GREG 0          Remainder
07  jj  GREG 0          Index for PRIME[j]
08  kk  GREG 0          Index for PRIME[k]
09  pk  GREG 0          Value of PRIME[k]
10  mm  IS  kk           Index for output lines
11  LOC Data_Segment
12  PRIME1 WYDE 2       PRIME[1] = 2
13  LOC PRIME1+2+L
14  ptop  GREG @        Address of PRIME[501]
15  j0   GREG PRIME1+2-@ Initial value of jj
16  BUF OCTA 0           Place to form decimal string
17  18  LOC #100
19  Main SET n,3         P1. Start table. n ← 3.
20   SET jj,j0           P2. n is prime. PRIME[j+1] ← n.
21  2H STWU n,ptop,jj    P2. n is prime. PRIME[j+1] ← n.
22   INCL jj,2            j ← j + 1.
23  3H BZ jj,2F          P3. 500 found?
24  4H INCL n,2           P4. Advance n.
25  5H SET kk,j0         P5. k ← 2.
26  6H LDWU pk,ptop,kk   P6. PRIME[k] \n27   DIV q,n,pk            P6. PRIME[k] \n28   GET r,xR             r ← n mod PRIME[k].
29   BZ r,4B               To P4 if r = 0.
30  7H CMP t,q,pk         P7. PRIME[k] \large?
31   BNP t,2B              To P2 if q ≤ PRIME[k].
32  8H INCL kk,2           P8. Advance k. k ← k + 1.
33   JMP 6B                To P6.

Fig. 14. Algorithm P.
GREG @  
Base address

Title BYTE "First Five Hundred Primes"

NewLn BYTE #a,0  
Newline and string terminator

Blanks BYTE ",",0  
String of three blanks

2H LDA t,Title  

TRAP 0,Fputs,StdOut

NEG mm,2  
Initialize m.

3H ADD mm,mm,j0  
P10. Print line.

LDA t,Blanks  
Output " ".

TRAP 0,Fputs,StdOut

PKW pk,ptop,mm  
pk ← prime to be printed.

OH GREG #2030303030000000  
" 0000",0,0

STOU OB,BUF  
Prepare buffer for decimal conversion.

LDA t,BUF+4  
t ← position of units digit.

DIV pk,pk,10  
pk ← [pk/10].

GET r,RR  
r ← next digit.

INCL r,0'  
r ← ASCII digit r.

STUB r,t,0  
Store r in the buffer.

SUB t,t,1  
Move one byte to the left.

PBNZ pk,1B  
Repeat on remaining digits.

LDA t,BUF  
Output " ", and four digits.

TRAP 0,Fputs,StdOut

INCL mm,2*L/10  
Advance by 50 wydes.

PBNZ mm,2B

LDA t,NewLn  
Output a newline.

TRAP 0,Fputs,StdOut

CMP t,mm,2*(L/10-1)  
P11. 500 printed?

PBNZ t,3B  
To P10 if not done.

TRAP 0,Halt,0

The following points of interest should be noted about this program:

1. Line 01 begins with a percent sign and line 17 is blank. Such "comment" 
   lines are merely explanatory; they have no effect on the assembled program.

   Each non-comment line has three fields called LABEL, OP, and EXPR, 
   separated by spaces. The EXPR field contains one or more symbolic expressions 
   separated by commas. Comments may follow the EXPR field.

2. As in Program M, the pseudo-operation IS sets the equivalent of a symbol. 
   For example, in line 02 the equivalent of L is set to 500, which is the number 
   of primes to be computed. Notice that in line 03, the equivalent of t is set to $255, 
   a register number, while L’s equivalent was 500, a pure number. Some symbols 
   have register number equivalents, ranging from $0 to $255; others have pure 
   equivalents, which are octabytes. We will generally use symbolic names that 
   begin with a lowercase letter to denote registers, and names that begin with an 
   uppercase letter to denote pure values, although MMIXAL does not enforce this 
   convention.

3. The pseudo-op GREG on line 04 allocates a global register. Register $255 
   is always global; the first GREG causes $254 to be global, and the next GREG does
the same for $253$, etc. Lines $04-09$ therefore allocate six global registers, and they cause the symbols n, q, r, j, k, p to be respectively equivalent to $254$, $253$, $252$, $251$, $250$, $249$. Line $10$ makes $mm$ equivalent to $250$.

If the EXPR field of a GREG definition is zero, as it is on lines $04-09$, the global register is assumed to have a dynamically varying value when the program is run. But if a nonzero expression is given, as on lines $14$, $15$, $34$, and $45$, the global register is assumed to be constant throughout a program’s execution. MMIXAL uses such global registers as base addresses when subsequent instructions refer to memory. For example, consider the instruction ‘LDA t, BUF+4’ in line $47$. MMIXAL is able to discover that global register ptop holds the address of BUF; therefore ‘LDA t, BUF+4’ can be assembled as ‘LDA t, ptop, 4’. Similarly, the LDA instructions on lines $38$, $42$, and $58$ make use of the nameless base address introduced by the instruction ‘GREG @’ on line $34$. (Recall from Section 1.3.1’ that @ denotes the current location.)

4. A good assembly language should mimic the way a programmer thinks about machine programs. One example of this philosophy is the automatic allocation of global registers and base addresses. Another example is the idea of local symbols such as the symbol 2H, which appears in the label field of lines $21$, $38$, and $44$.

Local symbols are special symbols whose equivalents can be redefined as many times as desired. A global symbol like PRIME1 has but one significance throughout a program, and if it were to appear in the label field of more than one line an error would be indicated by the assembler. But local symbols have a different nature; we write, for example, 2H ("2 here") in the LABEL field, and 2F ("2 forward") or 2B ("2 backward") in the EXPR field of an MMIXAL line:

2B means the closest previous label 2H;
2F means the closest following label 2H.

Thus the 2F in line $23$ refers to line $38$; the 2B in line $31$ refers back to line $21$; and the 2B in line $57$ refers to line $44$. The symbols 2F and 2B never refer to their own line. For example, the MMIXAL instructions

```
2H   IS  $10
2H   BZ  2B, 2F
2H   IS  2B-4
```

are virtually equivalent to the single instruction

```
BZ  $10, 0-4.
```

The symbols 2F and 2B should never be used in the LABEL field; the symbol 2H should never be used in the EXPR field. If 2B occurs before any appearance of 2H, it denotes zero. There are ten local symbols, which can be obtained by replacing ‘2’ in these examples by any digit from 0 to 9.

The idea of local symbols was introduced by M. E. Conway in 1958, in connection with an assembly program for the UNIVAC I. Local symbols free us from the obligation to choose a symbolic name when we merely want to refer to
an instruction a few lines away. There often is no appropriate name for nearby locations, so programmers have tended to introduce meaningless symbols like X1, X2, X3, etc., with the potential danger of duplication.

5. The reference to Data_Segment on line 11 introduces another new idea. In most embodiments of MMIX, the 2^64-byte virtual address space is broken into two parts, called user space (addresses `0000000000000000 ... 7fffffff`) and kernel space (addresses `8000000000000000 ... ffffffff`). The “negative” addresses of kernel space are reserved for the operating system.

User space is further subdivided into four segments of 2^61 bytes each. First comes the text segment; the user’s program generally resides here. Then comes the data segment, beginning at virtual address `2000000000000000`; this is for variables whose memory locations are allocated once and for all by the assembler, and for other variables allocated by the user without the help of the system library. Next is the pool segment, beginning at `4000000000000000`; command line arguments and other dynamically allocated data go here. Finally the stack segment, which starts at `6000000000000000`, is used by the MMIX hardware to maintain the register stack governed by PUSHP, POP, SAVE, and UNSAVE. Three symbols,

\[
\begin{align*}
\text{Data\_Segment} & = \text{'2000000000000000'}, \\
\text{Pool\_Segment} & = \text{'4000000000000000'}, \\
\text{Stack\_Segment} & = \text{'6000000000000000'},
\end{align*}
\]

are predefined for convenience in MMIXAL. Nothing should be assembled into the pool segment or the stack segment, although a program may refer to data found there. References to addresses near the beginning of a segment might be more efficient than references to addresses that come near the end; for example, MMIX might not be able to access the last byte of the text segment, M[7fffffff], as fast as it can read the first byte of the data segment.

Our programs for MMIX will always consider the text segment to be read-only: Everything in memory locations less than `2000000000000000` will remain constant once a program has been assembled and loaded. Therefore Program P puts the prime table and the output buffer into the data segment.

6. The text and data segments are entirely zero at the beginning of a program, except for instructions and data that have been loaded in accordance with the MMIXAL specification of the program. If two or more bytes of data are destined for the same cell of memory, the loader will fill that cell with their bitwise exclusive-or.

7. The symbolic expression ‘PRIME1+2*4’ on line 13 indicates that MMIXAL has the ability to do arithmetic on octabytes. See also the more elaborate example ‘2*(L/10-1)’ on line 60.

8. As a final note about Program P, we can observe that its instructions have been organized so that registers are counted towards zero, and tested against zero, whenever possible. For example, register jj holds a quantity that is related to the positive variable j of Algorithm P, but jj is normally negative; this change
makes it easy for the machine to decide when \( j \) has reached 500 (line 23). Lines
40–61 are particularly noteworthy in this regard, although perhaps a bit tricky.
The binary-to-decimal conversion routine in lines 45–55, based on division by 10,
is simple but not the fastest possible. More efficient methods are discussed in
Section 4.4.

It may be of interest to note a few of the statistics observed when Program P
was actually run. The division instruction in line 27 was executed 9538 times.
The total time to perform steps P1–P8 (lines 19–33) was 10036 \( \mu + 641543 \upsilon \); steps
P9–P11 cost an additional 2804\( \mu + 124559\upsilon \), not counting the time taken by the
operating system to handle TRAP requests.

**Language summary.** Now that we have seen three examples of what can be
done in MMIXAL, it is time to discuss the rules more carefully, observing in
particular the things that cannot be done. The following comparatively few rules
define the language.

1. A **symbol** is a string of letters and/or digits, beginning with a letter. The
underscore character ‘_’ is regarded as a letter, for purposes of this definition,
and so are all Unicode characters whose code value exceeds 126. **Examples:**
PRIME1, Data_Segment, Main, _, pâte.

   The special constructions _d, _f, and _d, where _ is a single digit, are effec-
tively replaced by unique symbols according to the “local symbol” convention
explained above.

2. A **constant** is either
   a) a **decimal constant**, consisting of one or more decimal digits \( \{0, 1, 2, 3, 4,
      5, 6, 7, 8, 9\} \), representing an unsigned octabyte in radix 10 notation; or
   b) a **hexadecimal constant**, consisting of a hash mark # followed by one or
      more hexadecimail digits \( \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, A, B, C, D, E, F\} \),
      representing an unsigned octabyte in radix 16 notation; or
   c) a **character constant**, consisting of a quote character ‘ ’ followed by any
      character other than newline, followed by another quote ‘ ’; this represents
      the ASCII or Unicode value of the quoted character.

   A **string constant** is a double-quote character " followed by one or more
   characters other than newline or double-quote, followed by another double-
   quote ". This construction is equivalent to a sequence of character constants
   for the individual characters, separated by commas.

3. Each appearance of a symbol in an MMIXAL program is said to be either a
   “defined symbol” or a “future reference.” A **defined symbol** is a symbol that
   has appeared in the LABEL field of a preceding line of this MMIXAL program. A
   **future reference** is a symbol that has not yet been defined in this way.

   A few symbols, like rR and ROUND_NEAR and V_BIT and W_Handler and
   Fputs, are predefined because they refer to constants associated with the MMIX
hardware or with its rudimentary operating system. Such symbols can be re-defined, because 

\*MMIXAL\* does not assume that every programmer knows all their names. But no symbol should appear as a label more than once.

Every defined symbol has an equivalent value, which is either pure (an unsigned octabyte) or a register number ($0 \text{ or } 1 \ldots \text{ or } 255$).

4. A primary is either
   a) a symbol; or
   b) a constant; or
   c) the character $\emptyset$, denoting the current location; or
   d) an expression enclosed in parentheses; or
   e) a unary operator followed by a primary.

   The unary operators are + (affirmation, which does nothing), - (negation, which subtracts from zero), $\sim$ (complementation, which changes all 64 bits), and $\&$ (registerization, which converts a pure value to a register number).

5. A term is a sequence of one or more primaries separated by strong binary operators; an expression is a sequence of one or more terms separated by weak binary operators. The strong binary operators are * (multiplication), / (division), \// (fractional division), \% (remainder), $\ll$ (left shift), $\gg$ (right shift), and $\&$ (bitwise and). The weak binary operators are + (addition), - (subtraction), 1 (bitwise or), and $\sim$ (bitwise exclusive-or). These operations act on unsigned octabytes; $x/y$ denotes $2^{64}x/y$ if $x < y$, and it is undefined if $x \geq y$. Binary operators of the same strength are performed from left to right; thus $a/b/c$ is $(a/b)/c$ and $a-b+c$ is $(a-b)+c$.

Example: 
\#ab<<32+k\&\&(k-1) is an expression, the sum of terms \#ab <<32 and k\&\&(k-1). The latter term is the bitwise and of primaries k and \&(k-1). The latter primary is the complement of (k-1), a parenthesized expression that is the difference of two terms k and 1. The term 1 is also a primary, and also a constant, in fact it is a decimal constant. If symbol k is equivalent to \&0def00, say, the entire expression \#ab <<32+k\&\&(k-1) is equivalent to \&ab00000100.

Binary operators are allowed only on pure numbers, except in cases like $3+2 = 3$ and $3-1 = 2$. Future references cannot be combined with anything else; an expression like $2F+1$ is always illegal, because $2F$ never corresponds to a defined symbol.

6. An instruction consists of three fields:
   a) the LABEL field, which is either blank or a symbol;
   b) the OP field, which is either an MMIX opcode or a MMIXAL pseudo-op;
   c) the EXPR field, which is a list of one or more expressions separated by commas. The EXPR field can also be blank, in which case it is equivalent to the single expression 0.

7. Assembly of an instruction takes place in three steps:
1.3.2

THE MMIX ASSEMBLY LANGUAGE

39

a) The current location $\theta$ is aligned, if necessary, by increasing it to the next multiple of

8, if OP is OCTA;
4, if OP is TETRA or an MMIX opcode;
2, if OP is WYDE.

b) The symbol in LABEL, if present, is defined to be $\theta$, unless $\theta = IS$ or $\theta = GREG$.

c) If OP is a pseudo-operation, see rule 8. Otherwise OP is an MMIX instruction; the OP and EXPR fields define a tetrabyte as explained in Section 1.3.1, and $\theta$ advances by 4. Some MMIX opcodes have three operands in the EXPR field, others have two, and others have only one.

If OP is ADD, say, MMIXAL will expect three operands, and will check that the first and second operands are register numbers. If the third operand is pure, MMIXAL will change the opcode from $^*20$ ("add") to $^*21" ("add immediate"), and will check that the immediate value is less than 256.

If OP is SETH, say, MMIXAL will expect two operands. The first operand should be a register number; the second should be a pure value less than 65536.

An OP like BNZ takes two operands: a register and a pure number. The pure number should be expressible as a relative address; in other words, its value should be expressible as $\theta + 4k$ where $-65536 \leq k < 65536$.

Any OP that refers to memory, like LDB or G0, has a two-operand form $^*x,A$ as well as the three-operand forms $^*x,y,z$ or $^*x,y,z$. The two-operand option can be used when the memory address A is expressible as the sum $y+z$ of a base address and a one-byte value; see rule 8(b).

8. MMIXAL includes the following pseudo-operations.

a) OP = IS: The EXPR should be a single expression; the symbol in LABEL, if present, is made equivalent to the value of this expression.

b) OP = GREG: The EXPR should be a single expression with a pure equivalent, $x$.

The symbol in LABEL, if present, is made equivalent to the largest previously unallocated global register number, and this global register will contain $x$ when the program begins. If $x \neq 0$, the value of $x$ is considered to be a base address, and the program should not change that global register.

c) OP = LOC: The EXPR should be a single expression with a pure equivalent, $x$.

The value of $\theta$ is set to $x$. For example, the instruction ‘T LOC @+1000’ defines symbol T to be the address of the first of a sequence of 1000 bytes, and advances $\theta$ to the byte following that sequence.

d) OP = BYTE, WDE, TETRA, or OCTA: The EXPR field should be a list of pure expressions that each fit in 1, 2, 4, or 8 bytes, respectively.

9. MMIXAL restricts future references so that the assembly process can work quickly in one pass over the program. A future reference is permitted only

a) in a relative address: as the operand of JMP, or as the second operand of a branch, probable branch, PUSHJ, or GETA; or

b) in an expression assembled by OCTA.
% Example program ... Table of primes
L IS 500    The number of primes to find
t IS $255    Temporary storage
n GREG      ; Prime candidate
q GREG /* Quotient */
r GREG // Remainder
jj GREG 0   Index for PRIME[j]

PB N mm,2B
LDA t,NewLn; TRAP 0,Pputs,StdOut
CMP t,mm,2*$(L/10-1) ; PBNZ t,3B; TRAP 0,Halt,0

Fig. 15. Program P as a computer file: The assembler tolerates many formats.

MMIXAL also has a few additional features relevant to system programming that do not concern us here. Complete details of the full language appear in the MMIXware document, together with the complete logic of a working assembler.

A free format can be used when presenting an MMIXAL program to the assembler (see Fig. 15). The LABEL field starts at the beginning of a line and continues up to the first blank space. The next nonblank character begins the OP field, which continues to the next blank, etc. The whole line is a comment if the first nonblank character is not a letter or digit; otherwise comments start after the EXPR field. Notice that the GREG definitions for n, q, and r in Fig. 15 have a blank EXPR field (which is equivalent to the single expression '0'); therefore the comments on those lines need to be introduced by some sort of special delimiter. But no such delimiter is necessary on the GREG line for jj, because an explicit EXPR of 0 appears there.

The final lines of Fig. 15 illustrate the fact that two or more instructions can be placed on a single line of input to the assembler, if they are separated by semicolons. If an instruction following a semicolon has a nonblank label, the label must immediately follow the ';'.

A consistent format would obviously be better than the hodgepodge of different styles shown in Fig. 15, because computer files are easier to read when they aren’t so chaotic. But the assembler itself is very forgiving; it doesn’t mind occasional sloppiness.

**Primitive input and output.** Let us conclude this section by discussing the special TRAP operations supported by the MMIX simulator. These operations provide basic input and output functions on which facilities at a much higher level could be built. A two-instruction sequence of the form

\[ \text{SET } $255,\langle \text{arg} \rangle; \text{ TRAP } 0,\langle \text{function} \rangle,\langle \text{handle} \rangle \]  

is usually used to invoke such a function, where \( \langle \text{arg} \rangle \) points to a parameter and \( \langle \text{handle} \rangle \) identifies the relevant file. For example, Program H uses

\[ \text{GETA } $255, \text{String; TRAP } 0,\text{Pputs,StdOut} \]

to put a string into the standard output file, and Program P is similar.
After the TRAP has been serviced by the operating system, register $255 will contain a return value. In each case this value will be negative if and only if an error occurred. Programs H and P do not check for file errors, because they assume that the correctness or incorrectness of the standard output will speak for itself; but error detection and error recovery are usually important in well-written programs.

- **Fopen(handle, name, mode)**. Each of the ten primitive input/output traps applies to a handle, which is a one-byte integer. Fopen associates handle with an external file whose name is the string name, and prepares to do input and/or output on that file. The third parameter, mode, must be one of the values TextRead, TextWrite, BinaryRead, BinaryWrite, or BinaryReadWrite, all of which are predefined in MIXAL. In the three ...Write modes, any previous file contents are discarded. The value returned is 0 if the handle was successfully opened, otherwise $-1$.

The calling sequence for Fopen is

\[
\text{LDA $255,$Arg;} \quad \text{TRAP 0,Fopen,(handle)} \quad (3)
\]

where Arg is a two-octabyte sequence

\[
\text{Arg OCTA (name),(mode)} \quad (4)
\]

that has been placed elsewhere in memory. For example, to call the function Fopen(5, "foo", BinaryWrite) in an MIXAL program, we could put

\[
\begin{align*}
\text{Arg OCTA 1F,BinaryWrite} \\
1H \quad &\text{BYTE "foo",0}
\end{align*}
\]

into, say, the data segment, and then give the instructions

\[
\text{LDA $255,$Arg;} \quad \text{TRAP 0,Fopen,5.}
\]

This would open handle 5 for writing a new file of binary output,* to be named "foo".

Three handles are already open at the beginning of each program: The standard input file stdin (handle 0) has mode TextRead; the standard output file stdout (handle 1) has mode TextWrite; the standard error file stderr (handle 2) also has mode TextWrite.

- **Fclose(handle)**. If handle has been opened, Fclose causes it to be closed, hence no longer associated with any file. Again the result is 0 if successful, or $-1$ if the file was already closed or unclosable. The calling sequence is simply

\[
\text{TRAP 0,Fclose,(handle)} \quad (5)
\]

because there is no need to put anything in $255$.

* Different computer systems have different notions of what constitutes a text file and what constitutes a binary file. Each MIX simulator adopts the conventions of the operating system on which it resides.
- **Fread**(handle, buffer, size). The file handle should have been opened with mode TextRead, BinaryRead, or BinaryReadWrite. The next size bytes are read from the file into MMIX’s memory starting at address buffer. The value \( n - size \) is returned, where \( n \) is the number of bytes successfully read and stored, or \(-1 - size\) if an error occurred. The calling sequence is

\[
\text{LDA \$255,Arg; TRAP 0,Fread,(handle)} \quad (6)
\]

with two octabytes for the other arguments

\[
\text{Arg OCTA (buffer),(size)} \quad (7)
\]
as in (3) and (4).

- **Fgets**(handle, buffer, size). The file handle should have been opened with mode TextRead, BinaryRead, or BinaryReadWrite. One-byte characters are read into MMIX’s memory starting at address buffer, until either size−1 characters have been read and stored or a newline character has been read and stored; the next byte in memory is then set to zero. If an error or end of file occurs before reading is complete, the memory contents are undefined and the value −1 is returned; otherwise the number of characters successfully read and stored is returned. The calling sequence is the same as (6) and (7), except of course that Fgets replaces Fread in (6).

- **Fgetws**(handle, buffer, size). This command is the same as Fgets, except that it applies to wyde characters instead of one-byte characters. Up to size−1 wyde characters are read; a wyde newline is \#0000.

- **Fwrite**(handle, buffer, size). The file handle should have been opened with one of the modes TextWrite, BinaryWrite, or BinaryReadWrite. The next size bytes are written from MMIX’s memory starting at address buffer. The value \( n - size \) is returned, where \( n \) is the number of bytes successfully written. The calling sequence is analogous to (6) and (7).

- **Fputs**(handle, string). The file handle should have been opened with mode TextWrite, BinaryWrite, or BinaryReadWrite. One-byte characters are written from MMIX’s memory to the file, starting at address string, up to but not including the first byte equal to zero. The number of bytes written is returned, or −1 on error. The calling sequence is

\[
\text{SET \$255,(string); TRAP 0,Fputs,(handle)} \quad (8)
\]

- **Fputws**(handle, string). This command is the same as Fputs, except that it applies to wyde characters instead of one-byte characters.

- **Fseek**(handle, offset). The file handle should have been opened with mode BinaryRead, BinaryWrite, or BinaryReadWrite. This operation causes the next input or output operation to begin at offset bytes from the beginning of the file, if offset ≥ 0, or at −offset−1 bytes before the end of the file, if offset < 0. (For example, offset = 0 “rewinds” the file to its very beginning; offset = −1
moves forward all the way to the end.) The result is 0 if successful, or −1 if the stated positioning could not be done. The calling sequence is

\[
\text{SET } $255, (\text{offset}); \ TRAP 0, F\text{seek}, (\text{handle}).
\] (9)

An F\text{seek} command must be given when switching from input to output or from output to input in BinaryReadWrite mode.

- F\text{tell}(\text{handle}). The given file handle should have been opened with mode BinaryRead, BinaryWrite, or BinaryReadWrite. This operation returns the current file position, measured in bytes from the beginning, or −1 if an error has occurred. The calling sequence is simply

\[
\text{TRAP } 0, F\text{tell}, (\text{handle}).
\] (10)

Complete details about all ten of these input/output functions appear in the MMIXware document, together with a reference implementation. The symbols

\[
\begin{align*}
\text{Fopen} & = 1, & \text{Fwrite} & = 6, & \text{TextRead} & = 0, \\
\text{Fclose} & = 2, & \text{Fputs} & = 7, & \text{TextWrite} & = 1, \\
\text{Fread} & = 3, & \text{Fputw} & = 8, & \text{BinaryRead} & = 2, \\
\text{Fgets} & = 4, & \text{Fseek} & = 9, & \text{BinaryWrite} & = 3, \\
\text{Fgetws} & = 5, & \text{Ftell} & = 10, & \text{BinaryReadWrite} & = 4
\end{align*}
\]

are predefined in MMIXAL; also Halt = 0.

**EXERCISES — First set**

1. [05] (a) What is the meaning of ‘4B’ in line 29 of Program P? (b) Would the program still work if the label of line 24 were changed to ‘2H’ and the \text{EXPR} field of line 29 were changed to ‘r,2B’?

2. [10] Explain what happens if an MMIXAL program contains several instances of the line

\[
9H \ IS \ 9B+1
\]

and no other occurrences of \text{9H}.

3. [25] What is the effect of the following program?

```plaintext
LOC Data_Segment
X0 IS @
N IS 100
x0 GREG X0

(Insert Program M here)

Main GETA t,9F; TRAP 0,Fread,StdIn
SET $0,N<3
1H SR $2,$0,3; PUSHJ $1,Maximum
LD0 $3,x0,$0
SL $2,$2,3
ST0 $1,x0,$0; ST0 $3,x0,$2
SUB $0,$0,1<<3; PBNZ $0,1B
GETA t,9F; TRAP 0,Fwrite,StdOut
TRAP 0,Halt,0
9H OCTA X0+1<<3,N<3
```

```
4. [10] What is the value of the constant #1223456 & 6789? 


6. [15] True or false: The single instruction TETRA (expr1); TETRA (expr2) always has the same effect as the pair of instructions TETRA (expr1); TETRA (expr2). 

7. [65] John H. Quick (a student) was shocked, shocked to find that the instruction GETA $0,0+1$ gave the same result as GETA $0,0$. Explain why it should not have been surprised. 

8. [15] What's a good way to align the current location 0 so that it is a multiple of 16, increasing it by 0...15 as necessary? 

9. [10] What changes to Program P will make it print a table of 600 primes? 

10. [25] Assemble Program P by hand. (It won't take as long as you think.) What are the actual numerical contents of memory, corresponding to that symbolic program? 

11. [HM20] (a) Show that every nonprime $n > 1$ has a divisor $d$ with $1 < d \leq \sqrt{n}$. 

(b) Use this fact to show that $n$ is prime if it passes the test in step P7 of Algorithm P. 

12. [15] The GREG instruction on line 34 of Program P defines a base address that is used for the string constants Title, NewLn, and Blank on lines 38, 42, and 58. Suggest a way to avoid using this extra global register, without making the program run slower. 

13. [20] Unicode characters make it possible to print the first 500 primes as 

أول خمس مرات الأسقم الأولية 

١٧٧٣ ٢٣٩٧ ٢٩٥٣ ٣٤٧٩ ٤١٨٩ ٥٠٣٩ ٥٥٤٣ ٦١٢١ ٦٥٦٨ ٧٢٠٣ ٧٧٤٨ ٨١٢٨ ٨٦٤٩ ٩١٦٠ 

١٧٧٣ ٢٣٩٧ ٢٩٥٣ ٣٤٧٩ ٤١٨٩ ٥٠٣٩ ٥٥٤٣ ٦١٢١ ٦٥٦٨ ٧٢٠٣ ٧٧٤٨ ٨١٢٨ ٨٦٤٩ ٩١٦٠ 

with "authentic" Arabic numerals. One simply uses wyde characters instead of bytes, translating the English title and then substituting Arabic-Indic digits & 0660- & 0669 for the ASCII digits & 30- & 39. (Arabic script is written from right to left, but numbers still appear with their least significant digits at the right. The bidirectional presentation rules of Unicode automatically take care of the necessary reversals when the output is formatted.) What changes to Program P will accomplish this? 

14. [27] Change Program P so that it uses floating point arithmetic for the divisibility test in step P6. (The FREM instruction always gives an exact result.) Use $\sqrt{n}$ instead of q in step P7. Do these changes increase or decrease the running time? 

15. [22] What does the following program do? (Do not run it on a computer, figure it out by hand!)

* Mystery Program 

a  GREG  '11'

b  GREG  ', ' 

c  GREG  Data_Segment  

LOC  #100 

Main  NEG  $1,1,75 

SET  $2,0 

2H  ADD  $3,1,75 

3H  STB  $2,9,82 

ADD  $2,8,1
16. [46] **MMIXAL** was designed with simplicity and efficiency in mind, so that people can easily prepare machine language programs for MMIX when those programs are relatively short. Longer programs are usually written in a higher-level language like C or Java, ignoring details at the machine level. But sometimes there is a need to write large-scale programs specifically for a particular machine, and to have precise control over each instruction. In such cases we ought to have a machine-oriented language with a much richer structure than the line-for-line approach of a traditional assembler.

Design and implement a language called PL/MMIX, which is analogous to Niklaus Wirth’s PL/360 language [JACM 15 (1968), 37–74]. Your language should also incorporate the ideas of literate programming [D. E. Knuth, Literate Programming (1992)].

**EXERCISES — Second set**

The next exercises are short programming problems, representing typical computer applications and covering a wide range of techniques. Every reader is encouraged to choose a few of these problems in order to get some experience using MMIX, as well as to practice basic programming skills. If desired, these exercises may be worked concurrently as the rest of Chapter 1 is being read. The following list indicates the types of programming techniques that are involved:

- The use of switching tables for multiway decisions: exercise 17.
- Computation with two-dimensional arrays: exercises 18, 28, and 35.
- Text and string manipulation: exercises 24, 25, and 35.
- Integer and scaled decimal arithmetic: exercises 21, 27, 30, and 32.
- Elementary floating point arithmetic: exercises 27 and 32.
- The use of subroutines: exercises 23, 24, 32, 33, 34, and 35.
- List processing: exercise 29.
- Real-time control: exercise 34.
- Typographic display: exercise 35.

Whenever an exercise in this book says “write an MMIX program” or “write an MMIX subroutine,” you need only write symbolic MMIXAL code for what is asked. This code will not be complete in itself; it will merely be a fragment of a (hypothetical) complete program. No input or output need be done in a code fragment, if the data is to be supplied externally; one need write only LABEL, OP, and EXPR fields of MMIXAL instructions, together with appropriate remarks. The numeric machine language, line number, and “Times” columns (see Program M) are not required unless specifically requested, nor will there be a Main label.

On the other hand, if an exercise says “write a complete MMIX program,” it implies that an executable program should be written in MMIXAL, including in particular the Main label. Such programs should preferably be tested with the help of an MMIX assembler and simulator.
17. [25] Register $0$ contains the address of a tetrabyte that purportedly is a valid, unprivileged MMIX instruction. (This means that $0 \geq 0$ and that the X, Y, and Z bytes of $M[0]$ obey all restrictions imposed by the OP byte, according to the rules of Section 1.3.1′. For example, a valid instruction with opcode `FIX` will have $Y \leq \text{ROUND_NEAR}$; a valid instruction with opcode `PUT` will have $Y = 0$ and either $X < 8$ or $18 < X < 32$. The opcode LDVTS is always privileged, for use by the operating system only. But most opcodes define instructions that are valid and unprivileged for all X, Y, and Z.) Write an MMIX subroutine that checks the given tetrabyte for validity in this sense; try to make your program as efficient as possible.

Note: Inexperienced programmers tend to tackle a problem like this by writing a long series of tests on the OP byte, such as “$\text{SR}_{op,\text{tetrA,24}}$; $\text{CMP}_{t,op,\#18}; \text{BN}_{t,1F}; \text{CMP}_{t,op,\#98}; \text{BN}_{t,2F}; \ldots$”. This is not good practice! The best way to make multiway decisions is to prepare an auxiliary table containing information that encapsulates the desired logic. For example, a table of 256 octabytes, one for each opcode, could be accessed by saying “$\text{SR}_{t,\text{tetrA,21}}$; $\text{LDU}_{t,\text{Table},t}$”, followed perhaps by a $\text{G0}$ instruction if many different kinds of actions need to be done. A tabular approach often makes a program dramatically faster and more flexible.

18. [37] Assume that a $9 \times 8$ matrix of signed one-byte elements

$$
\begin{pmatrix}
  a_{11} & a_{12} & a_{13} & \ldots & a_{18} \\
  a_{21} & a_{22} & a_{23} & \ldots & a_{28} \\
  \vdots & \vdots & \vdots & \ddots & \vdots \\
  a_{91} & a_{92} & a_{93} & \ldots & a_{98}
\end{pmatrix}
$$

has been stored so that $a_{ij}$ is in location $A + 8i + j$ for some constant $A$. The matrix therefore appears as follows in MMIX’s memory:

$$
\begin{pmatrix}
  \vdots & \vdots & \vdots & \ddots & \vdots \\
  M[A + 73] & M[A + 74] & M[A + 75] & \ldots & M[A + 80]
\end{pmatrix}
$$

An $m \times n$ matrix is said to have a “saddle point” if some position is the smallest value in its row and the largest value in its column. In symbols, $a_{ij}$ is a saddle point if

$$
a_{ij} = \min_{1 \leq k \leq n} a_{ik} = \max_{1 \leq k \leq m} a_{kj}.
$$

Write an MMIX program that computes the location of a saddle point (if there is at least one) or zero (if there is no saddle point), and puts this value in register $0$.

19. [M29] What is the probability that the matrix in the preceding exercise has a saddle point, assuming that the 72 elements are distinct and assuming that all $72!$ permutations are equally likely? What is the corresponding probability if we assume instead that the elements of the matrix are zeros and ones, and that all $2^{72}$ such matrices are equally likely?

20. [HM12] Two solutions are given for exercise 18 (see page 102), and a third is suggested; it is not clear which of them is better. Analyze the algorithms, using each of the assumptions of exercise 19, and decide which is the better method.
21. [25] The ascending sequence of all reduced fractions between 0 and 1 that have denominators ≤ n is called the “Farey series of order n.” For example, the Farey series of order 7 is
\[
\begin{align*}
0 & \quad 1 \quad 1 \quad 1 \quad 1 \quad 2 \quad 1 \quad 2 \quad 3 \quad 1 \quad 4 \quad 3 \quad 2 \quad 5 \quad 3 \quad 4 \quad 5 \quad 6 \quad 1 \\
1 & \quad \frac{7}{1} \quad \frac{6}{7} \quad \frac{5}{6} \quad \frac{4}{5} \quad \frac{7}{8} \quad \frac{6}{7} \quad \frac{5}{6} \quad \frac{4}{5} \quad \frac{3}{4} \quad \frac{5}{6} \quad \frac{4}{5} \quad \frac{3}{4} \quad \frac{5}{6} \quad \frac{4}{5} \quad \frac{3}{4} \quad \frac{2}{3} \quad \frac{3}{4} \quad \frac{2}{3} \quad \frac{1}{2} \quad \frac{1}{2} \\
\end{align*}
\]
If we denote this series by \(x_0/y_0, x_1/y_1, x_2/y_2, \ldots\), exercise 22 proves that
\[
\begin{align*}
x_0 & = 0, \quad y_0 = 1; \quad x_1 = 1, \quad y_1 = n; \\
x_{k+2} & = \left\lfloor \frac{(y_k + n)/y_{k+1}}{x_{k+1} - x_k} \right\rfloor \quad x_{k+1} = \left\lfloor \frac{y_k + n}{y_{k+1}} y_{k+1} - x_k \right\rfloor; \\
y_{k+2} & = \left\lfloor \frac{(y_k + n)/y_{k+1}}{x_{k+1} - x_k} \right\rfloor \quad y_{k+1} = \left\lfloor \frac{y_k + n}{y_{k+1}} y_{k+1} - x_k \right\rfloor.
\end{align*}
\]
Write an MMIX subroutine that computes the Farey series of order \(n\), by storing the values of \(x_k\) and \(y_k\) in tetrabytes \(X+4k\) and \(Y+4k\), respectively. (The total number of terms in the series is approximately \(3n^2/\pi^2\); thus we may assume that \(n < 2^{32}\).)

22. [M30] (a) Show that the numbers \(x_k\) and \(y_k\) defined by the recurrence in the preceding exercise satisfy the relation \(x_{k+1} y_k - x_k y_{k+1} = 1\). (b) Show that the fractions \(x_k/y_k\) are indeed the Farey series of order \(n\), using the fact proved in (a).

23. [25] Write an MMIX subroutine that sets \(n\) consecutive bytes of memory to zero, given a starting address in \(S0\) and an integer \(n \geq 0\) in \(S1\). Try to make your subroutine blazingly fast, when \(n\) is large; use an MMIX pipeline simulator to obtain realistic running-time statistics.

24. [30] Write an MMIX subroutine that copies a string, starting at the address in \(S0\), to bytes of memory starting at the address in \(S1\). Strings are terminated by null characters (that is, bytes equal to zero). Assume that there will be no overlap in memory between the string and its copy. Your routine should minimize the number of memory references by loading and storing eight bytes at a time when possible, so that long strings are copied efficiently. Compare your program to the trivial byte-at-a-time code

```
SUBU $1,$1,$0;1H LDBU $2,$0,0; STBU $2,$0,$1; INCL $0,1; PBNZ $2,1B
```
which takes \((2n + 2)\mu + (4n + 7)\nu\) to copy a string of length \(n\).

25. [26] A cryptanalyst wants to count how often each character occurs in a long string of ciphertext. Write an MMIX program that computes 255 frequency counts, one for each nonnull character; the first null byte ends the given string. Try for a solution that is efficient in terms of the “mems and cops” criteria of Table 1 in Section 1.3.1.

26. [32] Improve the solution to the previous exercise by optimizing its performance with respect to realistic configurations of the MMIX pipeline simulator.

27. [26] (Fibonacci approximations) Equation 1.2.8-(15) states that the formula \(F_n = \text{round}(\phi^n/\sqrt{5})\) holds for all \(n \geq 0\), where ‘round’ denotes rounding to the nearest integer. (a) Write a complete MMIX program to test how well this formula behaves with respect to floating point arithmetic: Compute straightforward approximations to \(\phi^n/\sqrt{5}\) for \(n = 0, 1, 2, \ldots\), and find the smallest \(n\) for which the approximation does not round to \(F_n\). (b) Exercise 1.2.8–28 proves that \(F_n = \text{round}(\phi F_{n-1})\) for all \(n \geq 3\). Find the smallest \(n \geq 3\) for which this equation fails when we compute \(\phi F_{n-1}\) approximately by fixed point multiplication of unsigned octabytes. (See Eq. 1.3.1–(17).)

28. [26] A magic square of order \(n\) is an arrangement of the numbers 1 through \(n^2\) in a square array in such a way that the sum of each row and column is \(n(n^2 + 1)/2\), and so is the sum of the two main diagonals. Figure 16 shows a magic square of order 7.
The rule for generating it is easily seen: Start with 1 just below the middle square, then go down and to the right diagonally until reaching a filled square; if you run off the edge, “wrap around” by imagining an entire plane tiled with squares. When you reach a nonempty position, drop down two spaces from the most-recently-filled square and continue. This method works whenever $n$ is odd.

Using memory allocated in a fashion like that of exercise 18, write a complete MMIX program to generate a $19 \times 19$ magic square by the method above, and to format the result in the standard output file. [This algorithm is due to Ibn al-Haytham, who was born in Basra about 965 and died in Cairo about 1040. Many other magic square constructions make good programming exercises; see W. W. Rouse Ball, *Mathematical Recreations and Essays*, revised by H. S. M. Coxeter (New York: Macmillan, 1939), Chapter 7.]

29. [30] (The Josephus problem.) There are $n$ men arranged in a circle. Beginning at a particular position, we count around the circle and brutally execute every $m$th man; the circle closes as men die. For example, the execution order when $n = 8$ and $m = 4$ is 54613872, as shown in Fig. 17: The first man is fifth to go, the second man is fourth, etc. Write a complete MMIX program that prints out the order of execution when $n = 24$, $m = 11$. Try to design a clever algorithm that works at high speed when $m$ and $n$ are large (it may save your life). *Reference:* W. Ahrens, *Mathematische Unterhaltungen und Spiele* 2 (Leipzig: Teubner, 1918), Chapter 15.

30. [31] We showed in Section 1.2.7 that the sum $1 + \frac{1}{2} + \frac{1}{3} + \cdots$ becomes infinitely large. But if it is calculated with finite accuracy by a computer, the sum actually exists, in some sense, because the terms eventually get so small that they contribute nothing to the sum if added one by one. For example, suppose we calculate the sum by rounding to one decimal place; then we have $1 + 0.5 + 0.3 + 0.2 + 0.2 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.0 + \cdots = 3.7$.

More precisely, let $r_n(x)$ be the number $x$ rounded to $n$ decimal places, rounding to an even digit in case of ties. For the purposes of this problem we can use the formula $r_n(x) = \lceil 10^nx - \frac{1}{2} \rceil / 10^n$. Then we wish to find

$$S_n = r_n(1) + r_n(\frac{1}{2}) + r_n(\frac{1}{3}) + \cdots$$

we know that $S_1 = 3.7$, and the problem is to write a complete MMIX program that calculates and prints $S_n$ for $1 \leq n \leq 10$. 

| 22 | 47 | 16 | 41 | 30 | 35 | 04 |
| 05 | 23 | 48 | 17 | 42 | 11 | 29 |
| 30 | 06 | 24 | 49 | 18 | 36 | 12 |
| 13 | 31 | 07 | 25 | 43 | 19 | 37 |
| 38 | 14 | 32 | 01 | 26 | 44 | 20 |
| 21 | 39 | 08 | 33 | 02 | 27 | 45 |
| 46 | 15 | 40 | 09 | 34 | 03 | 28 |
Note: There is a much faster way to do this than the simple procedure of adding \( r_n(1/m) \), one number at a time, until \( r_n(1/m) \) becomes zero. For example, we have \( r_5(1/m) = 0.00001 \) for all values of \( m \) from 66667 to 199999; it’s wise to avoid calculating \( 1/m \) all 133333 times! An algorithm along the following lines is better.

1. Start with \( m_1 = 1 \), \( S \leftarrow 1 \), \( k \leftarrow 1 \).
2. Calculate \( r \leftarrow r_n(1/(m_k + 1)) \), and stop if \( r = 0 \).
3. Find \( m_{k+1} \), the largest \( m \) for which \( r_n(1/m) = r \).
4. Set \( S \leftarrow S + (m_{k+1} - m_k)r \), \( k \leftarrow k + 1 \), and return to H2.

31. [HM30] Using the notation of the preceding exercise, prove or disprove the formula

\[
\lim_{n \to \infty} (S_{n+1} - S_n) = \ln 10.
\]

32. [31] The following algorithm, due to the Neapolitan astronomer Aloysius Lilius and the German Jesuit mathematician Christopher Clavius in the late 16th century, is used by most Western churches to determine the date of Easter Sunday for any year after 1582.

Algorithm E (Date of Easter). Let \( Y \) be the year for which Easter date is desired.

1. [Golden number.] Set \( G \leftarrow (Y \mod 19) + 1 \). (\( G \) is the so-called “golden number” of the year in the 19-year Metonic cycle.)
2. [Century.] Set \( C \leftarrow \lfloor Y/100 \rfloor + 1 \). (When \( Y \) is not a multiple of 100, \( C \) is the century number; for example, 1984 is in the twentieth century.)
3. [Corrections.] Set \( X \leftarrow \lfloor 3C/4 \rfloor - 12, Z \leftarrow \lfloor (8C + 5)/25 \rfloor - 5 \). (Here \( X \) is the number of years, such as 1900, in which leap year was dropped in order to keep in step with the sun; \( Z \) is a special correction designed to synchronize Easter with the moon’s orbit.)
4. [Find Sunday.] Set \( D \leftarrow (5Y/4) - X - 10 \). (March \((D) \mod 7\) will actually be a Sunday.)
5. [Ephact.] Set \( E \leftarrow (11G + 20 + Z - X) \mod 30 \). If \( E = 25 \) and the golden number \( G \) is greater than 11, or if \( E = 24 \), increase \( E \) by 1. (This number \( E \) is the ephact, which specifies when a full moon occurs.)
6. [Find full moon.] Set \( N \leftarrow 44 - E \). If \( N < 21 \) then set \( N \leftarrow N + 30 \). (Easter is supposedly the first Sunday following the first full moon that occurs on or after March 21. Actually perturbations in the moon’s orbit do not make this strictly true, but we are concerned here with the “calendar moon” rather than the actual moon. The \( N \)th of March is a calendar full moon.)
7. [Advance to Sunday.] Set \( N \leftarrow N + 7 - ((D + N) \mod 7) \).
8. [Get month.] If \( N > 31 \), the date is \((N - 31) \) APRIL; otherwise the date is \( N \) MARCH.

Write a subroutine to calculate and print Easter date given the year, assuming that the year is less than 100000. The output should have the form “dd MONTH yyyy” where \( dd \) is the day and \( yyyy \) is the year. Write a complete \texttt{MIX} program that uses this subroutine to prepare a table of the dates of Easter from 1950 through 2000.

33. [M30] Some computers —not \texttt{MIX!}— give a negative remainder when a negative number is divided by a positive number. Therefore a program for calculating the date of Easter by the algorithm in the previous exercise might fail when the quantity \((11G + 20 + Z - X)\) in step E5 is negative. For example, in the year 14250 we obtain \( G = 1, X = 95, Z = 49\); so if we had \( E = -24 \) instead of \( E = +6 \) we would get
the ridiculous answer “42 APRIL”. [See CACM 5 (1962), 556.] Write a complete MMIX program that finds the earliest year for which this error would actually cause the wrong date to be calculated for Easter.

34. [32] Assume that an MMIX computer has been wired up to the traffic signals at the corner of Del Mar Boulevard and Berkeley Avenue, via special “files” named /dev/lights and /dev/sensor. The computer activates the lights by outputting one byte to /dev/lights, specifying the sum of four two-bit codes as follows:

- Del Mar traffic light: 00 off, 04 green, 08 amber, c0 red;
- Berkeley traffic light: 00 off, 04 green, 02 amber, 03 red;
- Del Mar pedestrian light: 00 off, 04 WALK, 0c DON’T WALK;
- Berkeley pedestrian light: 00 off, 01 WALK, 03 DON’T WALK.

Cars or pedestrians wishing to travel on Berkeley across the boulevard must activate a sensor; if this condition never occurs, the light for Del Mar should remain green. When MMIX reads a byte from /dev/sensor, the input is nonzero if and only if the sensor has been activated since the previous input.

Cycle times are as follows:
- Del Mar traffic light is green \( \geq 30 \) sec, amber 8 sec;
- Berkeley traffic light is green 20 sec, amber 5 sec.

When a traffic light is green or amber for one direction, the other direction has a red light. When the traffic light is green, the corresponding WALK light is on, except that DON'T WALK flashes for 12 sec just before a green light turns to amber, as follows:

- DON'T WALK \( \frac{1}{2} \) sec
- 0 sec

- DON'T WALK 4 sec (and remains on through amber and red cycles).

If the sensor is activated while the Berkeley light is green, the car or pedestrian will pass on that cycle. But if it is activated during the amber or red portions, another cycle will be necessary after the Del Mar traffic has passed.

Write a complete MMIX program that controls these lights, following the stated protocol. Assume that the special clock register rC increases by 1 exactly \( \rho \) times per second, where the integer \( \rho \) is a given constant.

35. [37] This exercise is designed to give some experience in the many applications of computers for which the output is to be displayed graphically rather than in the usual tabular form. The object is to “draw” a crossword puzzle diagram.

You are given as input a matrix of zeros and ones. An entry of zero indicates a white square; a one indicates a black square. The output should generate a diagram of the puzzle, with the appropriate squares numbered for words across and down.

For example, given the matrix

\[
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 & 1
\end{pmatrix}
\]

\[\text{Fig. 18. Diagram corresponding to the matrix in exercise 35.}\]
the corresponding puzzle diagram would be as shown in Fig. 18. A square is numbered if it is a white square and either (a) the square below it is white and there is no white square immediately above, or (b) the square to its right is white and there is no white square immediately to its left. If black squares occur at the edges, they should be removed from the diagram. This is illustrated in Fig. 18, where the black squares at the corners were dropped. A simple way to accomplish this is to artificially insert rows and columns of $-1$’s at the top, bottom, and sides of the given input matrix, then to change every $+1$ that is adjacent to a $-1$ into a $-1$ until no $+1$ remains next to any $-1$.

Figure 18 was produced by the METAPOST program shown in Fig. 19. Simple changes to the uses of line and black, and to the coordinates in the for loop, will produce any desired diagram.

Write a complete MMIX program that reads a $25 \times 25$ matrix of zeros and ones in the standard input file and writes a suitable METAPOST program on the standard output file. The input should consist of 25 lines, each consisting of 25 digits followed by “newline”; for example, the first line corresponding to the matrix above would be ‘$0001111111111111111$’, using extra $1$s to extend the original $6 \times 6$ array. The diagram will not necessarily be symmetrical, and it might have long paths of black squares that are connected to the outside in strange ways.

beginfig(18)
transform t; t=identity rotated -90 scaled 17pt;
def line(expr i,j,ii,jj) =
draw ((i,j)--(ii,jj)) transformed t;
enddef;
def black(expr i,j) =
fill ((i,j)--(i+1,j)--(i+1,j+1)--(i,j+1)--cycle) transformed t;
enddef;
line (1,2,1,6); line (2,1,2,7); line (3,1,3,7); line (4,1,4,7);
line (5,1,5,7); line (6,1,6,7); line (7,2,7,6);
line (2,1,6,1); line (1,2,7,2); line (1,3,7,3); line (1,4,7,4);
line (1,5,7,5); line (1,6,7,6); line (2,7,6,7);
numeric n; n=0;
for p = (1,2,1,4,1,5,2,1,2,4,2,6),
    (3,1,3,3,4,3,4,5,5,1,5,2,5,5,6,2):
    n:=n+1; label.lrt(decimal n infont "cmr8", p transformed t);
endfor
black(2,3); black(3,5); black(4,2); black(5,4);
endfig;

Fig. 19. The METAPOST program that generated Fig. 18.

1.3.3’. Applications to Permutations

The MIX programs in the former Section 1.3.3 will all be converted to MMIX programs, and so will the MIX programs in Chapters 2, 3, 4, 5, and 6. Anyone who wishes to help with this instructive conversion project is invited to join the MMIXmasters (see page v).
1.4: SOME FUNDAMENTAL PROGRAMMING TECHNIQUES

1.4.1: Subroutines

When a certain task is to be performed at several different places in a program, we usually don’t want to repeat the coding over and over. To avoid this situation, the coding (called a subroutine) can be put into one place only, and a few extra instructions can be added to restart the main routine properly after the subroutine is finished. Transfer of control between subroutines and main programs is called subroutine linkage.

Each machine has its own peculiar way to achieve efficient subroutine linkage, usually by using special instructions. Our discussion will be based on MMIX machine language, but similar remarks will apply to subroutine linkage on most other general-purpose computers.

Subroutines are used to save space in a program. They do not save any time, other than the time implicitly saved by having less space — for example, less time to load the program, and better use of high-speed memory on machines with several grades of memory. The extra time taken to enter and leave a subroutine is usually negligible, except in critical innermost loops.

Subroutines have several other advantages. They make it easier to visualize the structure of a large and complex program; they form a logical segmentation of the entire problem, and this usually makes debugging of the program easier. Many subroutines have additional value because they can be used by people other than the programmer of the subroutine.

Most computer installations have built up a large library of useful subroutines, and such a library greatly facilitates the programming of standard computer applications that arise. A programmer should not think of this as the only purpose of subroutines, however; subroutines should not always be regarded as general-purpose programs to be used by the community. Special-purpose subroutines are just as important, even when they are intended to appear in only one program. Section 1.4.3 contains several typical examples.

The simplest subroutines are those that have only one entrance and one exit, such as the Maximum subroutine we have already considered (see Program M in Section 1.3.2 and exercise 1.3.2–3). Let’s look at that program again, recasting it slightly so that a fixed number of cells, 100, is searched for the maximum:

* Maximum of \(X[1..100]\)

\[
\begin{align*}
&j \text{ IS } \$0 \ ; m \text{ IS } \$1 \ ; kk \text{ IS } \$2 \ ; xk \text{ IS } \$3 \\
&\text{Max100 } \text{SETL } \text{kk,100} \times 8 \quad \text{M1. Initialize.} \\
&\quad \text{LDO } m,x0,kk \\
&\quad \text{JMP } 1F \\
&3H \quad \text{LDO } xk,x0,kk \quad \text{M3. Compare.} \\
&\quad \text{CMP } t,xk,m \\
&\quad \text{PBNP } t,5F \\
&4H \quad \text{SET } m,xk \quad \text{M4. Change m.} \\
&1H \quad \text{SR } j,kk,3 \\
&5H \quad \text{SUB } kk,kk,8 \quad \text{M5. Decrease k.} \\
&\quad \text{PBP } kk,3B \quad \text{M2. All tested?} \\
&6H \quad \text{POP } 2,0 \quad \text{Return to main program.}
\end{align*}
\]
This subroutine is assumed to be part of a larger program in which the symbol \( t \) has been defined to stand for register $255$, and the symbol $x_0$ has been defined to stand for a global register such that $X[k]$ appears in location $x_0 + 8k$. In that larger program, the single instruction “PUSHJ $1, \text{Max}100$” will cause register $1$ to be set to the current maximum value of \{ $X[1], \ldots, X[100]$ \}, and the position of the maximum will appear in $2$. Linkage in this case is achieved by the PUSHJ instruction that invokes the subroutine, together with “POP 2, 0” at the subroutine’s end. These MMIX instructions cause local registers to be renumbered while the subroutine is active; furthermore, the PUSHJ inserts a return address into special register rJ, and the POP jumps to this location.

We can also accomplish subroutine linkage in a simpler, rather different way, by using MMIX’s GO instruction instead of pushing and popping. We might, for instance, use the following code in place of (1):

```plaintext
* Maximum of X[1..100]

j GREG ;m GREG ;kk GREG ;xk GREG
GREG 0       ; Base address
GoMax100 SETL kk,100*8 M1. Initialize.
LDO m,x0,kk
JMP 1F

3H ... (Continue as in (1))

5H GO kk,$0,0 Return to main program.  
```

Now the instruction “GO $0, GoMax100” will transfer control to the subroutine, placing the address of the following instruction into $0$; the subsequent “GO kk,$0,0” at the subroutine’s end will return to this address. In this case the maximum value will appear in global register m, and its location will be in global register j. Two additional global registers, kk and xk, have also been set aside for use by this subroutine. Furthermore, the “GREG 0” provides a base address so that we can GO to GoMax100 in a single instruction; otherwise a two-step sequence like “GETA $0, GoMax100; GO $0,$0,0” would be necessary. Subroutine linkage like (2) is commonly used on machines that have no built-in register stack mechanism.

It is not hard to obtain quantitative statements about the amount of code saved and the amount of time lost when subroutines are used. Suppose that a piece of coding requires \( k \) tetrabytes and that it appears in \( m \) places in the program. Rewriting this as a subroutine, we need a PUSHJ or GO instruction in each of the \( m \) places where the subroutine is called, plus a single POP or GO instruction to return control. This gives a total of \( m + k + 1 \) tetrabytes, rather than \( mk \), so the amount saved is

\[
(m - 1)(k - 1) - 2.
\]

If \( k = 1 \) or \( m = 1 \) we cannot possibly save any space by using subroutines; this, of course, is obvious. If \( k = 2 \), \( m \) must be greater than 3 in order to gain, etc.

The amount of time lost is the time taken for the PUSHJ, POP, and/or GO instructions in the linkage. If the subroutine is invoked \( t \) times during a run of the
program, and if we assume that running time is governed by the approximations in Table 1.3.1–1, the extra cost is $4tv$ in case (1), or $6tv$ in case (2).

These estimates must be taken with a grain of salt, because they were given for an idealized situation. Many subroutines cannot be called simply with a single \texttt{PUSHJ} or \texttt{GO} instruction. Furthermore, if code is replicated in many parts of a program without using a subroutine approach, each instance can be customized to take advantage of special characteristics of the particular part of the program in which it lies. With a subroutine, on the other hand, the code must be written for the most general case; this will often add several additional instructions.

When a subroutine is written to handle a general case, it is expressed in terms of \textit{parameters}. Parameters are values that govern a subroutine’s actions; they are subject to change from one call of the subroutine to another.

The coding in the outside program that transfers control to a subroutine and gets it properly started is known as the \textit{calling sequence}. Particular values of parameters, supplied when the subroutine is called, are known as \textit{arguments}. With our \texttt{GoMax100} subroutine, the calling sequence is simply “\texttt{GO $0, GoMax100}”, but a longer calling sequence is generally necessary when arguments must be supplied.

For example, we might want to generalize (2) to a subroutine that finds the maximum of the first $n$ elements of an array, given \textit{any} constant $n$, by placing $n$ in the instruction stream with the two-step calling sequence

$$
\texttt{GO}$ $0, \texttt{GoMax}; \quad \texttt{TETRA} \ n. \ \ (4)
$$

The \texttt{GoMax} subroutine could then take the form

\begin{verbatim}
* Maximum of X[1..n]
j GREG ;m GREG ;kk GREG ;xk GREG
  GREG @ Base address
GoMax LDT kk,$0,0 Fetch the argument.
  SL kk,kk,3
  LDO m,x0,kk
  JMP 1F
3H ... (Continue as in (1))
  PBF kk,3B
6H GO kk,$0,4 Return to caller.  
\end{verbatim}

Still better would be to communicate the parameter $n$ by putting it into a register. We could, for example, use the two-step calling sequence

$$
\texttt{SET} \ $1,n; \quad \texttt{GO} \ $0, \texttt{GoMax} \ \ (6)
$$

\begin{verbatim}
GoMax SL kk,$1,3 Fetch the argument.
  LDO m,x0,kk
...  
6H GO kk,$0,0 Return.  
\end{verbatim}

This variation is faster than (5), and it allows $n$ to vary dynamically without modifying the instruction stream.
Notice that the address of array element X[0] is also essentially a parameter to subroutines (1), (2), (5), and (7). The operation of putting this address into register xo may be regarded as part of the calling sequence, in cases when the array is different each time.

If the calling sequence occupies \( c \) tetrabytes of memory, formula (3) for the amount of space saved changes to

\[
(m - 1)(k - c) - \text{constant}
\]

and the time lost for subroutine linkage is slightly increased.

A further correction to the formulas above can be necessary because certain registers might need to be saved and restored. For example, in the GoMax subroutine we must remember that by writing “SET $1, n; GO $0, GoMax” we are not only computing the maximum value in register m and its position in register j, we are also changing the values of global registers kk and xk. We have implemented (2), (5), and (7) with the implicit assumption that registers kk and xk are for the exclusive use of the maximum-finding routine, but many computers are not blessed with a large number of registers. Even MMIX will run out of registers if a lot of subroutines are present simultaneously. We might therefore want to revise (7) so that it will work with kk \( \equiv \$2 \) and xk \( \equiv \$3 \), say, without clobbering the contents of those registers. We could do this by writing

\[
j \text{ GREG ;m GREG ;kk IS } \$2 ; xk \text{ IS } \$3
\]

<table>
<thead>
<tr>
<th>GREG</th>
<th>0</th>
<th>Base address</th>
</tr>
</thead>
<tbody>
<tr>
<td>GoMax</td>
<td>STO kk, Tempkk</td>
<td>Save previous register contents.</td>
</tr>
<tr>
<td>STO xk, Tempxk</td>
<td></td>
<td></td>
</tr>
<tr>
<td>SL kk, $1, 3</td>
<td>Fetch the argument.</td>
<td></td>
</tr>
<tr>
<td>LDO m, xo, kk</td>
<td></td>
<td></td>
</tr>
<tr>
<td>...</td>
<td></td>
<td></td>
</tr>
<tr>
<td>LDO kk, Tempkk</td>
<td>Restore previous register contents.</td>
<td></td>
</tr>
<tr>
<td>LDO xk, Tempxk</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

6H GO $0, $0, 0 Return.

and by setting aside two octabytes called Tempkk and Tempxk in the data segment. Of course this change adds potentially significant overhead cost to each use of the subroutine.

A subroutine may be regarded as an extension of the computer’s machine language. For example, whenever the GoMax subroutine is present in memory we have a single machine instruction (namely, “GO $0, GoMax”) that is a maximum-finder. It is important to define the effect of each subroutine just as carefully as the machine language operators themselves have been defined; a programmer should therefore be sure to write down the relevant characteristics, even though nobody else will be making use of the routine or its specification. In the case of GoMax as given in (7) or (9), the characteristics are as follows:

Calling sequence: Go $0, GoMax.

Entry conditions: $1 = n \geq 1; \ xo = \text{address of } X[0]. \ \ \ \ \ (10)

Exit conditions: \( m = \max_{1 \leq k \leq n} X[k] = X[j]. \)
A specification should mention all changes to quantities that are external to the subroutine. If registers \( \text{kk} \) and \( \text{xk} \) are not considered “private” to the variant of \( \text{GoMax} \) in (7), we should include the fact that those registers are affected, as part of that subroutine’s exit conditions. The subroutine also changes register \( \text{t} \), namely register \$255; but that register is conventionally used for temporary quantities of only momentary significance, so we needn’t bother to list it explicitly.

Now let’s consider multiple entrances to subroutines. Suppose we have a program that requires the general subroutine \( \text{GoMax} \), but it usually wants to use the special case \( \text{GoMax}100 \) in which \( n = 100 \). The two can be combined as follows:

\[
\begin{align*}
\text{GoMax100} & \text{ SET } \$1,100 \quad \text{First entrance} \\
\text{GoMax} & \text{ ...} \quad \text{Second entrance; continue as in (7) or (9).} \quad (11)
\end{align*}
\]

We could also add a third entrance, say \( \text{GoMax50} \), by putting the code

\[
\text{GoMax50 SET } \$1,50; \quad \text{JMP GoMax}
\]

in some convenient place.

A subroutine might also have multiple exits, meaning that it is supposed to return to one of several different locations, depending on conditions that it has detected. For example, we can extend subroutine (11) yet again by assuming that an upper bound parameter is given in global register \( b \); the subroutine is now supposed to exit to one of the two tetrabytes following the \( \text{GO} \) instruction that calls it:

\[
\begin{align*}
\text{Calling sequence for general } n & \quad \text{Calling sequence for } n = 100 \\
\text{SET } \$1,n; \quad \text{GO } \$0,\text{GoMax} & \quad \text{GO } \$0,\text{GoMax100} \\
\text{Exit here if } m \leq 0 \text{ or } m \geq b. & \quad \text{Exit here if } m \leq 0 \text{ or } m \geq b. \\
\text{Exit here if } 0 < m < b. & \quad \text{Exit here if } 0 < m < b.
\end{align*}
\]

(In other words, we skip the tetrabyte after the \( \text{GO} \) when the maximum value is positive and less than the upper bound. A subroutine like this would be useful in a program that often needs to make such distinctions after computing a maximum value.) The implementation is easy:

\[
\begin{align*}
\# \text{ Maximum of } \text{X[1..n]} \text{ with bounds check} \\
\text{j GREG ; m GREG ; kk GREG ; xk GREG} & \quad \text{GREG 0 Base address} \\
\text{GoMax100 SET } \$1,100 & \quad \text{Entrance for } n = 100 \\
\text{GoMax SL kk,}\$1,3 & \quad \text{Entrance for general } n \\
\text{LDO m,}\text{x0,kk} & \quad \text{JMP 1F} \\
\text{3H ...} & \quad \text{(Continue as in (1))} \quad (12) \\
\text{PBP kk,3B} & \quad \text{Branch if } m \leq 0. \\
\text{BNP m,1F} & \quad \text{Branch if } m < b. \\
\text{CMP kk,m,b} & \quad \text{Branch if } m < b. \\
\text{BN kk,2F} & \quad \text{Branch if } m < b. \\
\text{1H GO kk,}\$0,0 & \quad \text{GO } \text{kk,}\$0,0 \quad \text{Take first exit if } m \leq 0 \text{ or } m \geq b. \\
\text{2H GO kk,}\$0,4 & \quad \text{GO } \text{kk,}\$0,4 \quad \text{Otherwise take second exit.} \quad (1)
\end{align*}
\]
Notice that this program combines the instruction-stream linking technique of (5) with the register-setting technique of (7). The location to which a subroutine exits is, strictly speaking, a parameter; hence the locations of multiple exits must be supplied as arguments. When a subroutine accesses one of its parameters all the time, the corresponding argument is best passed in a register, but when an argument is constant and not always needed it is best kept in the instruction stream.

Subroutines may call on other subroutines. Indeed, complicated programs often have subroutine calls nested more than five deep. The only restriction that must be followed when using the GO-type linkage described above is that all temporary storage locations and registers must be distinct; thus no subroutine may call on any other subroutine that is (directly or indirectly) calling on it. For example, consider the following scenario:

<table>
<thead>
<tr>
<th>[Main program]</th>
<th>[Subroutine A]</th>
<th>[Subroutine B]</th>
<th>[Subroutine C]</th>
</tr>
</thead>
<tbody>
<tr>
<td>:</td>
<td>A</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>GO $0,A</td>
<td>GO $1,B</td>
<td>GO $2,C</td>
<td>GO $0,A</td>
</tr>
<tr>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
</tr>
<tr>
<td>GO $0,$0,0</td>
<td>GO $1,$1,0</td>
<td>GO $2,$2,0</td>
<td></td>
</tr>
</tbody>
</table>

(13)

If the main program calls A, which calls B, which calls C, and then C calls on A, the address in $0 referring to the main program is destroyed, and there is no way to return to that program.

**Using a memory stack.** Recursive situations like (13) do not often arise in simple programs, but a great many important applications do have a natural recursive structure. Fortunately there is a straightforward way to avoid interference between subroutine calls, by letting each subroutine keep its local variables on a stack. For example, we can set aside a global register called sp (the “stack pointer”) and use GO $0,Sub to invoke each subroutine. If the code for the subroutine has the form

```
Sub  STO  $0,sp,0
     ADD  sp,sp,8
     ...
     SUB  sp,sp,8
     LDO  $0,sp,0
     GO   $0,$0,0
```

(14)

register $0 will always contain the proper return address; the problem of (13) no longer arises. (Initially we set sp to an address in the data segment, following all other memory locations needed.) Moreover, the STO/ADD and SUB/LDO instructions of (14) can be omitted if Sub is a so-called leaf subroutine — a subroutine that doesn’t call any other subroutines.

A stack can be used to hold parameters and other local variables besides the return addresses stored in (14). Suppose, for example, that subroutine Sub needs 20 octabytes of local data, in addition to the return address; then we can
use a scheme like this:

\[
\begin{align*}
\text{Sub} & \quad \text{STO} \quad \text{fp, sp, 0} \quad \text{Save the old frame pointer.} \\
& \quad \text{SET} \quad \text{fp, sp} \quad \text{Establish the new frame pointer.} \\
& \quad \text{INCL} \quad \text{sp, 8*22} \quad \text{Advance the stack pointer.} \\
& \quad \text{STO} \quad \text{0, fp, 8} \quad \text{Save the return address.} \\
\ldots \\
& \quad \text{LD0} \quad \text{0, fp, 8} \quad \text{Restore the return address.} \\
& \quad \text{SET} \quad \text{sp, fp} \quad \text{Restore the stack pointer.} \\
& \quad \text{LD0} \quad \text{fp, sp, 0} \quad \text{Restore the frame pointer.} \\
& \quad \text{GO} \quad \text{0, 0, 0} \quad \text{Return to caller.}\end{align*}
\]

(15)

Here \(\text{fp}\) is a global register called the \textit{frame pointer}. Within the “…” part of the subroutine, local quantity number \(k\) is equivalent to the octabyte in memory location \(\text{fp} + 8k + 8\), for \(1 \leq k \leq 20\). The instructions at the beginning are said to “push” local quantities onto the “top” of the stack; the instructions at the end “pop” those quantities off, leaving the stack in the condition it had when the subroutine was entered.

**Using the register stack.** We have discussed \(\text{GO}\)-type subroutine linkage at length because many computers have no better alternative. But \texttt{MMIX} has built-in instructions \texttt{PUSHJ} and \texttt{POP}, which handle subroutine linkage in a more efficient way, avoiding most of the overhead in schemes like (9) and (15). These instructions allow us to keep most parameters and local variables entirely in registers, instead of storing them into a memory stack and loading them again later. With \texttt{PUSHJ} and \texttt{POP}, most of the details of stack maintenance are done automatically by the machine.

The basic idea is quite simple, once the general idea of a stack is understood. \texttt{MMIX} has a \textit{register stack} consisting of octabytes \(S[0], S[1], \ldots, S[\tau-1]\) for some number \(\tau \geq 0\). The topmost \(L\) octabytes in the stack (namely \(S[\tau-L], S[\tau-L+1], \ldots, S[\tau-1]\)) are the current local registers \(\$0, \$1, \ldots, \$(L-1)\); the other \(\tau - L\) octabytes of the stack are currently inaccessible to the program, and we say they have been “pushed down.” The current number of local registers, \(L\), is kept in \texttt{MMIX}’s special register \(\text{rL}\), although a programmer rarely needs to know this. Initially \(L = 2, \tau = 2\), and local registers \(\$0\) and \(\$1\) represent the command line as in Program 1.3.2’H.

\texttt{MMIX} also has \textit{global registers}, namely \(\$G, \$(G+1), \ldots, \$255\); the value of \(G\) is kept in special register \(\text{rG}\), and we always have \(0 \leq L \leq G \leq 255\). (In fact, we also always have \(G \geq 32\).) Global registers are \textit{not} part of the register stack.

Registers that are neither local nor global are called \textit{marginal}. These registers, namely \(\$L, \$(L+1), \ldots, \$(G-1)\), have the value \textit{zero} whenever they are used as input operands to an \texttt{MMIX} instruction.

The register stack grows when a marginal register is given a value. This marginal register becomes local, and so do all marginal registers with smaller numbers. For example, if eight local registers are currently in use, the instruction \texttt{ADD \$10, \$20, 5} causes \(\$8, \$9, \text{and} \$10\) to become local; more precisely, if \(\text{rL} = 8\), the instruction \texttt{ADD \$10, \$20, 5} sets \(\$8 \leftarrow 0, \$9 \leftarrow 0, \$10 \leftarrow 5\), and \(\text{rL} \leftarrow 11\). (Register \(\$20\) remains marginal.)
If $X$ is a local register, the instruction PUSHJ $X$, Sub decreases the number of local registers and changes their effective register numbers: Local registers previously called $(X+1), (X+2), \ldots, (L-1)$ are called $0, 1, \ldots, (L-X-2)$ inside the subroutine, and the value of $L$ decreases by $X+1$. Thus the register stack remains unchanged, but $X+1$ of its entries have become inaccessible; the subroutine cannot damage those entries, and it has $X+1$ newly marginal registers to play with.

If $X \geq G$, so that $X$ is a global register, the action of PUSHJ $X$, Sub is similar, but a new entry is placed on the register stack and then $L+1$ registers are pushed down instead of $X+1$. In this case $L$ is zero when the subroutine begins; all of the formerly local registers have been pushed down, and the subroutine starts out with a clean slate.

The register stack shrinks only when a POP instruction is given, or when a program explicitly decreases the number of local registers with an instruction such as PUT $xL, 5$. The purpose of POP $X, YZ$ is to make the items pushed down by the most recent PUSHJ accessible again, as they were before, and to remove items from the register stack if they are no longer necessary. In general the $X$ field of a POP instruction is the number of values "returned" by the subroutine, if $X \leq L$. If $X > 0$, the main value returned is $(X-1)$; this value is removed from the register stack, together with all entries above it, and the return value is placed in the position specified by the PUSHJ command that invoked the subroutine. The behavior of POP is similar when $X > L$, but in this case the register stack remains intact and zero is placed in the position of the PUSHJ.

The rules we have just stated are a bit complicated, because many different cases can arise in practice. A few examples will, however, make everything clear. Suppose we are writing a routine A and we want to call subroutine B; suppose further that routine A has 5 local registers that should not be accessible to B. These registers are $0, 1, 2, 3, \text{and } 4$. We reserve the next register, $5$, for the main result of subroutine B. If B has, say, three parameters, we set $6 \leftarrow \text{arg0}$, $7 \leftarrow \text{arg1}$, and $8 \leftarrow \text{arg2}$, then issue the command PUSHJ $5, B$; this invokes B and the arguments are now found in $0, 1, \text{and } 2$.

If B returns no result, it will conclude with the command POP 0,YZ; this will restore $0, 1, 2, 3, \text{and } 4$ to their former values and set $L \leftarrow 5$.

If B returns a single result $x$, it will place $x$ in $0$ and conclude with the command POP 1,YZ. This will restore $0, 1, 2, 3, \text{and } 4$ as before; it will also set $5 \leftarrow x$ and $L \leftarrow 6$.

If B returns two results $x$ and $a$, it will place the main result $x$ in $1$ and the auxiliary result $a$ in $0$. Then POP 2,YZ will restore $0$ through $4$ and set $5 \leftarrow x$, $6 \leftarrow a$, $L \leftarrow 7$. Similarly, if B returns ten results $(x, a_0, \ldots, a_8)$, it will place the main result $x$ in $9$ and the others in the first nine registers: $0 \leftarrow a_0$, $1 \leftarrow a_1$, \ldots, $8 \leftarrow a_8$. Then POP 10,YZ will restore $0$ through $4$ and set $5 \leftarrow x$, $6 \leftarrow a_0$, \ldots, $14 \leftarrow a_8$. (The curious permutation of registers that arises when two or more results are returned may seem strange at first. But it makes sense, because it leaves the register stack unchanged except for the main result. For example, if subroutine B wants arg0, arg1, and arg2 to reappear in
§6, §7, and §8 after it has finished its work, it can leave them as auxiliary results in $0$, $1$, and $2$ and then say POP 4, YZ.)

The YZ field of a POP instruction is usually zero, but in general the instruction POP X, YZ returns to the instruction that is $YZ + 1$ tetrabytes after the PUSHJ that invoked the current subroutine. This generality is useful for subroutines with multiple exits. More precisely, a PUSHJ subroutine in location $@$ sets special register $rJ$ to $@ + 4$ before jumping to the subroutine: a POP instruction then returns to location $rJ + 4YZ$.

We can now recast the programs previously written with 00 linkage so that they use PUSH/PATP linkage instead. For example, the two-entrance, two-exit subroutine for maximum-finding in (12) takes the following form when MMIX’s register stack mechanism is used:

* Maximum of $X[1..n]$ with bounds check

\[
\begin{array}{l}
n \text{SET} \quad \text{J} \quad \text{SL} \quad \text{LD}0 \quad \text{JMP} \quad \ldots \quad \text{BNZ} \\
\text{SL} \quad \text{k}, \text{m}, \text{kk} \quad \text{kk} \quad \text{kk} \quad \text{kk} \quad \text{kk} \quad \text{kk} \quad \text{kk} \\
\text{LD}0 \quad \text{m}, \text{x}, \text{kk} \quad \text{kk} \quad \text{kk} \quad \text{kk} \quad \text{kk} \\
\text{JMP} \quad \text{1F} \quad \text{1F} \quad \text{1F} \quad \text{1F} \quad \text{1F} \quad \text{1F} \\
\text{BNZ} \quad \text{kk}, \text{2F} \quad \text{1H} \quad \text{1H} \quad \text{1H} \quad \text{1H} \quad \text{1H} \\
\text{POP} \quad \text{2, 0} \quad \text{2F} \quad \text{2F} \quad \text{2F} \quad \text{2F} \quad \text{2F} \\
\end{array}
\]

(16)

Calling sequence for general $n$  
Calling sequence for $n = 100$

\[
\begin{array}{l}
\text{SET} \quad \text{SA}, \text{n}; \quad \text{PUSHJ} \quad \text{SR, Max} \quad \text{PUSHJ} \quad \text{SR, Max100} \quad \text{Exit here if $SR < 0$ or $SR > b$.} \\
\text{Exit here if $SR < 0$ or $SR > b$.} \quad \text{Exit here if $SR < 0$ or $SR > b$.} \quad \text{Exit here if $SR < 0$ or $SR > b$.}
\end{array}
\]

The local result register $SR$ in the PUSHJ of this calling sequence is arbitrary, depending on the number of local variables the caller wishes to retain. The local argument register $SA$ is then $S(R + 1)$. After the call, $SR$ will contain the main result (the maximum value) and $SA$ will contain the auxiliary result (the array index of that maximum). If there are several arguments and/or auxiliaries, they are conventionally called $A_0$, $A_1$, $\ldots$, and we conventionally assume that $A_0 = R + 1$, $A_1 = R + 2$, $\ldots$ when PUSH/PATP calling sequences are written down.

A comparison of (12) and (16) shows only mild advantages for (16): The new form does not need to allocate global registers for $j$, $m$, $kk$, and $zk$, nor does it need a global base register for the address of the 00 command. (Recall from Section 1.3.1’ that 00 takes an absolute address, while PUSHJ has a relative address.) A 00 instruction is slightly slower than PUSHJ; it is no slower than POP, according to Table 1.3.1’-1, although high-speed implementations of MMIX could implement POP more efficiently. Programs (12) and (16) both have the same length.

The advantages of PUSH/PATP linkage over 00 linkage begin to manifest themselves when we have non-leaf subroutines (namely, subroutines that call other subroutines, possibly themselves). Then the 00-based code of (14) can be re-
placed by

\[
\begin{align*}
\text{Sub} & \quad \text{GET} \ retadd, rJ \\
\vdots & \\
\text{PUT} & \quad rJ, retadd \\
\text{POP} & \quad X, 0
\end{align*}
\]

(17)

where \textit{retadd} is a local register. (For example, \textit{retadd} might be $\$5$; its register number is generally greater than or equal to the number of returned results \textit{X}, so the \textit{POP} instruction will automatically remove it from the register stack.) Now the costly memory references of (14) are avoided.

A non-leaf subroutine with many local variables and/or parameters is significantly better off with a register stack than with the memory stack scheme of (15), because we can often perform the computations entirely in registers. We should note, however, that \texttt{MMIX}’s register stack applies only to local variables that are \textit{scalar}, not to local array variables that must be accessed by address computation. Subroutines that need non-scalar local variables should use a scheme like (15) for all such variables, while keeping scalars on the register stack. Both approaches can be used simultaneously, with \texttt{fp} and \texttt{sp} updated only by subroutines that need a memory stack.

If the register stack becomes extremely large, \texttt{MMIX} will automatically store its bottom entries in the stack segment of memory, using a behind-the-scenes procedure that we will study in Section 1.4.3. (Recall from Section 1.3.2 that the stack segment begins at address $*6000\ 0000\ 0000\ 0000$.) \texttt{MMIX} stores register stack items in memory also when a \texttt{SAVE} command saves a program’s entire current context. Saved stack items are automatically restored from memory when a \texttt{POP} command needs them or when an \texttt{UNSAVE} command restores a saved context. But in most cases \texttt{MMIX} is able to push and pop local registers without actually accessing memory, and without actually changing the contents of very many internal machine registers.

Stacks have many other uses in computer programs; we will study their basic properties in Section 2.2.1. We will get a further taste of nested subroutines and recursive procedures in Section 2.3, when we consider operations on trees. Chapter 8 studies recursion in detail.

*\textbf{Assembly language features.} The \texttt{MMIX} assembly language supports the writing of subroutines in three ways that were not mentioned in Section 1.3.2. The most important of these is the \texttt{PREFIX} operation, which makes it easy to define “private” symbols that will not interfere with symbols defined elsewhere in a large program. The basic idea is that a symbol can have a structured form like \texttt{Sub:X} (meaning symbol \texttt{X} of subroutine \texttt{Sub}), possibly carried to several levels like \texttt{Lib:Sub:X} (meaning symbol \texttt{X} of subroutine \texttt{Sub} in library \texttt{Lib}).

Structured symbols are accommodated by extending rule 1 of \texttt{MMIXAL} in Section 1.3.2 slightly, allowing the colon character ‘:’ to be regarded as a “letter” that can be used to construct symbols. Every symbol that does not begin with a colon is implicitly extended by placing the \textit{current prefix} in front of it. The \textit{current prefix} is initially ‘:’, but the user can change it with the
PREFIX command. For example,

```
ADD    x, y, z  means  ADD :x, :y, :z
PREFIX Foo:  current prefix is :Foo:
ADD    x, y, z  means  ADD :Foo:x, :Foo:y, :Foo:z
PREFIX Bar:  current prefix is :Foo:Bar:
ADD    :x, y, :z  means  ADD :x, :Foo:Bar:y, :z
PREFIX :  current prefix reverts to :
```

One way to use this idea is to replace the opening lines of (16) by

```
PREFIX Max:
   j IS $0 ; m IS $1 ; kk IS $2 ; xk IS $3
   x0 IS :x0 ; b IS :b ; t IS :t  External symbols
   :Max100 SET $0, 100  Entrance for n = 100
   :Max   SL    kk, $0, 3  Entrance for general n
   LDO   m, x0, kk
   JMP   1F
   ...
   (Continue as in (16))
```

and to add “PREFIX :” at the end. Then the symbols j, m, kk, and xk are free for use in the rest of the program or in the definition of other subroutines. Further examples of the use of prefixes appear in Section 1.4.3.

MMIXAL also includes a pseudo-operation called LOCAL. The assembly command “LOCAL $40” means, for example, that an error message should be given at the end of assembly if GREG commands allocate so many registers that $40 will be global. (This feature is needed only when a subroutine uses more than 32 local registers, because “LOCAL $31” is always implicitly true.)

A third feature for subroutine support, BSPEC . . . ESPEC, is also provided. It allows information to be passed to the object file so that debugging routines and other system programs know what kind of linkage is being used by each subroutine. This feature is discussed in the MMIXware document; it is primarily of interest in the output of compilers.

**Strategic considerations.** When ad hoc subroutines are written for special-purpose use, we can afford to use GREG instructions liberally, so that plenty of global registers are filled with basic constants that make our program run fast. Comparatively few local registers are needed, unless the subroutines are used recursively.

But when dozens or hundreds of general-purpose subroutines are written for inclusion in a large library, with the idea of allowing any user program to include whatever subroutines it needs, we obviously can’t allow each subroutine to allocate a substantial number of globals. Even one global variable per subroutine might be too much.

Thus we want to use GREG generously when we have only a few subroutines, but we want to use it sparingly when the number of subroutines is potentially huge. In the latter case we probably can make good use of local variables without too much loss of efficiency.
Let’s conclude this section by discussing briefly how we might go about writing a complex and lengthy program. How can we decide what kind of subroutines we will need? What calling sequences should be used? One successful way to determine this is to use an iterative procedure:

**Step 0** (Initial idea). First we decide vaguely upon the general plan of attack that the program will use.

**Step 1** (A rough sketch of the program). We start now by writing the “outer levels” of the program, in any convenient language. A somewhat systematic way to go about this has been described very nicely by E. W. Dijkstra, *Structured Programming* (Academic Press, 1972), Chapter 1, and by N. Wirth, *CACM* 14 (1971), 221–227. First we break the whole program into a small number of pieces, which might be thought of temporarily as subroutines although they are called only once. These pieces are successively refined into smaller and smaller parts, having correspondingly simpler jobs to do. Whenever some computational task arises that seems likely to occur elsewhere or that has already occurred elsewhere, we define a subroutine (a real one) to do that job. We do not write the subroutine at this point; we continue writing the main program, assuming that the subroutine has performed its task. Finally, when the main program has been sketched, we tackle the subroutines in turn, trying to take the most complex subroutines first and then their sub-subroutines, etc. In this manner we will come up with a list of subroutines. The actual function of each subroutine has probably already changed several times, so that the first parts of our sketch will by now be incorrect; but that is no problem, since we are merely making a sketch. We now have a reasonably good idea about how each subroutine will be called and how general-purpose it should be. We should consider extending the generality of each subroutine, at least a little.

**Step 2** (First working program). The next step goes in the opposite direction from step 1. We now write in computer language, say **MIXAL** or **PL/MIX** or—most probably—a higher-level language. We start this time with the lowest level subroutines, and do the main program last. As far as possible, we try never to write any instructions that call a subroutine before the subroutine itself has been coded. (In step 1, we tried the opposite, never considering a subroutine until all of its calls had been written.)

As more and more subroutines are written during this process, our confidence gradually grows, since we are continually extending the power of the machine we are programming. After an individual subroutine is coded, we should immediately prepare a complete description of what it does, and what its calling sequences are, as in (10). It is also important to be sure that global variables are not used for two conflicting purposes at the same time; when preparing the sketch in step 1, we didn’t have to worry about such problems.

**Step 3** (Reexamination). The result of step 2 should be very nearly a working program, but we may be able to improve it. A good way is to reverse direction again, studying for each subroutine all of the places it is called. Perhaps the subroutine should be enlarged to do some of the more common things that
are always done by the outside routine just before or after the subroutine is called. Perhaps several subroutines should be merged into one; or perhaps a subroutine is called only once and should not be a subroutine at all. Perhaps a subroutine is never called and can be dispensed with entirely.

At this point, it is often a good idea to scrap everything and start over again at step 1, or even at step 0! This is not intended to be a facetious remark; the time spent in getting this far has not been wasted, for we have learned a great deal about the problem. With hindsight, we will probably have discovered several improvements that could be made to the program's overall organization. There's no reason to be afraid to go back to step 1 — it will be much easier to go through steps 2 and 3 again, now that a similar program has been done already. Moreover, we will quite probably save as much debugging time later on as it will take to rewrite everything. Some of the best computer programs ever written owe much of their success to the fact that all the work was unintentionally lost, at about this stage, and the authors were forced to begin again.

On the other hand, there is probably never a point when a complex computer program cannot be improved somehow, so steps 1 and 2 should not be repeated indefinitely. When significant improvements can clearly be made, the additional time required to start over is well spent, but eventually a point of diminishing returns is reached.

Step 4 (Debugging). After a final polishing of the program, including perhaps the allocation of storage and other last-minute details, it is time to look at it in still another direction from the three that were used in steps 1, 2, and 3: Now we study the program in the order in which the computer will perform it. This may be done by hand or, of course, by machine. The author has found it quite helpful at this point to make use of system routines that trace each instruction the first two times it is executed; it is important to rethink the ideas underlying the program and to check that everything is actually taking place as expected.

Debugging is an art that needs much further study, and the way to approach it is highly dependent on the facilities available at each computer installation. A good start towards effective debugging is often the preparation of appropriate test data. The most successful debugging techniques are typically designed and built into the program itself: Many of today's best programmers devote nearly half of their programs to facilitating the debugging process in the other half. The first half, which usually consists of fairly straightforward routines that display relevant information in a readable format, will eventually be of little importance, but the net result is a surprising gain in productivity.

Another good debugging practice is to keep a record of every mistake made. Even though this will probably be quite embarrassing, such information is invaluable to anyone doing research on the debugging problem, and it will also help you learn how to cope with future errors.

Note: The author wrote most of the preceding comments in 1964, after he had successfully completed several medium-sized software projects but before he had developed a mature programming style. Later, during the 1980s, he
learned that an additional technique, called structured documentation or literate programming, is probably even more important. A summary of his current beliefs about the best way to write programs of all kinds appears in the book *Literate Programming* (Cambridge University Press, first published in 1992). Incidentally, Chapter 11 of that book contains a detailed record of all bugs removed from the \TeX{} program during the period 1978–1991.

*Up to a point it is better to let the snags [bugs] be there than to spend such time in design that there are none (how many decades would this course take?).*

— A. M. TURING, *Proposals for ACE* (1945)

**EXERCISES**

1. [20] Write a subroutine \texttt{GoMaxR} that generalizes Algorithm 1.2.10M by finding the maximum value of \(\{X[a], X[a + r], X[a + 2r], \ldots, X[n]\}\), where \(r\) and \(n\) are positive parameters and \(a\) is the smallest positive number with \(a \equiv n \pmod{r}\), namely \(a = 1 + (n - 1) \mod{r}\). Give a special entrance \texttt{GoMax} for the case \(r = 1\), using a \texttt{G8}-style calling sequence so that your subroutine is a generalization of (7).

2. [20] Convert the subroutine of exercise 1 from \texttt{G0} linkage to \texttt{PUSHJ/POP} linkage.

3. [15] How can scheme (15) be simplified when \texttt{Sub} is a leaf subroutine?

4. [15] The text in this section speaks often of \texttt{PUSHJ}, but Section 1.3.1 mentions also a command called \texttt{PUSHGO}. What is the difference between \texttt{PUSHJ} and \texttt{PUSHGO}?

5. [10] True or false: The number of marginal registers is \(G - L\).

6. [10] What is the effect of the instruction \texttt{DIVU $5,$5,$5} if \$5 is a marginal register?

7. [10] What is the effect of the instruction \texttt{INCML $5,$abcd} if \$5 is a marginal register?

8. [15] Suppose the instruction \texttt{SET $15,0} is performed when there are 10 local registers. This increases the number of local registers to 16; but the newly local registers (including \$15) are all zero, so they still behave essentially as if they were marginal. Is the instruction \texttt{SET $15,0} therefore entirely redundant in such a case?

9. [20] When a trip interrupt has been been enabled for some exceptional condition like arithmetic overflow, the trip handler might be called into action at unpredictable times. We don’t want to clobber any of the interrupted program’s registers; yet a trip handler can’t do much unless it has “elbow room.” Explain how to use \texttt{PUSHJ} and \texttt{POP} so that plenty of local registers are safely available to a handler.

10. [20] True or false: If an \texttt{MMIX} program never uses the instructions \texttt{PUSHJ}, \texttt{PUSHGO}, \texttt{POP}, \texttt{SAVE}, or \texttt{UNSAVE}, all 256 registers \$0, \$1, \ldots, \$255 are essentially equivalent, in the sense that the distinction between local, global, and marginal registers is irrelevant.

11. [20] Guess what happens if a program issues more \texttt{POP} instructions than \texttt{PUSH} instructions.

12. [10] True or false:

   a) The current prefix in an \texttt{MMIXAL} program always begins with a colon.

   b) The current prefix in an \texttt{MMIXAL} program always ends with a colon.

   c) The symbols : and :: are equivalent in \texttt{MMIXAL} programs.
13. [21] Write two \texttt{MIX} subroutines to calculate the Fibonacci number \( F_n \mod 2^{64} \),
given \( n \). The first subroutine should call itself recursively, using the definition
\[
F_n = n \quad \text{if } n \leq 1; \quad F_n = F_{n-1} + F_{n-2} \quad \text{if } n > 1.
\]
The second subroutine should not be recursive. Both subroutines should use \texttt{PUSH/POP}
linkage and should avoid global variables entirely.

14. [M21] What is the running time of the subroutines in exercise 13?

15. [21] Convert the recursive subroutine of exercise 13 to \texttt{G0}-style linkage, using a
memory stack as in (15) instead of \texttt{MIX}'s register stack. Compare the efficiency of the
two versions.

16. [25] (Nonlocal \texttt{goto} statements.) Sometimes we want to jump out of a subroutine,
to a location that is not in the calling routine. For example, suppose subroutine \texttt{A}
calls subroutine \texttt{B}, which calls subroutine \texttt{C}, which calls itself recursively a number of
times before deciding that it wants to exit directly to \texttt{A}. Explain how to handle such situations
when using \texttt{MIX}'s register stack. (We can't simply \texttt{JMP} from \texttt{C} to \texttt{A}; the stack must be
properly popped.)

1.4.2. Coroutines

Subroutines are special cases of more general program components, called \texttt{coroutines}. In contrast to the unsymmetric relationship between a main routine
and a subroutine, there is complete symmetry between coroutines, which \textit{call on each other}.

To understand the coroutine concept, let us consider another way of thinking
about subroutines. The viewpoint adopted in the previous section was that a
subroutine was merely an extension of the computer hardware, introduced to save
lines of coding. This may be true, but another point of view is also possible:
We may consider the main program and the subroutine as a \texttt{team} of programs,
each member of the team having a certain job to do. The main program, in
the course of doing its job, will activate the subprogram; the subprogram will
perform its own function and then activate the main program. We might stretch
our imagination to believe that, from the subroutine's point of view, when it
exits \textit{it} is calling the \textit{main} routine; the main routine continues to perform its
duty, then "exits" to the subroutine. The subroutine acts, then calls the main
routine again.

This egalitarian philosophy may sound far-fetched, but it actually rings
true with respect to coroutines. There is no way to distinguish which of two
coroutines is subordinate to the other. Suppose a program consists of coroutines
\texttt{A} and \texttt{B}; when programming \texttt{A}, we may think of \texttt{B} as our subroutine, but when
programming \texttt{B}, we may think of \texttt{A} as our subroutine. Whenever a coroutine is
activated, it resumes execution of its program at the point where the action was
last suspended.

The coroutines \texttt{A} and \texttt{B} might, for example, be two programs that play chess.
We can combine them so that they will play against each other.

Such coroutine linkage is easy to achieve with \texttt{MIX} if we set aside two
global registers, \texttt{a} and \texttt{b}. In coroutine \texttt{A}, the instruction \texttt{"GO a,b,0"} is used to
activate coroutine B; in coroutine B, the instruction \texttt{G0 b, a, 0} is used to activate coroutine A. This scheme requires only 3\nu time to transfer control each way.

The essential difference between routine-subroutine and coroutine-coroutine linkage can be seen by comparing the \texttt{G0}-type linkage of the previous section with the present scheme. A subroutine is always initiated \textit{at its beginning}, which is usually a fixed place; the main routine or a coroutine is always initiated \textit{at the place following} where it last terminated.

Coroutines arise most naturally in practice when they are connected with algorithms for input and output. For example, suppose it is the duty of coroutine A to read a file and to perform some transformation on the input, reducing it to a sequence of items. Another coroutine, which we will call B, does further processing of those items, and outputs the answers; B will periodically call for the successive input items found by A. Thus, coroutine B jumps to A whenever it wants the next input item, and coroutine A jumps to B whenever an input item has been found. The reader may say, “Well, B is the main program and A is merely a subroutine for doing the input.” This, however, becomes less true when the process A is very complicated; indeed, we can imagine A as the main routine and B as a subroutine for doing the output, and the above description remains valid. The usefulness of the coroutine idea emerges midway between these two extremes, when both A and B are complicated and each one calls the other in numerous places. It is not easy to find short, simple examples of coroutines that illustrate the importance of the idea; the most useful coroutine applications are generally quite lengthy.

In order to study coroutines in action, let us consider a contrived example. Suppose we want to write a program that translates one code into another. The input code to be translated is a sequence of 8-bit characters terminated by a period, such as

\begin{equation}
\texttt{a2b5e3426fg0zyw3210pq89r}.
\end{equation}

This code appears on the standard input file, interspersed with whitespace characters in an arbitrary fashion. For our purposes a “whitespace character” will be any byte whose value is less than or equal to \texttt{20}, the ASCII code for \texttt{' '}.

All whitespace characters in the input are ignored; the other characters should be interpreted as follows, when they are read in sequence: (1) If the next character is one of the decimal digits \texttt{0} or \texttt{1} or \ldots or \texttt{9}, say \texttt{n}, it indicates \((n + 1)\) repetitions of the following character, whether the following character is a digit or not. (2) A nondigit simply denotes itself. The output of our program is to consist of the resulting sequence separated into groups of three characters each, until a period appears; the last group may have fewer than three characters. For example, \eqref{eq:1} should be translated into

\begin{equation}
\texttt{abb bee eee e44 446 66f gzy w22 220 0pq 999 999 999 r}.
\end{equation}

Notice that \texttt{3426f} does not mean \texttt{3427} repetitions of the letter \texttt{f}; it means \texttt{4} fours and \texttt{3} sixes followed by \texttt{f}. If the input sequence is \texttt{1.}, the output is simply \texttt{.}, not \texttt{. .}, because the first period terminates the output. The goal of
our program is to produce a sequence of lines on the standard output file, with 16 three-character groups per line (except, of course, that the final line might be shorter). The three-character groups should be separated by blank spaces, and each line should end as usual with the ASCII newline character ∗a.

To accomplish this translation, we will write two coroutines and a subroutine. The program begins by giving symbolic names to three global registers, one for temporary storage and the others for coroutine linkage.

01 * An example of coroutines
02 t in  IS $255 Temporary data of short duration
03 in GREG 0 Address for resuming the first coroutine
04 out GREG 0 Address for resuming the second coroutine

The next step is to set aside the memory locations used for working storage.

05 * Input and output buffers
06 LOC Data_Segment
07 GREG 0 Base address
08 OutBuf TETRA "",#a,0 (see exercise 3)
09 Period BYTE ','
10 InArgs OCTA InBuf,1000
11 InBuf LOC #100

Now we turn to the program itself. The subroutine we need, called NextChar, is designed to find non-whitespace characters of the input, and to return the next such character:

12 * Subroutine for character input
13 intr GREG 0 (the current input position)
14 1H LDA t,InArgs Fill the input buffer.
15 TRAP 0,Fgets,StdIn
16 LDA intr,InBuf Start at beginning of buffer.
17 0H GREG Period
18 CSN intr,t,0B If error occurred, read a ','.
19 NextChar LDBU $0,intr,0 Fetch the next character.
20 INCL intr,1
21 BZ $0,1B Branch if at end of buffer.
22 CMPU t,$0,' '
23 BNP t,NextChar Branch if character is whitespace.
24 POP 1,0 Return to caller

This subroutine has the following characteristics:

Calling sequence: PUSH J $R,NextChar.
Entry conditions: intr points to the first unread character.
Exit conditions: $R = next non-whitespace character of input;
intr is ready for the next entry to NextChar.

The subroutine also changes register t, namely register $255, but we usually omit that register from such specifications, as we did in 1.4.1~(10).
Our first coroutine, called In, finds the characters of the input code with the proper replication. It begins initially at location In1:

```assembly
25 * First coroutine
26 count GREG 0 (the repetition counter)
27 In1 GO in, out, 0 Send a character to the Out coroutine.
28 PUSHJ $0, NextChar Get a new character.
29 CMPU t, $0, '9' Branch if it exceeds '9'.
30 PEP t, 1B Branch if it is less than '0'.
31 SUB count, $0, '0'
32 BN count, 1B
33 PUSHJ $0, NextChar Get another character.
34 In1 GO in, out, 0 Send it to Out.
35 SUB count, count, 1 Decrease the repetition counter.
36 PBNN count, 1B Repeat if necessary.
37 JMP In1 Otherwise begin a new cycle.
```

This coroutine has the following characteristics:

Calling sequence (from Out): GO out, in, 0.
Exit conditions (to Out): $0 = next input character with proper replication.
Entry conditions (upon return): $0 unchanged from its value at exit.

Register count is private to In and need not be mentioned.

The other coroutine, called Out, puts the code into three-character groups and sends them to the standard output file. It begins initially at Out1:

```assembly
38 * Second coroutine
39 outptr GREG 0 (the current output position)
40 Out1 LDA t, OutBuf Empty the output buffer.
41 TRAP 0, fputs, StdOut
42 LDA outptr, OutBuf Start at beginning of buffer.
43 GO out, in, 0 Get a new character from In.
44 STBU $0, outptr, 0 Store it as the first of three.
45 CMP t, $0, ',' Branch if it was ','.
46 BZ t, 1F Otherwise get another character.
47 GO out, in, 0 Store it as the second of three.
48 CMP t, $0, ',' Branch if it was ','.
49 BZ t, 2F Otherwise get another character.
50 STBU $0, outptr, 2 Store it as the third of three.
51 CMP t, $0, ',' Branch if it was ','.
52 BZ t, 3F Otherwise advance to next group.
53 INCL outptr, 4 Otherwise finish the line.
54 Out1 GREG OutBuf+4*16
55 CMP t, outptr, 0B
56 PBNN t, 2B Branch if fewer than 16 groups.
57 JMP 1B Otherwise finish the line.
```
60     INCL     outptr,1          Move past a stored character.
61     INCL     outptr,1          Move past a stored character.
62     GREG     #a               (newline character)
63     STBU     0B, outptr,1     Store newline after period.
64     GREG     0                 (null character)
65     STBU     0B, outptr,2     Store null after newline.
66     LDA      t, OutBuf
67     TRAP    0, Fputs, StdOut   Output the final line.
68     TRAP    0, Halt, 0        Terminate the program.

The characteristics of Out are designed to complement those of In:

Calling sequence (from In):  GO in, out, 0.
Exit conditions (to In):     $0 unchanged from its value at entry.
Entry conditions
                          $0 = next input character with proper replication.

To complete the program, we need to get everything off to a good start.
Initialization of coroutines tends to be a little tricky, although not really difficult.

* Initialization

Main

70     LDA     inptr, InBuf    Initialize NextChar.
71     GETA    in, In1        Initialize In.
72     JMP     Out1          Start with Out (see exercise 2).

This completes the program. The reader should study it carefully, noting in particular how each coroutine can be read and written independently as though the other coroutine were its subroutine.

We learned in Section 1.4.1 that MMIX’s PUSHJ and POP instructions are superior to the GO command with respect to subroutine linkage. But with coroutines the opposite is true: Pushing and popping are quite unsymmetrical, and MMIX’s register stack can get hopelessly entangled if two or more coroutines try to use it simultaneously. (See exercise 6.)

There is an important relation between coroutines and multipass algorithms. For example, the translation process we have just described could have been done in two distinct passes: We could first have done just the In coroutine, applying it to the entire input and writing each character with the proper amount of replication into an intermediate file. After this was finished, we could have read that file and done just the Out coroutine, taking the characters in groups of three. This would be called a “two-pass” process. (Intuitively, a “pass” denotes a complete scan of the input. This definition is not precise, and in many algorithms the number of passes taken is not at all clear; but the intuitive concept of “pass” is useful in spite of its vagueness.)

Figure 22(a) illustrates a four-pass process. Quite often we will find that the same process can be done in just one pass, as shown in part (b) of the figure, if we substitute four coroutines A, B, C, D for the respective passes A, B, C, D. Coroutine A will jump to B when pass A would have written an item of output on File 1; coroutine B will jump to A when pass B would have read an item of input from File 1, and B will jump to C when pass B would have written an item
of output on File 2, etc. UNIX® users will recognize this as a “pipe,” denoted by “PassA | PassB | PassC | PassD”. The programs for passes B, C, and D are sometimes referred to as “filters.”

\[\text{Input} \rightarrow \text{Pass } A \rightarrow \text{File 1}\]

\[\text{File 1} \rightarrow \text{Pass } B \rightarrow \text{File 2}\]

\[\text{File 2} \rightarrow \text{Pass } C \rightarrow \text{File 3}\]

\[\text{File 3} \rightarrow \text{Pass } D \rightarrow \text{Output}\]

\[\text{Input} \rightarrow \text{Coroutine } A\]

\[\text{Coroutine } B\]

\[\text{Coroutine } C\]

\[\text{Coroutine } D \rightarrow \text{Output}\]

**Fig. 22.** Passes: (a) a four-pass algorithm, and (b) a one-pass algorithm.

Conversely, a process done by \(n\) coroutines can often be transformed into an \(n\)-pass process. Due to this correspondence it is worthwhile to compare multipass algorithms with one-pass algorithms.

a) *Psychological difference.* A multipass algorithm is generally easier to create and to understand than a one-pass algorithm for the same problem. A process that has been broken into a sequence of small steps, which happen one after the other, is easier to comprehend than an involved process in which many transformations take place simultaneously.

Also, if a very large problem is being tackled and if many people are supposed to cooperate in producing a computer program, a multipass algorithm provides a natural way to divide up the job.

These advantages of a multipass algorithm are present in coroutines as well, since each coroutine can be written essentially separate from the others. The linkage makes an apparently multipass algorithm into a single-pass process.

b) *Time difference.* The time required to pack, write, read, and unpack the intermediate data that flows between passes (for example, the information in the files of Fig. 22) is avoided in a one-pass algorithm. For this reason, a one-pass algorithm will be faster.

c) *Space difference.* The one-pass algorithm requires space to hold all the programs in memory simultaneously, while a multipass algorithm requires space for only one at a time. This requirement may affect the speed, even to a greater extent than indicated in statement (b). For example, many computers have a limited amount of “fast memory” and a larger amount of slower memory; if each
pass just barely fits into the fast memory, the result will be considerably faster than if we use coroutines in a single pass (since the use of coroutines would presumably force most of the program to appear in the slower memory or to be repeatedly swapped in and out of fast memory).

Occasionally there is a need to design algorithms for several computer configurations at once, some of which have larger memory capacity than others. In such cases it is possible to write the program in terms of coroutines, and to let the memory size govern the number of passes: Load together as many coroutines as feasible, and supply input or output subroutines for the missing links.

Although this relationship between coroutines and passes is important, we should keep in mind that coroutine applications cannot always be split into multipass algorithms. If coroutine B gets input from A and also sends back crucial information to A, as in the example of chess play mentioned earlier, the sequence of actions can’t be converted into pass A followed by pass B.

Conversely, it is clear that some multipass algorithms cannot be converted to coroutines. Some algorithms are inherently multipass; for example, the second pass may require cumulative information from the first pass, like the total number of occurrences of a certain word in the input. There is an old joke worth noting in this regard:

Little old lady, riding a bus. “Little boy, can you tell me how to get off at Pasadena Street?”

Little boy. “Just watch me, and get off two stops before I do.”

(The joke is that the little boy gives a two-pass algorithm.)

So much for multipass algorithms. Coroutines also play an important role in discrete system simulation; see Section 2.2.5. When several more-or-less independent coroutines are controlled by a master process, they are often called threads of a computation. We will see further examples of coroutines in numerous places throughout this series of books. The important idea of replicated coroutines is discussed in Chapter 8, and some interesting applications of this idea may be found in Chapter 10.

EXERCISES

1. [10] Explain why short, simple examples of coroutines are hard for the author of a textbook to find.

2. [20] The program in the text starts up the put coroutine first. What would happen if in were the first to be executed instead — that is, if lines 71 and 72 were changed to “GETA out, puti; JMP ini”?  

3. [15] Explain the TETRA instruction on line 08 of the program in the text. (There are exactly fifteen blank spaces between the double-quote marks.)

4. [20] Suppose two coroutines A and B want to treat MMIX’s remainder register rR as if it were their private property, although both coroutines do division. (In other words, when one coroutine jumps to the other, it wants to be able to assume that the contents of rR will not have been altered when the other coroutine returns.) Devise a coroutine linkage that allows them this freedom.
5. [20] Could MMIX do reasonably efficient coroutine linkage by using its PUSH and POP instructions, without any G6 commands?

6. [20] The program in the text uses MMIX’s register stack only in a very limited way, namely when In calls NextChar. Discuss to what extent two cooperating coroutines could both make use of the register stack.

7. [30] Write an MMIX program that reverses the translation done by the program in the text. That is, your program should convert a file containing three-character groups like (2) into a file containing code like (1). The output should be as short a string of characters as possible, except for newlines; thus, for example, the zero before the z in (1) would not really be produced from (2).

1.4.3. **Interpretive Routines**

In this section we will investigate a common type of program known as an interpretive routine, often called an interpreter for short. An interpretive routine is a computer program that performs the instructions of another program, where the other program is written in some machine-like language. By a machine-like language, we mean a way of representing instructions, where the instructions typically have operation codes, addresses, etc. (This definition, like most definitions of today’s computer terms, is not precise, nor should it be; we cannot draw the line exactly and say just which programs are interpreters and which are not.)

Historically, the first interpreters were built around machine-like languages designed specially for simple programming; such languages were easier to use than a real machine language. The rise of symbolic languages for programming soon eliminated the need for interpretive routines of that kind, but interpreters have by no means begun to die out. On the contrary, their use has continued to grow, to the extent that an effective use of interpretive routines may be regarded as one of the essential characteristics of modern programming. The new applications of interpreters are made chiefly for the following reasons:

a) a machine-like language is able to represent a complicated sequence of decisions and actions in a compact, efficient manner; and

b) such a representation provides an excellent way to communicate between passes of a multipass process.

In such cases, special purpose machine-like languages are developed for use in a particular program, and programs in those languages are often generated only by computers. (Today’s expert programmers are also good machine designers. They not only create an interpretive routine, they also define a virtual machine whose language is to be interpreted.)

The interpretive technique has the further advantage of being relatively machine-independent, since only the interpreter must be revised when changing computers. Furthermore, helpful debugging aids can readily be built into an interpretive system.

Examples of type (a) interpreters appear in several places later in this series of books; see, for example, the recursive interpreter in Chapter 8 and the “Parsing
Machine” in Chapter 10. We typically need to deal with situations in which a
great many special cases arise, all similar, but having no really simple pattern.

For example, consider writing an algebraic compiler in which we want to gen-
erate efficient machine-language instructions that add two quantities together.
There might be ten classes of quantities (constants, simple variables, subscripted
variables, fixed or floating point, signed or unsigned, etc.) and the combina-
tion of all pairs yields 100 different cases. A long program would be required to do the
proper thing in each case. The interpretive solution to this problem is to make up
an ad hoc language whose “instructions” fit in one byte. Then we simply prepare
a table of 100 “programs” in this language, where each program ideally fits in
a single word. The idea is then to pick out the appropriate table entry and to
perform the program found there. This technique is simple and efficient.

An example interpreter of type (b) appears in the article “Computer-Drawn
Flowcharts” by D. E. Knuth, CACM 6 (1963), 555–563. In a multipass program,
the earlier passes must transmit information to the later passes. This information
is often transmitted most efficiently in a machine-like language, as a set of
instructions for the later pass; the later pass is then nothing but a special purpose
interpretive routine, and the earlier pass is a special purpose “compiler.” This
philosophy of multipass operation may be characterized as telling the later pass
what to do, whenever possible, rather than simply presenting it with a lot of
facts and asking it to figure out what to do.

Another example of a type-(b) interpreter occurs in connection with com-
pilers for special languages. If the language includes many features that are not
easily done on the machine except by subroutine, the resulting object programs
will be very long sequences of subroutine calls. This would happen, for example,
if the language were concerned primarily with multiple precision arithmetic. In
such a case the object program would be considerably shorter if it were expressed
in an interpretive language. See, for example, the book ALGOL 60 Implemen-
describes a compiler to translate from ALGOL 60 into an interpretive language,
and which also describes the interpreter for that language; and see “An ALGOL
87–124, for examples of interpretive routines used within a compiler. The rise of
microprogrammed machines and of special-purpose integrated circuit chips has
made this interpretive approach even more valuable.

The TeX program, which produced the pages of the book you are now
reading, converted a file that contained the text of this section into an interpretive
language called DVI format, designed by D. R. Fuchs in 1979. [See D. E.
Knuth, TeX: The Program (Reading, Mass.: Addison–Wesley, 1986), Part 31.] 
The DVI file that TeX produced was then processed by an interpreter called
dvips, written by T. G. Rokicki, and converted to a file of instructions in
another interpretive language called PostScript® [Adobe Systems Inc., PostScript
Language Reference, 3rd edition (Reading, Mass.: Addison–Wesley, 1999)]. The
PostScript file was sent to the publisher, who sent it to a commercial printer,
who used a PostScript interpreter to produce printing plates. This three-pass
operation illustrates interpreters of type (b); \TeX{} itself also includes a small
interpreter of type (a) to process the so-called ligature and kerning information
for characters that are being printed [\TeX: The Program, §545].

There is another way to look at a program written in interpretive language:
It may be regarded as a series of subroutine calls, one after another. Such a pro-
gram may in fact be expanded into a long sequence of calls on subroutines, and,
conversely, such a sequence can usually be packed into a coded form that is read-
ily interpreted. The advantages of interpretive techniques are the compactness of
representation, the machine independence, and the increased diagnostic capability.
An interpreter can often be written so that the amount of time spent in inter-
pretation of the code itself and branching to the appropriate routine is negligible.

*An MMIX simulator. When the language presented to an interpretive routine
is the machine language of another computer, the interpreter is often called a
simulator (or sometimes an emulator).

In the author’s opinion, entirely too much programmers’ time has been
spent in writing such simulators and entirely too much computer time has been
wasted in using them. The motivation for simulators is simple: A computer
installation buys a new machine and still wants to run programs written for
the old machine (rather than rewriting the programs). However, this usually
costs more and gives poorer results than if a special task force of program-
ners were given temporary employment to do the reprogramming. For example, the
author once participated in such a reprogramming project, and a serious error
was discovered in the original program, which had been in use for several years;
the new program worked at five times the speed of the old, besides giving the
right answers for a change! (Not all simulators are bad; for example, it is usually
advantageous for a computer manufacturer to simulate a new machine before it
has been built, so that software for the new machine may be developed as soon as
possible. But that is a very specialized application.) An extreme example of the
inefficient use of computer simulators is the true story of machine A simulating
machine B running a program that simulates machine C. This is the way to
make a large, expensive computer give poorer results than its cheaper cousin.

In view of all this, why should such a simulator rear its ugly head in this
book? There are three reasons:

a) The simulator we will describe below is a good example of a typical inter-
preative routine; the basic techniques employed in interpreters are illustrated here.
It also illustrates the use of subroutines in a moderately long program.

b) We will describe a simulator of the \textsc{mmix} computer, written in (of all things)
the \textsc{mmix} language. This will reinforce our knowledge of the machine. It also will
facilitate the writing of \textsc{mmix} simulators for other computers, although we will
not plunge deeply into the details of 64-bit integer or floating point arithmetic.

c) Our simulation of \textsc{mmix} explains how the register stack can be implemented
efficiently in hardware, so that pushing and popping are accomplished with very
little work. Similarly, the simulator presented here clarifies the \textsc{save} and \textsc{unsave}
operators, and it provides details about the behavior of trip interrupts. Such
things are best understood by looking at a reference implementation, so that we can see how the machine really works.

Computer simulators as described in this section should be distinguished from discrete system simulators. Discrete system simulators are important programs that will be discussed in Section 2.2.5.

Now let’s turn to the task of writing an MMIX simulator. We begin by making a tremendous simplification: Instead of attempting to simulate all the things that happen simultaneously in a pipelined computer, we will interpret only one instruction at a time. Pipeline processing is extremely instructive and important, but it is beyond the scope of this book; interested readers can find a complete program for a full-fledged pipeline “meta-simulator” in the MMIXware document. We will content ourselves here with a simulator that is blithely unaware of such things as cache memory, virtual address translation, dynamic instruction scheduling, reorder buffers, etc., etc. Moreover, we will simulate only the instructions that ordinary MMIX user programs can do; privileged instructions like LDVT, which are reserved for the operating system, will be considered erroneous if they arise. Trap interrupts will not be simulated by our program unless they perform rudimentary input or output as described in Section 1.3.2’.

The input to our program will be a binary file that specifies the initial contents of memory, just as the memory would be set up by an operating system when running a user program (including command line data). We want to mimic the behavior of MMIX’s hardware, pretending that MMIX itself is interpreting the instructions that begin at symbolic location Main; thus, we want to implement the specifications that were laid down in Section 1.3.1’, in the run-time environment that was discussed in Section 1.3.2’. Our program will, for example, maintain an array of 256 octabytes g[0], g[1], . . . , g[255] for the simulated global registers. The first 32 elements of this array will be the special registers listed in Table 1.3.1’–2; one of those special registers will be the simulated clock, tRC. We will assume that each instruction takes a fixed amount of time, as specified by Table 1.3.1’–1; the simulated tRC will increase by 2^{25} for each μ and by 1 for each ν. Thus, for example, after we have simulated Program 1.3.2’P, the simulated tRC will contain 0000 3228 000b 091, which represents 12840μ + 766097ν.

The program is rather long, but it has many points of interest and we will study it in short easy pieces. It begins as usual by defining a few symbols and by specifying the contents of the data segment. We put the array of 256 simulated global registers first in that segment; for example, the simulated $255 will be the octabyte g[255], in memory location Global+8*255. This global array is followed by a similar array called the local register ring, where we will keep the top items of the simulated register stack. The size of this ring is set to 256, although 512 or any higher power of 2 would also work. (A large ring of local registers costs more, but it might be noticeably faster when a program uses the register stack heavily. One of the purposes of a simulator is to find out whether additional hardware would be worth the expense.) The main portion of the data segment, starting at Chunk0, will be devoted to the simulated memory.
001 * MMIX Simulator (Simplified)
002 t IS $255   Volatile register for temporary info
003 lring_size IS 256   Size of the local register ring
004 LOC Data_Segment   Start at location $2000000000000000
005 Global LOC @+8*256   256 octabytes for global registers
006 g GREG Global      Base address for globals
007 Local LOC @+8*lring_size        lring_size octabytes for local registers
008 l GREG Local      Base address for locals
009 e GREG @      Base address for IOArg and Chunk0
010 IOArg OCTA 0,BinaryRead     (See exercise 20)
011 Chunk0 IS @      Beginning of simulated memory area
012 LOC #100      Put everything else in the text segment.

One of the key subroutines we will need is called MemFind. Given a 64-bit address A, this subroutine returns the resulting address R where the simulated contents of Mem[A] can be found. Of course 2^{64} bytes of simulated memory cannot be squeezed into a 2^{61}-byte data segment; but the simulator remembers all addresses that have occurred before, and it assumes that all locations not yet encountered are equal to zero.

Memory is divided into "chunks" of 2^{12} bytes each. MemFind looks at the leading 64 - 12 = 52 bits of A to see what chunk it belongs to, and extends the list of known chunks, if necessary. Then it computes R by adding the trailing 12 bits of A to the starting address of the relevant simulated chunk. (The chunk size could be any power of 2, as long as each chunk contains at least one octabyte. Small chunks cause MemFind to search through longer lists of chunks-in-hand; large chunks cause MemFind to waste space for bytes that will never be accessed.)

Each simulated chunk is encapsulated in a "node," which occupies 2^{12} + 24 bytes of memory. The first octabyte of such a node, called the KEY, identifies the simulated address of the first byte in the chunk. The second octabyte, called the LINK, points to the next node on MemFind's list; it is zero on the last node of the list. The LINK is followed by 2^{13} bytes of simulated memory called the DATA. Finally, each node ends with eight all-zero bytes, which are used as padding in the implementation of input-output (see exercises 15-17).

MemFind maintains its list of chunk nodes in order of use: The first node, pointed to by head, is the one that MemFind found on the previous call, and it links to the next-most-recently-used chunk, etc. If the future is like the past, MemFind will therefore not have to search far down its list. (Section 6.1 discusses such "self-organizing" list searches in detail.) Initially head points to Chunk0, whose KEY and LINK and DATA are all zero. The allocation pointer alloc is set initially to the place where the next chunk node will appear when it is needed, namely Chunk0+nodesize.

We implement MemFind with the PREFIX operation of MMIXAL discussed in Section 1.4.1', so that the private symbols head, key, addr, etc., will not conflict with any symbols in the rest of the program. The calling sequence will be

\[
\text{SET arg},A; \quad \text{PUSHJ res,MemFind} \quad (1)
\]

after which the resulting address R will appear in register res.
013  PREFIX :Mem:  (Begin private symbols for MemFind)
014  head  GREG  0  Address of first chunk
015  curkey  GREG  0  KEY (head)
016  alloc  GREG  0  Address of next chunk to allocate
017  Chunk  IS  #1000  Bytes per chunk, must be a power of 2
018  addr  IS  $0  The given address A
019  key  IS  $1  Its chunk address
020  test  IS  $2  Temporary register for key search
021  newlink  IS  $3  The second most recently used node
022  p  IS  $4  Temporary pointer register
023  t  IS  :t  External temporary register
024  KEY  IS  0
025  LINK  IS  8
026  DATA  IS  16
027  nodesize  GREG  Chunk+3*8
028  mask  GREG  Chunk-1
029  :MemFind  ANDN  key, addr, mask
030  CMPU  t, key, curkey
031  PBZ  t, 4F  Branch if head is the right chunk.
032  BN  addr, :Error  Disallow negative addresses A.
033  SET  newlink, head  Prepare for the search loop.
034  1H  SET  p, head  p ← head.
035  LDGU  head, p, LINK  head ← LINK(p).
036  PBNZ  head, 2F  Branch if head ≠ 0.
037  SET  head, alloc  Otherwise allocate a new node.
038  STOU  key, head, KEY
039  ADDU  alloc, alloc, nodesize
040  JMP  3F
041  2H  LDGU  test, head, KEY
042  CMPU  t, test, key
043  BNZ  t, 1B  Loop back if KEY(head) ≠ key.
044  3H  LDGU  t, head, LINK  Adjust pointers: t ← LINK(head).
045  STOU  newlink, head, LINK  LINK(head) ← newlink.
046  SET  curkey, key  curkey ← key.
047  STOU  t, p, LINK  LINK(p) ← t.
048  4H  SUBU  t, addr, key  t ← chunk offset.
049  LDA  $0, head, DATA  $0 ← address of DATA(head).
050  ADDU  $0, t, $0
051  POP  1, 0  Return R.
052  PREFIX :  (End of the ':Mem:' prefix)
053  res  IS  $2  Result register for PUSHJ
054  arg  IS  res+1  Argument register for PUSHJ

We come next to the most interesting aspect of the simulator, the implementation of MMIX's register stack. Recall from Section 1.4.1 that the register stack is conceptually a list of \( \tau \) items \( S[0], S[1], \ldots, S[\tau - 1] \). The final item \( S[\tau - 1] \) is said to be at the “top” of the stack, and MMIX's local registers \( \$0, \$1, \ldots, \$L-1 \) are the topmost \( L \) items \( S[\tau - L], S[\tau - L + 1], \ldots, S[\tau - 1] \); here \( L \) is the value of special register RL. We could simulate the stack by simply keeping
1.4.3’    INTERPRETIVE ROUTINES  79

it entirely in the simulated memory; but an efficient machine wants its registers
to be instantly accessible, not in a relatively slow memory unit. Therefore we
will simulate an efficient design that keeps the topmost stack items in an array
of internal registers called the local register ring.

The basic idea is quite simple. Suppose the local register ring has $\rho$ elements,
$l[0], l[1], \ldots, l[\rho - 1]$. Then we keep local register $\$k$ in $l[(\alpha + k) \mod \rho]$, where
$\alpha$ is an appropriate offset. (The value of $\rho$ is chosen to be a power of 2, so that
remainders mod $\rho$ require no expensive computation. Furthermore we want $\rho$
to be at least 256, so that there is room for all of the local registers.) A push
operation, which renumbers the local registers so that what once was, say, $\$3$ is
now called $\$0$, simply increases the value of $\alpha$ by 3; a pop operation restores the
previous state by decreasing $\alpha$. Although the registers change their numbers, no
data actually needs to be pushed down or popped up.

Of course we need to use memory as a backup when the register stack gets
large. The status of the ring at any time is best visualized in terms of three
variables, $\alpha$, $\beta$, and $\gamma$:

Elements $l[\alpha], l[\alpha + 1], \ldots, l[\beta - 1]$ of the ring are the current local registers
$\$0, $\$1, \ldots, $\$L - 1$; elements $l[\beta], l[\beta + 1], \ldots, l[\gamma - 1]$ are currently unused;
and elements $l[\gamma], l[\gamma + 1], \ldots, l[\alpha - 1]$ contain items of the register stack that
have been pushed down. If $\gamma \neq \alpha$, we can increase $\gamma$ by 1 if we first store $l[\gamma]$
in memory. If $\gamma \neq \beta$, we can decrease $\gamma$ by 1 if we then load $l[\gamma]$. MMIX has two
special registers called the stack pointer rS and the stack offset rO, which hold
the memory addresses where $l[\gamma]$ and $l[\alpha]$ will be stored, if necessary. The values
of $\alpha$, $\beta$, and $\gamma$ are related to rL, rS, and rO by the formulas

$$\alpha = (rO/8) \mod \rho, \quad \beta = (\alpha + rL) \mod \rho, \quad \gamma = (rS/8) \mod \rho$$  \hspace{1cm} (3)

The simulator keeps most of MMIX’s special registers in the first 32 positions
of the global register array. For example, the simulated remainder register rR is
the octabyte in location #Global+8*rR. But eight of the special registers, including
rS, rO, rL, and rG, are potentially relevant to every simulated instruction,
so the simulator maintains them separately in its own global registers. Thus, for
example, register ss holds the simulated value of rS, and register ll holds eight
times the simulated value of rL:

055 ss  GREG  0  The simulated stack pointer, rS
056 00  GREG  0  The simulated stack offset, rO
057 11  GREG  0  The simulated local threshold register, rL, times 8
058 88  GREG  0  The simulated global threshold register, rG, times 8
Here is a subroutine that obtains the current value of the simulated register $k$, given $k$. The calling sequence is

\[ \text{SLU arg}, k, 3; \ \text{PUSHJ res, GetReg} \]  

(after which the desired value will be in res.)

Notice the colon in the label field of line 064. This colon is redundant, because the current prefix is `:` (see line 052); the colon on line 029 was, however, necessary for the external symbol \texttt{MemFind}, because at that time the current prefix was `\text{Mem:}`. Colons in the label field, redundant or not, give us a handy way to advertise the fact that a subroutine is being defined.

The next subroutines, \texttt{StackStore} and \texttt{StackLoad}, simulate the operations of increasing $\gamma$ by 1 and decreasing $\gamma$ by 1 in the diagram (2). They return no result. \texttt{StackStore} is called only when $\gamma \neq \alpha$; \texttt{StackLoad} is called only when $\gamma \neq \beta$. Both of them must save and restore $rJ$, because they are not leaf subroutines.

```
074  :StackStore GET $0,rJ$   Save the return address.
075  AND t,ss,lr
076  LDOU $1,1,t$   $1 \gets l[\gamma].$
077  SET arg,ss
078  PUSHJ res,MemFind
079  STOU $1,\text{res},0$   $M_{\gamma}[rS] \gets 1.$
080  ADDU ss,ss,8   Increase rS by 8.
081  PUT rJ,$0$   Restore the return address.
082  POP 0   Return to caller.
083  :StackLoad GET $0,rJ$   Save the return address.
084  SUBU ss,ss,8   Decrease rS by 8.
085  SET arg,ss
086  PUSHJ res,MemFind
087  LDOU $1,\text{res},0$   $1 \gets M_{\gamma}[rS].$
088  AND t,ss,lr
```
089       STOU $1,1,t     l[γ] ← $1.
090       PUT rJ,$0     Restore the return address.
091       POP 0          Return to caller.  |

(Register rJ on lines 074, 081, 083, and 090 is, of course, the real rJ, not the simulated rJ. When we simulate a machine on itself, we have to remember to keep such things straight!)

The StackRoom subroutine is called when we have just increased β. It checks whether β = γ and, if so, it increases γ.

:StackRoom SUBU t,ss,00
         SUBU t,t,11
         AND t,t,ring_mask
         PBNZ t,1F     Branch if (rS−rO)/8 ≠ rL (modulo r).
         GET $0,rJ     Oops, we're not a leaf subroutine.
         PUSHJ res,StackStore  Advance rS.
         PUT rJ,$0     Restore the return address.
         1H PUT 0       Return to caller.  |

Now we come to the heart of the simulator, its main simulation loop. An interpretive routine generally has a central control section that is called into action between interpreted instructions. In our case, the program transfers to location Fetch when it is ready to simulate a new command. We keep the address $ of the next simulated instruction in the global register inst_ptr. Fetch usually sets loc ← inst_ptr and advances inst_ptr by 4; but if we are simulating a RESUME command that inserts the simulated rX into the instruction stream, Fetch sets loc ← inst_ptr−4 and leaves inst_ptr unchanged. This simulator considers an instruction to be ineligible for execution unless its location loc is in the text segment (that is, loc < *2000 0000 0000 0000).

100  * The main loop
101  loc GREG 0 Where the simulator is at
102  inst_ptr GREG 0 Where the simulator will be next
103  inst GREG 0 The current instruction being simulated
104  resuming GREG 0 Are we resuming an instruction in rX?
105  Fetch PBZ resuming,1F Branch if not resuming.
106       SUBU loc,inst_ptr,4 loc ← inst_ptr−4.
107       LDTU inst,g,8*rX+4 inst ← right half of rX.
108       JMP 2F
109 1H  SET loc,inst_ptr loc ← inst_ptr.
110       SET arg,loc
111       PUSHJ res,MemFind
112       LDTU inst,res,0 inst ← Mx[ loc].
113       ADDU inst_ptr,loc,4 inst_ptr ← loc + 4.
114 2H  CMPU t,loc,g Branch if loc ≥ Data_Segment.  |
115       BNN t_Error

The main control routine does the things common to all instructions. It unpacks the current instruction into its various parts and puts the parts into
convenient registers for later use. Most importantly, it sets global register \( f \) to
64 bits of “info” corresponding to the current opcode. A master table, which
starts at location \( \text{Info} \), contains such information for each of MMIX’s 256 opcodes.
(See Table 1 on page 88.) For example, \( f \) is set to an odd value if and only if the
\( Z \) field of the current opcode is an “immediate” operand or the opcode is \( \text{JMP} \);
similarly \( f \land ^{\#}40 \) is nonzero if and only if the instruction has a relative address.
Later steps of the simulator will be able to decide quickly what needs to be done
with respect to the current instruction because most of the relevant information
appears in register \( f \).

\begin{verbatim}
116 op  GREG  0  Opcode of the current instruction
117 xx   GREG  0  X field of the current instruction
118 yy   GREG  0  Y field of the current instruction
119 zz   GREG  0  Z field of the current instruction
120 zf   GREG  0  ZF field of the current instruction
121 f    GREG  0  Packed information about the current opcode
122 xxx  GREG  0  X field times 8
123 x    GREG  0  X operand and/or result
124 y    GREG  0  Y operand
125 z    GREG  0  Z operand
126 xptr GREG  0  Location where \( x \) should be stored
127 exc  GREG  0  Arithmetic exceptions
128 Z_is_immed_bit IS #1  Flag bits possibly set in \( f \)
129 Z_is_source_bit IS #2
130 Y_is_immed_bit IS #4
131 Y_is_source_bit IS #8
132 X_is_source_bit IS #10
133 X_is_dest_bit IS #20
134 Rel_addr_bit IS #40
135 Mem_bit IS #80
136 Info IS #1000
137 Done IS Info+8*256
138 info GREG Info  (Base address for the master info table)
139 c255 GREG 8*255  (A handy constant)
140 c256 GREG 8*256  (Another handy constant)
141 MOR op,inst,#8  op ← inst ⋄ 24.
142 MOR xx,inst,#4  xx ← (inst ⋄ 16) ⋄ ^{\#}ff.
143 MOR yy,inst,#2  yy ← (inst ⋄ 8) ⋄ ^{\#}ff.
144 MOR zz,inst,#1  zz ← inst ⋄ ^{\#}ff.
145 OH GREG #10000
146 ANDN yz,inst,0B
147 SLU xxx,xx,3
148 SLU t,op,3
149 LDU f,info,t  f ← Info[op].
150 SET x,0  x ← 0 (default value).
151 SET y,0  y ← 0 (default value).
152 SET z,0  z ← 0 (default value).
153 SET exc,0  exc ← 0 (default value).  
\end{verbatim}
The first thing we do, after having unpacked the instruction into its various fields, is convert a relative address to an absolute address if necessary.

```
154  AND  t,f,Rel_addr_bit
155  PBZ  t,1F       Branch if not a relative address.
156  PBEV  f,2F      Branch if op isn't JMP or JMB.
157  9H   GREG -#1000000
158  ANDN yz,inst,9B  yz ← inst ∧ \text{fffff} \ (\text{namely XYZ}).
159  ADDU t,yz,9B    t ← XYZ - 2^{16}.
160  JMP  3F
161  2H  ADDU t,yz,0B  t ← YZ - 2^{16}.
162  3H  CSUD yz,op,t  Set yz ← t if op is odd ("backward").
163  SL   t,yz,2
164  ADDU yz,loc,t    yz ← loc + yz \ll 2.  
```

The next task is critical for most instructions: We install the operands specified by the Y and Z fields into global registers y and z. Sometimes we also install a third operand into global register x, specified by the X field or coming from a special register like the simulated rD or rM.

```
165  1H   PBNN resuming,Install_X  Branch unless resuming < 0.
                 (See exercise 14.)
166
...  
174  Install_X AND t,f,X_is_source_bit
175  PBZ  t,1F       Branch unless $X$ is a source.
176  SET  arg,xxx
177  PUSHJ res,GetReg
178  SET  x,res      x ← $X$.
179  1H  SRU  t,f,5
180  AND  t,t,#f8    t ← special register number, times 8.
181  PBZ  t,Install_Z
182  LDQU x,g,t      If t ≠ 0, set x ← g[t].
183  Install_Z AND t,f,Z_is_source_bit
184  PBZ  t,1F       Branch unless $Z$ is a source.
185  SLU  arg,zz,3
186  PUSHJ res,GetReg
187  SET  z,res      z ← $Z$.
188  JMP  Install_Y
189  1H  CSOD z,f,zz  If Z is immediate, z ← Z.
190  AND  t,op,#f0
191  CMPU t,t,#e0
192  PBNZ t,Install_Y  Branch unless ^4o ≤ op < ^4f0.
193  AND  t,op,#3
194  NEG  t,3,t
195  SLU  t,t,4
196  SLU  z,yz,t      z ← yz \ll (48,32,16, or 0).
197  SET  y,x       y ← x.
198  Install_Y AND t,f,Y_is_immed_bit
199  PBZ  t,1F       Branch unless Y is immediate.
200  SET  y,yy
201  SLU  t,yy,40
202  ADDU f,f,t      Insert Y into left half of f.
```
203 1H       AND t,f,Y_is_source_bit
204         BZ t,1F  Branch unless $Y$ is a source.
205         SLU arg,yy,3
206         PUSHJ res,GetReg
207         SET y,res  y ← $Y$.  

When the X field specifies a destination register, we set xptr to the memory address where we will eventually store the simulated result; this address will be either in the Global array or the Local ring. The simulated register stack grows at this point if the destination register must be changed from marginal to local.

208 1H       AND t,f,X_is_dest_bit
209         BZ t,1F  Branch unless $X$ is a destination.
210     XDest CMPU t,xxx,gg
211         BN t,3F  Branch if $X$ is not global.
212         LDA xptr,g,xxx  xptr ← address of g[X].
213         JMP 1F
214 2H       ADDU t,00,11
215         AND t,t,ring_mask
216         STCO 0,l,t  l[($α + L) mod ρ] ← 0.
217         INCL 11,8  L ← L + 1.  ($L$ becomes local.)
218         PUSHJ res,StackRoom  Make sure $β ≠ γ$.
219     3H CMPU t,xxx,11
220         BNN t,2B  Branch if $X$ is not local.
221         ADD t,xxx,00
222         AND t,t,ring_mask
223         LDA xptr,l,t  xptr ← address of l[($α + X) mod ρ].  

Finally we reach the climax of the main control cycle: We simulate the current instruction by essentially doing a 256-way branch, based on the current opcode. The left half of register f is, in fact, an MMIX instruction that we perform at this point, by inserting it into the instruction stream via a RESUME command. For example, if we are simulating an ADD command, we put “ADD x,y,z” into the right half of rX and clear the exception bits of rA; the RESUME command will then cause the sum of registers y and z to be placed in register x, and rA will record whether overflow occurred. After the RESUME, control will pass to location Done, unless the inserted instruction was a branch or jump.

224 1H       AND t,f,Mem_bit
225         PBZ t,1F  Branch unless inst accesses memory.
226         ADDU arg,y,z
227         CMPU t,op,#A0  t ← [op is a load instruction].
228         BN t,2F
229         CMPU t,arg,g
230         BN t,Error  Error if storing into the text segment.
231     2H PUSHJ res,MemFind  res ← address of M[y + z].
232     1H SRU t,f,32
233         PUT rX,t  rX ← left half of f.
234         PUT rM,x  rM ← x  (prepare for MIX).
235         PUT rE,x  rE ← x  (prepare for PCMP, FUNE, FEQLE).
Some instructions can’t be simulated by simply “performing themselves” like an ADD command and jumping to Done. For example, a MULU command must insert the high half of its computed product into the simulated RH. A branch command must change \texttt{inst_ptr} if the branch is taken. A PUSHJ command must push the simulated register stack, and a POP command must pop it. SAVE, UNSAVE, RESUME, TRAP, etc., all need special care; therefore the next part of the simulator deals with all cases that don’t fit the nice “\textit{x equals y op z}” pattern.

Let’s start with multiplication and division, since they are easy:

\begin{verbatim}
243  MULU MULU x,y,z  Multiply y by z, unsigned.
244  GET t,rH  Set t ← upper half of the product.
245  STOU t,g,8*rH  g[rH] ← upper half product.
246  JMP XDone  Finish by storing x.
247  Div DIV x,y,z  (For division, see exercise 6.)
\end{verbatim}

If the simulated instruction was a branch command, say “BZ \$x,RA”, the main control routine will have converted the relative address RA to an absolute address in register yz (line 164), and it will also have placed the contents of the simulated \$x into register x (line 178). The RESUME command will then execute the instruction “BZ x,BTaken” (line 242); and control will pass to B Taken instead of Done if the simulated branch is taken. B Taken adds 2v to the simulated running time, changes \texttt{inst_ptr}, and jumps to Update.

\begin{verbatim}
254  BTaken ADDU cc,cc,4  Increase rC by 4v.
255  PBTaken SUBU cc,cc,2  Decrease rC by 2v.
256  SET inst_ptr,yz  inst_ptr ← branch address.
257  JMP Update  Finish the command.
258  Go SET x,inst_ptr  GO instruction: Set x ← loc + 4.
259  ADDU inst_ptr,y,z  inst_ptr ← (y + z) mod 254.
260  JMP XDone  Finish by storing x.
\end{verbatim}

(Line 257 could have jumped to Done, but that would be slower; a shortcut to Update is justified because a branch command doesn’t store x and cannot cause an arithmetic exception. See lines 500–541 below.)

A PUS HJ or PUS HG0 command pushes the simulated register stack down by increasing the \texttt{a} pointer of (2); this means increasing the simulated rO, namely register oo. If the command is “PUSHJ \$x,RA” and if \$x is local, we push X + 1 octabytes down by first setting \$x ← X and then increasing oo by 8(X + 1). (The value we have put in \$x will be used later by POP to determine how to restore oo to its former value. Simulated register \$X will then be set to the
result of the subroutine, as explained in Section 1.4.1.) If $X$ is global, we push $rL + 1$ octabytes down in a similar way.

```
261 PushGo ADDU yz,y,z yz ← (y + z) mod $2^{64}$.
262 PushJ SET inst_ptr,yz inst_ptr ← yz.
263 CMPU t,xxx,gg
264 PBN t,1F Branch if $X$ is local.
265 SET xxx,ll Pretend that $X = rL$.
266 SRU xx,xxx,3
267 INCL ll,8 Increase rL by 1.
268 PUSHJ 0,StackRoom Make sure $\beta \neq \gamma$ in (2).
269 1H ADDU t,xxx,oo
270 AND t,t,rling_mask
271 STOU xx,1,t l[(\alpha + X) mod 2^6] ← X.
272 ADDU t,loc,4
273 STOU t,g,8*rJ g[rJ] ← loc + 4.
274 INCL xxx,8
275 SUBU ll,11,xxx Decrease rL by X + 1.
276 ADDU oo,oo,xxx Increase rO by 8(X + 1).
277 JMP Update Finish the command.
```

Special routines are needed also to simulate POP, SAVE, UNSAVE, and several other opcodes including RESUME. Those routines deal with interesting details about MMIX, and we will consider them in the exercises; but we’ll skip them for now, since they do not involve any techniques related to interpretive routines that we haven’t seen already.

We might as well present the code for SYNC and TRIP, however, since those routines are so simple. (Indeed, there’s nothing to do for “SYNC XYZ” except to check that XYZ $\leq 3$, since we aren’t simulating cache memory.) Furthermore, we will take a look at the code for TRAP, which is interesting because it illustrates the important technique of a jump table for multiway switching:

```
278 Sync BNZ xx,Error Branch if $X \neq 0$.
279 CMPU t,yz,4
280 BNN t,Error Branch if YZ $\geq 4$.
281 JMP Update Finish the command.
282 Trip SET xx,0 Initiate a trip to location 0.
283 JMP TakeTrip (See exercise 13.)
284 Trap STOU inst_ptr,g,8*rW g[rW] ← inst_ptr.
285 OH GREG #8000000000000000
286 ADDU t,inst,0B
287 STOU t,g,8*rXX g[rXX] ← inst + 2^{32}.
288 STOU y,g,8*rYY g[rYY] ← y.
289 STOU z,g,8*rZZ g[rZZ] ← z.
290 SRU y,inst,6
291 CMPU t,y,4*11
292 BNN t,Error Branch if $X \neq 0$ or Y > Ftell.
293 LDOUT t,g,c255 t ← g[255].
```
294    OH GREG @+4
295    GO y,0B,y       Jump to @ + 4 + 4Y.
296    JMP SimHalt     Y = Halt: Jump to SimHalt.
297    JMP SimFopen     Y = Fopen: Jump to SimFopen.
298    JMP SimFclose    Y = Fclose: Jump to SimFclose.
299    JMP SimFread     Y = Fread: Jump to SimFread.
300    JMP SimFgets    Y = Fgets: Jump to SimFgets.
301    JMP SimFgetws   Y = Fgetws: Jump to SimFgetws.
302    JMP SimFwrite   Y = Fwrite: Jump to SimFwrite.
303    JMP SimFputs    Y = Fputs: Jump to SimFputs.
304    JMP SimFputws   Y = Fputws: Jump to SimFputws.
305    JMP SimFseek    Y = Fseek: Jump to SimFseek.
306    JMP SimFtell    Y = Ftell: Jump to SimFtell.
307    TrapDone ST0 t,g,8*x:BB Set g[xBB] ← t.
308    ST0 t,g,c255    A trap ends with g[255] ← g[xBB].
309    JMP Update      Finish the command. 1

(See exercises 15–17 for SimFopen, SimFclose, SimFread, etc.)

Now let’s look at the master Info table (Table 1), which allows the simulator to deal rather painlessly with 256 different opcodes. Each table entry is an octabyte consisting of (i) a four-byte MMIX instruction, which will be invoked by the RESUME instruction on line 242; (ii) two bytes that define the simulated running time, one byte for μ and one byte for ν; (iii) a byte that names a special register, if such a register ought to be loaded into x on line 182; and (iv) a byte that is the sum of eight 1-bit flags, expressing special properties of the opcode. For example, the info for opcode FIX is

\[
\text{FIX } x,0,z; \quad \text{BYTE } 0,4,0,#26;
\]

it means that (i) the instruction FIX x,0,z should be performed, to round a floating point number to a fixed point integer; (ii) the simulated running time should be increased by \(0\mu + 4\nu\); (iii) no special register is needed as an input operand; and (iv) the flag byte

\*26 = X_is_dest_bit + Y_is_immed_bit + Z_is_source_bit

determines the treatment of registers x, y, and z. (The Y_is_immed_bit actually causes the Y field of the simulated instruction to be inserted into the Y field of “FIX x,0,z”; see line 202.)

One interesting aspect of the Info table is that the RESUME command of line 242 executes the instruction as if it were in location Done-4, since rw = Done. Therefore, if the instruction is a JMP, the address must be relative to Done-4; but MMIXAL always assembles JMP commands with an address relative to the assembled location 0. We trick the assembler into doing the right thing by writing, for example, “JMP Trap+0-0”, where 0 is defined to equal Done-4. Then the RESUME command will indeed jump to location Trap as desired.

After we have executed the special instruction inserted by RESUME, we normally get to location Done. From here on everything is anticlimactic; but
Table 1
MASTER INFORMATION TABLE FOR SIMULATOR CONTROL

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>LDB x,res,0</td>
<td>BYTE 1,10,#aa (LDB)</td>
</tr>
<tr>
<td>LDB x,res,0</td>
<td>BYTE 1,10,#aa (LDB)</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>JMP Cseq+0-0; BYTE 2,2,0,#ba (CSNAP)</td>
<td></td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>SUB x,y,z</td>
<td>BYTE 0,4,0,#2a (SUB)</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>FPMUL x,y,z</td>
<td>BYTE 0,4,0,#2a (FPMUL)</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
<tr>
<td>CMPI x,y,z</td>
<td>BYTE 0,1,0,#26 (CMPI)</td>
</tr>
<tr>
<td>...</td>
<td></td>
</tr>
</tbody>
</table>

Entries not shown here explicitly follow a pattern that is easily deduced from the examples shown. (See, for example, exercise 1.)
we can take satisfaction in the fact that an instruction has been simulated successfully and the current cycle is nearly finished. Only a few details still need to be wrapped up: We must store the result \( x \) in the appropriate place, if the \( X_{\text{is dest bit}} \) flag is present, and we must check if an arithmetic exception has triggered a trip interrupt:

\[
\begin{align*}
500 & \text{Done AND } t,f, X_{\text{is dest bit}} \\
501 & \quad \text{BZ } t, 1F \quad \text{Branch unless } X \text{ is a destination.} \\
502 & \text{XDone STOU } x, xptr, 0 \quad \text{Store } x \text{ in simulated } X. \\
503 & \text{1H GET } t, rA \\
504 & \text{AND } t, t, \#ff \quad t \leftarrow \text{new arithmetic exceptions.} \\
505 & \text{OR exc, exc, } t \quad \text{exc} \leftarrow \text{exc } \lor t. \\
506 & \text{AND } t, \text{exc, U_BIT } \times \text{X_BIT} \\
507 & \text{CMPU } t, t, \text{U_BIT} \\
508 & \text{PBNZ } t, 1F \quad \text{Branch unless underflow is exact.} \\
509 & \text{OH GREG U_BIT } \times \text{<8} \\
510 & \text{AND } t, aa, 0B \\
511 & \text{BNZ } t, 1F \quad \text{Branch if underflow is enabled.} \\
512 & \text{ANDNL exc, U_BIT} \quad \text{Ignore U if exact and not enabled.} \\
513 & \text{1H PBZ } \text{exc, Update} \\
514 & \text{SRJ } t, aa, 8 \\
515 & \text{AND } t, t, \text{exc} \\
516 & \text{PBZ } t, 4F \quad \text{Branch unless trip interrupt needed.} \\
517 & \text{...} \quad \text{(See exercise 13.)} \\
518 & \text{OR } aa, aa, \text{exc} \quad \text{Record new exceptions in } rA. \quad \text{[1]}
\end{align*}
\]

Line number 500 is used here for convenience, although several hundred instructions and the entire Info table actually intervene between line 309 and this part of the program. Incidentally, the label Done on line 500 does not conflict with the label Done on line 137, because both of them define the same equivalent value for this symbol.

After line 505, register exc contains the bit codes for all arithmetic exceptions triggered by the instruction just simulated. At this point we must deal with a curious asymmetry in the rules for IEEE standard floating point arithmetic: An underflow exception (U) is suppressed unless the underflow trip has been enabled in rA or unless an inexact exception (X) has also occurred. (We had to enable the underflow trip in line 238 for precisely this reason; the simulator ends with the commands

\[
\text{LOC U_Handler; ORL exc, U_BIT; JMP Done}
\]

so that exc will properly record underflow exceptions in cases where a floating point computation was exact but produced a denormal result.)

Finally—Hurray!—we are able to close the cycle of operations that began long ago at location Fetch. We update the runtime clocks, take a deep breath, and return to Fetch again:

\[
\begin{align*}
540 & \text{OH GREG } #0000000800000004 \\
541 & \text{Update MOR } t, f, 0B \quad 2^{32} \text{mims } + \text{oops}
\end{align*}
\]
542 ADDU cc,cc,t Increase the simulated clock, rC.
543 ADDU uu,uu,1 Increase the usage counter, rU.
544 SUBU ii,ii,1 Decrease the interval counter, rI.
545 AllDone PBZ resuming,Fetch Go to Fetch if resuming = 0.
546 CMPI t,op,#F9 Otherwise set t ← [op = RESUME].
547 CSNZ resuming,t,0 Clear resuming if not resuming,
548 JMP Fetch and go to Fetch. ■

Our simulation program is now complete, except that we still must initialize everything properly. We assume that the simulator will be run with a command line that names a binary file. Exercise 20 explains the simple format of that file, which specifies what should be loaded into the simulated memory before simulation begins. Once the program has been loaded, we launch it as follows:

At line 576 below, register loc will contain a location from which a simulated UNSAVE command will get the program off to a good start. (In fact, we simulate an UNSAVE that is being simulated by a simulated RESUME. The code is tricky, perhaps, but it works.)

549 Infile IS 3 (Handle for binary input file)
550 Main LDA Mem:head,Chunk0 Initialize MemFind.
551 ADDU Mem:alloc,Mem:head,Mem:nodesize
552 GET t,rN
553 INCL t,1
554 STOU t,g,8*rN g[xN] ← (our rN) + 1.
555 LDOU t,$1,8 t ← binary file name (argv[1]).
556 STOU t,IOArgs
557 LDA t,IOArgs (See line 010)
558 TRAP 0,Fopen,Infile Open the binary file.
559 EN t,Error

... Now load the file (see exercise 20).
576 STOU loc,g,c255 g[255] ← place to UNSAVE.
577 SUBU arg,loc,8*13 arg ← place where $255 appears.
578 PUSHJ res,MemFind
579 LDOU inst_ptr,res,0 inst_ptr ← Main.
580 SET arg,#90
581 PUSHJ res,MemFind
582 LDGU x,res,0 x ← M4[-90].
583 SET resuming,1 resuming ← 1.
584 CSNZ inst_ptr,x,#90 If x ≠ 0, set inst_ptr ← #90.
585 OH GREG #FB<24+255
586 STOU 0B,g,8*rX g[x] ← “UNSAVE $255”.
587 SET gg,c255 G ← 255.
588 JMP Fetch Start the ball rolling.
589 Error NEG t,22 t ← -22 for error exit.
590 Exit TRAP 0,Halt,0 End of simulation.
591 LOC Global+8*rX; OCTA -1
592 LOC Global+8*rT; OCTA #8000000000000000
593 LOC Global+8*rTT; OCTA #8000000600000000
594 LOC Global+8*rV; OCTA #369c200400000000
The simulated program’s Main starting address will be in the simulated register $255$ after the simulated UNSAVE. Lines 580–584 of this code implement a feature that wasn’t mentioned in Section 1.3.2: If an instruction is loaded into location $90$, the program begins there instead of at Main. (This feature allows a subroutine library to initialize itself before starting a user program at Main.)

Lines 591–594 initialize the simulated rK, rT, rTT, and rV to appropriate constant values. Then the program is finished; it ends with the trip-handler instructions of (5).

Whee! Our simulator has turned out to be pretty long—longer, in fact, than any other program that we will encounter in this book. But in spite of its length, the program above is incomplete in several respects because the author did not want to make it even longer:

a) Several parts of the code have been left as exercises.
b) The program simply branches to Error and quits, when it detects a problem. A decent simulator would distinguish between different types of error, and would have a way to keep going.
c) The program doesn’t gather any statistics, except for the total running time (cc) and the total number of instructions simulated (uu). A more complete program would, for example, remember how often the user guessed correctly with respect to branches versus probable branches; it would also record the number of times the StackLoad and StackStore subroutines need to access simulated memory. It might also analyze its own algorithms, studying for example the efficiency of the self-organizing search technique used by MemFind.
d) The program has no diagnostic facilities. A useful simulator would, for example, allow interactive debugging, and would output selected snapshots of the simulated program’s execution; such features would not be difficult to add. The ability to monitor a program easily is, in fact, one of the main reasons for the importance of interpretive routines in general.

EXERCISES

1. [20] Table 1 shows the Info entries only for selected opcodes. What entries are appropriate for (a) opcode "3F (SRU)? (b) opcode "55 (PBB)? (c) opcode "D9 (MUX)? (d) opcode "E6 (INCM)?

2. [26] How much time does it take the simulator to simulate the instructions (a) ADDU $255, Y, Z$; (b) STHT $X, Y, Z$; (c) Pnz $X, 0-4$?

3. [23] Explain why $\gamma \neq \alpha$ when StackRoom calls StackStore on line 097.

4. [20] Criticize the fact that MemFind never checks to see if alloc has gotten too large. Is this a serious blunder?

5. [20] If the MemFind subroutine branches to Error, it does not pop the register stack. How many items might be on the register stack at such a time?


7. [21] Complete the simulation of CSWAP instructions, by writing appropriate code.
8. [28] Complete the simulation of GET instructions, by writing appropriate code.
9. [29] Complete the simulation of PUT instructions, by writing appropriate code.
10. [24] Complete the simulation of POP instructions, by writing appropriate code.
    *Note:* If the normal action of POP as described in Section 1.4.1 would leave \( rL > rG \),
    MMIX will pop entries off the top of the register stack so that \( rL = rG \). For example, if
    the user pushes 250 registers down with PUSHJ and then says "PUT rG, 32; POP", only
    32 of the pushed-down registers will survive.
11. [25] Complete the simulation of SAVe instructions, by writing appropriate code.
    *Note:* SAVe pushes all the local registers down and stores the entire register stack in
    memory, followed by \( rG, r(G+1), \ldots, r255 \), followed by \( rH, rD, rE, rJ, rM, rR, \)
    \( rF, rW, rX, rY, \) and \( rZ \) (in that order), followed by the octabyte \( 2^{32} + rA \).
12. [26] Complete the simulation of UNSAVe instructions, by writing appropriate code.
    *Note:* The very first simulated UNSAVe is part of the initial loading process (see lines
    583–588), so it should not update the simulated clocks.
13. [27] Complete the simulation of trip interrupts, by filling in the missing code of
    lines 517–538.
14. [28] Complete the simulation of RESUME instructions, by writing appropriate code.
    *Note:* When \( rX \) is nonnegative, its most significant byte is called the "ropcode";
    ropcodes 0, 1, 2 are available for user programs. Line 242 of the simulator uses
    ropcode 0, which simply inserts the lower half of \( rX \) into the instruction stream.
    Ropcode 1 is similar, but the instruction in \( rX \) is performed with \( y \leftarrow rY \) and \( z \leftarrow rZ \)
    in place of the normal operands; this variant is allowed only when the first hexadecimal
    digit of the inserted opcode is \( \overline{0}, \overline{1}, \overline{2}, \overline{3}, \overline{4}, \overline{5}, \overline{6}, \overline{7}, \overline{8}, \overline{9} \), \( \overline{A} \), \( \overline{B} \), \( \overline{C} \), \( \overline{D} \), \( \overline{E} \), \( \overline{F} \). Ropcode 2
    sets \( SX \leftarrow rZ \) and \( exc \leftarrow Q \), where \( X \) is the third byte from the right of \( rX \) and \( Q \) is
    the third byte from the left; this makes it possible to set the value of a register and
    simultaneously raise any subset of the arithmetic exceptions DVWIOUZ. Ropcodes
    1 and 2 can be used only when \( SX \) is not marginal. Your solution to this exercise
    should cause RESUME to set resuming \( \leftarrow 0 \) if the simulated \( rX \) is negative, otherwise
    resuming \( \leftarrow \{1, -1, -2\} \) for ropcodes \( \{0, 1, 2\} \). You should also supply the code that is
    missing from lines 166–173.

15. [25] Write the routine SimPut, which simulates the operation of outputting a
    string to the file corresponding to a given handle.
16. [25] Write the routine SimOpen, which opens a file corresponding to a given
    handle. (The simulator can use the same handle number as the user program.)
17. [25] Continuing the previous exercises, write the routine SimRead, which reads
    a given number of bytes from a file corresponding to a given handle.
18. [27] Would this simulator be of any use if \( \text{iring\_size} \) were less than 256, for ex-
    ample if \( \text{iring\_size} = 32 \)?
19. [14] Study all the uses of the StackRoom subroutine (namely in line 218, line 268,
    and in the answer to exercise 11). Can you suggest a better way to organize the code?
    (See step 3 in the discussion at the end of Section 1.4.1.)
20. [20] The binary files input by the simulator consist of one or more groups of
    octabytes each having the simple form
    \[
    \lambda, x_0, x_1, \ldots, x_{l-1}, 0
    \]
for some $l \geq 0$, where $x_0, x_1, \ldots, x_{l-1}$ are nonzero; the meaning is

$$M_k[\lambda + 8k] \leftarrow x_k, \quad \text{for } 0 \leq k < l.$$

The file ends after the last group. Complete the simulator by writing MMIX code to load such input (lines 560–575 of the program). The final value of register 10c should be the location of the last octabyte loaded, namely $\lambda + 8(l - 1)$.

21. [20] Is the simulation program of this section able to simulate itself? If so, is it able to simulate itself simulating itself? And if so, is it ⋯?

22. [40] Implement an efficient jump trace routine for MMIX. This is a program that records all transfers of control in the execution of another given program by recording a sequence of pairs $(x_1, y_1), (x_2, y_2), \ldots$, meaning that the given program jumped from location $x_1$ to $y_1$, then (after performing the instructions in locations $y_1, y_1 + 1, \ldots, x_2$) it jumped from $x_2$ to $y_2$, etc. [From this information it is possible for a subsequent routine to reconstruct the flow of the program and to deduce how frequently each instruction was performed.]

A trace routine differs from a simulator because it allows the traced program to occupy its normal memory locations. A jump trace modifies the instruction stream in memory, but does so only to the extent necessary to retain control. Otherwise it allows the machine to execute arithmetic and memory instructions at full speed. Some restrictions are necessary; for example, the program being traced shouldn’t modify itself. But you should try to keep such restrictions to a minimum.
ANSWERS TO EXERCISES

SECTION 1.3.1

1. 7d9 or 7D9.

2. (a) \{B, D, F, b, d, f\}. (b) \{A, C, E, a, c, e\}. An odd fact of life.

3. (Solution by Gregor N. Purdy.) 2 bits = 1 nyp; 2 nyps = 1 nybble; 2 nybbles = 1 byte. Incidentally, the word “byte” was coined in 1956 by members of IBM’s Stretch computer project; see W. Buchholz, *BYTE* 2, 2 (February 1977), 144.

4. 1000 MB = 1 gigabyte (GB), 1000 GB = 1 terabyte (TB), 1000 TB = 1 petabyte (PB), 1000 PB = 1 exabyte (EB), 1000 EB = 1 zettabyte (ZB), 1000 ZB = 1 yottabyte (YB), according to the 19th Conférence Générale des Poids et Mesures (1990).

   (Some people, however, use $2^{10}$ instead of 1000 in these formulas, claiming for example that a kilobyte is 1024 bytes. To resolve the ambiguity, such units should preferably be called *large kilobytes*, *large megabytes*, etc., and denoted by KKB, MMB, ... to indicate their binary nature.)

5. If $-2^{n-1} < x < 2^n - 1$, then $-2^n < x - s(\alpha) < 2^n$; hence $x \neq s(\alpha)$ implies that $x \not\equiv s(\alpha) \pmod{2^n}$. But $s(\alpha) = u(\alpha) - 2^n \lfloor \alpha \rfloor \equiv u(\alpha) \pmod{2^n}$.

6. Using the notation of the previous exercise, we have $u(\alpha) = 2^n - 1 - u(\alpha)$; hence $u(\alpha) + 1 \equiv -u(\alpha) \pmod{2^n}$, and it follows that $s(\alpha) + 1 = -s(\alpha)$. Overflow might occur, however, when adding 1. In that case $\alpha = 10...0$, $s(\alpha) = -2^{n-1}$, and $-s(\alpha)$ is not representable.

7. Yes. (See the discussion of shifting.)

8. The radix point now falls between rH and $\&X$. (In general, if the binary radix point is $m$ positions from the end of $Y$ and $n$ positions from the end of $Z$, it is $m + n$ positions from the end of the product.)

9. Yes, except when $X = Y$, or $X = Z$, or overflow occurs.

10. $Y = \#8000000000000000$, $Z = \#0000000000000000$ is the only example!

11. (a) True, because $s(\$Y) \equiv u(\$Y)$ and $s(\$Z) \equiv u(\$Z)$ (modulo $2^{14}$) by exercise 5. (b) Clearly true if $s(\$Y) \geq 0$ and $s(\$Z) \geq 0$, because $s(\$Y) = u(\$Y)$ and $s(\$Z) = u(\$Z)$ in such a case. Also true if $\$Z = 0$ or $\$Z = 1$ or $\$Z = $\$Y$ or $\$Y = 0$. Otherwise false.

12. If $X \neq Y$, say ‘ADDU $\&X, \$Y, \$Z$; CMPU carry,$\&X, \$Y$, ZSN carry,carry,1’. But if $X = Y = Z$, say ‘ZSN carry,$\&X,1$; ADDU $\&X, \$X, \$X$’.

13. Overflow occurs on signed addition if and only if $\$Y$ and $\$Z$ have the same sign but their unsigned sum has the opposite sign. Thus

   XOR $\$0,\$Y,\$Z$; ADDU $\&X,\$Y,\$Z$; XOR $\$1,\$X,\$Y$; ANDN $\$1,\$1,\$0$; ZSN ovf1,\$1,\$1

determines the presence or absence of overflow when $X \neq Y$. 

94
14. Interchange X and Y in the previous answer. (Overflow occurs when computing $x = y - z$ if and only if it occurs when computing $y = x + z$.)

15. Let $y$ and $z$ be the sign bits of $y$ and $z$, so that $s(y) = y \text{ mod } 2^d$ and $s(z) = z \text{ mod } 2^d$; we want to calculate $(y - z) \text{ mod } 2^{128} = \left((y - z) \text{ mod } 2^d\right) \text{ mod } 2^{128}$. Thus the program

\textbf{MULU} $\&x, \&y, \&z$; \textbf{GET} $\&0, zH$; \textbf{ZSN} $\&1, \&y, \&z$; \textbf{SUBU} $\&0, \&0, \&1$; \textbf{ZSN} $\&1, \&z, \&y$; \textbf{SUBU} $\&0, \&0, \&1$ puts the desired octabyte in $\&0$.

16. After the instructions in the previous answer, check that the upper half is the sign extension of the lower half, by saying \textbf{SR} $\&1, \&x, 63$; \textbf{CMP} $\&1, \&0, \&1$; \textbf{ZSNZ} \textbf{ovfl}, $\&1, 1'$.

17. Let $a$ be the stated constant, which is $(2^{65} + 1)/3$. Then $ay/2^{65} = y/3 + y/(3 \cdot 2^{65})$, so $[ay/2^{65}] = [y/3]$ for $0 \leq y < 2^{65}$.

18. By a similar argument, $[ay/2^{66}] = [y/5]$ for $0 \leq y < 2^{66}$ when $a = (2^{65} + 1)/5 = cccccccccccccc ccccd$.

19. This statement is widely believed, and it has been implemented by compiler writers who did not check the math. But it is false when $z = 7, 21, 25, 29, 31, 39, 47, 49, 53, 55, 61, 63, 71, 81, 89, \ldots$, and in fact for 189 odd divisors $z$ less than 1000!

    Let $c = ay/2^{64k} - y/2^k = (z - r)/(2^{64k} + r)$, where $r = 2^{64k} \text{ mod } z$. Then $0 < c < 2^z$, hence trouble can arise only when $y \equiv -1 \mod z$. Then $0 \leq y < 2^{64}$, and if only if it holds for the single value $y = 2^{64} - 1 - (2^{64} \text{ mod } z)$.

    (The formula is, always correct in the restricted range $0 \leq y < 2^{63}$. And Michael Yoder observes that high-multiplication by $[2^{64k+1}/z] - 2^{64}$, followed by addition of $y$ and right-shift by $e + 1$, does work in general.)

20. \textbf{ADDU} $\&x, \&y, \&z$; \textbf{ADDU} $\&x, \&x, \&x$.

21. SL sets $\&x$ to zero, overflow if $\&y$ was nonzero. SLJ and SRU set $\&x$ to zero. SR sets $\&x$ to 64 copies of the sign bit of $\&y$, namely to $\neg[\&y < 0]$. (Notice that shifting left by $-1$ does not shift right.)

22. Dull's program takes the wrong branch when the SUB instruction causes overflow. For example, it treats every nonnegative number as less than $2^{65}$, it treats $2^{65} - 1$ as less than every negative number. Although no error arises when $\&1$ and $\&2$ have the same sign, or when the numbers in $\&1$ and $\&2$ are both less than $2^{63}$ in absolute value, the correct formulation \textbf{CMP} $\&0, \&1, \&2$; \textbf{BN} $\&0, \text{Case 1}$ is much better. (Similar errors have been made by programmers and compiler writers since the 1950s, often causing significant and mysterious failures.)

23. \textbf{CMP} $\&0, \&1, \&2$; \textbf{BNP} $\&0, \text{Case 1}$.

24. \textbf{ANNN}.

25. \textbf{XOR} $\&x, \&y, \&z$; \textbf{SADD} $\&x, \&x, 0$.

26. \textbf{ANDN} $\&x, \&y, \&z$.

27. \textbf{EDIF} $\&w, \&y, \&z$; \textbf{ADDU} $\&x, \&z, \&w$; \textbf{SUBU} $\&w, \&y, \&w$.

28. \textbf{EDIF} $\&0, \&y, \&z$; \textbf{EDIF} $\&x, \&z, \&y$; \textbf{OR} $\&x, \&0, \&x$.

29. \textbf{NOR} $\&0, \&y, 0$; \textbf{EDIF} $\&0, \&0, \&z$; \textbf{NOR} $\&x, \&x, 0$. (This sequence computes $2^n - 1 - \max(0, (2^n - 1 - y) - z)$ in each byte position.)

30. \textbf{XOR} $\&1, \&0, \&2$; \textbf{EDIF} $\&1, \&3, \&1$; \textbf{SADD} $\&1, \&1, 0$ when $\&2 = 020202020202020$ and $\&3 = 01010101010101$.

31. \textbf{Mxor} $\&1, \&4, \&3$, \textbf{SADD} $\&1, \&1, 0$ when $\&4 = 0101010101010101$.

32. $C_{ji}^T = (A_{i1}^T \circ \cdots \circ (A_{in}^T \cdot B_{jn}^T))_{ji}$ if $\circ$ is commutative.
33. `MUR` (or `MXOR`) with the constant \(^*0180402010080402\).

34. `MUR $X, Z; [\#00000000000200010]`; `MUR $Y, Z; [\#0000000000020001]`. (Here we use brackets to denote registers that contain auxiliary constants.)

To go back, also checking that an 8-bit code is sufficient:

```
PUT rX, [\#00ff 00ff 00ff 00ff]
MUR $0, $X, [\#020100804020180]
MUX $1, $0, $Y
BNZ $1, BadCase
MUX $1, $Y, $0
MUR $Z, $1, [\#802008020100401]
```

35. `MUR $X, $Y, $Z`; `MUR $X, $Z, $X`; here $Z$ is the constant (14).

36. `XOR $0, $Y, $Z`; `MUR $0, [-1], $0`. Notes: Changing XOR to BDXIF gives a mask for the bytes where $Y$ exceeds $Z$. Given such a mask, AND it with \(^*0040201000040201\) and MUR with \(^*ff\) to get a one-byte encoding of the relevant byte positions.

37. Let the elements of the field be polynomials in the Boolean matrix

\[
\begin{pmatrix}
0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\
1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 1 & 1 & 1 & 0
\end{pmatrix}
\]

For example, this matrix is \(m[\#402010080402018e]\), and if we square it with MXOR we get the matrix \(m[\#2010080402018e47]\). The sum and product of such field elements are then obtained by XOR and MXOR, respectively.

(A field with \(2^k\) elements for \(2 \leq k \leq 7\) is obtained in a similar way from polynomials in the matrices \(^*0103\), \(^*020105\), \(^*04020109\), \(^*0804020112\), \(^*1008040201\), \(^*20100804020141\). Matrices of size up to \(16 \times 16\) can be represented as four octabytes; then multiplication requires eight MXORs and four XORs. We can, however, do multiplication in a field of \(2^{16}\) elements by performing only five MXORs and three XORs, if we represent the large field as a quadratic extension of the field of \(2^8\) elements.)

38. It sets \(1\) to the sum of the eight signed bytes initially in \(0\); it also sets \(2\) to the rightmost nonzero such byte, or zero, and it sets \(0\) to zero. (Changing SR to SRU would treat the bytes as unsigned. Changing SLU to SL would often overflow.)

39. The assumed running times are (a) \((3v + 2u)\) versus \(2v\); (b) \((4v + 3u)\) versus \(2v\); (c) \((4v + 3u)\) versus \(3v\); (d) \((u + 4v)\) versus \(2v\); (e) \((2v + 5u)\) versus \(2v\); (f) \((2v + 5u)\) versus \(3v\). So we should use the conditional instructions in cases (a,d) and (c,f), unless \(0\) is negative with probability \(> 2/3\); in the latter case we should use the `PEN` variants, (d) and (f). The conditionals always win in cases (b,e).

If the `ADDU` commands had been `ADD`, the instructions would not have been equivalent, because of possible overflows.

40. Suppose you \(G0\) to address \(^*101\); this sets \(0 \leftarrow ^*101\). The tetrabyte \(M_1[\#101]\) is the same as the tetrabyte \(M_1[\#100]\). If the opcode of that instruction is, say, `PUSHJ`, register \(r1\) will be set to \(^*105\). Similarly, if that instruction is `GETA $0, @`, register \(0\) will be set to \(^*101\). In such situations the value for \(0\) in `MMIX` assembly language is slightly different from the actual value during program execution.
Programmers could use these principles to send some sort of signal to a subroutine, based on the two trailing bits of @. (Tricky, but hey, why not use the bits we’ve got?)

41. (a) True. (b) True. (c) True. (d) False, but true with SRU in place of SR.

42. (a) NEGU $1,$0; CSNN $1,$0,$0,$0. (b) ANDN $1,$0,$0,[#8000000000000000].

43. Trailing zeros (solution by J. Dalles): SUBU $0,$Z,1; SADD $0,$0,$Z.

Leading zeros: FLTU $0,1,$Z; SRU $0,$0,$Z; SUB $0,[1086],$0. (If $Z$ could be zero, add the command CSZ $0,$Z,64.) This is the shortest program, but not the fastest; we save 2v if we reverse all bits (exercise 35) and count trailing zeros.

44. Use “high tetra arithmetic,” in which each 32-bit number appears in the left half of a register. LDHT and STHT load and store such quantities (see exercise 7); SETDH leads an immediate constant. To add, subtract, multiply, or divide high tetra $Y$ and $Z$, producing a high tetra $X$ with correct attention to integer overflow and divide check, the following commands work perfectly: (a) ADD $X,Y,Z$. (b) SUB $X,Y,Z$. (c) SR $X,Y,Z$. (d) MUL $X,Y,Z$ (assuming that we have $X \neq Y$). (d) DIV $X,Y,Z$. SL $X,Y,Z$: now $rR$ is the high tetra remainder.

46. It causes a trip to location 0.

47. “DF is MXOR (“multiple exclusive-or immediate”); “55 is PEPB (“probable branch positive backward”). But in a program we use the names MXOR and PEB; the assembler silently adds the I and B when required.

48. STU and STUUI; also the “immediate” variants LD0I and LD0UI, ST0I and ST0UI; also NEGU and NEG0I, although NEG is not equivalent to NEGU; also any two of the four opcodes FL0T, FL0TUI, SFL0T, and SFL0TUI.

(Every MMX operation on signed numbers has a corresponding operation on unsigned numbers, obtained by adding 2 to the opcode. This consistency makes the machine design easier to learn, the machine easier to build, and the compilers easier to write. But of course it also makes the machine less versatile, because it leaves no room for other operations that might be desired.)

49. Octabyte $M_8[0]$ is set to $0000010000000001$; riH is set to $0000101243210000$; $M_8[0244420000000000]$ is set to $0121$; riA is set to $00001$ (because overflow occurs on the STW); riB is set to $f(7) = 4010000000000000$ and $1 \leftarrow f8ffffffffffff.

(Also rL $\leftarrow 2$, if rL was originally 0 or 1.) We assume that the program is not located in such a place that the STC0, STB, or STW instructions could clobber it.

50. $4u + 34v = u + (u+v) + v + (u+v) + v + v + 4v + u + v + v + v + v + + v + v.$

51. 35010001 a0010101 2e010101 a5010101 f6000001 c4010101
   b5010101 8e010101 1a010101 db010101 c7010101 3d010101
   33010001 e4010001 f7150001 08010001 5701ffff 3f010101

52. Opcodes ADDI, ADDUI, SUBI, SUBUI, SLI, SLUI, SRI, SRI, ORI, ORI, ANDN, ANDN, ANDN, ANDN, ANDN, ANDN, ANDN, ANDN: X = Y = 255, Z = 0. Opcode MULI: X = Y = 255, Z = 1. Opcodes INCH, INCMH, INCMH, INCMH, ORH, ORMH, ORMH, ORH, ORH, ORH, ORH, ANDN, ANDN, ANDN, ANDN, ANDN, ANDN, ANDN, ANDN, ANDN, ANDN: X = 255, Y = Z = 0. Opcodes OR, AND, MIX: X = Y = Z = 255. Opcodes CSN, CSZ, ..., CSEV: X = Z = 255, Y arbitrary. Opcodes BN, BZ, ..., PBEV: X arbitrary, Y = 0, Z = 1. Opcode JMP: X = Y = 0, Z = 1. Opcodes PRELD, PRELD, PRELD, PRELD, SWYM: X, Y, Z arbitrary. (Subtle point: An instruction that sets register $S$ is not a no-op when $S$ is marginal, because it causes $rL$ to increase; and all registers except $S$ are marginal when $rL = 0$ and $rG = 255.$)
53. MUL, MULUI, PUT, PUTI, UNSAVE.

54. FCMP, FADD, FIX, FSUB, ..., FCMPF, FEQLE, ..., FINT, MUL, MULI, DIV, DIVI, ADD, ADDI, SUB, SUBI, NEG, SL, SLL, SRL, STL, STBI, SW, STW, STI, STSF, STSFI, PUT, PUTI, UNSAVE. (This was not quite a fair question, because the complete rules for floating point operations appear only elsewhere. One fine point is that FCMP might change the I_BIT of rA, if $Y$ or $Z$ is Not-a-Number, but FEQL and FNE never cause exceptions.)

55. FCMP, FUN, ..., SRAI, CSN, CSNI, ..., LDUNCI, GO, GUI, PUSHD, PUSHDQ, OR, ORI, ..., ANDNL, PUSHLJ, PUSHDJ, GETA, GETAB, PUT, PUTI, POP, SAVE, UNSAVE, GET.

56. Minimum space:

\[
\begin{align*}
\text{LD0} & \quad \text{MUL} & \quad \text{SET} & \quad \text{SETL} \\
0, x & \quad 0, 0, 0, 1 & \quad 0, 0, 1 & \quad 2, 12 \\
\quad & \quad (255) & \quad & \quad 2, 0, 0, 0, 2
\end{align*}
\]

Space = 6 × 4 = 24 bytes, time = $\mu + 149v$. Faster solutions are possible.

Minimum time: The assumption that $|x_{13}| \leq 2^{63}$ implies that $|x| < 2^5$ and $x^8 < 2^{10}$. The following solution, based on an idea of Y. N. Patt, exploits this fact.

\[
\begin{align*}
\text{LD0} & \quad 0, x \quad 0 = x \\
\text{MUL} & \quad 1, 0, 0, 0 \quad 1 = x^2 \\
\text{MUL} & \quad 1, 1, 1, 1 \quad 1 = x^4 \\
\text{SL} & \quad 2, 1, 2, 25 \quad 2 = 2^{x_6} \\
\text{SL} & \quad 3, 0, 39 \quad 3 = 2^{x_6} \\
\text{ADD} & \quad 3, 3, 3, 1 \quad 3 = 2^{x_6} + x^4 \\
\text{MULU} & \quad 1, 3, 3, 2 \quad u(\text{S}1) = 2^{x_6} \times 2^{x_4} \text{rH} = x^8 + 2^{x_6} \times 2^{x_4} [x < 0] \\
\text{GET} & \quad 2, \text{rH} \quad 2 = x^8 \text{ (modulo } 2^{25}) \\
\text{PUT} & \quad ^{1} \text{rH, [[]ffffff]} \\
\text{MUL} & \quad 2, 2, 3, 0 \quad 2 = x^5 \\
\text{SRAI} & \quad 1, 1, 1, 25 \quad 1 = x^8 \\
\text{MUL} & \quad 0, 0, 1, 2 \quad 0 = x_{13}
\end{align*}
\]

Space = 12 × 4 = 48 bytes, time = $\mu + 48v$. At least five multiplications are “necessary,” according to the theory developed in Section 4.6.3; yet this program uses only four! And in fact there is a way to avoid multiplication altogether.

True minimum time: As R. W. Floyd points out, we have $|x| \leq 28$, so the minimum execution time is achieved by referring to a table (unless $\mu > 45v$):

\[
\begin{align*}
\text{LD0} & \quad 0, x \\
\text{SADDO} & \quad 0, 0, [\text{Table}] \\
\text{LD0} & \quad 0, 0, 0, 8, 28 \\
\quad & \quad 0 = x_{13}
\end{align*}
\]

Table of \text{OC}TA

\[
\begin{align*}
\text{OC}TA & \quad -28 \times 28 \times 28 \times 28 \times 28 \times 28 \times 28 \times 28 \times 28 \times 28 \\
\text{OC}TA & \quad -27 \times 27 \times 27 \times 27 \times 28 \times 27 \times 28 \times 27 \times 27 \times 27 \\
\quad & \quad 28 \times 28 \times 28 \times 28 \times 28 \times 28 \times 28 \times 28 \times 28 \times 28
\end{align*}
\]

Space = 3 × 4 + 57 × 8 = 468 bytes, time = $2\mu + 3v$.

57. (1) An operating system can allocate high-speed memory more efficiently if program blocks are known to be “read-only.” (2) An instruction cache in hardware will be faster and less expensive if instructions cannot change. (3) Same as (2), with “pipeline” in place of “cache.” If an instruction is modified after entering a pipeline, the pipeline needs to be flushed; the circuitry needed to check this condition is complex and time-consuming. (4) Self-modifying code cannot be used by more than one process at once. (5) Self-modifying code can defeat techniques for “profiling” (that is, for computing the number of times each instruction is executed).
SECTION 1.3.2’

1. (a) It refers to the label of line 24. (b) No indeed. Line 23 would refer to line 24 instead of line 38; line 31 would refer to line 24 instead of line 21.

2. The current value of 9B will be a running count of the number of such lines that have appeared earlier.

3. Read in 100 octabytes from standard input; exchange their maximum with the last of them; exchange the maximum of the remaining 99 with the last of those, etc. Eventually the 100 octabytes will become completely sorted into nondecreasing order. The result is then written to the standard output. (Compare with Algorithm 5.2.38.)

4. 2233445566778899. [Large values are reduced mod 2^64.]

5. BYTE “silly”; but this trick is not recommended.

6. False; TETRA @, @ is not the same as TETRA @; TETRA @.

7. He forgot that relative addresses are to tetabyte locations; the two trailing bits are ignored.

8. LDC 16*(c+15)/16) or LDC -8/16*-16 or LDC (0+15)&-16, etc.

9. Change 500 to 600 on line 02; change Five to Six on line 35. [Five-digit numbers are not needed unless 1230 or more primes are to be printed. Each of the first 6542 primes will fit in a single wyde.]

10. M[20000000000000] = 0002, and the following nonzero data goes into the text segment:

$$\begin{align*}
$100: & 0003 \\
$104: & 0070 \\
$108: & 0002 \\
$110: & 0013 \\
$114: & 0002 \\
$118: & 0070 \\
$120: & 0000 \\
$124: & 0000 \\
$128: & 0000 \\
$130: & 0000 \\
$133: & 0000 \\
$137: & 0000 \\
$141: & 0000 \\
$145: & 0000 \\
$150: & 0000 \\
$154: & 0000 \\
$158: & 0000 \\
\end{align*}$$

(Notice that SET becomes SETL in $100, but ORI in $104. The current location @ is aligned to $15c at line 38, according to rule 7(a).) When the program begins, RG will be $f5, and we will have $248 = $20000000000000003e8, $247 = $f888888888888888, $246 = $13c, $245 = $20300300000000.
11. (a) If \( n \) is not prime, by definition \( n \) has a divisor \( d \) with \( 1 < d < n \). If \( d > \sqrt{n} \), then \( n/d \) is a divisor with \( 1 < n/d < \sqrt{n} \). (b) If \( n \) is not prime, \( n \) has a prime divisor \( d \) with \( 1 < d \leq \sqrt{n} \). The algorithm has verified that \( n \) has no prime divisors \( p = \text{PRIME}[k] \); also \( n = pq + r < pq + p \leq p^2 + p < (p + 1)^2 \). Any prime divisor of \( n \) is therefore greater than \( p + 1 > \sqrt{n} \).

We must also prove that there will be a sufficiently large prime less than \( n \) when \( n \) is prime, namely that the \((k + 1)\text{st}\) prime \( p_{k+1} \) is less than \( p_k^2 + p_k \); otherwise \( k \) would exceed \( j \) and \( \text{PRIME}[k] \) would be zero when we needed it to be large. The necessary proof follows from “Bertrand’s postulate”: If \( p \) is prime there is a larger prime less than \( 2p \).

12. We could move \texttt{Title}, \texttt{NewLn, and Blank} to the data segment following \texttt{BUF}, where they could use \texttt{ptop} as their base address. Or we could change the \texttt{LDA} instructions on lines 38, 42, and 58 to \texttt{SETL}, knowing that the string addresses happen to fit in two bytes because this program is short. Or we could change \texttt{LDA} to \texttt{GETA}; but in that case we would have to align each string modulo 4, for example by saying

\begin{verbatim}
Title BYTE "First Five Hundred Primes", #a, 0
L0C (0*3)&-4
NewLn BYTE #a, 0
L0C (0*3)&-4
Blanks BYTE " ", #0
\end{verbatim}

(See exercises 7 and 8.)

13. Line 35 gets the new title; change \texttt{BYTE} to \texttt{WYDE} on lines 35–37. Change \texttt{Puts} to \texttt{PutsW} in lines 39, 43, 55, 59. Change the constant in line 45 to \#0020066008600660. Change \texttt{BUF+4} to \texttt{BUF+2*4} on line 47. And change lines 50–52 to

\begin{verbatim}
INCL r,1111; STWU r,t,0; SUB t,t,2.
\end{verbatim}

Incidentally, the new title line might look like

\begin{verbatim}
Title WYDE "أول خمس مسجات الأرقام الآلية"
\end{verbatim}

when it is printed bidirectionally, but in the computer file the individual characters actually appear in “logical” order without ligatures. Thus a spelled-out sequence like

\begin{verbatim}
Title WYDE 1\text{"i",3",j",4",l",5",6",7",8"}
\end{verbatim}

would give an equivalent result, by the rule for string constants (rule 2).

14. We can, for example, replace lines 26–30 of Program P by

\begin{verbatim}
fn GREG 0
sqrtn GREG 0
FL0T fn,n
FSQRT sqrtn,fn
6H LDWU pk,ptop,kk
FL0T t,pk
FREM r,fn,t
BZ r,48
7H FCMP t,sqrtn,t
\end{verbatim}

The new \texttt{FREM} instruction is performed 9597 times, not 9538, because the new test in step P7 is not quite as effective as before. In spite of this, the floating point calculations reduce the running time by 426192\(v - 59\mu\), a notable improvement (unless of course
\[ \mu/v > 7000 \). An additional savings of \( 38169v \) can be achieved if the primes are stored as short floats instead of as unsigned wydes.

The number of divisibility tests can actually be reduced to 9357 if we replace \( q \) by \( \sqrt{n} - 1.9999 \) in step P7 (see the answer to exercise 11). But the extra subtractions cost more than they save, unless \( \mu/v > 15 \).

15. It prints a string consisting of a blank space followed by an asterisk followed by two blanks followed by an asterisk … followed by \( k \) blanks followed by an asterisk … followed by 74 blanks followed by an asterisk; a total of \( 2+3+\cdots+75 = \binom{75}{2} - 1 = 2849 \) characters. The total effect is one of 0P art.

17. The following subroutine returns zero if and only if the instruction is OK.

\[
\begin{align*}
\text{a IS } & \#ffffff \quad \text{Table entry when anything goes} \\
\text{b IS } & \#ffff04ff \quad \text{Table entry when } Y \leq \text{ROUND}_\text{NEAR} \\
\text{c IS } & \#000ff0ff \quad \text{Table entry for PUT and PUTI} \\
\text{d IS } & \#ff00000 \quad \text{Table entry for RESUME} \\
\text{e IS } & \#ffff0000 \quad \text{Table entry for SAVE} \\
\text{f IS } & \#ff0000ff \quad \text{Table entry for UNSAVE} \\
\text{g IS } & \#ff000003 \quad \text{Table entry for SYNC} \\
\text{h IS } & \#ffff00ff \quad \text{Table entry for GET} \\
\end{align*}
\]

**table**: GREG 0

**TETRA**: a,a,a,a,b,b,b,b,a,a,a,a,a,a,a,a, a 0x

**TETRA**: a,a,a,a,b,b,b,b,a,a,b,b,b,b, a 1x

**TETRA**: a,a,a,a,b,b,b,b,a,a,a,a,a,a,a,a, a 2x

**TETRA**: a,a,a,a,b,b,b,b,a,a,b,b,b,b, a 3x

**TETRA**: a,a,a,a,b,b,b,b,a,a,a,a,a,a,a,a, a 4x

**TETRA**: a,a,a,a,b,b,b,b,a,a,b,b,b,b, a 5x

**TETRA**: a,a,a,a,b,b,b,b,a,a,a,a,a,a,a,a, a 6x

**TETRA**: a,a,a,a,b,b,b,b,a,a,b,b,b,b, a 7x

**TETRA**: a,a,a,a,b,b,b,b,a,a,a,a,a,a,a,a, a 8x

**TETRA**: a,a,a,a,b,b,b,b,a,a,b,b,b,b, a 9x

**TETRA**: a,a,a,a,b,b,b,b,a,a,b,b,b,b, a Ax

**TETRA**: a,a,a,a,b,b,b,b,a,a,a,a,a,a,a,a, a Bx

**TETRA**: a,a,a,a,b,b,b,b,a,a,b,b,b,b, a Cx

**TETRA**: a,a,a,a,b,b,b,b,a,a,a,a,a,a,a,a, a Dx

**TETRA**: a,a,a,a,b,b,b,b,a,a,b,b,b,b, a Ex

**TETRA**: a,a,a,a,b,b,b,b,a,a,a,a,a,a,a,a, a Fx

**tetra IS** $1$

**maxXYZ IS** $2$

**InstTest**: BN $0.9F$ Invalid if address is negative.

**LDTU**: tetra, $0.0$ Fetch the tetrabyte.

**SR**: $0$,tetra,$22$ Extract its opcode (times 4).

**LDT**: maxXYZ,table,$30$ Get \( X_{\text{max}}, Y_{\text{max}}, Z_{\text{max}} \).

**RDIF**: $0$,tetra,maxXYZ Check if any max is exceeded.

**PNP**: maxXYZ,$9F$ If not a PUT, we are done.

**ANNDML**: $0$,#$ff00$ Zero out the OP byte.

**BNZ**: $0.9F$ Branch if any max is exceeded.

**MOR**: tetra,tetra,$\#4$ Extract the X byte.

**CMP**: $0$,tetra,$18$ Set \( X \leftarrow 0 \) if \( 18 < X < 32 \).
\[ \text{ODIF} \quad \$$0,\text{tetra},7 \quad \text{Set} \quad \$0 \leftarrow X - 7. \]
\[ \text{9H} \quad \text{POP} \quad 1,0 \quad \text{Return} \quad \$0 \quad \text{as} \quad \text{the} \quad \text{answer}. \]

This solution does not consider a tetrabyte to be invalid if it would jump to a negative address, nor is ‘SAVE \$0,0’ called invalid (although \$0 can never be a global register).

18. The catch to this problem is that there may be several places in a row or column where the minimum or maximum occurs, and each is a potential saddle point.

\textit{Solution 1:} In this solution we run through each row in turn, making a list of all columns in which the row minimum occurs and then checking each column on the list to see if the row minimum is also a column maximum. Notice that in all cases the terminating condition for a loop is that a register is \( \leq 0 \).

* Solution 1

\begin{verbatim}
  t IS $255
a00 GREG Data_Segment Address of “a_{oo}
ai0 GREG Data_Segment+8 Address of “a_{io}
i j IS $0 Element index and return register
j GREG 0 Column index
k GREG 0 Size of list of minimum indices
x GREG 0 Current minimum value
y GREG 0 Current element
Saddle SET i,j,9*8
RowMin SET j,8

LDB x,a10,i j Candidate for row minimum
SET k,0 Set list empty.

INCL k,1
STB j,a00,k Put column index in list.
SUB i,j,ij,1 Go left one.
SUB j,j,i
BZ j,ColMax Done with row?

LDB y,a10,i j SUB t,x,y PBN t,1B Is x still minimum?
SET x,y PBP t,2B New minimum?
JMP 4B Remember another minimum.

ColMax LDB $1,a00,k ADD j,$1,9*8-8 Get column from list.

LDB y,a10,j CMP t,x,y PBN t, No Is row min < column element?
SUB j,j,8 SUB j,j,1B Done with column?
Yes ADD i,j,i,j,1 Yes; i,j ← index of saddle.
LDA i,j,a10,i j
POP 1,0
No SUB k,k,1 Is list empty?
BP k,ColMax If not, try again.
PBP i,j,RowMin Have all rows been tried?
POP 1,0 Yes; \$0 = 0, no saddle.
\end{verbatim}
1.3.2'  ANSWERS TO EXERCISES  103

Solution 2: An infusion of mathematics gives a different algorithm.

**Theorem.** Let \( R(i) = \min_j a_{i,j} \), \( C(j) = \max_i a_{i,j} \). The element \( a_{i_0,j_0} \) is a saddle point if and only if \( R(i_0) = \max_i R(i) = C(j_0) = \min_j C(j) \).

Proof. If \( a_{i_0,j_0} \) is a saddle point, then for any fixed \( i \), \( R(i_0) = C(j_0) \geq a_{i,j} \geq R(i) \); so \( R(i_0) = \max_i R(i) \). Similarly \( C(j_0) = \min_j C(j) \). Conversely, we have \( R(i) \leq a_{i,j} \leq C(j) \) for all \( i \) and \( j \); hence \( R(i_0) = C(j_0) \) implies that \( a_{i_0,j_0} \) is a saddle point. 

(This proof shows that we always have \( \max_i R(i) \leq \min_j C(j) \). So there is no saddle point if and only if all the \( R \)'s are less than all the \( C \)'s.)

According to the theorem, it suffices to find the smallest column maximum, then to search for an equal row minimum.

* Solution 2

<p>| | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>( t )</td>
<td>IS</td>
<td>$255</td>
</tr>
<tr>
<td>a00</td>
<td>GREG</td>
<td>Data_Segment</td>
</tr>
<tr>
<td>a10</td>
<td>GREG</td>
<td>Data_Segment$+8</td>
</tr>
<tr>
<td>a20</td>
<td>GREG</td>
<td>Data_Segment$+8*2</td>
</tr>
<tr>
<td>i,j</td>
<td>GREG</td>
<td>0</td>
</tr>
<tr>
<td>ii</td>
<td>GREG</td>
<td>0</td>
</tr>
<tr>
<td>j</td>
<td>GREG</td>
<td>0</td>
</tr>
<tr>
<td>x</td>
<td>GREG</td>
<td>0</td>
</tr>
<tr>
<td>y</td>
<td>GREG</td>
<td>0</td>
</tr>
<tr>
<td>z</td>
<td>GREG</td>
<td>0</td>
</tr>
<tr>
<td>ans</td>
<td>IS</td>
<td>0</td>
</tr>
</tbody>
</table>

Phase 1

- **SET** \( j,8 \) Start at column 8.
- **SET** \( z,1000 \) \( z \leftarrow \infty \) (more or less).
- **3H** ADD \( i,j,j,9*8-2*8 \)
- **LDB** \( x,a20,i,j \)
- **1H** LDB \( y,a10,i,j \)
- **CMP** \( t,x,y \) Is \( x < y \)?
- **CSN** \( x,t,y \) If so, update the maximum.
- **2H** SUB \( i,j,i,j,8 \) Move up one.
- **PBP** \( i,j,1B \)
- **STB** \( x,a10,i,j \) Store column maximum.
- **CMP** \( t,x,z \) Is \( x < z \)?
- **CSN** \( z,t,x \) If so, update the min max.
- **SUB** \( j,j,1 \) Move left a column.
- **PBP** \( j,3B \)

Phase 2

- **SET** \( i,i,9*8-8 \) (At this point \( z = \min_j C(j) \).)
- **3H** ADD \( i,j,i,i,8 \) Prepare to search a row.
- **SET** \( j,8 \)
- **1H** LDB \( x,a10,i,j \)
- **SUB** \( t,z,x \) Is \( z > a_{i,j} \)?
- **PBP** \( t,No \) There's no saddle in this row.
- **PEN** \( t,2F \)
- **LDB** \( x,a00,j \) Is \( a_{i,j} = C(j) \)?
- **CMP** \( t,x,z \)
- **CSZ** \( ans,t,i,j \) If so, remember a possible saddle point.
2H SUB \( j, j, 1 \) Move left in row.
SUB \( ii, ii, 1 \)
FDP \( j, 1 \)
LDA \( \text{ans, a10, ans} \) A saddle point was found here.
PDP \( 1, 0 \)

No SUB \( ii, ii, 8 \)
PDP \( ii, 3B \) Try another row.
SET \( \text{ans}, 0 \)
PDP \( 1, 0 \) \( \text{ans} = 0; \) no saddle.

We leave it to the reader to invent a solution in which Phase 1 records all possible rows that are candidates for the row search in Phase 2. It is not necessary to search all rows, just those \( i_0 \) for which \( C(j_0) = \min_j C(j) \) implies \( a_{i_0, j_0} = C(j_0) \). Usually there is at most one such row.

In some trial runs with elements selected at random from \( \{-2, -1, 0, 1, 2\} \), Solution 1 required approximately \( 147 \mu + 863 \nu \) to run, while Solution 2 took about \( 95 \mu + 510 \nu \). Given a matrix of all zeros, Solution 1 found a saddle point in \( 26 \mu + 188 \nu \), Solution 2 in \( 96 \mu + 517 \nu \).

If an \( m \times n \) matrix has distinct elements, and \( m \geq n \), we can solve the problem by looking at only \( O(m + n) \) of them and doing \( O(m \log n) \) auxiliary operations. See Bienstock, Chung, Fredman, Schäffer, Shor, and Suri, AMM 98 (1991), 418-419.

19. Assume an \( m \times n \) matrix. (a) By the theorem in the answer to exercise 18, all saddle points of a matrix have the same value, so (under our assumption of distinct elements) there is at most one saddle point. By symmetry the desired probability is \( mn \) times the probability that \( a_{11} \) is a saddle point. This latter is \( 1/(mn)! \) times the number of permutations with \( a_{12} > a_{11}, \ldots, a_{1n} > a_{11}, a_{11} > a_{21}, \ldots, a_{11} > a_{m1} \); and this is \( 1/(m + n - 1)! \) times the number of permutations of \( m + n - 1 \) elements in which the first is greater than the next \( m - 1 \) and less than the remaining \( n - 1 \), namely \((m - 1)!(n - 1)! \). The answer is therefore
\[
(\frac{mn}{m + n})! \frac{(m - 1)!}{(m + n - 1)!} = \frac{(m + n)!}{m! n!}.
\]
In our case this is \( 17/\binom{17}{9} \), only one chance in 1430. (b) Under the second assumption, an entirely different method must be used since there can be multiple saddle points; in fact either a whole row or whole column must consist entirely of saddle points. The probability equals the probability that there is a saddle point with value zero plus the probability that there is a saddle point with value one. The former is the probability that there is at least one column of zeros; the latter is the probability that there is at least one row of ones. The answer is \((1 - (1 - 2^{-m})^n) + (1 - (1 - 2^{-n})^m)\); in our case, 92474479623@036231/184464073709551616, about 1 in 19.9. An approximate answer is \( n2^{-m} + m2^{-n} \).

20. M. Hofri and P. Jacquet [Algorithmica 22 (1998), 516-528] have analyzed the case when the \( m \times n \) matrix entries are distinct and in random order. The running times of the two \text{MMIX} programs are then \( (mn + mH_n + 2m + 1 + (m + 1)/(n - 1))\mu + (6mn + 7mH_n + 5m + 11 + 7(m + 1)/(n - 1))\nu + O(m + n)^2/\binom{m + n}{2} \) and \( (m + 1)n\mu + (5mn + 6m + 4n + 7H_n + 8)\nu + O(1/n) + O((\log n)^2/m) \), respectively, as \( m \to \infty \) and \( n \to \infty \), assuming that \( (\log n)/m \to 0 \).

21. Farey SET \( y, 1; \ldots \) PDP.
1.3.2

This answer is the first of many in Volumes 1–3 for which MIXmasters are being asked to contribute elegant solutions. (See the website information on page ii.) The fourth edition of this book will present the best parts of the best programs submitted. Note: Please reveal your full name, including all middle names, if you enter this competition, so that proper credit can be given!

22. (a) Induction. (b) Let \( k \geq 0 \) and \( X = ax_{k+1} - x_k, Y = ay_{k+1} - y_k \), where \( a = (y_k + n)/y_{k+1} \). By part (a) and the fact that \( 0 < Y \leq n \), we have \( X \perp Y \) and \( X/Y > x_{k+1}/y_{k+1} \). So if \( X/Y \neq x_{k+2}/y_{k+2} \) we have, by definition, \( X/Y > x_{k+2}/y_{k+2} \). But this implies that

\[
\frac{1}{Y_{yk+1}} = \frac{Xy_{k+1} - Yx_{k+1}}{Y_{y_{k+1}}} = \frac{X}{Y} \frac{x_{k+1}}{y_{k+1}}
\]

\[
= \left( \frac{X}{Y} - \frac{x_{k+2}}{y_{k+2}} \right) + \left( \frac{x_{k+2}}{y_{k+2}} - \frac{x_{k+1}}{y_{k+1}} \right)
\]

\[
\geq \frac{1}{Y_{y_{k+2}}} + \frac{1}{Yy_{k+1}y_{k+2}} = \frac{y_{k+1} + Y}{Yy_{k+1}y_{k+2}}
\]

\[
> \frac{n}{Yy_{k+1}y_{k+2}} \geq \frac{1}{Yy_{k+1}}
\]

Historical notes: C. Haros gave a (more complicated) rule for constructing such sequences, in *J. de l’École Polytechnique* 4, 11 (1802), 364–368; his method was correct, but his proof was inadequate. Several years later, the geologist John Farey independently conjectured that \( x_k/y_k \) is always equal to \( (x_{k-1} + x_{k+1})/(y_{k-1} + y_{k+1}) \) [Philos. Magazine and Journal 47 (1816), 385–386]; a proof was supplied shortly afterwards by A. Cauchy [Bull. Société Philomatique de Paris (3) 3 (1816), 133–135], who attached Farey’s name to the series. For more of its interesting properties, see G. H. Hardy and E. M. Wright, *An Introduction to the Theory of Numbers*, Chapter 3.

23. The following routine should do reasonably well on most pipeline and cache configurations.

\[
\begin{array}{llllllll}
    a & IS & 0 & SUB & n,n,8 & STCO & 0,a,56 \\
    n & IS & $1 & ADD & a,a,8 & ADD & a,a,64 \\
    z & IS & $2 & 3H & AND & t,a,63 & PBNZ & t,4B \\
    t & IS & $255 & PBNZ & t,2B & 5H & CMP & t,n,8 \\
        &     &     & CMP & t,n,64 & 6H & VS0 & t,7F \\
    1H & STB & z,a,0 & BN & t,5F & STCO & 0,a,0 \\
        & SUB & n,n,1 & 4H & PREST & 63,a,0 & SUB & n,n,8 \\
        & ADD & a,a,1 & ADD & n,n,64 & ADD & a,a,8 \\
        & Zero & EZ & n,9F & CMP & t,n,64 & CMP & t,n,8 \\
        & SET & z,0 & STCO & 0,a,0 & PBNZ & t,6B \\
        & AND & t,a,7 & STCO & 0,a,8 & 7H & EZ & n,9F \\
        & ENZ & t,1B & STCO & 0,a,16 & 8H & STB & z,a,0 \\
        & CMP & t,n,64 & STCO & 0,a,24 & SUB & n,n,1 \\
        & PBNNZ & t,3F & STCO & 0,a,32 & ADD & a,a,1 \\
        & JMP & 5F & STCO & 0,a,48 & PBNNZ & n,8B \\
    2H & STCO & 0,a,0 & STCO & 0,a,48 & 9H & POP &
\end{array}
\]

24. The following routine merits careful study; comments are left to the reader. A faster program would be possible if we treated \( 0 \equiv 1 \pmod{8} \) as a special case.
in IS $2
out IS $3
r IS $4
l IS $5
m IS $6
t IS $7
mm IS $8
tt IS $9
flip GREG #0102040810204080
ones GREG #0101010101010101
Loc #100

StrCpy AND in,$0,#7
SLU in,in,3
AND out,$1,#7
SLU out,out,3
SUB r,out,in
LDIU out,$1,0
SUB $1,$1,$8
NEG m,0,1
SRU m,m,in
LDIU in,$0,0
PUT rM,m
NEG mm,0,1
BN r,1F
NEG 1,64,r
SLU tt,out,r
MUX in,in,tt
SUBU t,mm,1
EDIF t,ones,in
AND t,t,m
SRU mm,mm,r
PUT rM,mm
JMP 4F

1H NEG 1,0,r
INCL r,64

The running time, approximately \((n/4+4)\mu+(n+40)v\) plus the time to \text{POP}, is less than the cost of the trivial code when \(n \geq 8\) and \(\mu \geq v\).

25. We assume that register \text{p} initially contains the address of the first byte, and that this address is a multiple of 8. Other local or global registers \(a, b, \ldots\) have also been declared. The following solution starts by counting the wyde frequencies first, since this requires only half as many operations as it takes to count byte frequencies. Then the byte frequencies are obtained as row and column sums of a \(256 \times 256\) matrix.

* Cryptanalysis Problem (CLASSIFIED)

Loc Data_Segment

\text{count GREG 0} Base address for wyde counts
Loc @+8*(1<<16) Space for the wyde frequencies

\text{freq GREG 0} Base address for byte counts
Loc @+8*(1<<8) Space for the byte frequencies

\text{P GREG 0}

\text{BYTE } "abracadabra",0,"abc" Trivial test data
1.3.2' ANSWERS TO EXERCISES 107

ones GREG #0101010101010101
LOC #100
2H SRU b,a,45 Isolate next wyde.
LD0 c,count,b Load old count.
INCL c,1 should
ST0 c,count,b Store new count.
SLU a,a,16 Delete one wyde.
PBNZ a,2B Done with octabyte?

Phase1 LOU a,p,0 Start here: Fetch the next eight bytes.
INCL p,8
BDIF t,ones,a Test if there’s a zero byte.
PBB t,2B Do main loop, unless near the end.
2H SRU b,a,45 Isolate next wyde.
LD0 c,count,b Load old count.
INCL c,1
ST0 c,count,b Store new count.
SRU b,t,48
SLU a,a,16
BDIF t,ones,a
PBZ b,2B Continue unless done.

Phase2 SET p,8*255 Now get ready to sum rows and columns.
1H SL a,p,8
LDA a,count,a p ← address of row p.
SET b,8*255
LD0 c,a,0
SET t,p
2H INCL t,#800
LD0 x,count,t Element of column p
LD0 y,a,b Element of row p
ADD c,c,x
ADD c,c,y
SUB b,b,8
PBZ b,2B
STU c,freq,p
SUB p,p,8
PBZ p,1B
POP

How long is “long”? This two-phase method is inferior to a simple one-phase approach when the string length n is less than $2^{17}$, but it takes only about $10/17$ as much time as the one-phase scheme when $n \approx 10^6$. A slightly faster routine can be obtained by “unrolling” the inner loop, as in the next answer.

Another approach, which uses a jump table and keeps the counts in 128 registers, is worthy of consideration when $\mu/v$ is large.

[This problem has a long history. See, for example, Charles P. Bourne and Donald F. Ford, “A study of the statistics of letters in English words,” Information and Control 4 (1961), 48–67.]

26. The wyde-counting trick in the previous solution will backfire if the machine’s primary cache holds fewer than $2^{19}$ bytes, unless comparatively few of the wyde counts
are nonzero. Therefore the following program computes only one-byte frequencies. This code avoids stalls, in a conventional pipeline, by never using the result of a LD0 in the immediately following instruction.

\[
\begin{align*}
\text{Start} & \quad \text{LD0U a,p,0} \quad \text{INCL c,1} \\
& \quad \text{INCL p,8} \quad \text{SRU bb,bb,53} \\
& \quad \text{BDIF t,ones,a} \quad \text{STU c,freq,b} \\
& \quad \text{BNZ t,3F} \quad \text{LD0 c,freq,bb} \\
& \quad \text{SRU b,a,53} \quad \text{LD0U a,p,0} \\
& \quad \text{LD0 c,freq,b} \quad \text{INCL p,8} \\
& \quad \text{SLU bb,a,8} \quad \text{INCL c,1} \\
& \quad \text{INCL c,1} \quad \text{BDIF t,ones,a} \\
& \quad \text{SRU bb,bb,53} \quad \text{STU c,freq,b} \\
& \quad \text{ST0 c,freq,bb} \quad \text{SRU b,b,3} \\
& \quad \text{LD0 c,freq,b} \quad \text{SLU a,a,8} \\
& \quad \ldots \quad \text{PBNZ b,3B} \\
& \quad \text{SLU bb,a,56} \quad \text{POP} \\
& \end{align*}
\]

Another solution works better on a superscalar machine that issues two instructions simultaneously:

\[
\begin{align*}
\text{Start} & \quad \text{LD0U a,p,0} \quad \text{SLU bbb,a,48} \\
& \quad \text{INCL p,8} \quad \text{SLU bbbb,a,56} \\
& \quad \text{BDIF t,ones,a} \quad \text{INCL c,1} \\
& \quad \text{SLU bb,a,8} \quad \text{INCL cc,1} \\
& \quad \text{BNZ t,3F} \quad \text{SRU bbb,bbbb,53} \\
& \quad \text{SRU b,a,53} \quad \text{SRU bbb,bbbb,53} \\
& \quad \text{SRU bb,bb,53} \quad \text{ST0 c,freq,b} \\
& \quad \text{LD0 c,freq,b} \quad \text{ST0 cc,freq,bb} \\
& \quad \text{LD0 c,freq,bb} \quad \text{LD0 c,freq,bb} \\
& \quad \text{SLU bbb,a,16} \quad \text{LD0 cc,freq,bb} \\
& \quad \text{SLU bbbb,a,24} \quad \text{LD0U a,p,0} \\
& \quad \text{INCL c,1} \quad \text{INCL p,8} \\
& \quad \text{INCL cc,1} \quad \text{INCL c,1} \\
& \quad \text{SRU bbb,bb,53} \quad \text{INCL cc,1} \\
& \quad \text{SRU bbbb,bbbbb,53} \quad \text{BDIF t,ones,a} \\
& \quad \text{ST0 c,freq,b} \quad \text{SLU bb,a,8} \\
& \quad \text{ST0 cc,freq,bb} \quad \text{ST0 c,freq,bb} \\
& \quad \text{LD0 c,freq,bb} \quad \text{ST0 cc,freq,bbbbb} \\
& \quad \text{LD0 cc,freq,bbbbb} \quad \text{PBNZ t,2B} \\
& \quad \text{SLU b,a,32} \quad \text{SRU b,a,53} \\
& \quad \ldots \quad \ldots \\
& \end{align*}
\]

In this case we must keep two separate frequency tables (and combine them at the end); otherwise an “aliasing” problem would lead to incorrect results in cases where b and bb both represent the same character.
27. (a)  
\[
\begin{array}{ll}
t & \text{IS} & \$255 \\
n & \text{IS} & \$0 \\
\text{new} & \text{GREG} & \text{old GREG} \\
phi & \text{GREG} & \phi_i \text{GREG} \\
rt5 & \text{GREG} & \#9e3779b97f4a7c16 \\
\text{acc} & \text{GREG} & \text{hi GREG} \\
f & \text{GREG} & \text{hihi GREG} \\
\end{array}
\]

(b)  
\[
\begin{array}{ll}
t & \text{IS} & \$255 \\
n & \text{IS} & \$0 \\
\text{new} & \text{GREG} & \text{old GREG} \\
phi & \text{GREG} & \phi_i \text{GREG} \\
rt5 & \text{GREG} & \#9e3779b97f4a7c16 \\
\text{acc} & \text{GREG} & \text{hi GREG} \\
f & \text{GREG} & \text{hihi GREG} \\
\end{array}
\]

<table>
<thead>
<tr>
<th>Main FLUT</th>
<th>t,5</th>
<th>Main SET</th>
<th>n,2</th>
</tr>
</thead>
<tbody>
<tr>
<td>FSQRT rt5,t</td>
<td>SET</td>
<td>old,1</td>
<td></td>
</tr>
<tr>
<td>FLUT t,1</td>
<td>SET</td>
<td>new,1</td>
<td></td>
</tr>
<tr>
<td>FADD phi,t,rt5</td>
<td>1H ADDU</td>
<td>new,new,old</td>
<td></td>
</tr>
<tr>
<td>INCH phi,#fff0</td>
<td>INCL</td>
<td>n,1</td>
<td></td>
</tr>
<tr>
<td>FDIV acc,phi,rt5</td>
<td>CMPU</td>
<td>t,new,old</td>
<td></td>
</tr>
<tr>
<td>SET n,1</td>
<td>BN</td>
<td>t,9F</td>
<td></td>
</tr>
<tr>
<td>SET new,1</td>
<td>SUBU</td>
<td>old,new,old</td>
<td></td>
</tr>
<tr>
<td>1H ADDU</td>
<td>new,new,old</td>
<td>MULU</td>
<td>lo,old,phi</td>
</tr>
<tr>
<td>INCL n,1</td>
<td>GET</td>
<td>hi,rH</td>
<td></td>
</tr>
<tr>
<td>CMPU t,new,old</td>
<td>ADDU</td>
<td>hi,hi,old</td>
<td></td>
</tr>
<tr>
<td>BN t,9F</td>
<td>ADDU</td>
<td>hihi,hi,1</td>
<td></td>
</tr>
<tr>
<td>SUBU old,new,old</td>
<td>CSM</td>
<td>hi,lo,hi</td>
<td></td>
</tr>
<tr>
<td>FMUL acc,acc,phi</td>
<td>CMP</td>
<td>t,hi,new</td>
<td></td>
</tr>
<tr>
<td>FIXU f,acc</td>
<td>PBZ</td>
<td>t,1B</td>
<td></td>
</tr>
<tr>
<td>CMP t,f,new</td>
<td>SET</td>
<td>t,1</td>
<td></td>
</tr>
<tr>
<td>PBZ t,1B</td>
<td>9H TRAP</td>
<td>0,Halt,0</td>
<td></td>
</tr>
<tr>
<td>SET t,1</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Program (a) halts with \( t = 1 \) and \( n = 71 \); the floating point representation of \( \phi \) is slightly high, hence errors ultimately accumulate until \( \phi^{1/\sqrt{5}} \) is approximated by \( F_{\phi} + .7 \), which rounds to \( F_{\phi} + 1 \). Program (b) halts with \( t = 1 \) and \( n = 94 \); unsigned overflow occurs before the approximation fails. (Indeed, \( F_{96} < 2^{94} < F_{94} \).)

29. The last man is in position 15. The total time before output is ...

\( \text{MMX} \) masters, please help! What is the neatest program that is analogous to the solution to exercise 1.3.2–22 in the third edition? Also, what would D. Ingalls do in the new situation? (Find a trick analogous to his previous scheme, but do not use self-modifying code.)

An asymptotically faster method appears in exercise 5.1.1–5.

30. Work with scaled numbers, \( R_n = 10^n r_n \). Then \( R_n(1/m) = R \) if and only if \( 10^n/(R + \frac{1}{2}) \leq m < 10^n/(R - \frac{1}{2}) \); thus we find \( m_{n+1} = \lfloor (2 \cdot 10^n - 1)/(2R - 1) \rfloor \).

* Sum of Rounded Harmonic Series

\[
\begin{array}{ll}
\text{MaxN IS} & 10 \\
a & \text{GREG} & 0 & \text{Accumulator} \\
c & \text{GREG} & 0 & 2 \cdot 10^n \\
d & \text{GREG} & 0 & \text{Divisor or digit} \\
r & \text{GREG} & 0 & \text{Scaled reciprocal} \\
\end{array}
\]
s GREG 0  Scaled sum
m GREG 0  \( m \)
m GREG 0  \( m_{k+1} \)
n GREG 0  \( n - \text{MaxN} \)
LOC Data_Segment
dec GREG @+3  Decimal point location
BYTE ",,"
LOC #100
Main NEG nn,MaxN-1  \( n \leftarrow 1 \)
SET c,20
1H SET m,1
SR s,c,1  \( S \leftarrow 10^n \)
JMP 2F
3H SUB a,c,1
SL dx,1
SUB d,d,1
DIV mm,a,d
4H SUB a,mm,m
MUL ax,a
ADD as,a
SET m,mm  \( k \leftarrow k + 1 \)
2H ADD a,c,m
2ADDD d,m,2
DIV x,a,d
PBNZ x,3B
5H ADD a,nn,MaxN+1
SET d,#a  Newline
JMP 7F
6H DIV s,s,10  Convert digits.
GET dxR
INCL d,1,0
7H STB d,dec,a
SUB ax,a,1
BE a,8-4
PBNZ s,6B
8H SUB $255,dec,3
TRAP 0,Fputs,StdOut
9H INCL nn,1  \( n \leftarrow n + 1 \)
MUL cc,10
PB EP nn,1B
TRAP 0,Halt,0

The outputs are respectively 3.7, 6.13, 8.445, 10.7504, 13.05357, 15.356255, 17.6588268, 19.96140681, 22.26391769, 24.5665766342, in \( 82\mu + 40659359u \). The calculation would work for \( n \) up to 17 without overflow, but the running time is of order \( 10^{n/2} \). (We could save about half the time by calculating \( R_n(1/m) \) directly when \( m < 10^{n/2} \), and by using the fact that \( R_n(m_{k+1}) = R_n(m_k - 1) \) for larger values of \( m \).)

31. Let \( N = [2 \cdot 10^n / (2m + 1)] \). Then \( S_n = H_N + O(N/10^n) + \sum_{k=1}^n \left( [2 \cdot 10^n / (2k - 1)] - [2 \cdot 10^n / (2k + 1)] \right) k/10^n = H_N + O(m^{-1}) + O(m/10^n) - 1 + 2H_{2m} - H_m = n \ln 10 + 2\gamma - 1 + 2 \ln 2 + O(10^{-n/2}) \) if we sum by parts and set \( m \approx 10^{n/2} \).
1.3.2'

Our approximation to \( S_{10} \) is \( \approx 24.5665766209 \), which is closer than predicted.

32. To make the problem more challenging, the following ingenious solution due in part to —— uses a lot of trickery in order to reduce execution time. Can the reader squeeze out any more nanoseconds?

\( \text{MIX} \text{mistsers: } \text{Please help fill in the blanks! Note, for example, that remainders mod 7, 19, and 30 are most rapidly computed by} \text{ FERM; division by 100 can be reduced to multiplication by} 1/100+1 \text{ (see exercise 1.3.1’-19); etc.} \)

[To calculate Easter in years \( \leq 1582 \), see CACM 5 (1962), 209–210. The first systematic algorithm for calculating the date of Easter was the \text{canon paschalis} due to Victorinus of Aquitania (A.D. 457). There are many indications that the sole nontrivial application of arithmetic in Europe during the Middle Ages was the calculation of Easter date, hence such algorithms are historically significant. See \text{Puzzles and Paradoxes} by T. H. O’Beirne (London: Oxford University Press, 1965), Chapter 10, for further commentary; and see the book \text{Calendrical Calculations} by E. M. Reingold and N. Dershowitz (Cambridge Univ. Press, 2001) for date-oriented algorithms of all kinds.]

33. The first such year is A.D. 10317, although the error \text{almost} leads to failure in A.D. 10108 + 19\text{k for} 0 \leq k \leq 10.

Incidentally, T. H. O’Beirne pointed out that the date of Easter repeats with a period of exactly 5,700,000 years. Calculations by Robert Hill show that the most common date is April 19 (220400 times per period), while the earliest and least common is March 22 (27550 times); the latest, and next-to-least common, is April 25 (42000 times). Hill found a nice explanation for the curious fact that the number of times any particular day occurs in the period is always a multiple of 25.

34. The following program follows the protocol to within a dozen or so \( \psi \); this is more than sufficiently accurate, since \( \rho \) is typically more than \( 10^8 \), and \( \rho \psi = 1 \text{ sec} \). All computation takes place in registers, except when a byte is input.

* Traffic Signal Problem
  rho GREG 250000000 Assume 250 MHz clock rate
  t IS $255
  Sensor_Buf IS Data_Segment
  GREG Sensor_Buf
  LOC #100
  Lights IS 3 Handle for /dev/lights
  Sensor IS 4 Handle for /dev/sensor
  Lights_Name BYTE ",/dev/lights",
  Sensor_Name BYTE ",/dev/sensor",
  Lights_Args OCTA Lights_Name,BinaryWrite
  Sensor_Args OCTA Sensor_Name,BinaryRead
  Read_Sensor OCTA Sensor_Buf,i
  Boulevard BYTE #77, 0 Green/red, WALK/DON’T
  BYTE #7f, 0 Green/red, DON’T/DON’T
  BYTE #73, 0 Green/red, off/DON’T
  BYTE #bf, 0 Amber/red, DON’T/DON’T
  Avenue BYTE #dd, 0 Red/green, DON’T/WALK
  BYTE #df, 0 Red/green, DON’T/DON’T
  BYTE #dc, 0 Red/green, DON’T/off
  BYTE #ef, 0 Red/amber, DON’T/DON’T
goal GREG 0 Transition time for lights
Main GETA t, Lights_Args Open the files: Fopen(Lights,
TRAP 0, Fopen, Lights "/dev/lights", BinaryWrite)
GETA t, Sensor_Args Fopen(Sensor,
TRAP 0, Fopen, Sensor "/dev/sensor", BinaryRead)
GET goal, rC
JMP 2F
delay_go GREG
Delay GET t, rC Subroutine for busy-waiting:
SUBU t, t, goal (N.B. Not CPU; see below)
PBN t, Delay Repeat until rC passes goal.
GO delay_go, delay_go, 0 Return to caller.
flash_go GREG
n GREG 0 Iteration counter
green GREG 0 Boulevard or Avenue
temp GREG
Flash SET n, 8 Subroutine to flash the lights:
1H ADD t, green, 2*1
TRAP 0, Fputs, Lights DON’T WALK
ADD temp, goal, rho
SR t, rho, i
ADDU goal, goal, t
GO delay_go, Delay
ADD t, green, 2*2
TRAP 0, Fputs, Lights (off)
SET goal, temp
GO delay_go, Delay
SUB n, n, i
PBP n, 1B Repeat eight times.
ADD t, green, 2*1
TRAP 0, Fputs, Lights DON’T WALK
MUL t, rho, 4
ADDU goal, goal, t
GO delay_go, Delay Hold for 4 sec.
ADD t, green, 2*3
TRAP 0, Fputs, Lights DON’T WALK, amber
GO flash_go, flash_go, 0 Return to caller.
Wait GET goal, rC Extend the 18 sec green.
1H GETA t, Read_Sensor
TRAP 0, fread, Sensor
LDB t, Sensor_Buf
BZ t, Wait
GETA green, Boulevard
GO flash_go, Flash
MUL t, rho, 8
ADDU goal, goal, t
GO delay_go, Delay Amber for 8 sec.
1.4.1.

SECTIONS TO EXERCISES

1. GETA t,Avenue
   TRAP 0,Fputs, Lights Green light for Berkeley.
   MUL t,rho,8
   ADDU goal,goal,t
   G0 delay_go,Delay
   GETA green,Avenue
   G0 flash_go,Flash Finish the avenue cycle.
   GETA t,Read_Sensor
   TRAP 0,Fread, Sensor Ignore sensor during green time.
   MUL t,rho,5
   ADDU goal,goal,t
   G0 delay_go,Delay Amber for 5 sec.
   2H GETA t,Boulevard
   TRAP 0,Fputs, Lights Green light for Del Mar.
   MUL t,rho,18
   ADDU goal,goal,t
   G0 delay_go,Delay At least 18 sec to WALK.
   JMP 1B

The SUBU instruction in the Delay subroutine is an interesting example of a case where the comparison should be done with SUBU, not with CMPU, in spite of the comments in exercise 1.3.1'-22. The reason is that the two quantities being compared, rC and goal, “wrap around” modulo 2^16.

SECTION 1.4.1'

1. j GREG ;m GREG ;kk GREG ; errone GREG ;rr GREG
   GREG 0 Base address
   GoMax SET $2,1 Special entrance for r = 1
   GoMaxR SL rr,$2,3 Multiply arguments by 8.
   SL kk,$1,3
   LDG m,x0,kk
   ...
   (Continue as in (i))
   5H SUB kk,kk,rr k ← k - r.
   PBP kk,3B Repeat if k > 0.
   6H G0 kk,$0,0 Return to caller.

The calling sequence for the general case is SET $2,r; SET $1,n; G0 $0,GoMaxR.

2. j IS $0 ;m IS $1 ;kk IS $2 ; errone IS $3 ;rr IS $4
   Max100 SET $0,100 Special entrance for n = 100 and r = 1
   Max SET $1,1 Special entrance for r = 1
   MaxR SL rr,$1,3 Multiply arguments by 8.
   SL kk,$0,3
   LDG m,x0,kk
   ...
   (Continue as in (i))
   5H SUB kk,kk,rr k ← k - r.
   PBP kk,3B Repeat if k > 0.
   6H POP 2,0 Return to caller.

In this case the general calling sequence is SET $A1,r; SET $A0,n; PUSHJ $R,MaxR.

3. Just Sub ...; G0 $0,$0,0. The local variables can be kept entirely in registers.
4. **PUSHJ $x$, $rA** has a relative address, allowing us to jump to any subroutine within ±2^18 bytes of our current location. **PUSHD $x$, $y$, $z$ or PUSHD $x$, $A$** has an absolute address, allowing us to jump to any desired place.

5. True. There are 256 – $G$ globals and $L$ locals.

6. **$s ← rD** and **$r ← 0** and **$rL ← 6**. All other newly local registers are also set to zero; for example, if $rL$ was 3, this DIVU instruction would set $s ← 0$ and $y ← 0$.

7. **$L ← 0, ..., 4 ← 0, 5 ← 5$ abcd0000, $rL ← 6$.

8. Usually such an instruction has no essential impact, except that context switching with **SAVE** and **UNSAVE** generally take longer when fewer marginal registers are present. However, an important difference can arise in certain scenarios. For example, a subsequent **PUSHJ $255$**, **Sub** followed by **POP 1, 0** would leave a result in $P$ instead of $L$.

9. **PUSHJ $255$, Handler** will make at least 32 marginal registers available (because $G ≥ 32$); then **POP 0** will restore the previous local registers, and two additional instructions "GET $255$, $rB$; RESUME" will restart the program as if nothing had happened.

10. Basically true. **MMIX** will start a program with $rG$ set to 255 minus the number of assembled GREG operations, and with $rL$ set to 2. Then, in the absence of **PUSHJ**, **PUSHD**, **POP**, **SAVE**, **UNSAVE**, **GET**, and **PUT**, the value of $rG$ will never change. The value of $rL$ will increase if the program puts anything into $s$, $y$, $z$, or $y(rG - 1)$, but the effect will be the same as if all registers were equivalent. The only register with slightly different behavior is $255$, which is affected by trip interrupts and used for communication in I/O traps. We could permute register numbers $s$, $y$, $z$, ..., $254$ arbitrarily in any **PUSH/POP/SAVE/UNSAVE/RESUME-free** program that does not get $rL$ or put anything into $rL$ or $rG$; the permuted program would produce identical results.

    The distinction between local, global, and marginal is irrelevant also with respect to **SAVE**, **UNSAVE**, and **RESUME**, in the absence of **PUSH** and **POP**, except that the destination register of **SAVE** must be global and the destination register of certain instructions inserted by **RESUME** mustn’t be marginal (see exercise 1.4.3–14).

11. The machine tries to access virtual address $0\text{fffffff}0\text{fffffff}8$, which is just below the stack segment. Nothing has been stored there, so a “page fault” occurs and the operating system aborts the program.

    (The behavior is, however, much more bizarre if a **POP** is given just after a **SAVE**, because **SAVE** essentially begins a new register stack immediately following the saved context. Anybody who tries such things is asking for trouble.)

12. (a) True. (Similarly, the name of the current “working directory” in a UNIX shell always begins with a slash.) (b) False. But confusion can arise if such prefixes are defined, so their use is discouraged. (c) False. (In this respect **MMIX**’s structured symbols are not analogous to UNIX directory names.)

13. Fib CMP $1, 0, 2$ Fib1 CMP $1, 0, 2$ Fib2 CMP $1, 0, 1$

    Pen $1, 1F$ EN $1, 1F$ EMP $1, 1F$

    Get $1, rJ$ SUB $2, 0, 1$ SUB $2, 0, 1$

    Sub $3, 0, 1$ SET $0, 0$ SET $0, 0$

    Pushj $2, \text{Fib}$ SET $1, 0$ 2H ADDU $0, 0, 0, 1$ ADDU $0, 0, 0, 1$

    Addu $4, 0, 2$ 2H ADDU $0, 0, 0, 1$ ADDU $0, 0, 0, 1$

    Pushj $3, \text{Fib}$ SUBU $1, 0, 0$ SUB $2, 0, 2$

    Addu $0, 0, 2, 3$ SUB $2, 0, 1$ PEP $2, 2B$

    Put $rJ, 3$ PENZ $2, 2B$ CSZ $0, 2, 3$

1H POP 1, 0 1H POP 1, 0 1H POP 1, 0
Here Fib2 is a faster alternative to Fib1. In each case the calling sequence has the form
\[ \text{"SET } S_A, n; \text{ PUSH } R, \text{ Fib..."}, \] where \( A = R + 1 \).

14. Mathematical induction shows that the POP instruction in Fib is executed exactly
\[ 2F_{n+1} - 1 \] times and the ADDU instruction is executed \( F_{n+1} - 1 \) times. The instruction
at 2H is performed \( n - [n \neq 0] \) times in Fib1, \([n/2]\) times in Fib2. Thus the total cost,
including the two instructions in the calling sequence, comes to \( (19F_{n+1} - 12)v \) for Fib,
\( (4n + 8)v \) for Fib1, and \( (4[n/2] + 12)v \) for Fib2, assuming that \( n > 1 \).

(The recursive subroutine Fib is a terrible way to compute Fibonacci numbers,
because it forgets the values it has already computed. It spends more than \( 10^7v \) units
of time just to compute \( F_{100} \).

15. \n\begin{align*}
\text{n} & \quad \text{GREG} & \quad \text{GO} & \quad \$0, \text{Fib} \\
\text{fn} & \quad \text{IS} & \quad \text{n} & \quad \text{STO} & \quad \text{fn,fp,24} \\
\text{GREG} & \quad \emptyset & \quad \text{LD0} & \quad \text{n,fp,16} \\
\text{Fib} & \quad \text{CMP} & \quad \$1,n,2 & \quad \text{SUB} & \quad \text{n,n,2} \\
\text{PBN} & \quad \$1,1F & \quad \text{G0} & \quad \$0, \text{Fib} \\
\text{ST0} & \quad \text{fp,sp,0} & \quad \text{LD0} & \quad \$0,fp,24 \\
\text{SET} & \quad \text{fp,sp} & \quad \text{ADDU} & \quad \text{fn,fn,0} \\
\text{INCL} & \quad \text{sp,S+4} & \quad \text{LD0} & \quad \$0,fp,8 \\
\text{ST0} & \quad \text{$0,fp,8} & \quad \text{SET} & \quad \text{sp,fp} \\
\text{ST0} & \quad \text{n,fp,16} & \quad \text{LD0} & \quad \text{fp,sp,0} \\
\text{SUB} & \quad \text{n,n,1} & \quad \text{1H} & \quad \text{G0} & \quad \$0,$0,0
\end{align*}

The calling sequence is \( \text{SET n,n; G0 } $0, \text{Fib} \); the answer is returned in global register fn.
The running time comes to \( (8F_{n+1} - 8)\mu + (32F_{n+1} - 23)v \), so the ratio between this
version and the register stack subroutine of exercise 13 is approximately \( (8\mu/v + 32)/19 \).
Although exercise 14 points out that we shouldn’t really calculate Fibonacci numbers
recursively, this analysis does demonstrate the advantage of a register stack. Even if
we are generous and assume that \( \mu = v \), the memory stack costs more than twice as
much in this example. A similar behavior occurs with respect to other subroutines,
but the analysis for Fib is particularly simple.

In the special case of Fib we can do without the frame pointer, because fp is
always a fixed distance from sp. A memory-stack subroutine based on this observation
runs about \( (6\mu/v + 29)/19 \) slower than the register-stack version; it’s better than the
version with general frames, but still not very good.

16. This is an ideal setup for a subroutine with two exits. Let’s assume for convenience that
\( B \) and \( C \) do not return any value, and that they each save rJ in \$1 (because they are
not leaf subroutines). Then we can proceed as follows: A calls B by saying \( \text{PUSHJ } R,R,B \)
as usual. B calls C by saying \( \text{PUSHJ } R,R,C; \text{ PUT } rJ,\$1; \text{ POP } 0,0 \) (with perhaps a different
value of R than used by subroutine A). C calls itself by saying \( \text{PUSHJ } R,R,C; \text{ PUT } rJ,\$1; \text{ POP } 0,0 \)
(with perhaps a different value of R than used by B). C jumps to A by saying
\( \text{PUT } rJ,\$1; \text{ POP } 0,0 \). C exits normally by saying \( \text{PUT } rJ,\$1; \text{ POP } 0,2 \).

Extensions of this idea, in which values are returned and an arbitrary jump address
can be part of the returned information, are clearly possible. Similar schemes apply to
the G0-oriented memory stack protocol of (15).

**SECTION 1.4.2**

1. If one coroutine calls the other only once, it is nothing but a subroutine; so we
need an application in which each coroutine calls the other in at least two distinct
places. Even then, it is often easy to set some sort of switch or to use some property
of the data, so that upon entry to a fixed place within one coroutine it is possible to branch to one of two desired places; again, nothing more than a subroutine would be required. Coroutines become correspondingly more useful as the number of references between them grows larger.

2. The first character found by In would be lost.

3. This is an MIMIXTRIX trick to make OutBuf contain fifteen tetrabytes TETRA ' ', followed by TETRA #a, followed by zero; and TETRA ' ' is equivalent to BYTE 0, 0, 0,' '. The output buffer is therefore set up to receive a line of 16 three-character groups separated by blank spaces.

4. If we include the code

```
rR_A GREG
rR_B GREG
GREG 0
A GET rR_B,rR
PUT rR,rR_A
G0 t,a,0
B GET rR_A,rR
PUT rR,rR_B
G0 t,b,0
```

then A can invoke B by "G0 a,B" and B can invoke A by "G0 b,A".

5. If we include the code

```
a GREG
b GREG
GREG 0
A GET b,rJ
PUT rJ,a
POP 0
B GET a,rJ
PUT rJ,b
POP 0
```

then A can invoke B by “PUSHJ $255,B” and B can invoke A by “PUSHJ $255,A”. Notice the similarity between this answer and the previous one. The coroutines should not use the register stack for other purposes except as permitted by the following exercise.

6. Suppose coroutine A has something in the register stack when invoking B. Then B is obliged to return the stack to the same state before returning to A, although B might push and pop any number of items in the meantime.

Coroutines might, of course, be sufficiently complicated that they each do require a register stack of their own. In such cases MIMIX’s SAVE and UNSAVE operations can be used, with care, to save and restore the context needed by each coroutine.

**SECTION 1.4.3**

1. (a) SRU x,y,z; BYTE 0,1,0,#29.  (b) PEB x,PETaken+0-0; BYTE 0,3,0,#50.  (c) MUX x,y,z; BYTE 0,1,0,#29.  (d) ADDU x,x,z; BYTE 0,1,0,#30.

2. The running time of MemFind is $9v + (2\mu + 8v)C + (3\mu + 6v)U + (2\mu + 11v)A$, where $C$ is the number of key comparisons on line 042, $U = \text{key} \neq \text{curkey}$, and $A = \text{new node needed}$. The running time of GetReg is $\mu + 6v + 6vL$, where $L = \lfloor k \mod L \rfloor$.
If we assume that \( C = U = A = L = 0 \) on each call, the time for simulation can be broken down as follows:

<table>
<thead>
<tr>
<th>Action</th>
<th>(a)</th>
<th>(b)</th>
<th>(c)</th>
</tr>
</thead>
<tbody>
<tr>
<td>fetching (lines 105–115)</td>
<td>( \mu + 17v )</td>
<td>( \mu + 17v )</td>
<td>( \mu + 17v )</td>
</tr>
<tr>
<td>unpacking (lines 141–153)</td>
<td>( \mu + 12v )</td>
<td>( \mu + 12v )</td>
<td>( \mu + 12v )</td>
</tr>
<tr>
<td>relating (lines 154–164)</td>
<td>( 2v )</td>
<td>( 2v )</td>
<td>( 9v )</td>
</tr>
<tr>
<td>installing X (lines 174–182)</td>
<td>( \mu + 13v )</td>
<td>( \mu + 13v )</td>
<td>( \mu + 13v )</td>
</tr>
<tr>
<td>installing Z (lines 183–197)</td>
<td>( 7v )</td>
<td>( \mu + 17v )</td>
<td>( \mu + 17v )</td>
</tr>
<tr>
<td>installing Y (lines 198–207)</td>
<td>( \mu + 13v )</td>
<td>( \mu + 13v )</td>
<td>( \mu + 13v )</td>
</tr>
<tr>
<td>destining (lines 208–231)</td>
<td>( 8v )</td>
<td>( 6v )</td>
<td>( 6v )</td>
</tr>
<tr>
<td>resuming (lines 232–242)</td>
<td>( 14v )</td>
<td>( \mu + 14v )</td>
<td>( 16v - \pi )</td>
</tr>
<tr>
<td>postprocessing (lines 243–539)</td>
<td>( \mu + 10v )</td>
<td>( 11v )</td>
<td>( 11v - 4\pi )</td>
</tr>
<tr>
<td>updating (lines 540–548)</td>
<td>( 5v )</td>
<td>( 5v )</td>
<td>( 5v )</td>
</tr>
<tr>
<td>total</td>
<td>( 5\mu + 101v )</td>
<td>( 5\mu + 120v )</td>
<td>( 3\mu + 105v - 5\pi )</td>
</tr>
</tbody>
</table>

To these times we must add \( 6v \) for each occurrence of a local register as a source, plus penalties for the times when \texttt{MemFind} doesn't immediately have the correct chunk. In case (b), \texttt{MemFind} must miss on line 231, and again on line 111 when fetching the following instruction. (We would be better off with two \texttt{MemFind} routines, one for data and one for instructions.) The most optimistic net cost of (b) is therefore obtained by taking \( C = A = 2 \), for a total running time of \( 13\mu + 158v \). (On long runs of the simulator simulating itself, the empirical average values per call of \texttt{MemFind} were \( C \approx .29, U \approx .00001, A \approx .16 \.)

3. We have \( \beta = \gamma \) and \( L > 0 \) on line 097. Thus \( \alpha = \gamma \) can arise, but only in extreme circumstances when \( L = 256 \) (see line 268 and exercise 11). Luckily \( L \) will soon become 0 in that case.

4. No problem can occur until a node invades the pool segment, which begins at address \( \#4000000000000000 \); then remnants of the command line might interfere with the program's assumption that a newly allocated node is initially zero. But the data segment is able to accommodate \( |(2^{61} - 2^{12} - 2^4)/(2^{12} + 24)| = 559,670,633,304,293 \) nodes, so we will not live long enough to experience any problem from this “bug.”

5. Line 218 calls \texttt{StackRoom} calls \texttt{StackStore} calls \texttt{MemFind}; this is as deep as it gets. Line 218 has pushed 3 registers down; \texttt{StackRoom} has pushed only 2 (since \( rL = 1 \) on line 097); \texttt{StackStore} has pushed 3. The value of \( rL \) on line 032 is 2 (although \( rL \) increases to 5 on line 034). Hence the register stack contains \( 3 + 2 + 3 + 2 = 10 \) unpoped items in the worst case.

The program halts shortly after branching to \texttt{Error}; and even if it were to continue, the extra garbage at the bottom of the stack won't hurt anything—we could simply ignore it. However, we could clear the stack by providing second exits as in exercise 1.4.1–16. A simpler way to flush an entire stack is to pop repeatedly until \( rO \) equals its initial value, \texttt{Stack_SEGMENT}.

6. 247 Div DIV x,y,z Divide y by z, signed.
    248 JMP 1F
    249 DivU PUT rD,x Put simulated rD into real rD.
    250 DIVU x,y,z Divide y by z, unsigned.
    251 1H GET t, rR
    252 STO t, g, 8xrR g[xR] ← remainder.
    253 JMP XDOne Finish by storing x. 1
7. (The following instructions should be inserted between line 309 of the text and the Info table, together with the answers to the next several exercises.)

\[
\begin{align*}
\text{Cswap} & \quad \text{LDSU} \quad z, g, 8 \times rP \\
& \quad \text{LDSU} \quad y, \text{res}, 0 \\
& \quad \text{CMPU} \quad t, y, z \\
& \quad \text{ENZ} \quad t, 1F \quad \text{Branch if } M_u[A] \neq g[rP]. \\
& \quad \text{STOU} \quad x, \text{res}, 0 \quad \text{Otherwise set } M_s[A] \leftarrow X. \\
& \quad \text{JMP} \quad 2F \\
1H & \quad \text{STOU} \quad y, g, 8 \times rP \quad \text{Set } g[rP] \leftarrow M_s[A]. \\
2H & \quad \text{ZSZ} \quad x, t, 1 \quad x \leftarrow \text{result of equality test.} \\
& \quad \text{JMP} \quad \text{XDone} \quad \text{Finish by storing } x. \\
\end{align*}
\]

8. Here we store the simulated registers that we’re keeping in actual registers. (This approach is better than a 32-way branch to see which register is being gotten; it’s also better than the alternative of storing the registers every time we change them.)

\[
\begin{align*}
\text{Get} & \quad \text{CMPU} \quad t, yz, 32 \\
& \quad \text{BN} \quad t, \text{Error} \quad \text{Make sure that } YZ < 32. \\
& \quad \text{STOU} \quad ii, g, 8 \times rI \quad \text{Put the correct value into } g[rI]. \\
& \quad \text{STOU} \quad cc, g, 8 \times rC \quad \text{Put the correct value into } g[rC]. \\
& \quad \text{STOU} \quad oo, g, 8 \times rO \quad \text{Put the correct value into } g[rO]. \\
& \quad \text{STOU} \quad ss, g, 8 \times rS \quad \text{Put the correct value into } g[rS]. \\
& \quad \text{STOU} \quad uu, g, 8 \times rU \quad \text{Put the correct value into } g[rU]. \\
& \quad \text{STOU} \quad aa, g, 8 \times rA \quad \text{Put the correct value into } g[rA]. \\
& \quad \text{SR} \quad t, ll, 3 \\
& \quad \text{STOU} \quad t, g, 8 \times rL \quad \text{Put the correct value into } g[rL]. \\
& \quad \text{SR} \quad t, gg, 3 \\
& \quad \text{STOU} \quad t, g, 8 \times rG \quad \text{Put the correct value into } g[rG]. \\
& \quad \text{SLU} \quad t, zz, 3 \\
& \quad \text{LDGU} \quad x, g, t \quad \text{Set } x \leftarrow g[z]. \\
& \quad \text{JMP} \quad \text{XDone} \quad \text{Finish by storing } x. \\
\end{align*}
\]

9. Put

\[
\begin{align*}
\text{ENZ} & \quad \text{yy, Error} \quad \text{Make sure that } Y = 0. \\
& \quad \text{CMPU} \quad t, xx, 32 \\
& \quad \text{BN} \quad t, \text{Error} \quad \text{Make sure that } X < 32. \\
& \quad \text{CMPU} \quad t, xx, rC \\
& \quad \text{BN} \quad t, \text{PutOK} \quad \text{Branch if } X < 8. \\
& \quad \text{CMPU} \quad t, xx, rF \\
& \quad \text{BN} \quad t, 1F \quad \text{Branch if } X < 22. \\
\text{PutOK} & \quad \text{STOU} \quad z, g, xxx \quad \text{Set } g[X] \leftarrow z. \\
& \quad \text{JMP} \quad \text{Update} \quad \text{Finish the command.} \\
1H & \quad \text{CMPU} \quad t, xx, rG \\
& \quad \text{BN} \quad t, \text{Error} \quad \text{Branch if } X < 19. \\
& \quad \text{SUB} \quad t, xx, rL \\
& \quad \text{PEP} \quad t, \text{PutA} \quad \text{Branch if } X = rA. \\
& \quad \text{BN} \quad t, \text{PutG} \quad \text{Branch if } X = rG. \\
\text{PutL} & \quad \text{SLU} \quad z, z, 3 \quad \text{Otherwise } X = rL. \\
& \quad \text{CMPU} \quad t, z, 1l \\
& \quad \text{CSN} \quad ll, t, z \quad \text{Set } rL \leftarrow \min(z, rL). \\
& \quad \text{JMP} \quad \text{Update} \quad \text{Finish the command.} \\
\text{OH GREG} & \quad \text{#400000}
\end{align*}
\]
1.4.3 Answers to Exercises

PutA  CMPU t,z,0B
      BNN t,Error  Make sure z ≤ °3fff.
      SET aa,z    Set rA ← z.
      JMP Update  Finish the command.

PutG  SRU t,z,8
      BNZ t,Error  Make sure z < 256.
      CMPU t,z,32
      EN t,Error   Make sure z ≥ 32.
      SLU z,z,3
      CMPU t,z,ll
      EN t,Error   Make sure z ≥ rL.
      JMP 2F

1H SUBU gg,gg,8  G ← G − 1. (%G becomes global.)
      STC0 0,g,gg   g[Γ] ← 0. (Compare with line 216.)

2H CMPU t,z,gg
      PBN t,iB     Branch if G < z.
      SET gg,z     Set rG ← z.
      JMP Update  Finish the command.

In this case the nine commands that branch to either Put0K, PutA, PutG, PutL, or
Error are tedious, yet still preferable to a 32-way switching table.

10. Pop SUBU 00,00,8
      BZ  xx,1F     Branch if X = 0.
      CMPU t,li,xx
      EN t,iF      Branch if X > L.
      ADDU t,xx,00
      AND t,t,lrng_mask
      LD0U y,l,t    y ← result to return.

1H CMPU t,00,ss
      PBNN t,iF    Branch unless α = γ.
      PUSHJ 0,StackLoad

1H AND t,00,lrng_mask
      LD0U z,l,t    z ← number of additional registers to pop.
      AND z,x,öff  Make sure z ≤ 255 (in case of weird error).
      SLU z,z,3

1H SUBU t,00,ss
      CMPU t,t,x
      PBNN t,iF    Branch unless z registers not all in the ring.
      PUSHJ 0,StackLoad  (See note below.)
      JMP 1B      Repeat until all necessary registers are loaded.

1H ADDU 11,11,8
      CMPU t,xxx,11
      CSN 11,t,xxx  Set L ← min(X, L + 1).
      ADDU 11,11,z  Then increase L by z.
      CMPU t,gg,11
      CSN 11,t,gg   Set L ← min(L, G).
      CMPU t,z,1l
      BNN t,iF     Branch if returned result should be discarded.
      AND t,00,lrng_mask
      ST0U y,l,t    Otherwise set l[(α − 1) mod ρ] ← y.
Here it is convenient to decrease $o_0$ in two steps, first by 8 and then by 8 times $z$. The program is complicated in general, but in most cases comparatively little computation actually needs to be done. If $β = γ$ when the second StackLoad call is given, we implicitly decrease $β$ by 1 (thereby discarding the topmost item of the register stack). That item will not be needed unless it is the value being returned, but the latter value has already been placed in $y$.

11. Save

```
BNZ        yz,Error
CMPU t,xxx,gg
BN t,Error
ADDU t,o0,11
AND t,t,lrng_mask
SRU y,11,3
STUI y,1,t
INCL 11,8
PUSHJ 0,StackRoom
ADDU o0,o0,11
SET 11,0
```

Push down all local registers and set rL ← 0.

1H PUSHJ 0,StackStore

```
CMPU t,ss,00
PENZ t,1B
SUBU y,gg,8
ADDU y,gg,8
```

Increase $k$ by 1.

4H ADDU y,gg,8

Set $k ← G - 1$. (Here $y ≡ 8k$.)

1H SET argv,ss

```
PUSHJ res,MemFind
CMPU t,y,8*(rZ+1)
LDGU z,gg,y
PENZ t,2F
SLU z,gg,56-3
ADDU z,zz,aa
```

If $k = rZ + 1$, set $z ← 2^k rG + rA$.

2H STUU z,zz,0

Store $z$ in Ms[R5].

```
INCL ss,8
BNZ t,1F
CMPU t,y,c255
BZ t,2F
```

Branch if we just stored rG and rA.

```
CMPU t,y,8*rR
PENZ t,4B
```

Branch unless we just stored rR.

```
SET y,8*rP
JMP 1B
```

Set $k ← rP$.

```
2H SET y,8*rB
JMP 1B
```

Set $k ← rB$.

```
1H SET o0,ss
define
```

Finish by storing $x$.

(The special registers saved are those with codes 0–6 and 23–27, plus (rG, rA).)
12. Unsaves

BNZ \( xx, \text{Error} \) Make sure \( X = 0 \).
BNZ \( yy, \text{Error} \) Make sure \( Y = 0 \).
ANDNL \( z, \#7 \) Make sure \( z \) is a multiple of 8.
ADDU \( ss, z, 8 \) Set \( rS \leftarrow z + 8 \).
SET \( y, 8 \times (rZ+2) \) Set \( k \leftarrow rZ + 2 \). \((y \equiv 8k)\)
1H  SUBU \( y, y, 8 \) Decrease \( k \) by 1.
4H  SUBU \( ss, ss, 8 \) Decrease \( rS \) by 8.
SET  arg, ss
PUSHJ \( \text{res} \), MemFind
LDU \( x, \text{res}, 0 \) Set \( x \leftarrow M_s[rS] \).
CMPU \( t, y, 8 \times (rZ+1) \)
PBNZ \( t, 2F \)
SRU \( gg, x, 56-3 \) If \( k = rZ + 1 \), initialize \( rG \) and \( rA \).
SLJ \( aa, x, 64-18 \)
SRU \( aa, aa, 64-18 \)
JMP \( 1B \)
2H  STUU \( x, g, y \) Otherwise set \( g[k] \leftarrow x \).
3H  CMPU \( t, y, 8 \times rP \)
CSZ \( y, t, 8 \times (rR+1) \) If \( k = rP \), set \( k \leftarrow rR + 1 \).
CSZ \( y, y, c256 \) If \( k = rB \), set \( k \leftarrow 256 \).
CMPU \( t, y, gg \)
PBNZ \( t, 1B \) Repeat the loop unless \( k = G \).
PUSHJ \( 0, \text{StackLoad} \)
AND \( t, ss, \text{lr} \text{ing} \_ \text{mask} \)
LDU \( x, l, t \) \( x \leftarrow \) the number of local registers.
AND \( x, x, \#ff \) Make sure \( x \leq 255 \) (in case of weird error).
BZ \( x, 1F \)
SET \( y, x \)
Now load \( x \) local registers into the ring.
2H  PUSHJ \( 0, \text{StackLoad} \)
SUBU \( y, y, 1 \)
PBNZ \( y, 2B \)
SLU \( x, x, 3 \)
1H  SET \( ll, x \)
CMPU \( t, gg, x \)
CSN \( ll, t, gg \) Set \( rL \leftarrow \min(x, rG) \).
SET \( oo, ss \) Set \( rO \leftarrow rS \).
PBNZ \( uu, \text{Update} \) Branch, if not the first time.
BZ \( \text{resuming}, \text{Update} \) Branch, if first command is UNSAVE.
JMP \( \text{AllDone} \) Otherwise clear resuming and finish. 

---

A straightforward answer is as good as a kiss of friendship.

― Proverbs 24:26
13.  \[ \text{517} \]  \begin{align*} &\text{SET } \text{xx}, 0 \\
&\text{SLU } t, t, $55 \quad \text{Loop to find highest trip bit.} \\
&\text{2H} \quad \text{INCL } \text{xx}, 1 \\
&\text{SLU } t, t, i \\
&\text{PBNL } t, 2B \\
&\text{SET } t, \#100 \quad \text{Now } \text{xx} = \text{index of trip bit.} \\
&\text{SRU } t, t, \text{xx} \quad t \leftarrow \text{corresponding event bit.} \\
&\text{ANDN } \text{exc, exc, t} \quad \text{Remove } t \text{ from exc.} \\
&\text{TakeTrip ST0U } \text{inst_ptr, g, 8} \ast rW \\
&\text{SLU } \text{inst_ptr, xx, 4} \quad \text{inst_ptr} \leftarrow \text{xx} \ll 4. \\
&\text{INCH } \text{inst, #8000} \\
&\text{ST0U } \text{inst, g, 8} \ast rX \quad g[rX] \leftarrow \text{inst} + 2^{63}. \\
&\text{AND } t, f, \text{Mem_bit} \\
&\text{PBNZ } t, 1F \quad \text{Branch if op doesn't access memory.} \\
&\text{ADDU } y, y, z \quad \text{Otherwise set } y \leftarrow (y + z) \mod 2^{64}. \\
&\text{SET } z, x \quad z \leftarrow z. \\
&\text{1H ST0U } y, \text{g, 8} \ast rY \quad g[rY] \leftarrow y. \\
&\text{ST0U } z, \text{g, 8} \ast rZ \quad g[rZ] \leftarrow z. \\
&\text{LD0U } t, \text{g, c255} \\
&\text{ST0U } t, \text{g, 8} \ast rB \quad g[rB] \leftarrow g[255]. \\
&\text{LD0U } t, \text{g, 8} \ast rJ \\
&\text{ST0U } t, \text{g, c255} \quad g[255] \leftarrow g[rJ]. \quad \blacksquare \\
\end{align*} \\

14. Resume \[ \text{SLU } t, \text{inst}, 40 \]
\[ \text{BNZ } t, \text{Error} \quad \text{Make sure XYZ = 0.} \]
\[ \text{LD0U } \text{inst_ptr, g, 8} \ast rW \quad \text{inst_ptr} \leftarrow g[rW]. \]
\[ \text{LD0U } x, \text{g, 8} \ast rX \]
\[ \text{BN } x, \text{Update} \quad \text{Finish the command if rX is negative.} \]
\[ \text{SRU } \text{xx, x, 56} \quad \text{Otherwise let } \text{xx} \text{ be the ropcode.} \]
\[ \text{SUBU } t, \text{xx, 2} \]
\[ \text{BNN } t, 1F \quad \text{Branch if the ropcode is } \geq 2. \]
\[ \text{PBN } \text{xx, 2F} \quad \text{Branch if the ropcode is } 0. \]
\[ \text{SRU } y, x, 28 \quad \text{Otherwise the ropcode is } 1: \]
\[ \text{AND } y, y, \#f \quad y \leftarrow k, \text{the leading nybble of the opcode.} \]
\[ \text{SET } z, 1 \]
\[ \text{SLU } z, z, y \quad z \leftarrow 2^k. \]
\[ \text{ANDML } z, \#70cf \quad \text{Zero out the acceptable values of } z. \]
\[ \text{BNZ } z, \text{Error} \quad \text{Make sure the opcode is “normal.”} \]
\[ \text{1H BP } t, \text{Error} \quad \text{Make sure the ropcode is } \leq 2. \]
\[ \text{SRU } t, x, 13 \]
\[ \text{AND } t, t, c255 \]
\[ \text{CMPU } y, t, 11 \]
\[ \text{BN } y, 2F \]
\[ \text{CMPU } y, t, gg \]
\[ \text{BN } y, \text{Error} \quad \text{Otherwise make sure } \$X \text{ is global.} \]
\[ \text{2H MSR } t, x, \#8 \]
\[ \text{CMPU } t, t, \#F9 \quad \text{Make sure the opcode isn’t RESUME.} \]
\[ \text{BEZ } t, \text{Error} \]
\[ \text{NEG resuming, xx} \]
1.4.3’

**ANSWERS TO EXERCISES**

```
CSNN resuming,resuming,1 Set resuming as specified.
JMP Update Finish the command.

166 LD0U y,g,8+rxY y ← g[rxY].
167 LD0U z,g,8+rz Z ← g[rz].
168 B0D resuming,Install_Y Branch if ropcode was 1.
169 OH GREG #C1<<56+(x-$0)<<48+(z-$0)<<40+1<<16+X_is_dest_bit
170 SET $,OB Otherwise change f to an ORI instruction.
171 LD0U exc,g,8+rxX
172 M0R exc,exc,#20 exc ← third-from-left byte of rX.
173 JMP XDest Continue as for ORI.  
```

15. We need to deal with the fact that the string to be output might be split across two or more chunks of the simulated memory. One solution is to output eight bytes at a time with `Fwrite` until reaching the last octabyte of the string; but that approach is complicated by the fact that the string might start in the middle of an octabyte. Alternatively, we could simply `Fwrite` only one byte at a time; but that would be almost obscenely slow. The following method is much better:

```
SimFputs SET xx,0 (xx will be the number of bytes written)
       SET z,t Set z ← virtual address of string.
1H SET arg,z
       PUSHJ res,MemFind
       SET t,res Set t ← actual address of string.
       G0 $0,DoInst (See below.)
       BN t,TrapDone If error occurred, pass the error to user.
       BZ t,1F Branch if the string was empty.
       ADD xx,xx,t Otherwise accumulate the number of bytes.
       ADDU z,z,t Find the address following the string output.
       AND t,z,Mem:mask
       BZ t,1B Continue if string ended at chunk boundary.
1H SET t,xx t ← number of bytes successfully put.
       JMP TrapDone Finish the operation.
```

Here `DoInst` is a little subroutine that inserts `inst` into the instruction stream. We provide it with additional entrances that will be useful in the next answers:

```
GREG 0 Base address
:SimInst LDA t,IOArgs DoInst to I0Args and return.
       JMP DoInst
SimFinish LDA t,IOArgs DoInst to I0Args and finish.
SimFclose GETA $0,TrapDone DoInst and finish.
:DoInst PUT rW,$0 Put return address into rW.
       PUT rX,inst Put inst into rX.
       RESUME 0 And do it.
```

16. Again we need to worry about chunk boundaries (see the previous answer), but a byte-at-a-time method is tolerable since file names tend to be fairly short.

```
SimFopen PUSHJ 0,GetArgs (See below.)
       ADDU xx,Mem:alloc,Mem:nodesize
       STGU xx,IOArgs
       SET x,xx (We’ll copy the file name into this open space.)
1H SET arg,z
       PUSHJ res,MemFind
```
LD BU t,res,0
ST BU t,x,0 Copy byte M[z].
IN CL x,1
IN CL z,1
PBNZ t,1B Repeat until the string has ended.
GO $0,SimInst Now open the file.

STC0 0,x,0 Now zero out the copied string.
CMPU z,xx,x
SUB x,x,8
PBN z,3B Repeat until it is surely obliterated.
JMP TrapDone Pass the result t to the user.

Here GetArgs is a subroutine that will be useful also in the implementation of other I/O commands. It sets up I0Args and computes several other useful results in global registers.

:GetArgs GET $0,rJ Save the return address.
SET y,t y ← g[255].
SET arg,t
PUSHJ res,MemFind
LDU0 z,res,0 z ← virtual address of first argument.
SET arg,z
PUSHJ res,MemFind
SET x,res x ← internal address of first argument.
STO x,I0Args
SET xx,Mem:Chunk
AND zz,x,Mem:mask
SUB xx,xx,zz xx ← bytes from x to chunk end.
ADDU arg,y,8
PUSHJ res,MemFind
LDU0 zz,res,0 zz ← second argument.
STOU zz,I0Args+8 Convert I0Args to internal form.
PUT rJ,$0 Restore the return address.
POP 0

17. This solution, which uses the subroutines above, works also for SimWrite().

SimRead PUSHJ 0,GetArgs Massage the input arguments.
SET y,zz y ← number of bytes to read.

1H CM P t,xx,y
PBNN t,SimFinish Branch if we can stay in one chunk.
STO xx,I0Args+8 Ops, we have to work piecewise.
SUB y,y,xx
GO $0,SimInst
BN t,1F Branch if an error occurs.
ADD z,z,xx
SET arg,z
PUSHJ res,MemFind
STOU res,I0Args Reduce to the previous problem.
STO y,I0Args+8
ADD xx,Mem:mask,1
JMP 1B
1.4.3  ANSWERS TO EXERCISES  125

1H  SUB  t,t,y  Compute the correct number of missing bytes.
JMP  TrapDone

SimWrite IS SimFread ;SimFseek IS SimFclose ;SimFtell IS SimFclose

(The program assumes that no file-reading error will occur if the first Fread was
successful.) Analogous routines for SimFgets, SimFgetws, and SimFputws can be found
in the file sim.mms, which is one of many demonstration files included with the author's
MMIXware programs.

18. The stated algorithms will work with any MMIX program for which the number of
local registers, $L$, never exceeds $\rho - 1$, where $\rho$ is the lrинг.size.

19. In all three cases the preceding instruction is INCL 11, 8, and a value is stored in
location $l + ((o + 11) \land \text{lrинг.mask})$. So we could shorten the program slightly.

20.  560  1H  GETA  t,OctaArgs
    561  TRAP 0,Fread,Infile  Input $\lambda$ into $g[255]$.
    562  BN  t,9F  Branch if end of file.
    563  LDQU loc,g,c255  loc $\leftarrow \lambda$.
    564  2H  GETA  t,OctaArgs
    565  TRAP 0,Fread,Infile  Input an octabyte $x$ into $g[255]$.
    566  LDQU x,g,c255
    567  BN  t,Error  Branch on unexpected end of file.
    568  SET  arg,loc
    569  BZ  x,1B  Start a new sequence if $x = 0$.
    570  PUSHJ res,MemFind
    571  STOU x,res,0  Otherwise store $x$ in $M[\text{loc}]$.
    572  INCL loc,8  Increase loc by 8.
    573  JMP  2B  Repeat until encountering a zero.
    574  9H  TRAP 0,Fclose,Infile  Close the input file.
    575  SUBU loc,loc,8  Decrease loc by 8.

Also put “OctaArgs OCTA Global+8*255,8” in some convenient place.

21. Yes it is, up to a point; but the question is interesting and nontrivial.

To analyze it quantitatively, let sim.mms be the simulator in MMIXAL, and let
sim.mmo be the corresponding object file produced by the assembler. Let Hello.mmo be the object file corresponding to Program 1.3.2H. Then the command line ‘Hello’
presented to MMIX's operating system will output ‘Hello, world’ and stop after $\mu + 17v$, not counting the time taken by the operating system to load it and to take care of
input/output operations.

Let Hello0.mmb be the binary file that corresponds to the command line ‘Hello’,
in the format of exercise 20. (This file is 176 bytes long.) Then the command line ‘sim
Hello0.mmb’ will output ‘Hello, world’ and stop after $168\mu + 1699v$.

Let Hello1.mmb be the binary file that corresponds to the command line ‘sim
Hello0.mmb’. (This file is 5768 bytes long.) Then the command line ‘sim Hello1.mmb’
will output ‘Hello, world’ and stop after $10549\mu + 169505v$.

Let Hello2.mmb be the binary file that corresponds to the command line ‘sim
Hello1.mmb’. (This file also turns out to be 5768 bytes long.) Then the command line
‘sim Hello2.mmb’ will output ‘Hello, world’ and stop after $789730\mu + 15117686v$.

Let Hello3.mmb be the binary file that corresponds to the command line ‘sim
Hello2.mmb’. (Again, 5768 bytes.) Then the command line ‘sim Hello3.mmb’ will output ‘Hello, world’ if we wait sufficiently long.
Now let `recurse.mmb` be the binary file that corresponds to the command line `sim recurse.mmb`. Then the command line `sim recurse.mmb` runs the simulator simulating itself simulating itself simulating itself · · · ad infinitum. The file handle `Infile` is first opened at time $3\mu + 13v$, when `recurse.mmb` begins to be read by the simulator at level 1. That handle is closed at time $1464\mu + 16438v$ when loading is complete; but the simulated simulator at level 2 opens it at time $1800\mu + 19689v$, and begins to load `recurse.mmb` into simulated simulated memory. The handle is closed again at time $99659\mu + 1484347v$, then reopened by the simulated simulated simulator at time $116999\mu + 1794455v$. The third level finishes loading at time $6827574\mu + 131658624v$ and the fourth level starts at time $8216888\mu + 159327275v$.

But the recursion cannot go on forever; indeed, the simulator running itself is a finite-state system, and a finite-state system cannot produce `Fopen–Fclose` events at exponentially longer and longer intervals. Eventually the memory will fill up (see exercise 4) and the simulation will go awry. When will this happen? The exact answer is not easy to determine, but we can estimate it as follows: If the $k$th level simulator needs $n_k$ chunks of memory to load the $(k+1)$st level simulator, the value of $n_{k+1}$ is at most $4 + [(2^{12} + 16 + (2^{12} + 24)n_k)/2^{12}]$, with $n_0 = 0$. We have $n_k = 6k$ for $k < 30$, but this sequence eventually grows exponentially; it first surpasses $2^{61}$ when $k = 6066$. Thus we can simulate at least $10^9$ E80 instructions before any problem arises, if we assume that each level of simulation introduces a factor of at least 100 (see exercise 2).

22. The pairs $(x_k, y_k)$ can be stored in memory following the trace program itself, which should appear after all other instructions in the text segment of the program being traced. (The operating system will give the trace routine permission to modify the text segment.) The main idea is to scan ahead from the current location in the traced program to the next branch or `G0` or `PUSH` or `POP` or `JMP` or `RESUME` or `TRIP` instruction, then to replace that instruction temporarily in memory with a `TRIP` command. The tetrabytes in locations $^*0$, $^*10$, $^*20$, ..., $^*80$ of the traced program are changed so that they jump to appropriate locations within the trace routine; then all control transfers will be traced, including transfers due to arithmetic interrupts. The original instructions in those locations can be traced via `RESUME`, as long as they are not themselves `RESUME` commands.
INDEX AND GLOSSARY

When an index entry refers to a page containing a relevant exercise, see also the answer to that exercise for further information. An answer page is not indexed here unless it refers to a topic not included in the statement of the exercise.

: (colon), 61–62, 65, 80.
" (double-quote), 31, 37, 44, 72, 100.
_ (underscore), 37.
@ (at sign), 15, 35, 38, 81.
$0, 31, 58.
$1, 31, 58.
2ADD (times 2 and add unsigned), 9.
4ADD (times 4 and add unsigned), 9.
8ADD (times 8 and add unsigned), 9.
16ADD (times 16 and add unsigned), 9.
$255, 34, 40–43, 56, 68, 114.
μ (average memory access time), 22.
φ (golden ratio), 8, 47.
u (instruction cycle time), 22.
Absolute address, 15.
Absolute difference, 26.
Absolute value, 26, 27.
ACE computer, 65.
ADD, 8.
Addition, 8, 12, 14, 25.
Addition chains, 98.
ADDU (add unsigned), 8.
Adobe Systems, 74.
Ahrens, Wilhelm Ernst Martin Georg, 48.
ALGOL language, 74.
ALGOL W language, iv.
Allazen, see Ibn al-Haytham.
Allasing, 108.
Alignment, 39, 44.
Alpha 2164 computer, 2.
AMD 29000 computer, 2.
AMD (bitwise and), 10.
AMDN (bitwise and-not), 10.
AMDI (bitwise and-not high wyde), 14.
AMDML (bitwise and-not low wyde), 14.
AMDMH (bitwise and-not high wyde), 14.
AMDMHL (bitwise and-not medium high wyde), 14.
ANSE: The American National Standards Institute, 12.
Arabic numerals, 44.
Arabic script, 44, 100.
Arguments, 54.
Arithmetic exceptions, 18, 89.
Arithmetic operators of MMIX, 8–9.
Arithmetic overflow, 6, 7, 19, 25, 27, 65, 84, 95, 100.
Arithmetic status register, 18.
Assembly language for MMIX, 28–44.
Assembly program, 29, 30, 40.
Associative law: (a ◦ b) ◦ c = a ◦ (b ◦ c), 11.
At sign (①), 15, 35, 38, 81.
Atomic instruction, 17.

b(x), 11.
Ball, Walter William Rouse, 48.
Base address, 35, 39.
BBDF (byte difference), 11, 26, 101.
Bertrand, Joseph Louis François, postulate, 100.
BEV (branch if even), 15.
Bidirectional typesetting, 44.
Bienstock, Daniel, 104.
Big-endian convention: Most significant byte first, 4–7, 116.
Binary file, 41.
for programs, 90, 92–93, 125.
Binary number system, 4.
Binary operators in MMIX, 10.
Binary radix point, 8, 24.
Binary-to-decimal conversion, 37.
BinaryRead mode, 43.
BinaryWrite mode, 43.
Bit: “Binary digit,” either zero or unity, 2.
Bit difference, 26.
Bit reversal, 26, 97.
Bit vectors, 10.
Bitwise difference, 14.
Bitwise operators of MMIX, 10, 14, 25.
Blanks space, 26, 40, 67.
B (branch if negative), 15.
BN (branch if nongenerative), 15.
BNP (branch if nonpositive), 15.
BNZ (branch if nonzero), 15.
BO (branch if odd), 15.
Boolean matrix, 11, 96.
Bootstrap register, 18.
Bourne, Charles Percy, 97.
BP (branch if positive), 15.
Branch operators of MMIX, 15, 85.
BSPEC (begin special data), 62.
Buchholz, Werner, 94.
Byte: An 8-bit quantity, 3, 24, 94.
Byte difference, 11, 26.
BYTE operator, 31, 39.
Byte reversal, 12.
BZ (branch if zero), 15.
INDEX AND GLOSSARY

c language, iv, 45.
c ++ language, iv.
cache memory, 17, 22–23, 72, 98, 105, 107.
calendar, 49.
calling sequence, 54–56, 60, 68–70.
carry, 25.
cauchy, Augustin Louis, 105.
ceiling, 13.
character constant, 37.
ches, 66.
Chung, Fan Rong King (鍾金芳權), 104.
chunks, 77, 123.
Clausius, Christoph, 49.
clipped C300 computer, 2.
clock register, 19, 76, 112.
cmp (compare), 9.
cmpu (compare unsigned), 9, 113.
colon (:), 61, 65, 80.
command line arguments, 31, 90, 125.
comments, 29.
commutative law: a ⋅ b = b ⋅ a, 95.
comparison operators of mmix, 9, 13, 25, 113.
compiler algorithms, 62, 74.
complement, 10, 24.
complete mmix program, 30, 45.
conditional operators of mmix, 10, 26.
conversion operators of mmix, 13.
conway, Melvin Edsward, 35.
copying a string, 47.
coroutines, 60–73.
linkage, 66, 72–73.
counting bits, 11.
coset, Harold Scott Macdonald, 48.
cray I computer, 2.
crossword puzzle, 50–51.
cryptanalysis, 47.
csev (conditional set if even), 10.
csw (conditional set if negative), 10.
cswn (conditional set if nonnegative), 10.
csp (conditional set if nonpositive), 10.
csz (conditional set if nonzero), 10.
csdd (conditional set if odd), 10.
csp (conditional set if positive), 10.
cswap (compare and swap), 17, 91.
csz (conditional set if zero), 10.
current prefix, 61, 65.
cycle counter, 19.
cyclic shift, 26.
D_BIt (integer divide check bits), 18.
Dalsos, Jozsef, 97.
data segment of memory, 36, 57, 76–77, 81, 117.
debugging, 64–65, 73, 91.
decimal constant, 37.
defined symbol, 37.
denormal floating point number, 12, 89.
dershowitz, Nachum (דֶרְשְׁוַיְת נָחוּם), 111.
dickens, Charles John Huffman, iii.
dictionaries, iii.
dijkstra, Edsger Wijbe, 63.
discrete system simulators, 76.
div (divide), 8, 24–25.
divide check, 8, 18.
dividend register, 9.
division, 9, 13, 24–25, 49, 91.
by small constants, 25.
by zero, 18.
converted to multiplication, 25, 111.
divu (divide unsigned), 8.
double-quote ("), 31, 37, 44, 72, 100.
dull, Brutius Cyclops, 25.
dvwiouzx, 18, 27, 89, 92.
dynamic traps, 19.
easter date, 49.
emulator, 75.
enable bits, 18, 85.
ending a program, 19, 31.
entrances to subroutines, 52–57, 123.
epsilon register, 13.
equivalent of mmixl symbol, 38.
error recovery, 91.
espec (end special data), 62.
evaluation of powers, 28, 98.
evans, Arthur, Jr., 74.
event bits, 18, 85.
evaluate, 94.
exceptions, 18, 89.
execution register, 18.
Exiting from a program, 19, 31.
Exits from subroutines, 52–57, 115.
Exponent of a floating point number, 12.
Exponentiation, 28.
expr field of mmixl line, 29, 38.
expression, in mmixl, 38.
Extending the sign bit, 7, 9, 95.
F(x), 12.
FADD (floating add), 12.
fallacies, 95.
Farey, John, 105.
series, 47.
Fascicles, iii.
float operation, 41, 43.
FCMP (floating compare), 13, 98.
FCMP (floating compare with respect to epsilon), 13.
FDIV (floating divide), 12.
FEQL (floating equal to), 13, 98.
FEQLE (floating equivalent with respect to epsilon), 13.
Fgets operation, 42, 43.
fgets operation, 42, 43.
Fibonacci, Leonardo, of Pisa.
numbers, 47, 66.
filters, 71.
finite fields, 26.
FIRST (floating integer), 13, 23.
FIX (convert floating to fixed), 13.
Fixed point arithmetic, 45.
F2U (convert floating to fixed unsigned), 13.
Flag bits, 82, 87.
Floating binary number, 12.
Floating point arithmetic, 12–13, 44, 45, 89.
Floating point operators of MMX, 12–13.
FLRT (convert fixed to floating), 13.
FLRTU (convert fixed to floating unsigned), 13, 97.
Floyd, Robert W., 98.
FMUL (floating multiply), 12.
Fopen operation, 41, 43, 92.
Ford, Donald Floyd, 107.
Forward reference, see Future reference.
Fputs operation, 42, 43, 92.
Fputw operation, 42, 43.
Fraction of a floating point number, 12.
Frame pointer, 58, 115.
Fread operation, 42, 43, 92.
Fredman, Michael Lawrence, 104.
FREM (floating remainder), 13, 23, 44, 111.
Fseek operation, 42, 43.
FSqrt (floating square root), 13.
FSUB (floating subtract), 12.
Ftell operation, 43.
Fuchs, David Raymond, 27, 74.
FUR (floating unordered), 13, 98.
FUNE (floating unordered with respect to epsilon), 13.
Future reference, 37, 39.
Fwrite operation, 42, 43, 124.

GET (get from special register), 19, 92.
GETA (get address), 39, 100.
Gigabyte, 94.
Global registers, 16, 34, 58, 65, 79, 80, 84, 92.
Global threshold register, 16.
G5, 15, 26, 53–58.
Gove, Philip Babcock, iii.
Graphical display, 50–51.
GNEQ (allocate global register), 34–35, 39, 62.

Half-bytes, 24.
halt operation, 31, 43.
Handles, 41.
Handlers, 18, 65, 89.
Hardy, Godfrey Harold, 105.
Harmonic convergence, 48.
Harmonic series, 48–49.
Haros, C., 105.
Heller, Joseph, 3.
Hello, world, 30–32, 125.
Hexadecimal constants, 37.
Hexadecimal digits, 3, 24.
Hexadecimal notation, 3, 19.
High tetra arithmetic, 97.
Hill, Robert, 111.
Himult register, 8.
Hints to MMX, 16–17.
Hitachi SuperH4 computer, 2.
Hofri, Micha (חַזְיַה), 104.
I-BIT (invalid floating operation bit), 18, 98.
IBM 601 computer, 2.
IBM 801 computer, 2.
IEEE: The Institute of Electrical and Electronics Engineers.
floating point standard, 12, 89.
Immediate constants, 13–14, 19.
INCR (increase by high wyde), 14.
INCL (increase by low wyde), 14.
INCM (increase by medium high wyde), 14.
INCMC (increase by medium low wyde), 14.
Inexact exception, 18, 89.
Ingalls, Daniel Henry Holmes, 109.
Initialization, 31, 91.
of coroutines, 70.
Infinite floating point number, 12.
int x, 13.
Input-output operations, 19, 31, 40–43, 92.
Instruction, machine language: A code that, when interpreted by the circuitry of a computer, causes the computer to perform some action.
in MMX, 5–28.
numeric form, 27–29, 44.
symbolic form, 28–40.
Integer overflow, 6, 7, 18, 25, 27, 65, 84, 95, 109.
Intel i860 computer, 2.
Internet, ii, v.
Interpreter, 73–75.
Interrupt mask register, 19.
Interrupt request register, 19.
Interrupts, 18–19, 86, 89, 92.
Interval counter, 19.
Invalid floating operation, 18.
15, 30, 34, 39.
ISO: The International Organization for Standardization, 3.
Ivanovic, Vladimir Gresham, v.
Iverson, Kenneth Eugene, 11.
130  INDEX AND GLOSSARY

Jacquet, Philippe Pierre, 104.
Java language, iv, 45.
JMP (jump), 15.
Joke, 72.
Josephus, Flavius, son of Mathias (Ἀνάγκη Φλαβίου ᾿Ιωάννου Μαθαίου), problem, 48.
Jump operators of MMX, 15.
Jump table, 86–87.
Jump trace, 93.

Kernel space, 36.
Kernighan, Brian Wilson, 23.
Kilobyte, 24. 94.
KKB (large kilobyte), 94.
Knuth, Donald Ervin (高德纳), i, v, 45, 65, 74, 89.

LABEL field of MMIXAL line, 29, 38.
Large kilobyte, 94.
Large programs, 63–65.
LDA (load address), 7, 9, 100.
LDB (load byte), 6.
LDBU (load byte unsigned), 7.
LDHT (load high tetra), 7, 24, 97.
LDQ (load quad), 6.
LDQU (load quad unsigned), 7.
LDWF (load short float), 13.
LDT (load tetra), 6.
LDTU (load tetra unsigned), 7.
LDUUC (load quad uncached), 17.
LDVT (load virtual translation status), 17.
LDWI (load wide), 6.
LDWIU (load wide unsigned), 7.
Leaf subroutine, 57, 65, 80.
Library of subroutines, 52, 61, 62, 91.
Lilias, Alycyas, 49.
Linked allocation, 77–78.
Literate programming, 45, 65.
Little-endian convention: Least significant byte first, see Bidirectional typesetting, Byte reversal.
Loader, 36.
Loading operators of MMIX, 6–7.
LOC (change location), 30, 39.
LOCAL (guarantee locality), 62.
Local registers, 16, 58, 65, 80, 84, 92.
ring of, 76, 79–81, 92.
Local symbols, 35–37, 43.
Local threshold register, 16.
Loop optimization, 115.

m(x), 11.
Machine language, 2.
Main location, 31, 91.
Marginal registers, 16, 58, 65, 80, 84, 97.
Matrix: A two-dimensional array, 46, 106.
Matrix multiplication, generalised, 11, 26.
Maximum, 26.
subroutine, 28–29, 52–56.
Megabyte, 24, 94.
Memory: Part of a computer system used to store data, 4–6.
address, 6.
hierarchy, 17, 22–23, 72, 98, 105, 107.
Memory stack, 57–58, 115.
Mems: Memory accesses, 22.
Meta-simulator, 22–23, 47, 76.
METAPOST language, 51.
Minimum, 26.
Minus zero, 13.
MIPS 4000 computer, 2.
MMX computer, iv.
.mmb (MMX binary file), 125.
MMB (Large megabyte), 94.
MMX computer, iv, 2–28.
MMIX simulator, 22–23, 30.
in MMIX, 75–93.
MMIXmasters, v, 51, 105, 111.
MMXware document, 2.
.moe (MMX object file), 30, 125.
.mms (MMX symbolic file), 30, 125.
MGR (multiple or), 12, 23, 26.
Motorola 88000 computer, 2.
Move-to-front heuristic, 77–78.
Mu (µ), 22.
MUL (multiply), 8.
Multipass algorithms, 70–72, 74.
Multiple entrances, 56, 123.
Multiple exits, 56–57, 60, 115.
Multiplex mask register, 11.
Multiplication, 8, 12, 25, 85.
by small constants, 9, 25.
Multiway decisions, 45, 46, 82, 86–88, 119.
MUL (multiply unsigned), 8, 25.
Murray, James Augustus Henry, iii.
MUX (multilexers), 11.
MXXR (multiple exclusive-or), 12, 23, 26.

NaN (Not-a-Number), 12, 98.
AND (bitwise not-and), 10.
NEG (negate), 9.
Negation, 9, 24.
NEG (negate unsigned), 9.
Newline, 32, 42.
MMIX operating system, 28, 31.
No-op, 21, 28.
Nonlocal goto statements, 66, 91, 117.
NOR (bitwise not-or), 10.
Normal floating point number, 12.
Not-a-Number, 12, 98.
INDEX AND GLOSSARY

Notational conventions:
- \( b(x) \), 11.
- \( f(x) \), 12.
- \( \text{int } x \), 13.
- \( m(x) \), 11.
- \( s(x) \), 6, 24.
- \( t(x) \), 11.
- \( u(x) \), 6, 24.
- \( v(x) \), 10.
- \( \varphi(x) \), 10.
- \( w(x) \), 11.
- \( x = y \), 11.
- \( x \neq y \), 9.
- \( x < y \), 9.
- \( x \land y \), 10.
- \( x \lor y \), 10.
- \( x \equiv y \), 10.
- \( x \mod y \), 13.
- \( \text{XYZ} \), 6.
- \( \text{XOR} \) (bitwise not-exclusive-or), 10.
- \( \text{Nyb} \): A 4-bit quantity, 24.
- \( \text{Nyp} \): A 2-bit quantity, 94.

O.ET (floating overflow bit), 18.
O’Brian, Thomas Hay, 111.
Object file, 30-31, 125.
Octa operator, 39.
Octabyte: A 64-bit quantity, 4.
Octa difference, 11, 102.
Oops, 22.
DP field of \text{IMIXAL} line, 29, 38.
Opcode: Operation code, 5, 19.
chart, 20.
Operands, 5, 83-84.
Operating system, 28, 36, 40-43.
Optimization of loops, 47.
OR (bitwise or), 10.
ORB (bitwise or with high wyde), 14.
BRL (bitwise or with low wyde), 14.
BMH (bitwise or with high wyde), 14.
BML (bitwise or with medium low wyde), 14.
BMN (bitwise or-not), 10.
Overflow, 6, 7, 18, 25, 27, 65, 84, 95, 100.
Oxford English Dictionary, iii.

Packed data, 82, 87-88.
Page fault, 114.
Parameters, 54.
Parity, 26.
Pascal language, iv.
Pass, in a program, 70-72.
Fatt, Yale Nance, 98.
PBEV (probable branch if even), 16.
PBB (probable branch if negative), 15.
PBBN (probable branch if nonnegative), 15.
PBBP (probable branch if nonpositive), 16.
PBNZ (probable branch if nonzero), 16.
PBD (probable branch if odd), 15.
PBP (probable branch if positive), 15.
PBNZ (probable branch if zero), 15.
PetaByte, 94.
Phi (\( \phi \)), 8, 47.
Pipe, 71.
Pipelining, 22, 47, 76, 98.
Pixel values, 11, 26.
PL/300 language, 45.
PL/1 language, 45, 63.
Pool segment of memory, 36, 117.
POP (pop registers and return), 16, 53, 59, 73, 92.
Population counting, 11.
PostScript language, 74.
POWER 2 computer, 2.
Power of number, evaluation, 28.
Predefined symbols, 36-38, 43.
Prediction register, 17.
PREFIX specification, 61-62, 65, 77-78, 80.
Prefetching, 17, 22.
Prescales for units of measure, 91.
PREET (prefetch data), 17.
PREF (preload data), 17.
PRIMARY, in \text{IMIXAL}, 38.
Prime numbers, program to compute, 32-34, 37.
Privileged instructions, 46, 76.
Probable branch, 15-16, 22, 26, 85.
Profile of a program: The number of times each instruction is performed, 29, 31, 93, 95.
Program construction, 63-65.
Programming languages, iv, 63.
Pseudo-operations, 30-31.
Purely, Gregor Neal, 94.
PUSHD (push registers and go), 16, 65, 73, 85-86.
PUSHJ (push registers and jump), 16, 33, 59, 73, 85-86.
PUTF (put into special register), 19, 92.
Quick, Jonathan Horati, 44.
rA (arithmetic status register), 18, 28.
rA (relative address), 15.
Radix point, 8, 24.
Randel, Brian, 74.
Randel, Vance, 28.
Rational numbers, 47.
r3 (bootstrap register for trips), 18.
rB3 (bootstrap register for traps), 18.
rC (cycle counter), 19, 112.
rD (dividend register), 9.
rE (epsilon register), 13.
Reachability, 51.
Read-only access, 36.
Recursive use of subroutines, 57, 66, 125-126.
Register 80, 31, 58.
Register 81, 31, 58.
Register 82, 54, 40-43, 56, 68, 114.
Register number, 34, 58.
Register stack, 16, 58-61, 65-66, 70, 73, 78-81, 84-86, 115.
Register stack offset, 17.
Register stack pointer, 17.
Registers: Portions of a computer’s internal circuitry in which data is most accessible.
of MMX, 4-5, 21, 23, 76, 79.
saving and restoring, 55; see also SAVE, UNSAVE.
Reingold, Edward Martin (ר"ט מartin, מ"ח מרטין), 111.
Relative addresses, 15-16, 20, 30, 83, 87, 90.
Remainder, 8, 13, 49.
Remainder register, 8.
Replicated coroutines, 72.
Reprogramming, 75.
RESUME (resume after interrupt), 19, 84, 92, 114, 128.
Return-jump register, 16.
Reversal of bits and bytes, 12, 28, 97.
Rewinding a file, 42.
Rewrites, v, 64.
rG (global threshold register), 16, 58, 92.
rH (hmult register), 8, 28, 85, 94.
rL (local counter), 19.
Ring of local registers, 76, 79-81, 92.
RISC: Reduced Instruction Set
Computer, 24.
RISC II computer, 2.
RJ (return-jump register), 16, 60, 80, 81.
RK (interrupt mask register), 19, 90-91.
rl (local threshold register), 16, 28, 58, 79, 92, 97, 117.
RM (multiplex mask register), 11.
RN (serial number), 19.
RO (register stack offset), 17, 79.
Rokicki, Tomas Gerhard, 74.
Roman numerals, 2, 3.
Ropcodes, 19, 92.
ROUND_DOWN mode, 13.
ROUND_NEAR mode, 13, 37.
ROUND_OFF mode, 13.
ROUND_UP mode, 13.
Rounding, 13, 18, 47, 48.
Row major order, 46.
rP (prediction register), 17.
rQ (interrupt request register), 19.
rR (remainder register), 8.
RS (register stack pointer), 17, 79.
RT (trap address register), 18, 90-91.
rTT (dynamic trap address register), 19, 90-91.
rU (usage counter), 19.
Running time, 20-23.
Russell, Lawford John, 74.
rV (virtual translation register), 20, 90-91.
rW (where-interrupted register for trips), 18.
rWW (where-interrupted register for traps), 18.
rX (execution register for trips), 18.
rXX (execution register for traps), 18.
rY (Y operand register for trips), 18.
rYY (Y operand register for traps), 18.
rZ (Z operand register for trips), 18.
rZZ (Z operand register for traps), 18.
s(x), 6, 24.
SADD (sideways add), 11.
Saddle point, 46.
Saturation addition, 26.
Saturation subtraction, 11.
SAVE (save process state), 16, 61, 92, 114, 116.
Saving and restoring registers, 55; see also SAVE, UNSAVE.
Scalar variables, 61.
Schiffer, Alejandro Alberto, 104.
Segments of user space, 36.
Self-modifying code, iv, 28, 93.
Self-organizing list search, 77-78.
Self-reference, 126, 132.
Sequential array allocation, 46.
Serial number register, 19.
SET, 14, 90.
Set difference, 25.
Set intersection, 25.
Set union, 25.
SETH (set high wyde), 14.
SETL (set low wyde), 14, 100.
SETM (set medium high wyde), 14, 97.
SETML (set medium low wyde), 14.
SPLIT (convert fixed to short float), 13.
SPLITU (convert fixed to short float unsigned), 13.
Shift operators of MMX, 9.
Shor, Peter Williston, 104.
Short float format, 12-13.
Sideways addition, 11.
Sign extension, 7, 9, 95.
Sign of floating point number, 12.
Signed integers, 4, 6-7, 25.
Sikes, William, iii.
Simon, Marvin Neil, v.
Simulation of computers, 75-76.
SL (shift left), 9, 25.
SLJ (shift left unsigned), 9, 25.
Small constant numbers, 9, 13.
Division by, 25.
multiplication by, 9, 25.
Sparc 64 computer, 2.
Special registers of MMX, 5, 19, 21, 76, 118.
INDEX AND GLOSSARY 133

Square root, 13.
SR (shift right), 9, 25.
SRU (shift right unsigned), 9, 25.
Stack offset register, 79.
Stack operators of MMX, 16–17.
Stack pointer register, 57–58, 79.
Stack segment of memory, 30, 61, 114, 117.
Stacks, see Memory stack, Register stack.
Stalling a pipeline, 108.
Standard error file, 41.
Standard input file, 41.
Standard output file, 31, 41.
Starting a program, 31, 70, 91.
STB (store byte), 7.
STBU (store byte unsigned), 8.
STCO (store constant octabyte), 8.
STMERR (standard error file), 41.
STM (standard input file), 41.
STMAT (standard output file), 30–31, 41.
STMH (store high tetra), 8, 24, 97.
STO (store octa), 7.
Storing operators of MMX, 7–8.
STOU (store octa unsigned), 8.
Stretch computer, 94.
String constant in MMXAL, 31, 37, 100.
String manipulation, 26, 47.
Strong binary operators, 38.
StrongArm 110 computer, 2.
STSF (store short float), 13.
STT (store tetra), 7.
STTU (store tetra unsigned), 8.
STUM (store octa uncached), 17.
STW (store wyde), 7.
STWU (store wyde unsigned), 8.
SUB (subtract), 8.
Subroutines, 30, 45, 52–70, 75, 77–81, 92.
Linkage of, 52–61.
Subsets, representation of, 25.
Subtraction, 8, 12, 25.
SUBU (subtract unsigned), 8.
Superscalar machine, 108.
Suri, Subhash (सुरी सुभाष), 104.
Switching tables, 45, 46, 82, 86–88, 119.
SWIM (sympathize with your machinery), 21.
SYNC (synchronize), 17, 86.
SYNCD (synchronize data), 17.
SYNCID (synchronize instructions and data), 17, 28.
System operators of MMX, 17.
System/360 computer, 45.

u(x), 6, 24.
U_BIT (floating underflow bit), 18, 85, 89.
U_Handler: Address of an underflow trip, 89.
UCS: Universal Multiple-Octet Coded Character Set, 3.
Underflow, 18, 89.
Underscore (_), 37.
Unicode, 3, 26, 37, 44.
Units of measure, 94.
UNIX computer, 35.
UNIX operating system, 71, 114.
Unpacking, 82.
Unrolling a loop, 107.
UNSAVE (restore process state), 16, 61, 90, 92, 116.
Unsigned integers, 4, 6–8.
Upsilon (υ), 22.
Usage counter, 19.
User space, 36.

v(x), v̅(x), 10.
V_BIT (integer overflow bit), 18.
Valid MMX instruction, 46.
Van Wyk, Christopher John, 23.
Vector, 10.
Victorius of Aquitania, 111.
Virtual address translation, 17.
Virtual machine, 73.
Virtual translation register, 20.

Terabyte, 94.
Term, in MMXAL, 38.
Terminating a program, 19, 31.
Teran: Short form of “tetabyte”, 4.
Terabyte difference, 11.
TETRA operator, 39, 72.
Tetabyte: A 32-bit quantity, 4.
Tetabyte arithmetic, 27.
TeX, 65, 74–75.
Text file, 41.
Text segment of memory, 30, 77, 81.
TextRead mode, 43.
TextWrite mode, 43.
Threads, 72.
Trace routine, 64, 93.
Traffic signals, 30.
TRAP (force trap interrupt), 18–19, 40, 86–87.
Trap address register, 18.
Trap handlers, 18–19.
TRIP (force trip interrupt), 18, 86.
Trip handlers, 18, 89.
Trip interrupts, 65, 92.
Turing, Alan Mathison, 65.
Twist, Olivier, iii.
Two's complement notation, 4, 24.
INDEX AND GLOSSARY

w(x), 11.
W_BIT (float-to-fix overflow bit), 18.
W_Handler: Address of a float-to-fix
overflow trip, 37.
WDFP (wyde difference), 11.
Weak binary operators, 38.
Webster, Noah, iii.
Where-interrupted register, 18.
Whitespace character, 67.
Wide strings, 42.
Wilson, George Pickett, 28.
Wirth, Niklaus Emil, 45, 63.
Wordsworth, William, 24.
Wright, Edward Maitland, 105.
Wyde: A 16-bit quantity, 4.
Wyde difference, 11.
Wyde immediate, 14.
WYDE operator, 39.
X field of MMIX instruction, 5.
X_BIT (floating inexact bit), 18, 89.
XOR (bitwise exclusive-or), 10.
XYZ field of MMIX instruction, 6.
Y field of MMIX instruction, 5.
Y operand register, 18.
Yoder, Michael Franz, 95.
Yossarian, John, 3.
Yottabyte, 94.
YZ field of MMIX instruction, 5-6.
Z field of MMIX instruction, 5.
as immediate constant, 14.
Z operand register, 18.
Z_BIT (floating division by zero bit), 18.
Zero or set instructions of MMIX, 10.
Zettabyte, 94.
ZSEV (zero or set if even), 10.
ZSN (zero or set if negative), 10.
ZSNX (zero or set if nonnegative), 10.
ZSNP (zero or set if nonpositive), 10.
ZSNZ (zero or set if nonzero), 10.
ZSOD (zero or set if odd), 10.
ZSP (zero or set if positive), 10.
ZSZ (zero or set if zero), 10.
### ASCII Characters

<table>
<thead>
<tr>
<th>$x_0$</th>
<th>$x_1$</th>
<th>$x_2$</th>
<th>$x_3$</th>
<th>$x_4$</th>
<th>$x_5$</th>
<th>$x_6$</th>
<th>$x_7$</th>
<th>$x_8$</th>
<th>$x_9$</th>
<th>$x_a$</th>
<th>$x_b$</th>
<th>$x_c$</th>
<th>$x_d$</th>
<th>$x_e$</th>
<th>$x_f$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$%2x$</td>
<td>t</td>
<td>&quot;</td>
<td>$-$</td>
<td>$%$</td>
<td>$&amp;$</td>
<td>$'$</td>
<td>(</td>
<td>)</td>
<td>+</td>
<td>-</td>
<td>.</td>
<td>/</td>
<td>$%2x$</td>
<td></td>
<td></td>
</tr>
<tr>
<td>$%3x$</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>:</td>
<td>;</td>
<td>&lt;</td>
<td>&gt;</td>
<td>?</td>
</tr>
<tr>
<td>$%4x$</td>
<td>6</td>
<td>A</td>
<td>B</td>
<td>C</td>
<td>D</td>
<td>E</td>
<td>F</td>
<td>G</td>
<td>H</td>
<td>I</td>
<td>J</td>
<td>K</td>
<td>L</td>
<td>M</td>
<td>N</td>
</tr>
<tr>
<td>$%6x$</td>
<td>P</td>
<td>Q</td>
<td>R</td>
<td>S</td>
<td>T</td>
<td>U</td>
<td>V</td>
<td>W</td>
<td>X</td>
<td>Y</td>
<td>Z</td>
<td>[</td>
<td>\</td>
<td>]</td>
<td>-</td>
</tr>
<tr>
<td>$%7x$</td>
<td>p</td>
<td>q</td>
<td>r</td>
<td>s</td>
<td>t</td>
<td>u</td>
<td>v</td>
<td>v</td>
<td>x</td>
<td>y</td>
<td>z</td>
<td>(</td>
<td>)</td>
<td>-</td>
<td>_</td>
</tr>
</tbody>
</table>

### MMX Operation Codes

<table>
<thead>
<tr>
<th>$x_0$</th>
<th>$x_1$</th>
<th>$x_2$</th>
<th>$x_3$</th>
<th>$x_4$</th>
<th>$x_5$</th>
<th>$x_6$</th>
<th>$x_7$</th>
<th>$x_8$</th>
<th>$x_9$</th>
<th>$x_a$</th>
<th>$x_b$</th>
<th>$x_c$</th>
<th>$x_d$</th>
<th>$x_e$</th>
<th>$x_f$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$%0x$</td>
<td>TRAP 4v</td>
<td>PUMP v</td>
<td>FUN v</td>
<td>FEQL v</td>
<td>FADO 4v</td>
<td>FIX 4v</td>
<td>PSUB 4v</td>
<td>FIUE 4v</td>
<td>$%0x$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>$%1x$</td>
<td>FMMUL 4v</td>
<td>FCMPL 4v</td>
<td>FMCNE 4v</td>
<td>FDIV 4v</td>
<td>FSQRT 4v</td>
<td>FPREM 4v</td>
<td>FIINT 4v</td>
<td>$%1x$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>$%Ex$</td>
<td>SETV v</td>
<td>SETVS v</td>
<td>SETVL v</td>
<td>SETLV v</td>
<td>$%Ex$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

$\pi = 2v$ if the branch is taken, $\pi = 0$ if the branch is not taken.