03-InputOutput

Operating systems control all the computer’s I/O (Input/Output) devices.
An OS must issue commands to the devices, catch interrupts, and handle errors.
It provides an interface between the devices and the rest of the system.
That interface should be simple and easy to use.
The interface should be as similar as possbile for all devices (device independence).
The I/O code represents a significant fraction of the total operating system.
Thus to really understand what an operating system does,
you have to understand how I/O works.

Then we will look at I/O software in general.
I/O software can be structured in layers,
with each layer having a well defined task to perform.

After that, comes a section on deadlocks.
We will define deadlocks precisely,
show how they are caused,
give two models for analyzing them,
and discuss some algorithms for preventing their occurrence.

Then we will move on to look at MINIX3.
We will start with a bird’s-eye view of I/O in MINIX3,
including interrupts, device drivers, device-dependent I/O, and device-independent I/O.
Following that introduction,
we will look at several I/O devices in detail:
disks, keyboards, and displays.
For each device, we will look at its hardware and software.

1.1 Principles of I/O hardware

Electrical engineers look at it in terms of:
chips, wires, power supplies, motors,
and all the other physical components that make up the hardware.

Programmers look at the interface presented to the software:
the commands the hardware accepts,
the functions it carries out,
and the errors that can be reported back.

We are concerned with programming I/O devices,
not designing, building, or maintaining them,
so our interest will be restricted to how the hardware is programmed,
not how it works inside.
Programming of I/O devices is connected with their internal operation.
Thus, we provide general background on I/O hardware,
as it relates to programming.

1.1.1 I/O Devices

I/O devices can be roughly divided into two categories:
1) block devices and
2) character devices.

1.1.1.1 Block Devices

A block device is one that stores information in fixed-size blocks,
each one with its own address.
Common block sizes range from 512 bytes to 32,768 bytes.
The essential property of a block device is that:
it is possible to read or write each block independently of all the other ones.
Disks are the most common block devices.

However, the boundary between devices that are block addressable,
and those that are not, is not well defined.

Everyone agrees that a disk is a block addressable device,
because no matter where the arm currently is,
it is always possible to seek to another cylinder,
and then wait for the required block to rotate under the head.

Now consider a tape drive used for making disk backups.
Tapes contain a sequence of blocks.
If the tape drive is given a command to read block N,
then it can always rewind the tape,
and go forward until it comes to block N.
This operation is analogous to a disk doing a seek,
except that it takes much longer.
It may or may not be possible to rewrite one block in the middle of a tape.
Even if it were possible to use tapes as random access block devices,
that is stretching the point somewhat:
they are not normally used that way.

1.1.1.2 Character devices

The other type of I/O device is the character device.
A character device delivers or accepts a stream of characters,
without regard to any block structure.
It is not addressable, and does not have any seek operation.
Printers, network interfaces, mice (for pointing),
and most other devices that are not disk-like,
can be seen as character devices.

1.1.1.3 Other devices

Clocks, for example, are not block addressable.
Nor do they generate or accept character streams.
All they do is cause interrupts at well-defined intervals.

Still, the model of block and character devices is general enough,
that it can make some of the operating system software for I/O, device independent.
For example, the file system deals only with abstract block devices,
and leaves the device-dependent part to lower-level software called device drivers.

I/O devices cover a huge range in speeds.
The figure below shows the data rates of some common devices.
03-InputOutput/f3-01.png

Some typical device, network, and bus data rates.
Most of these devices tend to get faster as time goes on.

1.1.2 Device Controllers

I/O units typically consist of an electrical component and a mechanical component.
It is often possible to separate the two portions,
to provide a more modular and general design.

The electronic component is called the device controller or adapter.
On personal computers, it often takes the form of a printed circuit card,
that can be inserted into an expansion slot.

The mechanical component is the device itself.
This arrangement is shown:
03-InputOutput/f3-02.png

A model for connecting the CPU, memory, controllers, and I/O devices.

The controller card usually has a connector on it,
into which a cable leading to the device itself can be plugged.
Many controllers can handle two, four, or even eight identical devices.
If the interface between the controller and device is a standard interface,
either an official ANSI, IEEE, or ISO standard or a de-facto one,
then companies can make controllers or devices that fit that interface.
For example, many companies make disk drives that match either the:
IDE (Integrated Drive Electronics) or
SCSI (Small Computer System Interface) interfaces.

We mention this distinction between controller and device,
because the operating system nearly always deals with the controller,
not the device.

Most personal computers and servers use the bus model above,
for communication between the CPU and the controllers.

Large mainframes often use a different model,
with specialized I/O computers called I/O channels,
taking some of the load off the main CPU.

The interface between the controller and the device is often low-level.
A disk, for example, might be formatted,
with 1024 sectors of 512 bytes per track.
What actually comes off the drive, however,
is a serial bit stream, starting with a preamble,
then the 4096 bits in a sector, and finally a checksum,
also called an Error-Correcting Code (ECC).
The preamble is written when the disk is formatted,
and contains the cylinder and sector number,
the sector size, and similar data.

The controller converts the serial bit stream into a block of bytes,
and performs any error correction necessary.
The block of bytes is typically first assembled,
bit by bit, in a buffer inside the controller.
After its checksum has been verified,
and the block declared to be free of errors,
then it can then be copied to main memory.

A display monitor’s controller also works as a bit serial device,
at an equally low level.
It reads bytes containing the characters to be displayed from memory,
and generates the signals used to modulate the CRT beam.

The controller also generates signals,
for making a CRT beam do a horizontal retrace,
after it has finished a scan line,
as well as for making it do a vertical retrace,
after the entire screen has been scanned.

On an LCD screen, these signals select individual pixels,
and control their brightness,
simulating the effect of the electron beam in a CRT.

If it were not for the video controller,
then the operating system programmer would have to program the scanning explicitly.
The operating system initializes the controller with a few parameters,
such as the number of characters or pixels per line,
and number of lines per screen,
and lets the controller take care of actually drawing the display.

Modern disk controllers often have excess memory inside the controller.
When a read is being processed,
as soon as the arm gets to the correct cylinder,
the controller begins reading and storing data,
even if it has not yet reached the sector it needs.
This cached data may come in handy for satisfying subsequent requests.
Even after the requested data has been obtained,
the controller may continue to cache data from subsequent sectors,
since they are likely to be needed later.
Many disk reads can be handled without any disk activity at all.

1.1.3 Memory-Mapped I/O

Each controller has a few registers,
that are used for communicating with the CPU.

By writing into these registers,
the operating system can command the device to:
deliver data, accept data, switch itself on or off,
or otherwise perform some action.

By reading from these registers,
the operating system can learn what the device’s state is,
whether it is prepared to accept a new command, and so on.

In addition to the control registers,
many devices have a data buffer,
that the operating system can read and write.
For example, a common way for computers to display pixels on the screen,
is to have a video RAM, which is basically just a data buffer,
available for programs or the operating system to write into.

The issue thus arises,
of how the CPU communicates with the control registers,
and the device data buffers.
Two alternatives exist:

1.1.3.1 Memory, ports

In the first approach,
each control register is assigned an I/O port number,
an 8- or 16-bit integer.

Using a special I/O instruction such as
IN REG,PORT
the CPU can read in control register PORT,
and store the result in CPU register REG.

Similarly, using
OUT PORT,REG
the CPU can write the contents of REG to a control register.

Most early computers, including nearly all mainframes,
such as the IBM 360 and all of its successors, worked this way.
In this scheme, the address spaces for memory and I/O are different,
as shown in (a):
03-InputOutput/f3-03.png

(a) Separate I/O and memory space.
(b) Memory-mapped I/O.
(c) Hybrid.

1.1.3.2 One memory address space

In the second approach,
I/O registers are part of the regular memory address space,
as shown above in (b)
This scheme is called memory-mapped I/O,
and was introduced with the PDP-11 minicomputer.
Each control register is assigned a unique memory address,
to which no memory is assigned.
Usually, the assigned addresses are at the top of the address space.

1.1.3.3 Two memory address spaces

A hybrid scheme,
with memory-mapped I/O data buffers,
and separate I/O ports for the control registers,
is shown in (c).
The Pentium uses this architecture,
with addresses 640K to 1M being reserved,
for device data buffers in IBM PC compatibles,
in addition to I/O buffer 0 through 64K.

1.1.3.4 How do these schemes work?

In all cases, when the CPU wants to read a word,
either from memory or from an I/O port,
it puts the address it needs on the address lines of the bus,
and then asserts a READ signal on a bus control line.
A second signal line is used,
to tell whether I/O space or memory space is needed.

If there is only memory space, as in (b),
then every memory module and every I/O device,
compares the address lines to the range of addresses that it services.
If the address falls in its range,
then it responds to the request.
Since no address is ever assigned to both memory and an I/O device,
there is no ambiguity and no conflict.

1.1.4 Interrupts

Usually, controller registers have one or more status bits,
that can be tested to determine:
if an output operation is complete, or
if new data is available from an input device.

1.1.4.1 Polling

A CPU can execute a loop, testing a status bit each time,
until a device is ready to accept or provide new data.
This is called polling or busy waiting.
We saw this busy waiting concept before,
as a possible method to deal with critical sections,
and in that context it was dismissed,
as something to be avoided in most circumstances.
In the realm of I/O, where you might have to wait a very long time,
for the outside world to accept or produce data, polling is not acceptable,
except for very small dedicated systems, not running multiple processes.

1.1.4.2 Interrupts

In addition to status bits, many controllers use interrupts,
to tell the CPU when they are ready to have their registers read or written.
We saw how interrupts are handled by the CPU.
In the context of I/O, all you need to know is that:
most interface devices provide an output,
which is logically the same as the previous register status bit,
indicating “operation complete” or “data ready”,
but which drives one of the IRQ (Interrupt ReQuest) lines of the system bus.
Thus when an interrupt-enabled operation completes,
it interrupts the CPU, and starts the interrupt handler running.
This piece of code informs the operating system that I/O is complete.
The operating system may then check the status bits,
to verify that all went well,
and either harvest the resulting data,
or initiate a retry.

The number of inputs to the interrupt controller may be limited;
Pentium-class PCs have only 15 available for I/O devices.
Some controllers are hard-wired onto the system parentboard,
for example, the disk and keyboard controllers of an IBM PC.
On older systems, a switch or jumper associated with the controller,
set the IRQ used by the device.
If a user bought a new plug-in board,
then they had to manually set the IRQ,
to avoid conflicts with existing IRQs.
Few users could do this correctly,
which led the industry to develop Plug ’n Play,
in which the BIOS can automatically assign IRQs to devices,
at boot time to avoid conflicts.

1.1.5 Direct Memory Access (DMA)

Whether or not a system has memory-mapped I/O,
its CPU needs to address the device controllers,
to exchange data with them.
The CPU can request data from an I/O controller, one byte at a time,
but doing so for a device like a disk,
that produces a large block of data,
wastes the CPU’s time, so a different scheme,
called DMA (Direct Memory Access) is often used.
The operating system can only use DMA,
if the hardware has a DMA controller,
which most systems do.
Sometimes this controller is integrated into disk controllers and other controllers,
but such a design requires a separate DMA controller for each device.
More commonly, a single DMA controller is available
(on the parentboard) for regulating transfers to multiple devices,
often concurrently.

No matter where it is physically located,
the DMA controller has access to the system bus,
independent of the CPU, as shown:
03-InputOutput/f3-04.png

Operation of a DMA transfer.

The DMA contains several registers that can be written and read by the CPU.
These include:
a memory address register,
a byte count register,
and one or more control registers.

The control registers specify:
which I/O port to use,
the direction of the transfer (reading or writing to the I/O device),
the transfer unit (byte at a time, or word at a time), and
the number of bytes to transfer in one burst.

1.1.5.1 Without DMA

Let us first look at how disk reads occur, when DMA is not used.
First the controller reads the block (one or more sectors) from the drive serially,
bit by bit, until the entire block is in the controller’s internal buffer.
Next, it computes the checksum,
to verify that no read errors have occurred.
Then the controller causes an interrupt.
When the operating system starts running,
it can read the disk block from the controller’s buffer,
a byte or a word at a time by executing a loop,
with each iteration:
reading one byte or word from a controller device register,
storing it in main memory,
incrementing the memory address,
and decrementing the count of items to be read,
until it reaches zero.

1.1.5.2 With DMA

Step 1:
When DMA is used, the procedure is different.
First the CPU programs the DMA controller,
by setting its registers, so it knows what to transfer where
It also issues a command to the disk controller,
telling it to read data from the disk into its internal buffer,
and verify the checksum.
When valid data are in the disk controller’s buffer,
DMA can begin.

Step 2:
The DMA controller initiates the transfer,
by issuing a read request over the bus to the disk controller.
This read request looks like any other read request,
and the disk controller does not know or care whether:
it came from the CPU or from a DMA controller.
Typically, the memory address to write to is on the address lines of the bus,
so when the disk controller fetches the next word from its internal buffer,
it knows where to write it.

Step 3:
The write to memory is another standard bus cycle.
When the write is complete,

Step 4:
The disk controller sends an acknowledgement signal to the disk controller,
also over the bus.
The DMA controller then increments the memory address to use,
and decrements the byte count.
If the byte count is still greater than 0,
then steps 2 through 4 are repeated,
until the count reaches 0.

At this point the controller causes an interrupt.
When the operating system starts up,
it does not have to copy the block to memory;
it is already there.

Why does the controller not just store the bytes in main memory,
as soon as it gets them from the disk.
Why does it need an internal buffer?
There are two reasons:

First, by doing internal buffering,
the disk controller can verify the checksum,
before starting a transfer.
If the checksum is incorrect,
then an error is signaled,
and no transfer to memory is done.

Second, once a disk transfer has started,
the bits keep arriving from the disk at a constant rate,
whether the controller is ready for them or not.
If the controller tried to write data directly to memory,
then it would have to go over the system bus for each word transferred.
If the bus were busy, due to some other device using it,
then the controller would have to wait.
If the next disk word arrived before the previous one had been stored,
then the controller would have to store it somewhere.
If the bus were very busy,
then the controller might end up storing quite a few words,
and having a lot of administration to do as well.
When the block is buffered internally,
the bus is not needed until the DMA begins,
so the design of the controller is much simpler,
because the DMA transfer to memory is not time critical.

1.1.5.3 Speed of DMA

Not all computers use DMA.
The argument against it is that:
the main CPU is often far faster than the DMA controller,
and can do the job much faster
(when the limiting factor is not the speed of the I/O device).
If there is no other work for it to do,
then having the (fast) CPU wait for the (slow) DMA controller to finish is pointless.
Also, getting rid of the DMA controller,
and having the CPU do all the work in software saves money,
important on low-end (embedded) computers.

1.2 Principles of I/O software

Let us now turn away from the I/O hardware,
and look at the I/O software.
First we will look at the goals of the I/O software,
and then at the different ways I/O can be done,
from the point of view of the operating system.

1.2.1 Goals of the I/O Software

1.2.1.1 Device independence

A key concept in the design of I/O software is device independence.
What this means is that:
it should be possible to write programs that can access any I/O device,
without having to specify the device in advance.
For example, a program that reads a file as input,
should be able to read a file on any disk,
a floppy disk, on a hard disk, or on a CD-ROM,
without having to modify the program for each different device.
Similarly, one should be able to type a shell command such as:
sort <input >output
and have it work with input coming from a:
floppy disk, an IDE disk, a SCSI disk, or the keyboard,
and the output going to any kind of disk or the screen.
It is up to the operating system to take care of the problems,
caused by the fact that these devices really are different,
and require very different command sequences to read or write.

1.2.1.2 Uniform naming

Closely related to device independence,
is the goal of uniform naming.
The name of a file or a device should simply be a string or an integer,
and not depend on the device in any way.
In UNIX and MINIX3, all disks can be integrated into the file system hierarchy,
in arbitrary ways, so the user need not be aware of which name corresponds to which device.
For example, a floppy disk can be mounted on top of the directory /usr/ast/backup
so that copying a file to that directory copies the file to the diskette.
In this way, all files and devices are addressed the same way:
by a path name.

1.2.1.3 Error handling

Another important issue for I/O software is error handling.
In general, errors should be handled as close to the hardware as possible.
If the controller discovers a read error,
then it should try to correct the error itself if it can.
If it cannot, then the device driver should handle it,
perhaps by just trying to read the block again.
Many errors are transient,
such as read errors caused by specks of dust on the read head,
and will go away if the operation is repeated.
Only if the lower layers are not able to deal with the problem,
should the upper layers be told about it.
In many cases, error recovery can be done transparently at a low level,
without the upper levels even knowing about the error.

1.2.1.4 Sync vs. Asynch

Still another key issue is synchronous (blocking) versus asynchronous (interrupt-driven) transfers.

Most physical I/O is asynchronous.
The CPU starts the transfer,
and goes off to do something else,
until the interrupt arrives.

User programs are much easier to write, if the I/O operations are blocking.
After a receive system call, the program is automatically suspended,
until the data are available in the buffer.

It is up to the operating system to make operations that are actually interrupt-driven,
look blocking to the user programs.

1.2.1.5 Buffering

Another issue for the I/O software is buffering.
Often data that come off a device cannot be stored directly in the final destination.
For example, when a packet comes in to the kernel from the network,
the operating system does not know where to put it,
until it has stored the packet somewhere and examined it.
Also, some devices have severe real-time constraints
(for example, digital audio devices),
so the data must be put into an output buffer in advance,
to decouple the rate at which the buffer is filled,
from the rate at which it is emptied,
in order to avoid buffer under-runs.
Buffering involves considerable copying,
and often has a major impact on I/O performance.

1.2.1.6 Sharable vs dedicated devices

The final concept that we will mention here is sharable versus dedicated devices.

Some I/O devices, such as disks,
can be used by many users at the same time.
No problems are caused by multiple users having open files,
on the same disk at the same time.

Other devices, such as tape drives,
have to be dedicated to a single user until that user is finished.
Then another user can have the tape drive.
Having two or more users writing blocks, intermixed at random,
to the same tape, will definitely not work.

Introducing dedicated (unshared) devices also introduces a variety of problems,
such as deadlocks.
The operating system must be able to handle both shared and dedicated devices,
in a way that avoids problems.

1.2.2 Layering in I/O Software

I/O software is often organized in four layers, as shown:
03-InputOutput/f3-05.png

Layers of the I/O software system.
We will look at each in turn, starting at the bottom.
The emphasis in this lecture is on the device drivers (layer 2),
but we will summarize the rest of the I/O software,
to show how the pieces of the I/O system fit together.

1.2.2.1 Interrupt Handlers

Interrupts are an unpleasant fact of life.
Although they cannot be avoided, they should be hidden away,
deep in the bowels of the operating system,
so that as little of the operating system as possible knows about them.
The best way to hide them,
is to have the driver starting an I/O operation, block,
until the I/O has completed, and the interrupt occurs.
The driver can block itself by doing a:
down on a semaphore,
a wait on a condition variable,
a receive on a message,
or something similar.

When the interrupt happens,
the interrupt procedure handles the interrupt.
Then it can unblock the driver that started it.
In some cases it will just complete up on a semaphore.
In others it will do a signal on a condition variable in a monitor.
In still others, it will send a message to the blocked driver.
In all cases, the net effect of the interrupt will be that:
a driver that was previously blocked, will now be able to run.
If drivers are structured as independent processes,
with their own states, stacks, and program counters.
then this model works better.

1.2.2.2 Device Drivers

Earlier, we said each device controller has registers,
used to give it commands, or to read out its status, or both.
The number of registers, and the nature of the commands,
vary radically from device to device.

For example, a mouse driver has to accept information from the mouse,
telling how far it has moved, and which buttons are currently depressed.

In contrast, a disk driver has to know about:
sectors, tracks, cylinders, heads, arm motion, motor drives, head settling times,
and all the other mechanics of making the disk work properly.

Thus, each I/O device attached to a computer,
needs some device-specific code for controlling it.
This code, called the device driver,
is generally written by the device’s manufacturer,
and delivered along with the device.
Since each operating system needs its own drivers,
device manufacturers often supply drivers for popular operating systems.

Each device driver normally handles one device type,
or one class of closely related devices.

For example, it would probably be a good idea to have a single mouse driver,
even if the system supports several different brands of mice.

As another example, a disk driver can usually handle multiple disks,
of different sizes and different speeds, and perhaps a CD-ROM as well.

On the other hand, a mouse and a disk are so different,
that different drivers are necessary.

In order to access the device’s hardware,
meaning the controller’s registers,
the device driver traditionally has been part of the system kernel.
This approach gives the good performance,
but has worse reliability,
since a bug in any device driver, can crash the entire system.
MINIX3 departs from this model, to enhance reliability and security.
In MINIX3 each device driver is now a separate user-mode process.

Operating systems usually classify drivers:
as block devices, such as disks, or
character devices, such as keyboards and printers.
Most operating systems define a standard interface,
that all block drivers must support,
and a second standard interface,
that all character drivers must support.
These interfaces consist of a number of procedures,
that the rest of the operating system can call,
to get the driver to do work for it.

The job of a device driver is to accept abstract requests,
from the device-independent software above it,
and see to it that the request is executed.
A typical request to a disk driver is to read block n.
If the driver is idle at the time a request comes in,
then it starts carrying out the request immediately.
If, however, it is already busy with a request,
it can enter the new request into a queue of pending requests,
to be dealt with as soon as possible.

The first step in actually carrying out an I/O request,
is to check that the input parameters are valid,
and if they are not, to return an error.
If the request is valid,
then the next step is to translate it,
from abstract to concrete terms.

For a disk driver, this means:
figuring out where on the disk the requested block actually is,
checking to see if the drive’s motor is running,
determining if the arm is positioned on the proper cylinder,

and so on.
The driver must decide:
which controller operations are required,
and in what sequence.

Once the driver has determined which commands to issue to the controller,
it starts issuing them, by writing into the controller’s device registers.
Simple controllers can handle only one command at a time.
More sophisticated controllers accept a linked list of commands,
which they then carry out by themselves,
without further help from the operating system.

After the command or commands have been issued,
one of two situations will apply.

Blocking
In many cases, the device driver must wait,
until the controller does some work for it,
so it blocks itself,
until an interrupt arrives, to unblock it.
The blocked driver will later be awakened by the interrupt.

No blocking
In other cases, the operation finishes without delay,
so the driver need not block.

As an example of the latter situation,
scrolling the screen on some graphics cards,
requires just writing a few bytes into the controller’s registers.
No mechanical motion is needed,
so the entire operation can be completed in a few microseconds.
The driver never goes to sleep.

Either way, after the operation has been completed,
it must check for errors.
If everything is all right,
then the driver may have data to pass,
to the device-independent software (e.g., a block just read).
Finally, it returns some status information,
for error reporting, back to its caller.

If any other requests are queued,
then one of them can now be selected and started.
If nothing is queued,
then the driver blocks waiting for the next request.

Drivers mainly serve requests for reading and writing,
but there may be other requirements.
For example, the driver may initialize a device,
at system startup, or the first time it is used.
Also, there may be a need to manage power requirements,
handle Plug ’n Play, or log events.

1.2.2.3 Device-Independent I/O Software

Although some of the I/O software is device specific,
a large fraction of it is device independent.
Divisions between drivers and the device-independent software are system dependent,
because some functions that could be device-independent,
may actually be done in the drivers,
for efficiency or other reasons.
Device-independent software typically performs the following functions:
03-InputOutput/f3-06.png

In MINIX3, most of the device-independent software is part of the file system.
Although we will study the file system upcoming,
we will take a quick look at the device-independent software here,
to provide some perspective on I/O,
and show better where the drivers fit in.

The basic function of the device-independent software,
is to perform the I/O functions that are common to all devices,
and to provide a uniform interface to the user-level software.
Below we will look at the above issues in more detail.

1.2.2.3.1 Uniform Interfacing for Device Drivers

One major issue in operating systems is:
how to make all I/O devices and drivers look more-or-less the same.
If disks, printers, monitors, keyboards, etc.,
are all interfaced in different ways,
then every time a new peripheral device comes along,
the operating system must be modified for the new device:

With a standard interface,
it is much easier to plug in a new driver,
providing it conforms to the driver interface.
It also means that driver writers know what is expected of them
(e.g., what functions they must provide
and what kernel functions they may call).

In practice, not all devices are absolutely identical,
but usually there are only a small number of device types,
and even these are generally almost the same.
For example, even block and character devices have many functions in common.

Another aspect of having a uniform interface is how I/O devices are named.
The device-independent software maps symbolic device names onto the proper driver.In UNIX and MINIX3, a device name, such as /dev/disk0,
uniquely specifies the i-node for a special file,
and this i-node contains the major device number,
which is used to locate the appropriate driver.
The inode also contains the minor device number,
which is passed as a parameter to the driver,
in order to specify the unit to be read or written.
All devices have major and minor numbers,
and all drivers are accessed by using the major device number,
to select the driver.

Closely related to naming is protection.
How does the system prevent users from accessing devices,
that they are not entitled to access?
In UNIX, MINIX3, and also in later Windows versions,
such as Windows 2000 and Windows XP,
devices appear in the file system as named objects,
which means that the usual protection rules for files,
also apply to I/O devices.
The system administrator can then set the proper permissions
(i.e., in UNIX the rwx bits) for each device.

1.2.2.3.2 Buffering

Block:
For block devices.
hardware generally reads and writes entire blocks at once,
but user processes are free to read and write in arbitrary units.
If a user process writes half a block,
the operating system will normally keep the data around internally,
until the rest of the data are written,
at which time the block can go out to the disk.

Character:
For character devices,
users can write data to the system faster than it can be output,
necessitating buffering.
Keyboard input that arrives before it is needed also requires buffering.

1.2.2.3.3 Error Reporting

Errors are far more common in the context of I/O,
than in any other context!
When they occur, the operating system must handle them.
Many errors are device-specific,
so only the driver knows what to do
(e.g., retry, ignore, or panic).

An example error is:
one caused by a disk block that has been damaged,
and cannot be read any more.
After the driver has tried to read the block a certain number of times,
it gives up, and informs the device-independent software.
How the error is treated from here on, is device independent.

If the error occurred while reading a user file,
it may be sufficient to report the error back to the caller.

However, if it occurred while reading a critical system data structure,
such as the block containing the bitmap showing which blocks are free,
the operating system may have to display an error message and terminate.

1.2.2.3.4 Allocating and Releasing Dedicated Devices

Some devices, such as CD-ROM recorders,
can be used only by a single process at any given moment.
It is up to the operating system to examine requests for device usage,
and accept or reject them,
depending on whether the requested device is available or not.
A simple way to handle these requests,
is to require processes to perform an open,
on the special files for devices directly.
If the device is unavailable, the open fails.
Closing such a dedicated device then releases it.

1.2.2.3.5 Device-Independent Block Size

Not all disks have the same sector size.
It is up to the device-independent software to hide this fact,
and provide a uniform block size to higher layers,
for example, by treating several sectors as a single logical block.
In this way, the higher layers only deal with abstract devices,
that all use the same logical block size,
independent of the physical sector size.

Some character devices deliver their data one byte at a time (e.g., modems),
while others deliver theirs in larger units (e.g., network interfaces).
These differences may also be hidden.

1.2.2.4 User-Space I/O Software

Although most of the I/O software is within the operating system,
a small portion consists of libraries linked together with user programs,
and even whole programs running outside the kernel.
System calls, including the I/O system calls,
are normally made by library procedures.

For example, when a C program contains the call
count = write(fd, buffer, nbytes);
the library procedure write will be linked with the program,
and contained in the binary program present in memory at run time.
The collection of all these library procedures is clearly part of the I/O system.

These procedures merely put their parameters in place for the system call,
but there are other I/O procedures that actually do real work.

In particular, formatting of input and output is done by library procedures.
For example, the C printf,
which takes as input, a format string, and possibly some variables,
builds an ASCII string,
and then calls write to output the string.
Consider the statement:
printf("The square of %3d is %6d\n", i, i*i);
It formats a string consisting of the 14-character string “The square of”,
followed by the value i as a 3-character string,
then the 4-character string ” is “,
then i² as six characters,
and finally a newline feed.

An example of a similar procedure for input is scanf
which reads input, and stores it into variables,
in a format described in a format string,
using the same syntax as printf.

The standard I/O library contains a number of procedures that involve I/O,
and all run as part of user programs.

Not all user-level I/O software consists of library procedures.
Another important category is the spooling system.
Spooling deals with dedicated I/O devices in a multiprogramming system.

Consider a typical spooled device: a printer.
One could let any user process open the character special file for the printer.
However, suppose a process opened it and then did nothing for hours?
No other process could print anything.

Instead what is done is to create:
a special process, called a daemon,
and a special directory, called a spooling directory.

To print a file, a process first generates the entire file to be printed,
and puts it in the spooling directory.
It is up to the daemon to print the files in the directory.
The daemon is the only process with permission to use the printer’s special file.
By protecting the special file against direct use by users,
the problem of having someone keeping it open unnecessarily long is eliminated.

Spooling is used not only for printers,
but also in various other situations.
For example, electronic mail (e-mail) usually uses a daemon.
When a message is submitted, it is put in a mail spool directory.
Later on the mail daemon tries to send it.
At any given instant of time,
a particular destination may be temporarily unreachable,
so the daemon leaves the message in the spool,
with status information indicating it should be tried again.
The daemon may also send a message back to the sender,
saying delivery is delayed, or,
after a delay of hours or days,
saying the message cannot be delivered.
All of this is outside the operating system.

The image below summarizes the I/O system,
showing the layers and principal functions of each layer.
Starting at the bottom, the layers are:
the hardware,
interrupt handlers,
device drivers,
device-independent software, and
user processes.
03-InputOutput/f3-08.png

Layers of the I/O system, and the main functions of each layer.
The arrows show the flow of control.

For example, when a user program tries to read a block from a file,
the operating system is invoked, to carry out the call.
The device-independent software looks for it in the buffer cache.
If the needed block is not there,
then it calls the device driver,
to issue the request to the hardware,
to go get it from the disk.
The process is then blocked,
until the disk operation has been completed.

When the disk is finished, the hardware generates an interrupt.
The interrupt handler is run, to discover what has happened,
that is, which device wants attention right now.
It then extracts the status from the device,
and wakes up the sleeping process,
to finish off the I/O request,
and let the user process continue.

1.3 Deadlocks

Computer systems are full of resources that can only be used by one process at a time.
Common examples include:
printers, drives, and slots in the system’s internal tables.

Having two processes simultaneously writing to the printer leads to gibberish.
Having two processes using the same file system table slot,
will invariably lead to a corrupted file system.
Operating systems temporarily grant a process exclusive access to certain resources,
both hardware and software.

Suppose, for example, two processes each want to record a scanned document on a CD.
Process A requests permission to use the scanner, and is granted it.
Process B is programmed differently,
and requests the CD recorder first, and is also granted it.
Now A asks for the CD recorder, but the request is denied, until B releases it.
Unfortunately, instead of releasing the CD recorder, B asks for the scanner.
At this point, both processes are blocked, and will remain so forever.
This situation is called a resource deadlock.

Resource deadlocks can occur in a variety of situations,
besides requesting dedicated I/O devices.
In a database system for example,
a program may have to lock several records it is using,
to avoid race conditions.
If process A locks record R1, and process B locks record R2,
and then each process tries to lock the other one’s record,
we also have a deadlock.

Although this material is about deadlocks in the context of operating systems,
they also occur in database systems, and many other contexts in computer science,
so this material is actually applicable to a wide variety of multiprocess systems.

1.3.1 Resources

Deadlocks can occur when processes have been granted exclusive access to devices, files, etc.
To make the discussion of deadlocks as general as possible,
we will refer to the objects granted as resources.
A resource can be either a hardware device (e.g., a tape drive),
or a piece of information (e.g., a locked record in a database).
A computer will normally have many different resources that can be acquired.
For some resources, several identical instances may be available,
such as three tape drives.
When interchangeable copies of a resource are available,
called fungible resources, any one of them can be used,
to satisfy any request for the resource.
A resource is anything that can be used by only a single process,
at any instant of time.

A preemptable resource is one that can be taken away from the process owning it,
with no ill effects.
Memory is an example of a preemptable resource.
Consider, for example,
a system with 64 MB of user memory,
one printer,
and two 64-MB processes, that each want to print something.
Process A requests and gets the printer,
then starts to compute the values to print.
Before it has finished with the computation,
it exceeds its time quantum and is swapped or paged out.

Process B now runs and tries, unsuccessfully, to acquire the printer.
Potentially, we now have a deadlock situation,
because A has the printer and B has the memory,
and neither can proceed without the resource held by the other.
Fortunately, it is possible to preempt (take away) the memory from B,
by swapping it out (to disk for example), and swapping A in.
Now A can run, do its printing, and then release the printer.
No deadlock occurs.

A non-preemptable resource, is one that cannot be taken away from its current owner,
without causing the computation to fail.
If a process has begun to burn a CD-ROM,
then suddenly taking the CD recorder away from it,
and giving it to another process,
will result in a garbled CD.
CD recorders are not preemptable at an arbitrary moment.

In general, deadlocks involve non-preemptable resources.
Potential deadlocks that involve preemptable resources,
can usually be resolved,
by reallocating resources from one process to another.
Thus our treatment will focus on non-preemptable resources.

The sequence of events required to use a resource is given below,
in an abstract form:

If the resource is not available when it is requested,
then the requesting process is forced to wait.
In some operating systems,
when a resource request fails,
the process is automatically blocked,
and awakened when it becomes available.
In other systems, the request fails with an error code,
and it is up to the calling process to wait a little while,
and try again.

1.3.2 Principles of Deadlock

A set of processes is deadlocked,
if each process in the set is waiting for an event,
that only another process in the set can cause.

Because all the processes are waiting,
none of them will ever cause any of the needed events,
that could wake up any of the other members of the set,
and all the processes continue to wait forever.
For this model, we assume that processes have only a single thread,
and that there are no interrupts possible,
to wake up a blocked process.
The no-interrupts condition is needed for deadlock,
to prevent an otherwise deadlocked process,
from being awakened by, for example, an alarm,
and then causing events that release other processes in the set.

In most cases, the event that each process is waiting for,
is the release of some resource,
currently possessed by another member of the set.
Each member of the set of deadlocked processes is waiting for a resource,
that is owned by a deadlocked process.
None of the processes can run,
none of them can release any resources,
and none of them can be awakened.
The number of processes, and the number and kind of resources possessed and requested,
are unimportant.
This result holds for any kind of resource,
including both hardware and software.

1.3.2.1 Conditions for Deadlock

Coffman et al. (1971) showed that four conditions must hold for there to be a deadlock:

All four of these conditions must be present for a deadlock to occur.
If one of them is absent, then no deadlock is possible.

In a series of papers,
Levine (2003a, 2003b, 2005) points out that there are various situations,
all called deadlock in the literature,
and that Coffman et al.’s conditions apply more narrowly,
only to what should properly be called resource deadlock.
The literature contains other examples of “deadlock”,
that do not really meet all of these conditions.

For example, if four vehicles arrive simultaneously at a crossroad,
and try to obey the rule that each should yield to the vehicle on the right,
then none can proceed,
but this is not a case where one process already has possession of a unique resource.
Rather, this problem is a “scheduling deadlock”,
which can be resolved by a decision about priorities,
imposed from outside by a policeman.

It is worth noting that:
each condition relates to a policy that a system can have, or not have.
Can a given resource be assigned to more than one process at once?
Can a process hold a resource and ask for another?
Can resources be preempted?
Can circular waits exist?
Later on, we will see how deadlocks can be attacked,
by trying to negate some of these conditions.

1.3.2.2 Deadlock Modeling

Holt (1972) showed how these four conditions can be modeled using directed graphs.
The graphs have two kinds of nodes:
processes, shown as circles, and
resources, shown as squares.

An arc from a resource node (square) to a process node (circle),
means that the resource has previously been
requested by, granted to, and is currently held by that process.
An arc from a process to a resource,
means that the process is currently blocked waiting for that resource.
03-InputOutput/f3-09.png

Resource allocation graphs.

Now let us see how resource graphs can be used.
Imagine that we have three processes, A, B, and C,
and three resources, R, S, and T.
03-InputOutput/f3-10.png

An example of how deadlock occurs and how it can be avoided.

(a-c) The requests and releases of the three processes are given.
The operating system is free to run any unblocked process at any instant,
so it could decide to run A until A finished all its work,
then run B to completion, and finally run C.
This ordering does not lead to any deadlocks
(because there is no competition for resources)
but it also has no parallelism at all.
In addition to requesting and releasing resources,
processes compute and do I/O.
When the processes are run sequentially,
there is no possibility that while one process is waiting for I/O,
another can use the CPU.
Thus running the processes strictly sequentially may not be optimal.
On the other hand, if none of the processes do any I/O at all,
shortest job first is better than round robin,
so under such circumstances,
running all processes sequentially may be the best way.

The operating system is not required to run the processes in any special order.
In particular, if granting a particular request might lead to deadlock,
then the operating system can simply suspend the process,
without granting the request until it is safe
(i.e., just not schedule the process).
If the operating system knew about the impending deadlock,
then it could suspend B instead of granting it S.

After step (q), process B can be granted S,
because A is finished and C has everything it needs.
Even if B should eventually block when requesting T, no deadlock can occur.
B will just wait until C is finished.

Some algorithms can make allocation decisions that do not lead to deadlock.
For now, the point to understand is that resource graphs are a tool,
that let us see if a given request/release sequence leads to deadlock.
We just carry out the requests and releases step by step,
and after every step check the graph,
to see if it contains any cycles.
If so, we have a deadlock;
if not, there is no deadlock.

This example was of case of a single resource of each type,
but resource graphs can also be generalized,
to handle multiple resources of the same type (Holt, 1972).
However, Levine (2003a, 2003b) points out, that with fungible resources,
this can get very complicated indeed.
If even one branch of the graph is not part of a cycle, that is,
if one process which is not deadlocked, holds a copy of one of the resources,
then deadlock may not occur.

1.3.3 The Ostrich Algorithm

The first and simplest approach is the ostrich algorithm:
stick your head in the sand, and pretend there is no problem at all.
Different people react to this strategy in very different ways.
Mathematicians find it completely unacceptable,
and say that deadlocks must be prevented at all costs.
Engineers ask how often the problem is expected,
how often the system crashes for other reasons,
and how serious a deadlock is.
If deadlocks occur on the average once every five years,
but system crashes due to other bugs occur once a week,
then most engineers would not be willing to pay a large penalty,
in performance or convenience, to eliminate deadlocks.

To make this contrast more specific,
UNIX (and MINIX3) potentially suffer from deadlocks that are not even detected,
let alone automatically broken.

The total number of processes in a system,
is determined by the number of entries in the process table.
Thus process table slots are finite resources.
If a fork fails because the table is full,
a reasonable approach for the program doing the fork,
is to wait a random time and try again.

Now suppose that a MINIX3 system has 100 process slots.
Ten programs are running,
each of which needs to create 12 (sub)processes.
After each process has created 9 processes,
the 10 original processes and the 90 new processes have exhausted the table.
Each of the 10 original processes now sits in an endless loop,
forking and failing—a deadlock.
The probability of this happening is minuscule,
but it could happen.
Should we abandon processes and the fork call to eliminate the problem?

The maximum number of open files is similarly restricted,
by the size of the i-node table,
so a similar problem occurs when it fills up.

Swap space on the disk is another limited resource.
In fact, almost every table in the operating system represents a finite resource.
Should we abolish all of these, because it might happen,
that a collection of n processes might each claim 1/n of the total,
and then each try to claim another one?

Most operating systems, including UNIX, MINIX3, and Windows,
just ignore the problem, on the assumption that:
most users would prefer an occasional deadlock,
to a rule restricting all users to only one open resource.

If deadlocks could be eliminated for free,
then there would not be much discussion.
The problem is that, the price is high,
mostly in terms of putting inconvenient restrictions on processes,
as we will see shortly.
Thus we are faced with an unpleasant trade-off,
between convenience and correctness,
and a great deal of discussion about which is more important, and to whom.
Under these conditions, general solutions are hard to find.

1.3.4 Detection and Recovery

A second technique is detection and recovery.
The system only monitors the requests and releases of resources,
rather than preventing, just fixed errors after they occur.
Every time a resource is requested or released,
the resource graph is updated,
and a check is made, to see if any cycles exist.
If a cycle exists, then one of the processes in the cycle is killed.
If this does not break the deadlock,
another process is killed,
and so on until the cycle is broken.

A somewhat cruder method,
is not even to maintain the resource graph,
but instead periodically to check timing,
to see if there are any processes that have been continuously blocked,
for more than some fixed time, perhaps, 1 hour.
Such processes are then killed.

Detection and recovery is the strategy often used on large mainframe computers,
especially batch systems, in which killing a process and then restarting it,
is usually acceptable.
Care must be taken to restore any modified files to their original state,
and undo any other side effects that may have occurred.

1.3.5 Deadlock Prevention

The third deadlock strategy is to impose suitable restrictions on processes,
so that deadlocks are structurally impossible.
The four conditions stated by Coffman et al.
(1971) provide a clue to some possible solutions.

1.3.5.1 Eliminate mutal exclusion?

First let us attack the mutual exclusion condition.
If no resource was ever assigned exclusively to a single process,
then we would never have deadlocks.

For example:
Allowing two processes to write on the printer at the same time will lead to chaos.
But, by spooling printer output,
several processes can generate output at the same time.
The only process that actually requests the physical printer is the printer daemon.
Since the daemon never requests any other resources,
we can eliminate deadlock for the printer.

Unfortunately, not all devices can be spooled
(the process table does not lend itself well to being spooled).
Furthermore, competition for disk space for spooling can itself lead to deadlock.
What if two processes each filled up half of the available spooling space with output,
and neither was finished producing output?

If the daemon was programmed to begin printing even before all the output was spooled,
and an output process decided to wait several hours after the first burst of output,
then the printer could still lie idle.

Daemons are normally programmed to print only after the complete output file is available.
In this case, we can have two processes that have each finished part,
but not all, of their output, and cannot continue.
Neither process will ever finish, so we have a deadlock on the disk.

1.3.5.2 Eliminate hold and wait?

The second of the conditions stated by Coffman et al. looks slightly more promising.
If we can prevent processes that hold resources,
from waiting for more resources,
then we can eliminate deadlocks.
One way to achieve this goal, while still allowing multiple resource access,
is to require all processes to request all their resources before starting execution.
If everything is available,
then the process will be allocated whatever it needs,
and can run to completion.
If one or more of the desired resources are busy,
then nothing will be allocated and the process would just wait.

An immediate problem with this approach is that:
many processes do not know how many resources they will need,
until after they have started running.

Another problem is that:
resources will not be used optimally with this approach.

Take, as an example, a process that reads data from an input tape,
analyzes it for an hour, and then writes an output tape,
as well as plotting the results.
If all resources must be requested in advance,
then the process will tie up the output tape drive and the plotter for an hour.

A slightly different way to break the hold-and-wait condition,
is to require that a process requesting a resource,
to first temporarily release all the resources it currently holds.
Then it tries to get everything it needs all at once.

1.3.5.3 Eliminate no preemption?

Attacking the third condition (no preemption) is even less promising,
compared to attacking the second one.
If a process has been assigned the printer,
and is in the middle of printing its output,
then forcibly taking away the printer,
because a needed plotter is not available,
is tricky at best, and impossible at worst.

1.3.5.4 Eliminate circular wait?

Only one condition is left.
The circular wait can be eliminated in several ways.

One way is simply to have a rule saying that:
a process is entitled only to a single resource at any moment.
If it needs a second one, then it must release the first one.
For a process that needs to copy a huge file from a tape to a printer,
this restriction is unacceptable.

Another way to avoid the circular wait is to:
provide a global numbering of all the resources, as shown in (a).
03-InputOutput/f3-11.png

(a) Numerically ordered resources.
(b) A resource graph.

Now the rule is this:
processes can request resources whenever they want to,
but all requests must be made in numerical order.
A process may request a scanner first, and then a tape drive,
but it may not request a plotter first, and then a scanner.

With this rule, the resource allocation graph can never have cycles.
Let us see why this is true for the case of two processes (b).
We can get a deadlock only if:
A requests resource j,
and B requests resource i.
Assuming i and j are distinct resources,
they will have different numbers.

If i > j, then A is not allowed to request j,
because that is lower than what it already has.

If i < j, then B is not allowed to request i,
because that is lower than what it already has.

With multiple processes, the same logic holds.
At every instant, one of the assigned resources will be highest.
The process holding that resource will never ask for a resource already assigned.
It will either finish, or at worst,
request even higher numbered resources,
all of which are available.
Eventually, it will finish and free its resources.
At this point, some other process will hold the highest resource,
and can also finish.
In short, there exists a scenario in which all processes finish,
so no deadlock is present.

A minor variation of this algorithm is to:
drop the requirement that resources be acquired in strictly increasing sequence,
and merely insist that no process request a resource lower than what it is already holding.
If a process initially requests 9 and 10, and then releases both of them,
it is effectively starting all over,
so there is no reason to prohibit it from now requesting resource 1.

Although numerically ordering the resources eliminates the problem of deadlocks,
it may be impossible to find an ordering that satisfies everyone.
When the resources include process table slots, disk spooler space,
locked database records, and other abstract resources,
the number of potential resources and different uses may be so large,
that no ordering could possibly work.

Also, as Levine (2005) points out, ordering resources negates fungibility,
a perfectly good and available copy of a resource,
could be inaccessible with such a rule.

The various approaches to deadlock prevention are summarized:
03-InputOutput/f3-12.png

Summary of approaches to deadlock prevention.

1.3.6 Deadlock Avoidance

We saw that deadlock was avoidable,
by carefully analyzing each resource request,
to see if it could be safely granted.

Question:
is there an algorithm that can always avoid deadlock,
by making the right choice all the time?

Answer:
The answer is a qualified yes, we can avoid deadlocks,
but only if certain information is available in advance.
Deadlock can be avoided by careful resource allocation.

1.3.6.1 The Banker’s Algorithm for a Single Resource

The Banker’s Algorithm can avoid deadlocks (Dijkstra, 1965).
It is modeled on the way a small-town banker might lend,
in dealing with a group of customers to whom he has granted lines of credit.
The banker does not necessarily have enough cash on hand,
to simultaneously lend every customer the full amount of each one’s line of credit.
In (a) we see four customers, A, B, C, and D,
each of whom has been granted a certain number of credit units
(e.g., 1 unit is 1K dollars):
03-InputOutput/f3-13.png

Three resource allocation states:
(a) Safe.
(b) Safe.
(c) Unsafe.

The banker knows that not all customers will need their maximum credit immediately,
so he has reserved only 10 units rather than 22 to service them.
He also trusts every customer to be able to repay his loan,
very soon after receiving his total line of credit (it is a small town),
A customer can only repay their loan,
after having received it all.
so he knows eventually he can service all the requests.
In this analogy:
customers are like processes,
units are like drives,
and the banker is the operating system.

Each part of the figure shows a state of the system,
with respect to resource allocation, that is,
a list of customers showing the money already loaned (drives already assigned)
and the maximum credit available (maximum number of drives needed at once later).
A state is safe, if there exists a sequence of other states,
that leads to all customers getting loans up to their credit limits
(all processes getting all their resources and terminating).

The customers go about their respective businesses,
making loan requests from time to time (i.e., asking for resources).
At a certain moment, the situation is as shown in (b).
This state is safe because with two units left,
the banker can delay any requests except C’s,
thus letting C finish, and release all four of his resources.
With four units in hand, the banker can let either D or B have the necessary units, and so on.

Consider what would happen:
If a request from B for one more unit were granted in (b).
Then, we would have situation (c), which is unsafe.
If all the customers suddenly asked for their maximum loans,
then the banker could not satisfy any of them,
and we would have a deadlock.
An unsafe state does not have to lead to deadlock,
since a customer might not need the entire credit line available,
but the banker cannot count on this behavior.

The banker’s algorithm considers each request as it occurs,
and calculates if granting it leads to a safe state.
If it does, the request is granted;
otherwise, it is postponed until later.
To see if a state is safe,
the banker checks to see if he has enough resources to satisfy some customer.
If so, those loans are assumed to be repayable,
and the customer now closest to the limit is checked, and so on.
If all loans can eventually be repaid,
the state is safe and the initial request can be granted.

1.3.6.2 Resource Trajectories

The above algorithm was described in terms of a single resource class
(e.g., only tape drives or only printers, but not some of each).
Below is a model for dealing with two processes and two resources,
for example, a printer and a plotter.
03-InputOutput/f3-14.png

Two process resource trajectories.

The horizontal axis represents the number of instructions executed by process A.
The vertical axis represents the number of instructions executed by process B.
At I₁ A requests a printer; at I₂ it needs a plotter.
The printer and plotter are released at I₃ and I₄ respectively.
Process B needs the plotter from I₅ to I₇ and the printer from I₆ to I₈.

Every point in the diagram represents a joint state of the two processes.
Initially, the state is at p,
with neither process having executed any instructions.
If the scheduler chooses to run A first,
then we get to the point q,
in which A has executed some number of instructions,
but B has executed none.
At point q the trajectory becomes vertical,
indicating that the scheduler has chosen to run B.
With a single processor, all paths must be horizontal or vertical, never diagonal.
Furthermore, motion is always to the north or east,
never to the south or west (processes cannot run backward).

When A crosses the I₁ line on the path from r to s,
it requests and is granted the printer.
When B reaches point t, it requests the plotter.

The regions that are shaded are especially interesting.
and represent both processes having the resource.
The mutual exclusion rule makes it impossible to enter this region.
Under no conditions can the system enter the shaded regions.

If the system ever enters the box bounded by:
I₁ and I₂ on the sides and I₅ and I₆ top and bottom,
then it will eventually deadlock,
when it gets to the intersection of I₂ and I₆.
At this point, A is requesting the plotter,
and B is requesting the printer,
and both are already assigned.
The entire box is unsafe, and must not be entered.
At point t, the only safe thing to do,
is run process A until it gets to I₄.
Beyond that, any trajectory to u will do.

The important thing to see here is that:
at point t, B is requesting a resource.
The system must decide whether to grant it or not.
If the grant is made,
then the system will enter an unsafe region,
and eventually deadlock.
To avoid the deadlock, B should be suspended,
until A has requested and released the plotter.

1.3.6.3 The Banker’s Algorithm for Multiple Resources

The previous graphical model is difficult to apply to the general case,
with an arbitrary number of processes,
and an arbitrary number of resource classes,
each with multiple instances (e.g., two plotters, three tape drives).
However, the banker’s algorithm can be generalized to do the job:
03-InputOutput/f3-15.png

We see two matrices.

The one on the left:
shows how many of each resource are currently assigned,
to each of the five processes.

The matrix on the right:
shows how many resources each process still needs,
in order to complete.

As in the single resource case,
processes must state their total resource needs before executing,
so that the system can compute the right-hand matrix at each instant.

The three vectors at the right of the figure show;
the existing resources, E,
the possessed resources, P, and
the available resources, A, respectively.
From E we see that the system has:
six tape drives, three plotters, four printers, and two CD-ROM drives.
Of these, five tape drives, three plotters, two printers, and two CD-ROM drives are currently assigned.
This fact can be seen,
by adding up the four resource columns in the left-hand matrix.
The available resource vector is simply the difference between what the system has,
and what is currently in use.

If several processes are eligible to be chosen in step 1,
then it does not matter which one is selected:
the pool of available resources either gets larger, or stays the same.

Now let us get back to the example image above.
The current state is safe.
Suppose that process B now requests a printer.
This request can be granted, because the resulting state is still safe
(process D can finish, and then processes A or E, followed by the rest).

Now imagine that:
after giving B one of the two remaining printers,
E wants the last printer.
Granting that request would reduce the vector of available resources to (1 0 0 0),
which leads to deadlock.
Clearly E’s request must be deferred for a while.

The banker’s algorithm was first published by Dijkstra in 1965.
Since that time, nearly every book on operating systems has described it in detail.
Innumerable papers have been written about various aspects of it.
Few authors have had the audacity to point out that:
although in theory the algorithm is wonderful,
in practice it is essentially useless,
because processes rarely know in advance what their maximum resource needs will be.
In addition, the number of processes is not fixed,
but dynamically varying, as new users log in and out.
Furthermore, resources that were thought to be available,
can suddenly vanish (tape drives can break).
Thus in practice, few, if any, existing systems use the banker’s algorithm for avoiding deadlocks.

1.3.6.4 In summary:

The schemes described earlier under the name “prevention”,
are overly restrictive,
and the algorithm described here as “avoidance”,
requires information that is usually not available.
If you can think of a general-purpose algorithm,
that does the job in practice as well as in theory,
then write it up and send it to your local computer science journal.

Although both avoidance and prevention are not promising in the general case,
for specific applications, many excellent special-purpose algorithms are known.

As an example, in many database systems,
an operation that occurs frequently is:
requesting locks on several records,
and then updating all the locked records.
When multiple processes are running at the same time,
there is a real danger of deadlock.
To eliminate this problem, special techniques are used.

The approach most often used is called two-phase locking.
In the first phase,
the process tries to lock all the records it needs,
one at a time.
If it succeeds, then it begins the second phase,
performing its updates and releasing the locks.
No real work is done in the first phase.

If during the first phase,
some record is needed that is already locked,
then the process just releases all its locks,
and starts the first phase all over.
This approach is similar to requesting all the resources needed in advance,
or at least before anything irreversible is done.
In some versions of two-phase locking,
if a lock is encountered during the first phase,
then there is no release and restart.
In these versions, deadlock can occur.

However, this strategy is not applicable in general.
In real-time systems and process control systems, for example,
it is not acceptable to just terminate a process partway through,
because a resource is not available, and start all over again.
Neither is it acceptable to start over,
if the process has read or written messages to the network, updated files,
or anything else that cannot be safely repeated.
The algorithm works only in some situations,
where the programmer has very carefully arranged things,
so that the program can be stopped at any point during the first phase, and restarted.
Many applications cannot be structured this way.

1.4 Locking in the Linux kernel

1.5 Overview of I/O in minix3

The top four layers correspond to the four-layered structure of MINIX 3.
Interrupt handling was already covered.
In the following sections we will look briefly at each of the layers,
with an emphasis on the device drivers.
The device-independent I/O will be discussed later,
when we discuss the file system itself in-depth.

1.5.1 Interrupt Handlers and I/O Access in MINIX 3

Many device drivers start some I/O device, and then block,
waiting for a message to arrive.
That message is usually generated by the interrupt handler for the device.
Other device drivers do not start any physical I/O
(e.g., reading from RAM disk and writing to a memory-mapped display),
do not use interrupts, and do not wait for a message from an I/O device.
has been presented in great detail, and we will say no more about it here.
Here we will discuss interrupts for I/O in device drivers.

For disk devices,
input and output involves commanding a device to perform its operation,
and then waiting until the operation is complete.
The disk controller does most of the work,
and very little is required of the interrupt handler.

However, there is sometimes more for the low-level handler to do.
The message passing mechanism has a cost.
When an interrupt may occur frequently,
but the amount of I/O handled per interrupt is small,
it may be ideal to make the handler itself do somewhat more work,
and to postpone sending a message to the driver until a subsequent interrupt,
when there is more for the driver to do.
But, in MINIX 3 this is not possible for most I/O,
because the low level handler in the kernel is a general purpose routine,
used for almost all devices.

The clock is an exception to that constraint.
Because it is compiled with the kernel,
the clock can have its own handler that does extra work.
On many clock ticks, there is very little to be done,
except for maintaining the time.
This is done without sending a message to the clock task itself.
The clock’s interrupt handler increments a variable,
appropriately named realtime,
possibly adding a correction for ticks counted during a BIOS call.
The handler does some additional very simple arithmetic;
it increments counters for user time and billing time,
decrements the ticks_left counter for the current process,
and tests to see if a timer has expired.
Only if the current process has used up its quantum,
or a timer has expired,
is a message sent to the clock task.

The clock interrupt handler is unique compared to other devices in MINIX 3,
because the clock is the only interrupt driven device that runs in kernel space.
The clock hardware is integral to the PC.
The clock interrupt line does not connect to any pin on the sockets where add-on I/O controllers can be plugged in.
So it is impossible to install a clock upgrade package,
with replacement clock hardware,
with a driver provided by another manufacturer.
It is reasonable, then, for the clock driver to be compiled into the kernel,
and have access to any variable in kernel space.
MINIX 3 aims to make it unnecessary for any other device driver to have that kind of access.

The remaining device drivers that run in user space are restricted,
and cannot directly access kernel memory or I/O ports.
Although possible, it would also violate the design principles of MINIX 3,
to allow an interrupt service routine to make a call to execute a service routine within the text segment of a user process.
The kernel should not trust code provided by a user program.

There are several different levels of I/O access,
that might be needed by a user-space device driver.

1.5.1.1 Accessing outside process memory

The first case, access to extra memory segments,
uses hardware segmentation support provided by Intel processors.
A normal process has access only to its own text, data, and stack segments.
However the system task allows other segments to be defined,
and accessed by user-space processes.

The memory driver can access a memory region reserved for use as a RAM disk,
as well as other regions designated for special access.
The console driver accesses memory on a video display adapter in the same way.

1.5.1.2 Accessing IO ports

For the second case, MINIX 3 provides kernel calls,
which user drivers call to use I/O instructions.
The system task does the actual I/O,
on behalf of a less-privileged process.
The hard disk driver uses this service.
The disk driver may have to write to a single output port to select a disk,
then read from another port, in order to verify the device is ready.
If response is normally expected to be very quick, polling can be done.
There are kernel calls to specify a port and data to be written,
or a location for receipt of data read.
This requires that a call to read a port be nonblocking,
and in fact, kernel calls do not block.

Some insurance against device failure is useful.
A polling loop could include a counter,
that terminates the loop if the device does not become ready,
after a certain time.
A way is provided by the MINIX 3 system library,
which provides a getuptime function.
This uses a kernel call to retrieve a counter of clock ticks,
since system startup, maintained by the clock task.
The cost of using this information to keep track of time spent in a loop,
is the overhead of an additional kernel call on each iteration.
Another possibility is to ask the system task to set a watchdog timer.
But to receive a notification from a timer on a receive operation,
which will block, is required.
This is not a good solution if a fast response is expected.

The hard disk also makes use of variants of the kernel calls for I/O,
that make it possible to send to the system task,
a list of ports and data to write, or variables to be altered.
The hard disk driver writes a sequence of byte values to seven output ports,
to initiate an operation.
The last byte in the sequence is a command,
and when the disk controller completes a command,
it generates an interrupt.
All this can be accomplished with a single kernel call,
greatly reducing the number of messages needed.

1.5.1.3 Responding to predictable interrupts

This brings us to the third item in the list:
responding to an expected interrupt.
As noted in the discussion of the system task,
when an interrupt is initialized on behalf of a user space program
(using a sys_irqctl kernel call),
the handler routine for the interrupt is always generic_handler,
a function defined as part of the system task.
This routine converts the interrupt,
into a notification message to the process,
on whose behalf the interrupt was set.
The device driver therefore must initiate a receive operation,
after the kernel call that issues the command to the controller.
When the notification is received,
then the device driver services the interrupt.

Although in this case an interrupt is expected,
it is possibility that something might go wrong.
Given the possibility that an interrupt might fail to be triggered,
a process can request the system task to set up a watchdog timer.
Watchdog timers also generate notification messages,
and thus the receive operation could get a notification,
either because an interrupt occurred, or because a timer expired.
This is not a problem because the notification message indicates its origin,
although a notification does not convey much other information.
Although both notifications are generated by the system task,
notification of an interrupt will appear to come from HARDWARE,
and notification of a timer expiring will appear to come from CLOCK.

There is another problem.
If an interrupt is received in a timely way,
and a watchdog timer has been set,
then expiration of the timer at some future time,
will be detected by another receive operation,
possibly in the main loop of the driver.
One solution is to make a kernel call,
to disable the timer when the notification from HARDWARE is received.
Alternatively, if it is likely that the next receive operation will be one where a message from CLOCK is not expected, such a message could be ignored and receive called again.
Although less likely, it is conceivable that a disk operation could occur after an unexpectedly long delay, generating the interrupt only after the watchdog has timed out.
The same solutions apply here.
When a timeout occurs a kernel call can be made to disable an interrupt, or a receive operation that does not expect an interrupt could ignore any message from HARDWARE.

This is a good time to mention that when an interrupt is first enabled, a kernel call can be made to set a “policy” for the interrupt.
The policy is simply a flag that determines whether the interrupt should be automatically reenabled or whether it should remain disabled until the device driver it serves makes a kernel call to reenable it.
For the disk driver there may be a substantial amount of work to be done after an interrupt, and thus it may be best to leave the interrupt disabled until all data has been copied.

1.5.1.4 Responding to un-predictable interrupts

The fourth item in our list is the most problematic.
Keyboard support is part of the tty driver, which provides output as well as input.
Furthermore, multiple devices may be supported.
So input may come from a local keyboard, but it can also come from a remote user connected by a serial line or a network connection.
And several processes may be running, each producing output for a different local or remote terminal.
When you do not know when, if ever, an interrupt might occur, you cannot just make a blocking receive call to accept input from a single source if the same process may need to respond to other input and output sources.

MINIX 3 uses several techniques to deal with this problem.
The principal technique used by the terminal driver for dealing with keyboard input is to make the interrupt response as fast as possible, so characters will not be lost.
The minimum possible amount of work is done to get characters from the keyboard hardware to a buffer.
Additionally, when data has been fetched from the keyboard in response to an interrupt, as soon as the data is buffered the keyboard is read again before returning from the interrupt.
Interrupts generate notification messages, which do not block the sender; this helps to prevent loss of input.
A nonblocking receive operation is available, too, although it is only used to handle messages during a system crash.
Watchdog timers are also used to activate the routine that checks the keyboard.

1.5.2 Device Drivers in MINIX 3

For each class of I/O device present in a MINIX 3 system, a separate I/O device driver is present.
These drivers are full-fledged processes, each one with its own state, registers, stack, and so on.
Device drivers communicate with the file system using the standard message passing mechanism used by all MINIX 3 processes.
A simple device driver may be written as a single source file.
For the RAM disk, hard disk, and floppy disk there is a source file to support each type of device, as well as a set of common routines in driver.c and drvlib.c to support all blcok device types.
This separation of the hardware-dependent and hardwareindependent parts of the software makes for easy adaptation to a variety of different hardware configurations.
Although some common source code is used, the driver for each disk type runs as a separate process, in order to support rapid data transfers and isolate drivers from each other.

The terminal driver source code is organized in a similar way, with the hardware-independent code in tty.c and source code to support different devices, such as memory-mapped consoles, the keyboard, serial lines, and pseudo terminals in separate files.
In this case, however, a single process supports all of the different device types.

For groups of devices such as disk devices and terminals, for which there are several source files, there are also header files.
Driver.h supports all the block device drivers.
Tty.h provides common definitions for all the terminal devices.

The MINIX 3 design principle of running components of the operating system as completely separate processes in user space is highly modular and moderately efficient.
It is also one of the few places where MINIX 3 differs from UNIX in an essential way.
In MINIX 3 a process reads a file by sending a message to the file system process.
The file system, in turn, may send a message to the disk driver asking it to read the needed block.
The disk driver uses kernel calls to ask the system task to do the actual I/O and to copy data between processes.
This sequence (slightly simplified from reality) is shown in Fig. 3-16(a).
By making these interactions via the message mechanism, we force various parts of the system to interface in standard ways with other parts.

In UNIX all processes have two parts: a user-space part and a kernel-space part, as shown in Fig. 3-16(b).
When a system call is made, the operating system switches from the user-space part to the kernel-space part in a somewhat magical way.
This structure is a remnant of the MULTICS design, in which the switch was just an ordinary procedure call, rather than a trap followed by saving the state of the user-part, as it is in UNIX .

Device drivers in UNIX are simply kernel procedures that are called by the kernel-space part of the process.
When a driver needs to wait for an interrupt, it calls a kernel procedure that puts it to sleep until some interrupt handler wakes it up.
Note that it is the user process itself that is being put to sleep here, because the kernel and user parts are really different parts of the same process.

Among operating system designers, arguments about the merits of monolithic systems, as in UNIX, versus process-structured systems, as in MINIX 3, are endless.
The MINIX 3 approach is better structured (more modular), has cleaner interfaces between the pieces, and extends easily to distributed systems in which the various processes run on different computers.
The UNIX approach is more efficient, because procedure calls are much faster than sending messages.
MINIX 3 was split into many processes because we believe that with increasingly powerful personal computers available, cleaner software structure was worth making the system slightly slower.
The performance loss due to having most of the operating system run in user space is typically in the range of 5–10%.
Be warned that some operating system designers do not share the belief that it is worth sacrificing a little speed for a more modular and more reliable system.

In this chapter, drivers for RAM disk, hard disk, clock, and terminal are discussed.
The standard MINIX 3 configuration also includes drivers for the floppy disk and the printer, which are not discussed in detail.
The MINIX 3 software distribution contains source code for additional drivers for RS-232 serial lines, CD-ROMs, various Ethernet adapter, and sound cards.
These may be compiled separately and started on the fly at any time.

All of these drivers interface with other parts of the MINIX 3 system in the same way: request messages are sent to the drivers.
The messages contain a variety of fields used to hold the operation code (e.g., READ or WRITE) and its parameters.
A driver attempts to fulfill a request and returns a reply message.

For block devices, the fields of the request and reply messages are shown in Fig. 3-17.
The request message includes the address of a buffer area containing data to be transmitted or in which received data are expected.
The reply includes status information so the requesting process can verify that its request was properly carried out.
The fields for the character devices are basically similar but can vary slightly from driver to driver.
Messages to the terminal driver can contain the address of a data structure which specifies all of the many configurable aspects of a terminal, such as the characters to use for the intraline editing functions erase-character and kill-line.

The function of each driver is to accept requests from other processes, normally the file system, and carry them out.
All the block device drivers have been written to get a message, carry it out, and send a reply.
Among other things, this decision means that these drivers are strictly sequential and do not contain any internal multiprogramming, to keep them simple.
When a hardware request has been issued, the driver does a receive operation specifying that it is interested only in accepting interrupt messages, not new requests for work.
Any new request messages are just kept waiting until the current work has been done (rendezvous principle).
The terminal driver is slightly different, since a single driver services several devices.
Thus, it is possible to accept a new request for input from the keyboard while a request to read from a serial line is still being fulfilled.
Nevertheless, for each device a request must be completed before beginning a new one.

The main program for each block device driver is structurally the same and is outlined in Fig. 3-18.
When the system first comes up, each one of the drivers is started up in turn to give each a chance to initialize internal tables and similar things.
Then each device driver blocks by trying to get a message.
When a message comes in, the identity of the caller is saved, and a procedure is called to carry out the work, with a different procedure invoked for each operation available.
After the work has been finished, a reply is sent back to the caller, and the driver then goes back to the top of the loop to wait for the next request.

Figure 3-17.
Fields of the messages sent by the file system to the block device drivers and fields of the replies sent back.

Each of the dev_XXX procedures handles one of the operations of which the driver is capable.
It returns a status code telling what happened.
The status code, which is included in the reply message as the field REP STATUS, is the count of bytes transferred (zero or positive) if all went well, or the error number (negative) if something went wrong.
This count may differ from the number of bytes requested.
When the end of a file is reached, the number of bytes available may be less than number requested.
On terminals at most one line is returned (except in raw mode), even if the count requested is larger.

1.5.3 Device-Independent I/O Software in MINIX 3

In MINIX 3 the file system process contains all the device-independent I/O code.
The I/O system is so closely related to the file system that they were merged into one process.
The functions performed by the file system are those shown in Fig. 3-6, except for requesting and releasing dedicated devices, which do not exist in MINIX 3 as it is presently configured.
They could, however, easily be added to the relevant device drivers should the need arise in the future.

message mess; /* message buffer */
void io_driver() {
    initialize(); /* only done once, during system init.
    */
    while (TRUE) {
        receive(ANY, &mess); /* wait for a request for work */
        caller = mess.source; /* process from whom message came */
        switch(mess.type) {
            case READ:
            rcode = dev_read(&mess); break;
            case WRITE:
            rcode = dev_write(&mess); break;
            /* Other cases go here, including OPEN, CLOSE, and IOCTL */
            default:
            rcode = ERROR;
        }
        mess.type = DRIVER_REPLY;
        mess.status = rcode; /* result code */
        send(caller, &mess); /* send reply message back to caller */
    }
}

In addition to handling the interface with the drivers, buffering, and block allocation, the file system also handles protection and the management of i-nodes, directories, and mounted file systems.
This will be covered in detail in Chap.
5.

1.5.4 User-Level I/O Software in MINIX 3

The general model outlined earlier in this chapter also applies here.
Library procedures are available for making system calls and for all the C functions required by the POSIX standard, such as the formatted input and output functions printf and scanf.
The standard MINIX 3 configuration contains one spooler daemon, lpd, which spools and prints files passed to it by the lp command.
The standard MINIX 3 software distribution also provides a number of daemons that support various network functions.
The MINIX 3 configuration described in this book supports most network operations, all that is needed is to enable the network server and drivers for ethernet adapters at startup time.
Recompiling the terminal driver with pseudo terminals and serial line support will add support for logins from remote terminals and networking over serial lines (including modems).
The network server runs at the same priority as the memory manager and the file system, and like them, it runs as a user process.

1.5.5 Deadlock Handling in MINIX 3

True to its heritage, MINIX 3 follows the same path as UNIX with respect to deadlocks of the types described earlier in this chapter: it just ignores the problem.
Normally, MINIX 3 does not contain any dedicated I/O devices, although if someone wanted to hang an industry standard DAT tape drive on a PC, making the software for it would not pose any special problems.
In short, the only place deadlocks can occur are with the implicit shared resources, such as process table slots, i-node table slots, and so on.
None of the known deadlock algorithms can deal with resources like these that are not requested explicitly.

Actually, the above is not strictly true.
Accepting the risk that user processes could deadlock is one thing, but within the operating system itself a few places do exist where considerable care has been taken to avoid problems.
The main one is the message-passing interaction between processes.
For instance, user processes are only allowed to use the sendrec messaging method, so a user process should never lock up because it did a receive when there was no process with an interest in sending to it.
Servers only use send or sendrec to communicate with device drivers, and device drivers only use send or sendrec to communicate with the system task in the kernel layer.
In the rare case where servers must communicate between themselves, such as exchanges between the process manager and the file system as they initialize their parts of the process table, the order of communication is very carefully designed to avoid deadlock.
Also, at the very lowest level of the message passing system there is a check to make sure that when a process is about to do a send that the destination process is not trying to the same thing.

In addition to the above restrictions, in MINIX 3 the new notify message primitive is provided to handle those situations in which a message must be sent in the “upstream” direction.
Notify is nonblocking, and notifications are stored when a recipient is not immediately available.
As we examine the implementation of MINIX 3 device drivers in this chapter we will see that notify is used extensively.

Locks are another mechanism that can prevent deadlocks.
It is possible to lock devices and files even without operating system support.
A file name can serve as a truly global variable, whose presence or absence can be noted by all other processes.
A special directory, /usr/spool/locks/, is usually present on MINIX 3 systems, as on most UNIX -like systems, where processes can create lock files, to mark any resources they are using.
The MINIX 3 file system also supports POSIX-style advisory file locking.
But neither of these mechanisms is enforceable.
They depend upon the good behavior of processes, and there is nothing to prevent a program from trying to use a resource that is locked by another process.
This is not exactly the same thing as preemption of the resource, because it does not prevent the first process from attempting to continue its use of the resource.
In other words, there is no mutual exclusion.
The result of such an action by an ill-behaved process is likely to be a mess, but no deadlock results.

1.6 Block devices in minix 3

MINIX 3 supports several different block devices, so we will begin by discussing common aspects of all block devices.
Then we will discuss the RAM disk, the hard disk, and the floppy disk.
Each of these is interesting for a different reason.
The RAM disk is a good example to study because it has all the properties of block devices in general except the actual I/O—because the “disk” is actually just a portion of memory.
This simplicity makes it a good place to start.
The hard disk shows what a real disk driver looks like.
One might expect the floppy disk to be easier to support than the hard disk, but, in fact, it is not.
We will not discuss all the details of the floppy disk, but we will point out several of the complications to be found in the floppy disk driver.

Looking ahead, after the discussion of block drivers, we will discuss the terminal (keyboard + display) driver, which is important on all systems, and, furthermore is a good example of a character device driver.

Each of these sections describes the relevant hardware, the software principles behind the driver, an overview of the implementation, and the code itself.
This structure may make the sections useful reading even for readers who are not interested in the details of the code itself.

1.6.1 Overview of Block Device Drivers in MINIX 3

We mentioned earlier that the main procedures of all I/O drivers have a similar structure.
MINIX 3 always has at least two block device drivers compiled into the system: the RAM disk driver, and either one of several possible hard disk drivers or a floppy disk driver.
Usually, there are three block devices—both the floppy disk driver and an IDE (Integrated Drive Electronics) hard disk driver are present.
The driver for each block device driver is compiled independently, but a common library of source code is shared by all of them.

In older versions of MINIX a separate CD-ROM driver was sometimes present, and could be added if necessary.
Separate CD-ROM drivers are now obsolete.
They used to be necessary to support the proprietary interfaces of different drive manufacturers, but modern CD-ROM drives are usually connected to the IDE controller, although on notebook computers some CD-ROMs are USB.
The full version of the MINIX 3 hard disk device driver includes CD-ROM support, but we have taken the CD-ROM support out of the driver as described in this text and listed in Appendix B.

Each block device driver has to do some initialization, of course.
The RAM disk driver has to reserve some memory, the hard disk driver has to determine the parameters of the hard disk hardware, and so on.
All of the disk drivers are called individually for hardware-specific initialization.
After doing whatever may be necessary, each driver then calls the function containing its main loop.
This loop is executed forever; there is no return to the caller.
Within the main loop a message is received, a function to perform the operation needed by each message is called, and then a reply message is generated.

The common main loop called by each disk driver process is compiled when drivers/libdriver/driver.c and the other files in its directory are compiled, and then a copy of the object file driver.o is linked into each disk driver’s executable file.

The technique used is to have each driver pass to the main loop a parameter consisting of a pointer to a table of the addresses of the functions that driver will use for each operation and then to call these functions indirectly.

If the drivers were compiled together in a single executable file only one copy of the main loop would be needed.
This code was, in fact, first written for an earlier version of MINIX in which all the drivers were compiled together.
The emphasis in MINIX 3 is on making individual operating system components as independent as possible, but using common source code for separate programs is still a good way to increase reliability.
Assuming you get it right once, it will be right for all the drivers.
Or, a bug found in one use might very well exist unnoticed in other uses.
Thus, shared source code gets tested more thoroughly.

A number of other functions potentially useful to multiple disk drivers are defined in drivers/libdriver/drvlib.c, and linking drvlib.o makes these available.
All of the functionality could have been provided in a single file, but not all of it is needed by every disk driver.
For instance, the memory driver, which is simpler than other drivers, links in only driver.o.
The at_wini driver links in both driver.o and drvlib.o.

Figure 3-19 shows an outline of the main loop, in a form similar to that of Fig. 3-18.
Statements like

are indirect function calls.
A different dev_read function is called by each driver, even though each driver is executing a main loop compiled from the same source file.
But some other operations, for example close, are simple enough that more than one device can call the same function.

There are six possible operations that can be requested of any device driver.
These correspond to the possible values that can be found in the m.m_type field of the message of Fig. 3-17.
They are:
1. OPEN
2. CLOSE
3. READ
4. WRITE
5. IOCTL
6. SCATTERED_IO

Many of these operations are most likely familiar to readers with programming experience.
At the device driver level most operations are related to system calls with the same name.
For instance, the meanings of READ and WRITE should be fairly clear.
For each of these operations, a block of data is transferred from the device to the memory of the process that initiated the call, or vice versa.
A READ operation normally does not result in a return to the caller until the data transfer is complete, but an operating system may buffer data transferred during a WRITE for actual transfer to the destination at a later time, and return to the caller immediately.
That is fine as far as the caller is concerned; it is then free to reuse the buffer from which the operating system has copied the data to write.
OPEN and CLOSE for a device have similar meanings to the way the open and close system calls apply to operations on files: an OPEN operation should verify that the device is accessible, or return an error message if not, and a CLOSE should guarantee that any buffered data that were written by the caller are completely transferred to their final destination on the device.

message mess; /* message buffer */
void shared_io_driver(struct driver_table *entry_points) {
    /* initialization is done by each driver before calling this */
    while (TRUE) {
        receive(ANY, &mess);
        caller = mess.source;
        switch(mess.type) {
            case READ:
            rcode = (*entry_points−>dev_read)(&mess); break;
            case WRITE:
            rcode = (*entry_points−>dev_write)(&mess); break;
            /* Other cases go here, including OPEN, CLOSE, and IOCTL */
            default:
            rcode = ERROR;
        }
        mess.type = DRIVER_REPLY;
        mess.status = rcode; /* result code */
        send(caller, &mess);
    }
}

The IOCTL operation may not be so familiar.
Many I/O devices have operational parameters which occasionally must be examined and perhaps changed.
IOCTL operations do this.
A familiar example is changing the speed of transmission or the parity of a communications line.
For block devices, IOCTL operations are less common.
Examining or changing the way a disk device is partitioned is done using an IOCTL operation in MINIX 3 (although it could just as well have been done by reading and writing a block of data).

The SCATTERED_IO operation is no doubt the least familiar of these.
Except with exceedingly fast disk devices (for example, the RAM disk), satisfactory disk I/O performance is difficult to obtain if all disk requests are for individual blocks, one at a time.
A SCATTERED_IO request allows the file system to make a request to read or write multiple blocks.
In the case of a READ operation, the additional blocks may not have been requested by the process on whose behalf the call is made; the operating system attempts to anticipate future requests for data.
In such a request not all the transfers requested are necessarily honored by the device driver.
The request for each block may be modified by a flag bit that tells the device driver that the request is optional.
In effect the file system can say: “It would be nice to have all these data, but I do not really need them all right now.” The device can do what is best for it.
The floppy disk driver, for instance, will return all the data blocks it can read from a single track, effectively saying, “I will give you these, but it takes too long to move to another track; ask me again later for the rest.”

When data must be written, there is no question of its being optional; every write is mandatory.
Nevertheless, the operating system may buffer a number of write requests in the hope that writing multiple blocks can be done more efficiently than handling each request as it comes in.
In a SCATTERED_IO request, whether for reading or writing, the list of blocks requested is sorted, and this makes the operation more efficient than handling the requests randomly.
In addition, making only one call to the driver to transfer multiple blocks reduces the number of messages sent within MINIX 3.

1.6.2 Common Block Device Driver Software

Definitions that are needed by all of the block device drivers are located in drivers/libdriver/driver.h.
The most important thing in this file is the driver structure, which is used by each driver to pass a list of the addresses of the functions it will use to perform each part of its job.
Also defined here is the device structure which holds the most important information about partitions, the base address, and the size, in byte units.
This format was chosen so no conversions are necessary when working with memory-based devices, maximizing speed of response.
With real disks there are so many other factors delaying access that converting to sectors is not a significant inconvenience.

The source of the main loop and common functions of all the block device drivers are in driver.c.
After doing whatever hardware-specific initialization may be necessary, each driver calls driver_task, passing a driver structure as the argument to the call.
After obtaining the address of a buffer to use for DMA operations the main loop is entered.

In the switch statement in the main loop, the first five message types, DEV_OPEN, DEV_CLOSE, DEV_IOCTL, DEV_CANCEL, and DEV_SELECT result in indirect calls using addresses passed in the driver structure.
The DEV_READ and DEV_WRITE messages both result in direct calls to do_rdwt; DEV_GATHER and DEV_SCATTER messages both result in direct calls to do_vrdwt.
The driver structure is passed as an argument by all the calls from within the switch, whether direct or indirect, so all called functions can make further use of it as needed.
Do_rdwt and do_vrdwt do some preliminary processing, but then they too make indirect calls to device-specific routines.

The other cases, HARD_INT, SYS_SIG, and SYN_ALARM, respond to notifications.
These also result in indirect calls, but upon completion each of these executes a continue statement.
This causes control to return to the top of the loop, bypassing the cleanup and reply message steps.

After doing whatever is requested in the message, some sort of cleanup may be necessary, depending upon the nature of the device.
For a floppy disk, for instance, this might involve starting a timer to turn off the disk drive motor if another request does not arrive soon.
An indirect call is used for this as well.
Following the cleanup, a reply message is constructed and sent to the caller.
It is possible for a routine that services one of the message types to return a EDONTREPLY value to suppress the reply message, but none of the current drivers use this option.

The first thing each driver does after entering the main loop is to make a call to init_buffer, which assigns a buffer for use in DMA operations.
That this initialization is even necessary at all is due to a quirk of the hardware of the original IBM PC, which requires that the DMA buffer not cross a 64K boundary.
That is, a 1-KB DMA buffer may begin at 64510, but not at 64514, because a buffer starting at the latter address extends just beyond the 64K boundary at 65536.

This annoying rule occurs because the IBM PC used an old DMA chip, the Intel 8237A, which contains a 16-bit counter.
A bigger counter is needed because DMA uses absolute addresses, not addresses relative to a segment register.
On older machines that can address only 1M of memory, the low-order 16 bits of the DMA address are loaded into the 8237A, and the high-order 4 bits are loaded into a 4-bit latch.
Newer machines use an 8-bit latch and can address 16M.
When the 8237A goes from 0xFFFF to 0x0000, it does not generate a carry into the latch, so the DMA address suddenly jumps down by 64K in memory.

A portable C program cannot specify an absolute memory location for a data structure, so there is no way to prevent the compiler from placing the buffer in an unusable location.
The solution is to allocate an array of bytes twice as large as necessary at buffer and to reserve a pointer tmp_buf to use for actually accessing this array.
Init_buffer makes a trial setting of tmp_buf pointing to the beginning of buffer, then tests to see if that allows enough space before a 64K boundary is hit.
If the trial setting does not provide enough space, tmp_buf is incremented by the number of bytes actually required.
Thus some space is always wasted at one end or the other of the space allocated in buffer, but there is never a failure due to the buffer falling on a 64K boundary.

Newer computers of the IBM PC family have better DMA controllers, and this code could be simplified, and a small amount of memory reclaimed, if one could be sure that one’s machine were immune to this problem.
If you are considering this, however, consider how the bug will manifest itself if you are wrong.
If a 1K DMA buffer is desired, the chance is 1 in 64 that there will be a problem on a machine with the old DMA chip.
Every time the kernel source code is modified in a way that changes the size of the compiled kernel, there is the same probability that the problem will manifest itself.
Most likely, when the failure occurs next month or next year, it will be attributed to the code that was last modified.
Unexpected hardware “features” like this can cause weeks of time spent looking for exceedingly obscure bugs (all the more so when, like this one, the technical reference manual says nary a word about them).

Do_rdwt is the next function in driver.c.
It, in turn calls two device-dependent functions pointed to by the dr_prepare and dr_transfer fields in the driver structure.
Here and in what follows we will use the C language-like notation (*function_pointer) to indicate we are talking about the function pointed to by function_pointer.

After checking to see that the byte count in the request is positive, do_rdwt calls (*dr_prepare).
This operation fills in the base and size of the disk, partition, or subpartition being accessed in a device structure.
For the memory driver, which does not support partitions, it just checks that the minor device number is valid.
For the hard disk it uses the minor device number to get the size of the partition or subpartition indicated by the minor device number.
This should succeed, since (*dr_prepare) can fail only if an invalid device is specified in an open operation.
Next, an iovec_t structure (which is defined in include/minix/type.h), iovec1, is filled in.
This structure specifies the virtual address and size of the local buffer to or from which data will be copied by the system task.
This is the same structure that is used as an element of an array of requests when the call is for multiple blocks.
The address of a variable and the address of the first element of an array of the same type of variable can be handled exactly the same way.
Then comes another indirect call, this time to (*dr_transfer), which performs the data copy and I/O operations required.
The routines that handle transfers all expect to receive an array of requests.
In do_rdwt the last argument to the call is 1, specifying an array of one element.

As we will see in the discussion of disk hardware in the next section, responding to disk requests in the order they are received can be inefficient, and this routine allows a particular device to handle requests in the way that is best for the device.
The indirection here masks much possible variation in the way individual devices perform.
For the RAM disk, dr_transfer points to a routine that makes a kernel call to ask the system task to copy data from one part of physical memory to another, if the minor device being accessed is /dev/ram, /dev/mem, /dev/kmem, /dev/boot, or /dev/zero.
(No copying is required to access /dev/null, of course.) For a real disk, the code pointed to by dr_transfer also has to ask the system task for a data transfer.
But before the copy operation (for a read) or after it (for a write) a kernel call must also be made to ask the system task to do actual I/O, writing bytes to registers that are part of the disk controller to select the location on the disk and the size and direction of the transfer.

In the transfer routine the iov_size count in the iovec1 structure is modified, returning an error code (a negative number) if there was an error or a positive number indicating the number of bytes transferred.
It is not necessarily an error if no bytes are transferred; this indicates that the end of the device has been reached.
Upon returning to the main loop, the error code or the byte count is returned in the REP_STATUS field in the reply message from driver_task.

The next function, do_vrdwt, handles scattered I/O requests.
A message that requests a scattered I/O request uses the ADDRESS field to point to an array of iovec_t structures, each of which specifies the address of a buffer and the number of bytes to transfer.
In MINIX 3 such a request can be made only for contiguous blocks on the disk; the initial offset on the device and whether the operation is a read or a write are in the message.
So all the operations in one request will be for either reading or writing, and they will be sorted into block order on the device.
On line 11198 a check is done to see if this call is being done on behalf of a kernel-space I/O task; this is a vestige of an early phase of the development of MINIX 3 before all the disk drivers had been rewritten to run in user space.

Fundamentally, the code for this operation is very similar to that for the simple read or write performed by do_rdwt.
The same indirect calls to the device-dependent (*dr_prepare) and (*dr_transfer) routines are made.
The looping in order to handle multiple requests is all done internal to the function pointed to by (*dr_transfer).
The last argument in this case is not 1, it is the size of the array of iovec_t elements.
After termination of the loop the array of requests is copied back where it came from.
The io_size field of each element in the array will show the number of bytes transferred for that request, and although the total is not passed back directly in the reply message that driver_task constructs, the caller can extract the total from this array.

The next few routines in driver.c are for general support of the above operations.
A (*dr_name) call can be used to return the name of a device.
For a device with no specific name the no_name function returns the string “noname”.
Some devices may not require a particular service, for instance, a RAM disk does not require that anything special be done upon a DEV_CLOSE request.
The do_nop function fills in here, returning various codes depending upon the kind of request made.
Additional functions, nop_signal, nop_alarm, nop_prepare, nop_cleanup, and nop_cancel, are similar dummy routines for devices that do not need these services.

Finally, do_diocntl carries out DEV_IOCTL requests for a block device.
It is an error if any DEV_IOCTL operation other than reading (DIOCGETP) or writing (DIOCSETP) partition information is requested.
Do_diocntl calls the device’s (*dr_prepare) function to verify the device is valid and to get a pointer to the device structure that describes the partition base and size in byte units.
On a request to read, it calls the device’s (*dr_geometry) function to get the last cylinder, head, and sector information about the partition.
In each case a sys_datacopy kernel call is made to request that the system task copy the data between the memory spaces of the driver and the requesting process.

1.6.3 The Driver Library

The files drvlib.h and drvlib.c contain system-dependent code that supports disk partitions on IBM PC compatible computers.

Partitioning allows a single storage device to be divided up into subdevices.
It is most commonly used with hard disks, but MINIX 3 provides support for partitioning floppy disks, as well.
Some reasons to partition a disk device are:

Support for disk partitions is platform specific.
This specificity is not related to the hardware.
Partition support is device independent.
But if more than one operating system is to run on a particular set of hardware, all must agree on a format for the partition table.
On IBM PCs the standard is set by the MS-DOS fdisk command, and other OSs, such as MINIX 3, Windows, and Linux, use this format so they can coexist with MS-DOS.
When MINIX 3 is ported to another machine type, it makes sense to use a partition table format compatible with other operating systems used on the new hardware.
Thus the MINIX 3 source code to support partitions on IBM computers is put in drvlib.c, rather than being included in driver.c, for two reasons.
First, not all disk types support partitions.
As noted earlier, the memory driver links to driver.o but does not use the functions compiled into drvlib.o.
Second, this makes it easier to port MINIX 3 to different hardware.
It is easier to replace one small file than to edit a large one with many sections to be conditionally compiled for different environments.

The basic data structure inherited from the firmware designers is defined in include/ibm/partition.h, which is included by a #include statement in drvlib.h.
This includes information on the cylinder-head-sector geometry of each partition, as well as codes identifying the type of file system on the partition and an active flag indicating if it is bootable.
Most of this information is not needed by MINIX 3 once the file system is verified.

The partition function (in drvlib.c, line 11426) is called the first time a block device is opened.
Its arguments include a driver structure, so it can call device-specific functions, an initial minor device number, and a parameter indicating whether the partitioning style is floppy disk, primary partition, or subpartition.
It calls the device-specific (*dr_prepare) function to verify the device is valid and to get the base address and the size into a device structure of the type mentioned in the previous section.
Then it calls get_part_table to determine if a partition table is present and, if so, to read it.
If there is no partition table, the work is done.
Otherwise the minor device number of the first partition is computed, using the rules for numbering minor devices that apply to the style of partitioning specified in the original call.
In the case of primary partitions the partition table is sorted so the order of the partitions is consistent with that used by other operating systems.

At this point another call is made to (*dr_prepare), this time using the newly calculated device number of the first partition.
If the subdevice is valid, then a loop is made over all the entries in the table, checking that the values read from the table on the device are not out of the range obtained earlier for the base and size of the entire device.
If there is a discrepancy, the table in memory is adjusted to conform.
This may seem paranoid, but since partition tables may be written by different operating systems, a programmer using another system may have cleverly tried to use the partition table for something unexpected or there could be garbage in the table on disk for some other reason.
We put the most trust in the numbers we calculate using MINIX 3.
Better safe than sorry.

Still within the loop, for all partitions on the device, if the partition is identified as a MINIX 3 partition, partition is called recursively to gather subpartition information.
If a partition is identified as an extended partition, the next function, extpartition, is called instead.

Extpartition has nothing to do with MINIX 3 itself, so we will not discuss details.
Some other operating systems (e.g., Windows) use extended partitions.
These use linked lists rather than fixed-size arrays to support subpartitions.
For simplicity MINIX 3 uses the same mechanism for subpartitions as for primary partitions.
However, minimal support for extended partitions is provided to support MINIX 3 commands to read and write files and directories of other operating systems.
These operations are easy; providing full support for mounting and otherwise using extended partitions in the same way as primary partitions would be much more complicated.

Get_part_table calls do_rdwt to get the sector on a device (or subdevice) where a partition table is located.
The offset argument is zero if it is called to get a primary partition or nonzero for a subpartition.
It checks for the magic number (0xaa55) and returns true or false status to indicate whether a valid partition table was found.
If a table is found, it copies it to the table address that was passed as an argument.

Finally, sort sorts the entries in a partition table by lowest sector.
Entries that are marked as having no partition are excluded from the sort, so they come at the end, even though they may have a zero value in their low sector field.
The sort is a simple bubble sort; there is no need to use a fancy algorithm to sort a list of four items.

1.7 RAM disks

Now we will get back to the individual block device drivers and study several of them in detail.
The first one we will look at is the memory driver.
It can be used to provide access to any part of memory.
Its primary use is to allow a part of memory to be reserved for use like an ordinary disk, and we will also refer to it as the RAM disk driver.
A RAM disk does not provide permanent storage, but once files have been copied to this area they can be accessed extremely quickly.

A RAM disk is also useful for initial installation of an operating system on a computer with only one removable storage device, whether a floppy disk, CD-ROM, or some other device.
By putting the root device on the RAM disk, removable storage devices can be mounted and unmounted as needed to transfer data to the hard disk.
Putting the root device on a floppy disk would make it impossible to save files on floppies, since the root device (the only floppy) cannot be unmounted.
RAM disks also are used with “live” CD-ROMs that allow one to run an operating system for tests and demonstrations, without copying any files onto the hard disk.
Having the root device on the RAM disk makes the system highly flexible: any combination of floppy disks or hard disks can be mounted on it.
MINIX 3 and many other operating systems are distributed on live CD-ROMs.

As we shall see, the memory driver supports several other functions in addition to a RAM disk.
It supports straightforward random access to any part of memory, byte by byte or in chunks of any size.
Used this way it acts as a character device rather than as a block device.
Other character devices supported by the memory driver are /dev/zero, and /dev/null, otherwise known as the great bit bucket in the sky.

1.7.1 RAM Disk Hardware and Software

The idea behind a RAM disk is simple.
A block device is a storage medium with two commands: write a block and read a block.
Normally, these blocks are stored on rotating memories, such as floppy disks or hard disks.
A RAM disk is simpler.
It just uses a preallocated portion of main memory for storing the blocks.
A RAM disk has the advantage of having instant access (no seek or rotational delay), making it suitable for storing programs or data that are frequently accessed.

As an aside, it is worth briefly pointing out a difference between systems that support mounted file systems and those that do not (e.g., MS-DOS and Windows).
With mounted file systems, the root device is always present and in a fixed location, and removable file systems (i.e., disks) can be mounted in the file tree to form an integrated file system.
Once everything has been mounted, the user need not worry at all about which device a file is on.

In contrast, with systems like MS-DOS, the user must specify the location of each file, either explicitly as in B: DIR FILE or by using certain defaults (current device, current directory, and so on).
With only one or two floppy disks, this burden is manageable, but on a large computer system, with dozens of disks, having to keep track of devices all the time would be unbearable.
Remember that UNIX -like operating systems run on hardware ranging from small home and office machines to supercomputers such as the IBM Blue Gene/L supercomputer, the world’s fastest computer as of this writing; MS-DOS runs only on small systems.

Figure 3-20 shows the idea behind a RAM disk.
The RAM disk is split up into n blocks, depending on how much memory has been allocated for it.
Each block is the same size as the block size used on the real disks.
When the driver receives a message to read or write a block, it just computes where in the RAM disk memory the requested block lies and reads from it or writes to it, instead of from or to a floppy or hard disk.
Ultimately the system task is called to carry out the transfer.
This is done by phys_copy, an assembly language procedure in the kernel that copies to or from the user program at the maximum speed of which the hardware is capable.

A RAM disk driver may support several areas of memory used as RAM disk, each distinguished by a different minor device number.
Usually, these areas are distinct, but in some fairly specific situations it may be convenient to have them overlap, as we shall see in the next section.

1.7.2 Overview of the RAM Disk Driver in MINIX 3

The MINIX 3 RAM disk driver is actually six closely related drivers in one.
Each message to it specifies a minor device as follows:

The first special file listed above, /dev/ram, is a true RAM disk.
Neither its size nor its origin is built into the driver.
They are determined by the file system when MINIX 3 is booted.
If the boot parameters specify that the root file system is to be on the RAM disk but the RAM disk size is not specified, a RAM disk of the same size as the root file system image device is created.
A boot parameter can be used to specify a RAM disk larger than the root file system, or if the root is not to be copied to the RAM, the specified size may be any value that fits in memory and leaves enough memory for system operation.
Once the size is known, a block of memory big enough is found and removed from the memory pool by the process manager during its initialization.
This strategy makes it possible to increase or reduce the amount of RAM disk present without having to recompile the operating system.

The next two minor devices are used to read and write physical memory and kernel memory, respectively.
When /dev/mem is opened and read, it yields the contents of physical memory locations starting at absolute address zero (the real-mode interrupt vectors).
Ordinary user programs never do this, but a system program concerned with debugging the system might possibly need this facility.
Opening /dev/mem and writing on it will change the interrupt vectors.
Needless to say, this should only be done with the greatest of caution by an experienced user who knows exactly what he is doing.

The special file /dev/kmem is like /dev/mem, except that byte 0 of this file is byte 0 of the kernel’s data memory, a location whose absolute address varies, depending on the size of the MINIX 3 kernel text segment.
It too is used mostly for debugging and very special programs.
Note that the RAM disk areas covered by these two minor devices overlap.
If you know exactly how the kernel is placed in memory, you can open /dev/mem, seek to the beginning of the kernel’s data area, and see exactly the same thing as reading from the beginning of /dev/kmem.
But, if you recompile the kernel, changing its size, or if in a subsequent version of MINIX 3 the kernel is moved somewhere else in memory, you will have to seek a different amount in /dev/mem to see the same thing you now see at the start of /dev/kmem.
Both of these special files should be protected to prevent everyone except the superuser from using them.

The next file in this group, /dev/null, is a special file that accepts data and throws them away.
It is commonly used in shell commands when the program being called generates output that is not needed.
For example,

runs the program a.out but discards its output.
The RAM disk driver effectively treats this minor device as having zero size, so no data are ever copied to or from it.
If you read from it you will get an immediate EOF (End of File).

If you have looked at the directory entries for these files in /dev/ you may have noticed that, of those mentioned so far, only /dev/ram is a block special file.
All the others are character devices.
There is one more block device supported by the memory driver.
This is /dev/boot.
From the point of view of the device driver it is another block device implemented in RAM, just like /dev/ram.
However, it is meant to be initialized by copying a file appended to the boot image after init into memory, rather than starting with an empty block of memory, as is done for /dev/ram.
Support for this device is provided for future use and it is not used in MINIX 3 as described in this text.

Finally, the last device supported by the memory driver is another character special file, /dev/zero.
It is sometimes convenient to have a source of zeros.
Writing to /dev/zero is like writing to /dev/null; it throws data away.
But reading /dev/zero gives you zeros, in any quantity you want, whether a single character or a disk full.

At the driver level, the code for handling /dev/ram, /dev/mem, /dev/kmem, and /dev/boot is identical.
The only difference among them is that each one corresponds to a different region of memory, indicated by the arrays ram_origin and ram_limit, each indexed by minor device number.
The file system manages devices at a higher level.
The file system interprets devices as character or block devices, and thus can mount /dev/ram and /dev/boot and manage directories and files on these devices.
For the devices defined as character devices the file system can only read and write streams of data (although a stream read from /dev/null gets only EOF).
### Implementation of the RAM Disk Driver in MINIX 3
As with other disk drivers, the main loop of the RAM disk driver is in the file driver.c.
The device-specific support for memory devices is in memory.c.
When the memory driver is compiled, a copy of the object file called drivers/libdriver/driver.o, produced by compiling drivers/libdriver/driver.c, is linked with the object file drivers/memory/memory.o , the product of compiling drivers/memory/memory.c .

It may be worth taking a moment to consider how the main loop is compiled.
The declaration of the driver structure in driver.h defines a data structure, but does not create one.
The declaration of m_dtab creates an instance of this with each part of the structure filled in with a pointer to a function.
Some of these functions are generic code compiled when driver.c is compiled, for instance, all of the nop functions.
Others are code compiled when memory.c is compiled, for instance, m_do_open.
Note that for the memory driver seven of the entries are do-little or do-nothing routines and the last two are defined as NULL (which means these functions will never be called, there is no need even for a do_nop).
All this is a sure clue that the operation of a RAM disk is not terribly complicated.

The memory device does not require definition of a large number of data structures, either.
The array m_geom[NR_DEVS] holds the base and size of each of the six memory devices in bytes, as 64 bit unsigned integers, so there is no immediate danger of MINIX 3 not being able to have a big enough RAM disk.
The next line defines an interesting structure that will not be seen in other drivers.
M_seg[NR_DEVS] is apparently just an aray of integers, but these integers are indices that allow segment descriptors to be found.
The memory device driver is unusual among user-space processes in having the ability to access regions of memory outside of the ordinary text, data, and stack segments every process owns.
This array holds the information that allows access to the designated additional memory regions.
The variable m_device just holds the index into these arrays of the currently active minor device.

To use /dev/ram as the root device the memory driver must be initialized very early during startup of MINIX 3.
The kinfo and machine structures that are defined next will hold data retrieved from the kernel during startup that is necessary for initializing the memory driver.

One other data structure is defined before the executable code begins.
This is dev_zero, an array of 1024 bytes, used to supply data when a read call is made to /dev/zero.

The main procedure main calls one function to do some local initialization.
After that, it calls the main loop, which gets messages, dispatches to the appropriate procedures, and sends the replies.
There is no return to main upon completion.

The next function, m_name, is trivial.
It returns the string “memory” when called.

On a read or write operation, the main loop makes three calls: one to prepare a device, one to do the actual data transfer, and one to do cleanup.
For a memory device, a call to m_prepare is the first of these.
It checks that a valid minor device has been requested and then returns the address of the structure that holds the base address and size of the requested RAM area.
The second call is for m_transfer.
This does all the work.
As we saw in driver.c, all calls to read or write data are transformed into calls to read or write multiple contiguous blocks of data—if only one block is needed the request is passed on as a request for multiple blocks with a count of one.
So only two kinds of transfer requests are passed on to the driver, DEV_GATHER, requesting a read of one or more blocks, and DEV_SCATTER, a request to write one or more blocks.
Thus, after getting the minor device number, m_transfer enters a loop, repeated for the number of transfers requested.
Within the loop there is a switch on the device type.

The first case is for /dev/null, and the action is to return immediately on a DEV_GATHER request or on a DEV_SCATTER request to fall through to the end of the switch.
This is so the number of bytes transferred (although this number is zero for /dev/null) can be returned, as would be done for any write operation.

For all of the device types that refer to real locations in memory the action is similar.
The requested offset is checked against the size of the device to determine that the request is within the bounds of the memory allocated to the device.
Then a kernel call is made to copy data either to or from the memory of the caller.
There are two chunks of code that do this, however.
For /dev/ram, /dev/kmem, and /dev/boot virtual addresses are used, which requires retrieving the segment address of the memory region to be accessed from the m_seg array, and then making a sys_vircopy kernel call.
For /dev/mem a physical address is used and the call is to sys_physcopy.

The remaining operation is a read or write to /dev/zero.
For reading the data is taken from the dev_zero array mentioned earlier.
You might ask, why not just generate zero values as needed, rather than copying from a buffer full of them? Since the copying of the data to its destination has to be done by a kernel call, such a method would require either an inefficient copying of single bytes from the memory driver to the system task, or building code to generate zeros into the system task.
The latter approach would increase the complexity of kernel-space code, something that we would like to avoid in MINIX 3.

A memory device does not need a third step to finish a read or write operation, and the corresponding slot in m_dtab is a call to nop_finish.

Opening a memory device is done by m_do_open.
The job is done by calling m_prepare to check that a valid device is being referenced.
More interesting than the code that exists is a comment about code that was found here in older versions of MINIX.
Previously a trick was hidden here.
A call by a user process to open /dev/mem or /dev/kmem would also magically confer upon the caller the ability to execute instructions which access I/O ports.
Pentium-class CPUs implement four privilege levels, and user processes normally run at the least-privileged level.
The CPU generates a general protection exception when an process tries to execute an instruction not allowed at its privilege level.
Providing a way to get around this was considered safe because the memory devices could only be accessed by a user with root privileges.
In any case, this possibly risky “feature” is absent from MINIX 3 because kernel calls that allow I/O access via the system task are now available.
The comment remains, to point out that if MINIX 3 is ported to hardware that uses memory-mapped I/O such a feature might need to be reintroduced.
The function to do this, enable_iop, remains in the kernel code to show how this can be done, although it is now an orphan.

The next function, m_init, is called only once, when mem_task is called for the first time.
This routine uses a number of kernel calls, and is worth study to see how MINIX 3 drivers interact with kernel space by using system task services.
First a sys_getkinfo kernel call is made to get a copy of the kernel’s kinfo data.
From this data it copies the base address and size of /dev/kmem into the corresponding fields of the m_geom data structure.
A different kernel call, sys_segctl, converts the physical address and size of /dev/kmem into the segment descriptor information needed to treat the kernel memory as a virtual memory space.
If an image of a boot device has been compiled into the system boot image, the field for the base address of /dev/boot will be non-zero.
If this is so, then information to access the memory region for this device is set up in exactly the same way it was done for /dev/kmem.
Next the array used to supply data when /dev/zero is accessed is explicitly filled with zeros.
This is probably unnecessary; C compilers are supposed to initialize newly created static variables to all zeros.

Finally, m_init uses a sys_getmachine kernel call to get another set of data from the kernel, the machine structure which flags various possible hardware alternatives.
In this case the information needed is whether or not the CPU is capable of protected mode operation.
Based on this information the size of /dev/mem is set to either 1 MB, or 4 GB − 1, depending upon whether MINIX 3 is running in 8088 or 80386 mode.
These sizes are the maximum sizes supported by MINIX 3 and do not have anything to do with how much RAM is installed in the machine.
Only the size of the device is set; the compiler is trusted to set the base address correctly to zero.
Also, since /dev/mem is accessed as physical (not virtual) memory there is no need to make a sys_segctl kernel call to set up a segment descriptor.

Before we leave m_init we should mention another kernel call used here, although it is not obvious in the code.
Many of the actions taken during initialization of the memory driver are essential to proper functioning of MINIX 3, and thus several tests are made and panic is called if a test fails.
In this case panic is a library routine which ultimately results in a sys_exit kernel call.
The kernel and (as we shall see) the process manager and the file system have their own panic routines.
The library routine is provided for device drivers and other small system components.

Surprisingly, the function we just examined, m_init, does not initialize the quintessential memory device, /dev/ram.
This is taken care of in the next function, m_ioctl.
In fact, there is only one ioctl operation defined for the RAM disk; this is MIOCRAMSIZE, which is used by the file system to set the RAM disk size.
Much of the job is done without requiring any services from the kernel.
The call to allocmem is a system call, but not a kernel call.
It is handled by the process manager, which maintains all of the information necessary to find an available region of memory.
However, at the end one kernel call is needed.
At line 11894 a sys_segctl call is made to convert the physical address and size returned by allocmem into the segment information needed for further access.

The last function defined in memory.c is m_geometry.
This is a fake.
Obviously, cylinders, heads, and sectors are irrelevant in addressing semiconductor memory, but if a request is made for such information for a memory device this function pretends it has 64 heads and 32 sectors per track, and calculates from the size how many cylinders there are.

1.8 Disks

All modern computers except embedded ones have disk drives.
For that reason, we will now study them, starting with the hardware, then moving on to say some general things about disk software.
After that we will delve into the way MINIX 3 controls its disks.

1.8.1 Disk Hardware

All real disks are organized into cylinders, each one containing as many tracks as there are heads stacked vertically.
The tracks are divided into sectors, with the number of sectors around the circumference typically being 8 to 32 on floppy disks, and up to several hundred on some hard disks.
The simplest designs have the same number of sectors on each track.
All sectors contain the same number of bytes, although a little thought will make it clear that sectors close to the outer rim of the disk will be physically longer than those close to the hub.
The time to read or write each sector will be same, however.
The data density is obviously higher on the innermost cylinders, and some disk designs require a change in the drive current to the read-write heads for the inner tracks.
This is handled by the disk controller hardware and is not visible to the user (or the implementer of an operating system).

The difference in data density between inner and outer tracks means a sacrifice in capacity, and more sophisticated systems exist.
Floppy disk designs that rotate at higher speeds when the heads are over the outer tracks have been tried.
This allows more sectors on those tracks, increasing disk capacity.
Such disks are not supported by any system for which MINIX 3 is currently available, however.
Modern large hard drives also have more sectors per track on outer tracks than on inner tracks.
These are IDE (Integrated Drive Electronics) drives, and the sophisticated processing done by the drive’s built-in electronics masks the details.
To the operating system they appear to have a simple geometry with the same number of sectors on each track.

The drive and controller electronics are as important as the mechanical hardware.
The main element of the disk controller is a specialized integrated circuit, really a small microcomputer.
Once this would have been on a card plugged into the computer’s backplane, but on modern systems, the disk controller is on the parentboard.
For a modern hard disk this disk controller circuitry may be simpler than for a floppy disk, since a hard drive has a powerful electronic controller integrated into the drive itself.

A device feature that has important implications for the disk driver is the possibility of a controller doing seeks on two or more drives at the same time.
These are known as overlapped seeks.
While the controller and software are waiting for a seek to complete on one drive, the controller can initiate a seek on another drive.
Many controllers can also read or write on one drive while seeking on one or more other drives, but a floppy disk controller cannot read or write on two drives at the same time.
(Reading or writing requires the controller to move bits on a microsecond time scale, so one transfer uses up most of its computing power.) The situation is different for hard disks with integrated controllers, and in a system with more than one of these hard drives they can operate simultaneously, at least to the extent of transferring between the disk and the controller’s buffer memory.
Only one transfer between the controller and the system memory is possible at once, however.
The ability to perform two or more operations at the same time can reduce the average access time considerably.

One thing to be aware of in looking at the specifications of modern hard disks is that the geometry specified, and used by the driver software, is almost always different from the physical format.
In fact, if you look up the “recommended setup parameters” for a large hard disk, you are likely to find it specified as 16383 cylinders, 16 heads, and 63 sectors per track, no matter what the size of the disk.
These numbers correspond to a disk size of 8 GB, but are used for all disks this size or larger.
The designers of the original IBM PC ROM BIOS allotted a 6-bit field for the sector count, 4 bits to specify the head, and 14 bits to select a cylinder.
With 512 byte sectors this comes out to 8 GB.
So if you try to install a large hard drive into a very old computer you may find you can access only 8 GB, even though you have a much bigger disk.
The usual way around this limitation is to use logical block addressing in which disk sectors are just numbered consecutively starting at zero, without regard to the disk geometry.

The geometry of a modern disk is a fiction, anyway.
On a modern disk the surface is divided into 20 or more zones.
Zones closer to the center of the disk have fewer sectors per track than zones nearer the periphery.
Thus sectors have approximately the same physical length no matter where they are located on the disk, making more efficient use of the disk surface.
Internally, the integrated controller addresses the disk by calculating the zone, cylinder, head, and sector.
But this is never visible to the user, and the details are rarely found in published specifications.
The bottom line is, there is no point to using cylinder, head, sector addressing of a disk unless you are working with a very old computer that does not support logical block addressing.
Also, it does not make sense to buy a new 400 GB drive for the PC-XT you bought in 1983; you will get no more than 8 GB use out of it.

This is a good place to mention a confusing point about disk capacity specifications.
Computer professionals are accustomed to using powers of 2—a Kilobyte (KB) is 210 = 1024 bytes, a Megabyte (MB) is 220 = 10242 bytes, etc., to express the size of memory devices.
A Gigabyte (GB), then, should be 10243 , or 230 bytes.
However, disk manufacturers have adopted the habit of using the term “Gigabyte” to mean 109 , which (on paper) instantly increases the size of their products.
Thus the 8 GB limit mentioned above is an 8.4 GB disk in the language of the disk salesman.
Recently there has been a move toward using the term Gibibyte (GiB) to mean 230 .
However, in this text the authors, being set in their ways and in protest of the hijacking of tradition for advertising purposes, will continue to use terms like Megabyte and Gigabyte to mean what they have always meant.

1.8.2 RAID

Although modern disks are much faster than older ones, improvements in CPU performance have far exceeded improvements in disk performance.
It has occurred to various people over the years that parallel disk I/O might be helpful.
Thus has come about a new class of I/O device called a RAID, an acronym for Redundant Array of Independent Disks.
Actually, the designers of RAID (at Berkeley) originally used the acronym RAID to stand for “Redundant Array of Inexpensive Disks” to contrast this design with a SLED (Single Large Expensive Disk).
However, when RAID became commercially popular, disk manufacturers changed the meaning of the acronym because it was tough to sell an expensive product whose name stood for “inexpensive.” The basic idea behind a RAID is to install a box full of disks next to the computer, typically a large server, replace the disk controller card with a RAID controller, copy the data over to the RAID, and then continue normal operation.

The independent disks can be used together in a variety of ways.
We do not have space for an exhaustive description of all of these, and MINIX 3 does not (yet) support RAID, but an introduction to operating systems should at least mention some of the possibilities.
RAID can be used both to speed disk access and to make data more secure.

For example, consider a very simple RAID of two drives.
When multiple sectors of data are to be written to the “disk” the RAID controller sends sectors 0, 2, 4, etc., to the first drive, and sectors 1, 3, 5, etc., to the second drive.
The controller divides up the data and the two disks are written simultaneously, doubling the writing speed.
When reading, both drives are read simultaneously, but the controller reassembles the data in the proper order, and to the rest of the system it just looks like the reading speed is twice as fast.
This technique is called striping.
This is a simple example of RAID level 0.
In practice four or more drives would be used.
This works best when data are usually read or written in large blocks.
Obviously, nothing is gained if a typical disk request is for a single sector at a time.

The previous example shows how multiple drives can increase speed.
What about reliability? RAID level 1 works like RAID level 0, except the data is duplicated.
Again, a very simple array of two drives could be used, and all of the data could be written to both of them.
This provides no speedup, but there is 100% redundancy.
If an error is detected during reading there is no need for a retry if the other drive reads the data correctly.
The controller just has to make sure the correct data is passed on to the system.
It probably would not be a good idea to skip retries if errors are detected while writing, however.
And if errors occur frequently enough that skipping retries actually makes reading noticeably faster it is probably time to decide complete failure is imminent.
Typically the drives used for RAIDs are hot-swappable, meaning they can be replaced without powering down the system.

More complex arrays of multiple disks can increase both speed and reliability.
Consider, for instance, an array of 7 disks.
Bytes could be split into 4-bit nybbles, with each bit being recorded on one of four drives and with the other three drives being used to record a three bit error-correcting code.
If a drive goes bad and needs to be hot-swapped for a new one, a missing drive is equivalent to one bad bit, so the system can keep running while maintenance is done.
For the cost of seven drives you get reliable performance that is four times as fast as one drive, and no downtime.

1.8.3 Disk Software

In this section we will look at some issues related to disk drivers in general.
First, consider how long it takes to read or write a disk block.
The time required is determined by three factors:

For most disks, the seek time dominates the other two times, so reducing the mean seek time can improve system performance substantially.

Disk devices are prone to errors.
Some kind of error check, a checksum or a cyclic redundancy check, is always recorded along with the data in each sector on a disk.
Even the sector addresses recorded when the disk is formatted have check data.
Floppy disk controller hardware can usually report when an error is detected, but the software must then decide what to do about it.
Hard disk controllers often take on much of this burden.

Particularly with hard disks, the transfer time for consecutive sectors within a track can be very fast.
Thus reading more data than requested and caching it in memory can be very effective in speeding disk access.

If the disk driver accepts requests one at a time and carries them out in that order, that is, First-Come, First-Served (FCFS), little can be done to optimize seek time.
However, another strategy is possible when the disk is heavily loaded.
It is likely that while the arm is seeking on behalf of one request, other disk requests may be generated by other processes.
Many disk drivers maintain a table, indexed by cylinder number, with all pending requests for each cylinder chained together in a linked list headed by the table entries.

Given this kind of data structure, we can improve upon the first-come, firstserved scheduling algorithm.
To see how, consider a disk with 40 cylinders.
A request comes in to read a block on cylinder 11.
While the seek to cylinder 11 is in progress, new requests come in for cylinders 1, 36, 16, 34, 9, and 12, in that order.
They are entered into the table of pending requests, with a separate linked list for each cylinder.
The requests are shown in Fig. 3-21.

When the current request (for cylinder 11) is finished, the disk driver has a choice of which request to handle next.
Using FCFS, it would go next to cylinder 1, then to 36, and so on.
This algorithm would require arm motions of 10, 35, 20, 18, 25, and 3, respectively, for a total of 111 cylinders.

Alternatively, it could always handle the closest request next, to minimize seek time.
Given the requests of Fig. 3-21, the sequence is 12, 9, 16, 1, 34, and 36, as shown as the jagged line at the bottom of Fig. 3-21.
With this sequence, the arm motions are 1, 3, 7, 15, 33, and 2, for a total of 61 cylinders.
This algorithm, Shortest Seek First (SSF), cuts the total arm motion almost in half compared to FCFS.

Unfortunately, SSF has a problem.
Suppose that more requests keep coming in while the requests of Fig. 3-21 are being processed.
For example, if, after going to cylinder 16, a new request for cylinder 8 is present, that request will have priority over cylinder 1.
If a request for cylinder 13 then comes in, the arm will next go to 13, instead of 1.
With a heavily loaded disk, the arm will tend to stay in the middle of the disk most of the time, so requests at either extreme will have to wait until a statistical fluctuation in the load causes there to be no requests near the middle.
Requests far from the middle may get poor service.
The goals of minimal response time and fairness are in conflict here.

Tall buildings also have to deal with this trade-off.
The problem of scheduling an elevator in a tall building is similar to that of scheduling a disk arm.
Requests come in continuously calling the elevator to floors (cylinders) at random.

The microprocessor running the elevator could easily keep track of the sequence in which customers pushed the call button and service them using FCFS.
It could also use SSF.

However, most elevators use a different algorithm to reconcile the conflicting goals of efficiency and fairness.
They keep moving in the same direction until there are no more outstanding requests in that direction, then they switch directions.
This algorithm, known both in the disk world and the elevator world as the elevator algorithm, requires the software to maintain 1 bit: the current direction bit, UP or DOWN.
When a request finishes, the disk or elevator driver checks the bit.
If it is UP, the arm or cabin is moved to the next highest pending request.
If no requests are pending at higher positions, the direction bit is reversed.
When the bit is set to DOWN, the move is to the next lowest requested position, if any.

Figure 3-22 shows the elevator algorithm using the same seven requests as Fig. 3-21, assuming the direction bit was initially UP.
The order in which the cylinders are serviced is 12, 16, 34, 36, 9, and 1, which yields arm motions of 1, 4, 18, 2, 27, and 8, for a total of 60 cylinders.
In this case the elevator algorithm is slightly better than SSF, although it is usually worse.
One nice property that the elevator algorithm has is that given any collection of requests, the upper bound on the total motion is fixed: it is just twice the number of cylinders.

A slight modification of this algorithm that has a smaller variance in response times is to always scan in the same direction (Teory, 1972).
When the highest numbered cylinder with a pending request has been serviced, the arm goes to the lowest-numbered cylinder with a pending request and then continues moving in an upward direction.
In effect, the lowest-numbered cylinder is thought of as being just above the highest-numbered cylinder.

Some disk controllers provide a way for the software to inspect the current sector number under the head.
With such a controller, another optimization is possible.
If two or more requests for the same cylinder are pending, the driver can issue a request for the sector that will pass under the head next.
Note that when multiple tracks are present in a cylinder, consecutive requests can be for different tracks with no penalty.
The controller can select any of its heads instantaneously, because head selection involves neither arm motion nor rotational delay.

With a modern hard disk, the data transfer rate is so much faster than that of a floppy disk that some kind of automatic caching is necessary.
Typically any request to read a sector will cause that sector and up to the rest of the current track to be read, depending upon how much space is available in the controller’s cache memory.
Current caches are often 8 MB or more.

When several drives are present, a pending request table should be kept for each drive separately.
Whenever any drive is idle, a seek should be issued to move its arm to the cylinder where it will be needed next (assuming the controller allows overlapped seeks).
When the current transfer finishes, a check can be made to see if any drives are positioned on the correct cylinder.
If one or more are, the next transfer can be started on a drive that is already on the right cylinder.
If none of the arms is in the right place, the driver should issue a new seek on the drive that just completed a transfer and wait until the next interrupt to see which arm gets to its destination first.

RAM disks do not have to worry about seek or rotational optimization: at any instant all blocks can be read or written without any physical motion.
Another area in which RAM disks are simpler than real disks is error handling.
RAM disks always work; real ones do not always work.
They are subject to a wide variety of errors.
Some of the more common ones are:

Programming errors occur when the driver tells the controller to seek to a nonexistent cylinder, read from a nonexistent sector, use a nonexistent head, or transfer to or from nonexistent memory.
Most controllers check the parameters given to them and complain if they are invalid.
In theory, these errors should never occur, but what should the driver do if the controller indicates that one has happened? For a home-grown system, the best thing to do is stop and print a message like “Call the programmer” so the error can be tracked down and fixed.
For a commercial software product in use at thousands of sites around the world, this approach is less attractive.
Probably the only thing to do is terminate the current disk request with an error and hope it will not recur too often.

Transient checksum errors are caused by specks of dust in the air that get between the head and the disk surface.
Most of the time they can be eliminated by just repeating the operation a few times.
If the error persists, the block has to be marked as a bad block and avoided.

One way to avoid bad blocks is to write a very special program that takes a list of bad blocks as input and carefully hand crafts a file containing all the bad blocks.
Once this file has been made, the disk allocator will think these blocks are occupied and never allocate them.
As long as no one ever tries to read the bad block file, no problems will occur.

Not reading the bad block file is easier said than done.
Many disks are backed up by copying their contents a track at a time to a backup tape or disk drive.
If this procedure is followed, the bad blocks will cause trouble.
Backing up the disk one file at a time is slower but will solve the problem, provided that the backup program knows the name of the bad block file and refrains from copying it.

Another problem that cannot be solved with a bad block file is the problem of a bad block in a file system data structure that must be in a fixed location.
Almost every file system has at least one data structure whose location is fixed, so it can be found easily.
On a partitioned file system it may be possible to repartition and work around a bad track, but a permanent error in the first few sectors of either a floppy or hard disk generally means the disk is unusable.

“Intelligent” controllers reserve a few tracks not normally available to user programs.
When a disk drive is formatted, the controller determines which blocks are bad and automatically substitutes one of the spare tracks for the bad one.
The table that maps bad tracks to spare tracks is kept in the controller’s internal memory and on the disk.
This substitution is transparent (invisible) to the driver, except that its carefully worked out elevator algorithm may perform poorly if the controller is secretly using cylinder 800 whenever cylinder 3 is requested.
The technology of manufacturing disk recording surfaces is better than it used to be, but it is still not perfect.
However, the technology of hiding the imperfections from the user has also improved.
Many controllers also manage new errors that may develop with use, permanently assigning substitute blocks when they determine that an error is unrecoverable.
With such disks the driver software rarely sees any indication that there any bad blocks.

Seek errors are caused by mechanical problems in the arm.
The controller keeps track of the arm position internally.
To perform a seek, it issues a series of pulses to the arm motor, one pulse per cylinder, to move the arm to the new cylinder.
When the arm gets to its destination, the controller reads the actual cylinder number (written when the drive was formatted).
If the arm is in the wrong place, a seek error has occurred and some corrective action is required.

Most hard disk controllers correct seek errors automatically, but many floppy controllers (including the IBM PCs) just set an error bit and leave the rest to the driver.
The driver handles this error by issuing a recalibrate command, to move the arm as far out as it will go and reset the controller’s internal idea of the current cylinder to 0.
Usually, this solves the problem.
If it does not, the drive must be repaired.

As we have seen, the controller is really a specialized little computer, complete with software, variables, buffers, and occasionally, bugs.
Sometimes an unusual sequence of events such as an interrupt on one drive occurring simultaneously with a recalibrate command for another drive will trigger a bug and cause the controller to go into a loop or lose track of what it was doing.
Controller designers usually plan for the worst and provide a pin on the chip which, when asserted, forces the controller to forget whatever it was doing and reset itself.
If all else fails, the disk driver can set a bit to invoke this signal and reset the controller.
If that does not help, all the driver can do is print a message and give up.

The time required to seek to a new cylinder is usually much more than the rotational delay, and always vastly more than the transfer time to read or write one sector.
In other words, once the driver has gone to the trouble of moving the arm somewhere, it hardly matters whether it reads one sector or a whole track.
This effect is especially true if the controller provides rotational sensing, so the driver can see which sector is currently under the head and issue a request for the next sector, thereby making it possible to read an entire disk track in a single rotation time.
(Normally it takes half a rotation plus one sector time just to read a single sector, on the average.)

Some disk drivers take advantage of these timing properties by maintaining a secret track-at-a-time cache, unknown to the device-independent software.
If a sector that is in the cache is needed, no disk transfer is required.
A disadvantage of track-at-a-time caching (in addition to the software complexity and buffer space needed) is that transfers from the cache to the calling program will have to be done by the CPU using a programmed loop, rather than letting the DMA hardware do the job.

Some controllers take this process a step further, and do track-at-a-time caching in their own internal memory, transparent to the driver, so that transfer between the controller and memory can use DMA.
If the controller works this way, there is little point in having the disk driver do it as well.
Note that both the controller and the driver are in a good position to read and write entire tracks in one command, but that the device-independent software cannot, because it regards a disk as a linear sequence of blocks, without regard to how they are divided up into tracks and cylinders.
Only the controller knows the true geometry for sure.

1.8.4 Overview of the Hard Disk Driver in MINIX 3

The hard disk driver is the first part of MINIX 3 we have looked at that has to deal with a range of different types of hardware.
Before we discuss the driver, we will briefly consider some of the problems hardware differences can cause.

The “PC” is really a family of different computers.
Not only are different processors used in different members of the family, there are also some major differences in the basic hardware.
MINIX 3 has been developed on and for newer systems with Pentium-class CPUs, but even among these there are differences.
For instance, the oldest Pentium systems use the 16-bit AT bus originally designed for the 80286 processor.
A feature of the AT bus is that it was cleverly designed so older 8-bit peripherals could still be used.
Later systems added a 32-bit PCI bus for peripherals, while still providing AT bus slots.
The newest designs have dropped AT-bus support, providing only a PCI bus.
But it is reasonable to expect that users with computers of a certain age may want to be able to use MINIX 3 with a mix of 8-bit, 16-bit, and 32-bit peripherals.

For every bus there is a different family of I/O adapters.
On older systems these are separate circuit boards which plug into the system parentboard.
On newer systems many standard adapters, especially disk controllers, are integrated parts of the parentboard chipset.
In itself this is not a problem for the programmer, as integrated adapters usually have a software interface identical to that of removable devices.
Also, integrated controllers can usually be disabled.
This allows use of a more advanced add-on device, such as a SCSI controller, in place of a built-in device.
To take advantage of this flexibility the operating system should not be restricted to using just one kind of adapter.

In the IBM PC family, as in most other computer systems, each bus design also comes with firmware in the Basic I/O System Read-Only Memory (the BIOS ROM) which is designed to bridge the gap between the operating system and the peculiarities of the hardware.
Some peripheral devices may even provide extensions to the BIOS in ROM chips on the peripheral cards themselves.
The difficulty faced by an operating system implementer is that the BIOS in IBM-type computers (certainly the early ones) was designed for an operating system, MS-DOS, that does not support multiprogramming and that runs in 16-bit real mode, the lowest common denominator of the various modes of operation available from the 80x86 family of CPUs.

The implementer of a new operating system for the IBM PC is thus faced with several choices.
One is whether to use the driver support for peripherals in the BIOS or to write new drivers from scratch.
This was not a hard choice in the design of early versions of MINIX, since the BIOS was in many ways not suitable to its needs.
Of course, to start MINIX 3 the boot monitor uses the BIOS to do the initial loading of the system, whether from hard disk, CD-ROM, or floppy disk— there is no practical alternative to doing it this way.
Once we have loaded the system, including our own I/O drivers, we can do better than the BIOS.

The second choice then must be faced: without the BIOS support how are we going to make our drivers adapt to the varied kinds of hardware on different systems? To make the discussion concrete, consider that there are two fundamentally different types of hard disk controller usable on the modern 32-bit Pentium systems for which MINIX 3 has been designed: the integrated IDE controller and add-on SCSI controllers for the PCI bus.
If you would like to take advantage of older hardware and adapt MINIX 3 to work on the hardware targeted by earlier versions of MINIX, there are four hard disk controller types to consider: the original 8-bit XT-type controller, the 16-bit AT-type controller, and two different controllers for two different types of IBM PS/2 series computers.
There are several possible ways to deal with all these alternatives:

The first way is really the best way in the long run.
For use on a particular installation there is no need to use up disk and memory space with code for alternative drivers that will never be used.
However, it is a nightmare for the distributor of the software.
Supplying four different startup disks and advising users on how to use them is expensive and difficult.
Thus, another method is advisable, at least for the initial installation.

The second method is to have the operating system probe the peripherals, by reading the ROM on each card or writing and reading I/O ports to identify each card.
This is possible (and works better on newer IBM-type systems than on older ones), but it does not accommodate nonstandard I/O devices.
Also, probing I/O ports to identify one device sometimes can activate another device which seizes control and disables the system.
This method complicates the startup code for each device, and yet still does not work very well.
Operating systems that do use this method generally have to provide some kind of override, typically a mechanism such as we use with MINIX 3.

The third method, used in MINIX 3, is to allow inclusion of several drivers in the boot image.
The MINIX 3 boot monitor allows various boot parameters to be read at startup time.
These can be entered by hand, or stored permanently on the disk.
At startup time, if a boot parameter of the form

is found, this forces the IDE disk controller (at_wini) to be used when MINIX 3 is started.
This depends upon the at_wini driver being assigned this label.
Labels are assigned when the boot image is compiled.

There are two other things MINIX 3 does to try to minimize problems with multiple hard disk drivers.
One is that there is, after all, a driver that interfaces between MINIX 3 and the ROM BIOS hard disk support.
This driver is almost guaranteed to work on any system and can be selected by use of a

boot parameter.
Generally, this should be a last resort, however.
MINIX 3 as described here runs only in protected mode on systems with an 80386 or better processor, but the BIOS code always runs in real (8086) mode.
Switching out of protected mode and back again whenever a routine in the BIOS is called is very slow.
The other strategy MINIX 3 uses in dealing with drivers is to postpone initialization until the last possible moment.
Thus, if on some hardware configuration none of the hard disk drivers work, we can still start MINIX 3 from a floppy disk and do some useful work.
MINIX 3 will have no problems as long as no attempt is made to access the hard disk.
This may not seem like a major breakthrough in user friendliness, but consider this: if all the drivers try to initialize immediately on system startup, the system can be totally paralyzed by improper configuration of some device we do not need anyway.
By postponing initialization of each driver until it is needed, the system can continue with whatever does work, while the user tries to resolve the problems.

We learned this lesson the hard way: earlier versions of MINIX tried to initialize the hard disk as soon as the system was booted.
If no hard disk was present, the system hung.
This behavior was especially unfortunate because MINIX would run quite happily on a system without a hard disk, albeit with restricted storage capacity and reduced performance.

In the discussion in this section and the next, we will take as our model the AT-style hard disk driver, which is the default driver in the standard MINIX 3 distribution.
This is a versatile driver that handles hard disk controllers from the ones used in the earliest 80286 systems to modern EIDE (Extended Integrated Drive Electronics) controllers that handle gigabyte capacity hard disks.
Modern EIDE controllers also support standard CD-ROM drives.
However, in order to simplify our discussion the extensions that support CD-ROMs have been taken out of the code listed in Appendix B.
The general aspects of hard disk operation we discuss in this section apply to the other supported drivers as well.

The main loop of the hard disk driver is the same common code we have already discussed, and supports the standard nine kinds of requests that can be made.
A DEV_OPEN request can entail a substantial amount of work, as there are always partitions and may be subpartitions on a hard disk.
These must be read when a device is opened, (i.e., when it is first accessed).
When CD-ROMs are supported, on a DEV_OPEN the presence of the medium must be verified, since it is removable.
On a CD-ROM a DEV_CLOSE operation also has meaning: it requires that the door be unlocked and the CD-ROM ejected.
There are other complications of removable media that are more applicable to floppy drives, so we will discuss these in a later section.
For CD-ROMs a DEV_IOCTL operation is used to set a flag to mark that the medium should be ejected from the drive upon a DEV_CLOSE.
A DEV_IOCTL operation is also used to read and write partition tables.

DEV_READ, DEV_WRITE, DEV_GATHER and DEV_SCATTER requests are each handled in two phases, prepare and transfer, as we saw previously.
For the hard disk DEV_CANCEL and DEV_SELECT calls are ignored.

No scheduling is done by the hard disk device driver at all, that is done by the file system, which assembles the vector requests for gather/scatter I/O.
Requests come from the file system cache as DEV__ GATHER or DEV_SCATTER requests for multiples of blocks (4-KB in the default configuration of MINIX 3), but the hard disk driver is able to handle requests for any multiple of a sector (512 bytes).
In any case, as we have seen, the main loop of all disk drivers transforms requests
for single blocks of data into one element vector requests.

Requests for reading and writing are not mixed in a vector of requests, nor can requests be marked as optional.
The elements of a request vector are for contiguous disk sectors, and the vector is sorted by the file system before being passed to the device driver, so it suffices to specify just the starting position on the disk for an entire array of requests.

The driver is expected to succeed in reading or writing at least the first request in a request vector, and to return when a request fails.
It is up to the file system to decide what to do; the file system will try to complete a write operation but will return to the calling process only as much data as it can get on a read.

The file system itself, by using scattered I/O, can implement something similar to Teory’s version of the elevator algorithm—recall that in a scattered I/O request the list of requests is sorted on the block number.
The second step in scheduling takes place in the controller of a modern hard disk.
Such controllers are “smart” and can buffer large quantities of data, using internally programmed algorithms to retrieve data in the most efficient order, irrespective of the order of receipt of the requests.

1.8.5 Implementation of the Hard Disk Driver in MINIX 3

Small hard disks used on microcomputers are sometimes called “winchester” disks.
The term was IBM’s code name for the project that developed the disk technology in which the read/write heads fly on a thin cushion of air and land on the recording medium when the disk stops spinning.
The explanation of the name is that an early model had two data modules, a 30-Mbyte fixed and a 30-Mbyte removable one.
Supposedly this reminded the developers of the Winchester 30-30 firearm which figures in many tales of the United States’ western frontier.
Whatever the origin of the name, the basic technology remains the same, although today’s typical PC disk is much smaller and the capacity is much larger than the 14-inch disks that were typical of the early 1970s when the winchester technology was developed.

The MINIX 3 AT-style hard disk driver is in at_wini.c.
This is a complicated driver for a sophisticated device, and there are several pages of macro definitions specifying controller registers, status bits and commands, data structures, and prototypes.
As with other block device drivers, a driver structure, w_dtab, is initialized with pointers to the functions that actually do the work.
Most of them are defined in at_wini.c, but as the hard disk requires no special cleanup operation, its dr_cleanup entry points to the common nop_cleanup in driver.c, shared with other drivers that have no special cleanup requirement.
Several other possible functions are also irrelevant for this driver and also are initialized to point to nop_functions.
The entry function, called at_winchester_task, calls a procedure that does hardware-specific initialization and then calls the main loop in driver.c, passing the address of w_dtab.
The main loop, driver_task in libdriver/driver.c, runs forever, dispatching calls to the various functions pointed to by the driver table.

Since we are now dealing with real electromechanical storage devices, there is a substantial amount of work to be done by init_params to initialize the hard disk driver.
Various parameters about the hard disks are kept in the wini table, which has an element for each of the MAX_DRIVES (8) drives supported, up to four conventional IDE drives, and up to four drives on the PCI bus, either plug-in IDE controllers or SATA (Serial AT Attachment) controllers.

Following the policy of postponing initialization steps that could fail until the first time they are truly necessary, init_params does not do anything that requires accessing the disk devices themselves.
The main thing it does is to copy information about the hard disk logical configuration into the wini array.
The ROM BIOS on a Pentium-class computer retrieves basic configuration information from the CMOS memory used to preserve basic configuration data.
The BIOS does this when the computer is first turned on, before the first part of the MINIX 3 loading process begins.
The information is copied from the BIOS.
Many of the constants used here, such as NR_HD_DRIVES_ADDR are defined in include/ibm/bios.h, a file which is not listed in Appendix B but which can be found on the MINIX 3 CD-ROM.
It is not necessarily fatal if this information cannot be retrieved.
If the disk is a modern one, the information can be retrieved directly from the disk when it is accessed for the first time.
Following the entry of data obtained from the BIOS, additional disk information is filled in for each drive using a call to the next function, init_drive.

On older systems with IDE controllers, the disk functions as if it were an AT-style peripheral card, even though it may be integrated on the parentboard.
Modern drive controllers usually function as PCI devices, with a 32-bit data path to the CPU, rather than the 16-bit AT bus.
Fortunately for us, once initialization is complete, the interface to both generations of disk controller appears the same to the programmer.
To make this work, init_params_pci is called if necessary to get the parameters of the PCI devices.
We will not describe the details of this routine, but a few points should be mentioned.
First, the boot parameter ata_instance is used to set the value of the variable w_instance.
If the boot parameter is not explicitly set the value will be zero.
If it is set and greater than zero the test causes querying the BIOS and initialization of standard IDE drives to be skipped.
In this case only drives found on the PCI bus will be registered.

The second point is that a controller found on the PCI bus will be identified as controlling devices c0d4 through c0d7.
If w_instance is non-zero the drive identifiers c0d0 through c0d3 will be skipped, unless a PCI bus controller identifies itself as “compatible.” Drives handled by a compatible PCI bus controller will be designated c0d0 through c0d3.
For most MINIX 3 users all of these complications can probably be ignored.
A computer with less than four drives (including the CD-ROM drive), will most likely appear to the user to have the classical configuration, with drives designated c0d0 to c0d3, whether they are connected to IDE or PCI controllers, and whether or not they use the classic 40-pin parallel connectors or the newer serial connectors.
But the programming required to create this illusion is complicated.

After the call to the common main loop, nothing may happen for a while until the first attempt is made to access the hard disk.
When the first attempt to access a disk is made a message requesting a DEV_OPEN operation will be received by the main loop and w_do_open will be indirectly called.
In turn, w_do_open calls w_prepare to determine if the device requested is valid, and then w_identify to identify the type of device and initialize some more parameters in the wini array.
Finally, a counter in the wini array is used to test whether this is first time the device has been opened since MINIX 3 was started.
After being examined, the counter is incremented.
If it is the first DEV_OPEN operation, the partition function (in drvlib.c) is called.

The next function, w_prepare, accepts an integer argument, device, which is the minor device number of the drive or partition to be used, and returns a pointer to the device structure that indicates the base address and size of the device.
In the C language, the use of an identifier to name a structure does not preclude use of the same identifier to name a variable.
Whether a device is a drive, a partition, or a subpartition can be determined from the minor device number.
Once w_prepare has completed its job, none of the other functions used to read or write the disk need to concern themselves with partitioning.
As we have seen, w_prepare is called when a DEV_OPEN request is made; it is also one phase of the prepare/transfer cycle used by all data transfer requests.

Software-compatible AT-style disks have been in use for quite a while, and w_identify has to distinguish between a number of different designs that have been introduced over the years.
The first step is to see that a readable and writeable I/O port exists where one should exist on all disk controllers in this family.
This is the first example we have seen of I/O port access by a user-space driver, and the operation merits a description.
For a disk device I/O is done using a command structure, which is filled in with a series of byte values.
We will describe this in a bit more detail later; for the moment note that two bytes of this structure are filled in, one with a value ATA_IDENTIFY, interpreted as a command that asks an ATA (AT Attached) drive to identify itself, and another with a bit pattern that selects the drive.
Then com_simple is called.

This function hides all the work of constructing a vector of seven I/O port addresses and bytes to be written to them, sending this information to the system task, waiting for an interrupt, and checking the status returned.
This tests that the drive is alive and allows a string of 16-bit values to be read by the sys_insw kernel call.
Decoding this information is a messy process, and we will not describe it in detail.
Suffice it to say that a considerable amount of information is retrieved, including a string that identifies the model of the disk, and the preferred physical cylinder, head, and sector parameters for the device.
(Note that the “physical” configuration reported may not be the true physical configuration, but we have no alternative to accepting what the disk drive claims.) The disk information also indicates whether or not the disk is capable of Logical Block Addressing (LBA).
If it is, the driver can ignore the cylinder, head, and sector parameters and can address the disk using absolute sector numbers, which is much simpler.

As we mentioned earlier, it is possible that init_params may not recover the logical disk configuration information from the BIOS tables.
If that happens, the code tries to create an appropriate set of parameters based on what it reads from the drive itself.
The idea is that the maximum cylinder, head, and sector numbers can be 1023, 255, and 63 respectively, due to the number of bits allowed for these fields in the original BIOS data structures.

If the ATA_IDENTIFY command fails, it may simply mean that the disk is an older model that does not support the command.
In this case the logical configuration values previously read by init_params are all we have.
If they are valid, they are copied to the physical parameter fields of wini; otherwise an error is returned and the disk is not usable.

Finally, MINIX 3 uses a u32_t variable to count addresses in bytes.
This limits the size of a partition to 4 GB.
However, the device structure used to record the base and size of a partition (defined in drivers/libdriver/driver.h) uses u64_t numbers, and a 64 bit multiplication operation is used to calculate the size of the drive on, and the base and size of the whole drive are then entered into the wini array, and w_specify is called, twice if necessary, to pass the parameters to be used back to the disk controller.
Finally, more kernel calls are made: a sys_irqsetpolicy call ensures that when a disk controller interrupt occurs and is serviced the interrupt will be automatically reenabled in preparation for the next one.
Following that, a sys_irqenablecall actually enables the interrupt.

W_name returns a pointer to a string containing the device name, which will be either “AT-D0,” “AT-D1” “AT-D2,” or “AT-D3.” When an error message must be generated this function tells which drive produced it.

It is possible that a drive will turn out to be incompatible with MINIX 3 for some reason.
The function w_io_test is provided to test each drive the first time an attempt is made to open it.
This routine tries to read the first block on the drive, with shorter timeout values than are used in normal operation.
If the test fails the drive is permanently marked as unavailable.

W_specify, in addition to passing the parameters to the controller, also recalibrates the drive (if it is an older model), by doing a seek to cylinder zero.

Do_transfer does what its name implies, it assembles a command structure with all the byte values needed to request transfer of a chunk of data (possibly as many as 255 disk sectors), and then it calls com_out, which sends the command to the disk controller.
The data must be formatted differently depending upon how the disk is to be addressed, that is, whether by cylinder, head, and sector or by LBA.
Internally MINIX 3 addresses disk blocks linearly, so if LBA is supported the first three byte-wide fields are filled in by shifting the sector count an appropriate number of bits to the right and then masking to get 8-bit values.
The sector count is a 28 bit number, so the last masking operation uses a 4-bit mask.
If the disk does not support LBA then cylinder, head, and sector values are calculated, based on the parameters of the disk in use.

The code contains a hint of a future enhancement.
LBA addressing with a 28-bit sector count limits MINIX 3 to fully utilizing disks of 128 GB or smaller size.
(You can use a bigger disk, but MINIX 3 can only access the first 128 GB).
The programmers have been thinking about, but have not yet implemented, use of the newer LBA48 method, which uses 48 bits to address disk blocks.
On line 12824 a test is made for whether this is enabled.
The test will always fail with the version of MINIX 3 described here.
This is good, because no code is provided to be executed if the test succeeds.
Keep in mind if you decide to modify MINIX 3 yourself to use LBA48 that you need to do more than just add some code here.
You will have to make changes in many places to handle the 48-bit addresses.
You might find it easier to wait until MINIX 3 has been ported to a 64-bit processor, too.
But if a 128 GB disk is not big enough for you, LBA48 will give you access to 128 PB (Petabytes).

Now we will briefly look at how a data transfer takes place at a higher level.
W_prepare, which we have already discussed, is called first.
If the transfer operation requested was for multiple blocks (that is, a DEV_GATHER or DEV_SCATTER request), w_transfer line 12848 is called immediately afterward.
If the transfer is for a single block (a DEV_READ or DEV_WRITE request), a one element scatter/gather vector is created, and then w_transfer is called.
Accordingly, w_transfer is written to expect a vector of iovec_t requests.
Each element of the request vector consists of a buffer address and the size of the buffer, constrained that the size must be a multiple of the size of a disk sector.
All other information needed is passed as an argument to the call, and applies to the entire request vector.

The first thing done is a simple test to see if the disk address requested for the start of the transfer is aligned on a sector boundary.
Then the outer loop of the function is entered.
This loop repeats for each element of the request vector.
Within the loop, as we have seen many times before, a number of tests are made before the real work of the function is done.
First the total number of bytes remaining in the request is calculated by summing the iov_size fields of each element of the request vector.
This result is checked to be sure it is an exact multiple of the size of a sector.
Other tests check that the starting position is not at or beyond the end of the device, and if the request would end past the end of the device the size of the request is truncated.
All calculations so far have been in bytes, but a calculation is made of the block position on the disk, using 64 bit arithmetic.
Note that although the variable used is named block, this is a number of disk blocks, that is, 512 byte sectors, not the “block” used internally by MINIX 3, normally 4096 bytes.
After this one more adjustment is made.
Every drive has a maximum number of bytes that can be requested at one time, and the request is scaled back to this quantity if necessary.
After verifying that the disk has been initialized, and doing so again if necessary, a request for a chunk of data is made by calling do_transfer.

After a transfer request has been made the inner loop is entered, which repeats for each sector.
For a read or write operation an interrupt will be generated for each sector.
On a read the interrupt signifies data is ready and can be transferred.
The sys_insw kernel call asks the system task to read the specified I/O port repeatedly, transferring the data to a virtual address in the data space of the specified process.
For a write operation the order is reversed.
The sys_outsw call a few lines further down writes a string of data to the controller, and the interrupt comes from the disk controller when the transfer to the disk is complete.
In the case of either a read or a write, at_intr_wait is called to receive the interrupt, for example, following the write operation.
Although the interrupt is expected, this function provides a way to abort the wait if a malfunction occurs and the interrupt never arrives.
At_intr_wait also reads the disk controller’s status register and returns various codes.
On an error when either reading or writing, there is a break which skips over the section where results are recorded and poiners and counters adjusted for the next sector, so the next time through the inner loop will be a retry of the same sector, if another try is allowed.
If the disk controller reports a bad sector w_transfer terminates immediately.
For other errors a counter is incremented and the function is allowed to continue if max_errors has not been reached.

The next function we will discuss is com_out, which sends the command to the disk controller, but before we look at its code let us first look at the controller as it is seen by the software.
The disk controller is controlled through a set of registers, which could be memory mapped on some systems, but on an IBM compatible appear as I/O ports.
We will look at these registers and discuss a few aspects of how they (and I/O control registers in general) are used.
In MINIX 3 there is the added complication that drivers run in user space and cannot execute the instructions that read or write registers.
This will provide an opportunity to look at how kernel calls are used to work around this restriction.

The registers used by a standard IBM-AT class hard disk controller are shown in Fig. 3-23.

Figure 3-23.
(a) The control registers of an IDE hard disk controller.
The numbers in parentheses are the bits of the logical block address selected by each register in LBA mode.
(b) The fields of the Select Drive/Head register.

We have mentioned several times reading and writing to I/O ports, but we tacitly treated them just like memory addresses.
In fact, I/O ports often behave differently from memory addresses.
For one thing, input and output registers that happen to have the same I/O port address are not the same register.
Thus, the data written to a particular address cannot necessarily be retrieved by a subsequent read operation.
For example, the last register address shown in Fig. 3-23 shows the status of the disk controller when read and is used to issue commands to the controller when written to.
It is also common that the very act of reading or writing an I/O device register causes an action to occur, independently of the details of the data transferred.
This is true of the command register on the AT disk controller.
In use, data are written to the lower-numbered registers to select the disk address to be read from or written to, and then the command register is written last with an operation code.
The data written to the command register determines what the operation will be.
The act of writing the operation code into the command register starts the operation.

It is also the case that the use of some registers or fields in the registers may vary with different modes of operation.
In the example given in the figure, writing a 0 or a 1 to the LBA bit, bit 6 of register 6, selects whether CHS (Cylinder-Head-Sector) or LBA (Logical Block Addressing) mode is used.
The data written to or read from registers 3, 4, and 5, and the low four bits of register 6 are interpreted differently according to the setting of the LBA bit.

Now let us take a look at how a command is sent to the controller by calling com_out.
This function is called after setting up a cmd structure (with do_transfer, which we saw earlier).
Before changing any registers, the status register is read to determine that the controller is not busy.
This is done by testing the STATUS_BSY bit.
Speed is important here, and normally the disk controller is ready or will be ready in a short time, so busy waiting is used.
On line 12960 w_waitfor is called to test STATUS_BSY.
W_waitfor uses a kernel call to ask the system task to read an I/O port so w_waitfor can test a bit in the status register.
It loops until the bit is ready or until there is a timeout.
The loop is programmed for a quick return when the disk is ready.
Thus the returned value will be true with the minimum possible delay if the controller is ready, true after a delay if it is temporarily unavailable, or false if it is not ready after the timeout period.
We will have more to say about the timeout when we discuss w_waitfor itself.

A controller can handle more than one drive, so once it is determined that the controller is ready, a byte is written to select the drive, head, and mode of operation and w_waitfor is called again.
A disk drive sometimes fails to carry out a command or to properly return an error code—it is, after all, a mechanical device that can stick, jam, or break internally—and as insurance a sys_setalarm kernel call is made to have the system task schedule a call to a wakeup routine.
Following this, the command is issued by first writing all the parameters to the various registers and finally writing the command code itself to the command register.
This is done with a sys_voutb kernel call, which sends a vector of (value, address) pairs to the system task.
The system task writes each value to the I/O port specified by the address in order.
The vector of data for the sys_voutb call is constructed by use of a macro, pv_set, which is defined in include/minix/devio.h.
The act of writing the operation code to the command register makes the operation begin.
When it is complete, an interrupt is generated and a notification message is sent.
If the command times out the alarm will expire and a synchronous alarm notification will wake up the disk driver.

The next several functions are short.
W_need_reset is called when timeouts occur while waiting for the disk to interrupt or become ready.
The action of w_need_reset is just to mark the state variable for every drive in the wini array to force initialization on the next access.

W_do_close has very little to do for a conventional hard disk.
Additional code is needed to support CD-ROMs.

Com_simple is called to issue controller commands that terminate immediately without a data transfer phase.
Commands that fall into this category include those that retrieve the disk identification, setting of some parameters, and recalibration.
We saw an example of its use in w_identify.
Before it is called the command structure must be correctly initialized.
Note that immediately after the call to com_out a call to at_intr_wait is made.
This eventually does a receive which blocks until a notification arrives signifying that an interrupt has occurred.

We noted that com_out does a sys_setalarm kernel call before asking the system task to write the registers which set up and execute a command.
As we mentioned in the overview section, the next receive operation normally should receive a notification indicating an interrupt.
If an alarm has been set and no interrupt occurs, the next message will be a SYN_ALARM.
In this case w_timeout line 13046 is called.
What needs to be done depends on the current command in w_command.
The timeout might have been left over from a previous operation, and w_command may have the value CMD_IDLE, meaning the disk completed its operation.
In that case there is nothing to do.
If the command does not complete and the operation is a read or write, it may help to reduce the size of I/O requests.
This is done in two steps, first reducing the maximum number of sectors that can be requested to 8, and then to 1.
For all timeouts a message is printed and w_need_reset is called to force re-initialization of all drives on the next attempted access.

When a reset is required, w_reset is called.
This function makes use of a library function, tickdelay, that sets a watchdog timer and then waits for it to expire.
After an initial delay to give the drive time to recover from previous operations, a bit in the disk controller’s control register is strobed—that is, set to a logical 1 level for a definite period, then returned to the logical 0 level.
Following this operation, w_waitfor is called to give the drive a reasonable period to signal it is ready.
In case the reset does not succeed, a message is printed and an error status returned.

Commands to the disk that involve data transfer normally terminate by generating an interrupt, which sends a message back to the disk driver.
In fact, an interrupt is generated for each sector read or written.
The function w_intr_wait calls receive in a loop, and if a SYN_ALARM message is received w_timeout is called.
The only other message type this function should see is HARD_INT.
When this is received the status register is read and ack_args is called to reinitialize the interrupt.

W_intr_wait is not called directly; when an interrupt is expected the function called is the next one, at_intr_wait.
After an interrupt is received by at_intr_wait a quick check is made of the drive status bits.
All is OK if the bits corresponding to busy, write fault, and error are all clear.
Otherwise a closer look is taken.
If the register could not be read at all, it is panic time.
If the problem was a bad sector a specific error is returned, any other problem results in a general error code.
In all cases the STATUS_ADMBSY bit is set, to be reset later by the caller.

We have seen several places where w_waitfor is called to do busy waiting on a bit in the disk controller status register.
This is used in situations where it is expected the bit might be clear on the first test, and a quick test is desirable.
For the sake of speed, a macro that read the I/O port directly was used in earlier versions of MINIX—this is, of course, not allowable for a user-space driver in MINIX 3.
The solution here is to use a do …
while loop with a minimum of overhead before the first test is made.
If the bit being tested is clear there is an immediate return from within the loop.
To deal with the possibility of failure a timeout is implemented within the loop by keeping track of clock ticks.
If a timeout does occur w_need_reset is called.

The timeout parameter that is used by the w_waitfor function is defined by DEF_TIMEOUT_TICKS as 300 ticks, or 5 seconds.
A similar parameter, WAKEUP, used to schedule wakeups from the clock task, is set to 31 seconds.
These are very long periods of time to spend busy waiting, when you consider that an ordinary process only gets 100 msec to run before it will be evicted.
But, these numbers are based upon the published standard for interfacing disk devices to AT-class computers, which states that up to 31 seconds must be allowed for a disk to “spin up” to speed.
The fact is, of course, that this is a worst-case specification, and that on most systems spin up will only occur at power-on time, or possibly after long periods of inactivity, at least for hard disks.
For CD-ROMs or other devices which must spin up frequently this may be a more important issue.

There are a few more functions in at_wini.c.
W_geometry returns the logical maximum cylinder, head, and sector values of the selected hard disk device.
In this case the numbers are real ones, not made up as they were for the RAM disk driver.
W_other is a catch-all for unrecognized commands and ioctls.
In fact, it is not used in the current release of MINIX 3, and we should probably have removed it from the Appendix B listing.
W_hw_int is called when a hardware interrupt is received when it is not expected.
In the overview we mentioned that this can happen when a timeout expires before an expected interrupt occurs.
This will satisfy a receive operation that was blocked waiting for the interrupt, but the interrupt notification may then be found by a subsequent receive.
The only thing to be done is to reenable the interrupt, which is done by calling the next function, ack_irqs.
It cycles through all the known drives and uses the sys_irqenable kernel call to ensure all interrupts are enabled.
Finally, at the end of at_wini.c two strange little functions are found, strstatus and strerr.
These use macros defined just ahead of them to concatenate error codes into strings.
These functions are not used in MINIX 3 as described here.

1.8.6 Floppy Disk Handling

The floppy disk driver is longer and more complicated than the hard disk driver.
This may seem paradoxical, since floppy disk mechanisms are simpler than those of hard disks, but the simpler mechanism has a more primitive controller that requires more attention from the operating system.
Also, the fact that the medium is removable adds complications.
In this section we will describe some of the things an implementer must consider in dealing with floppy disks.
However, we will not go into the details of the MINIX 3 floppy disk driver code.
In fact, we have not listed the floppy disk driver in Appendix B.
The most important parts are similar to those for the hard disk.

One of the things we do not have to worry about with the floppy driver is the multiple types of controller to support that we had to deal with in the case of the hard disk driver.
Although the high-density floppy disks currently used were not supported in the design of the original IBM PC, the floppy disk controllers of all computers in the IBM PC family are supported by a single software driver.
The contrast with the hard disk situation is probably due to lack of motivation to increase floppy disk performance.
Floppy disks are rarely used as working storage during operation of a computer system; their speed and data capacity are too limited compared to those of hard disks.
Floppy disks at one time were important for distribution of new software and for backup, but as networks and larger-capacity removable storage devices have become common, PCs rarely come standard with a floppy disk drives any more.

The floppy disk driver does not use the SSF or the elevator algorithm.
It is strictly sequential, accepting a request and carrying it out before even accepting another request.
In the original design of MINIX it was felt that, since MINIX was intended for use on personal computers, most of the time there would be only one process active.
Thus the chance of a disk request arriving while another was being carried out was small.
There would be little to gain from the considerable increase in software complexity that would be required for queueing requests.
Complexity is even less worthwhile now, since floppy disks are rarely used for anything but transferring data into or out of a system with a hard disk.

That said, the floppy driver, like any other block driver, can handle a request for scattered I/O.
However, in the case of the floppy driver the array of requests is smaller than for the hard disk, limited to the maximum number of sectors per track on a floppy diskette.

The simplicity of the floppy disk hardware is responsible for some of the complications in floppy disk driver software.
Cheap, slow, low-capacity floppy drives do not justify the sophisticated integrated controllers that are part of modern hard drives, so the driver software has to deal explicitly with aspects of disk operation that are hidden in the operation of a hard drive.
As an example of a complication caused by the simplicity of floppy drives, consider positioning the read/write head to a particular track during a SEEK operation.
No hard disk has ever required the driver software to explicitly call for a SEEK.
For a hard disk the cylinder, head, and sector geometry visible to the programmer often do not correspond to the physical geometry.
In fact, the physical geometry may be quite complicated.
Typically there are multiple zones (groups of cylinders) with more sectors per track on outer zones than on inner ones.
This is not visible to the user, however.
Modern hard disks accept Logical Block Addressing (LBA), addressing by the absolute sector number on the disk, as an alternative to cylinder, head, and sector addressing.
Even if addressing is done by cylinder, head, and sector, any geometry that does not address nonexistent sectors may be used, since the integrated controller on the disk calculates where to move the read/write heads and does a seek operation when required.

For a floppy disk, however, explicit programming of SEEK operations is needed.
In case a SEEK fails, it is necessary to provide a routine to perform a RECALIBRATE operation, which forces the heads to cylinder 0.
This makes it possible for the controller to advance them to a desired track position by stepping the heads a known number of times.
Similar operations are necessary for the hard drive, of course, but the controller handles them without detailed guidance from the device driver software.

Some hard disk controllers provide for removable media, for instance, on a CD-ROM drive, but the drive controller is generally able to handle any complications without support in the device driver software.
With a floppy disk, however, the built-in support is not there, and yet it is needed more.
Some of the most common uses for floppy disks—installing new software or backing up files—are likely to require switching of disks in and out of the drives.
It will cause grief if data intended for one diskette are written onto another.
The device driver should do what it can to prevent this.
This is not always possible, as not all floppy drive hardware allows determination of whether the drive door has been opened since the last access.
Another problem that can be caused by removable media is that a system can become hung up if an attempt is made to access a floppy drive that currently has no diskette inserted.
This can be solved if an open door can be detected, but since this is not always possible some provision must be made for a timeout and an error return if an operation on a floppy disk does not terminate in a reasonable time.

Removable media can be replaced with other media, and in the case of floppy disks there are many different possible formats.
IBM compatible hardware supports both 3.5-inch and 5.25-inch disk drives and the diskettes can be formatted in a variety of ways to hold from 360 KB up to 1.2 MB (on a 5.25-inch diskette) or 1.44 MB (on a 3.5-inch diskette).

MINIX 3 supports seven different floppy disk formats.
Two possible solutions are possible for the problem this causes.
One way is to refer to each possible format as a distinct drive and provide multiple minor devices.
Older versions of MINIX did this.
Fourteen different devices were defined, ranging from /dev/pc0, a 360 KB 5.25-inch diskette in the first drive, to /dev/PS1, a 1.44 MB 3.5-inch diskette in the second drive.
This was a cumbersome solution.
MINIX 3 uses another method: when the first floppy disk drive is addressed as /dev/fd0, or the second as /dev/fd1, the floppy disk driver tests the diskette currently in the drive when it is accessed, in order to determine the format.
Some formats have more cylinders, and others have more sectors per track than other formats.
Determination of the format of a diskette is done by attempting to read the higher numbered sectors and tracks.
By a process of elimination the format can be determined.
This takes time, but on modern computers only 1.44 MB 3.5-inch diskettes are likely to be found, and this format is probed first.
Another possible problem is that a disk with bad sectors could be misidentified.
A utility program is available for testing disks; doing so automatically in the operating system would be too slow.

The final complication of the floppy disk driver is motor control.
Diskettes cannot be read or written unless they are revolving.
Hard disks are designed to run for thousands of hours on end without wearing out, but leaving the motors on all the time causes a floppy drive and diskette to wear out quickly.
If the motor is not already on when a drive is accessed, it is necessary to issue a command to start the drive and then to wait about a half second before attempting to read or write data.
Turning the motors on or off is slow, so MINIX 3 leaves a drive motor on for a few seconds after a drive is used.
If the drive is used again within this interval, the timer is extended for another few seconds.
If the drive is not used in this interval, the motor is turned off.

1.9 Terminals

For decades, users have communicated with computers using devices consisting of a keyboard for user input and a display for computer output.
For many years, these were combined into free-standing devices called terminals, which were connected to the computer by a wire.
Large mainframes used in the financial and travel industries sometimes still use these terminals, typically connected to the mainframe via a modem, especially when they are far from the mainframe.
However, with the emergence of the personal computer, the keyboard and display have become separate peripherals rather than a single device, but they are so closely interrelated that we will discuss them together here under the combined name of “terminal.”

Historically, terminals have come in a variety of forms.
It is up to the terminal driver to hide all these differences, so that the device-independent part of the operating system and the user programs do not have to be rewritten for each kind of terminal.
In the following sections we will follow our now-standard approach of first discussing terminal hardware and software in general, and then discussing the MINIX 3 software.

1.9.1 Terminal Hardware

From the operating system’s point of view, terminals can be divided into three broad categories based on how the operating system communicates with them as well as their actual hardware characteristics.
The first category consists of memory-mapped terminals, which consist of a keyboard and a display, both of which are hardwired to the computer.
This model is used in all personal computers for the keyboard and the monitor.
The second category consists of terminals that interface via a serial communication line using the RS-232 standard, most frequently over a modem.
This model is still used on some mainframes, but PCs also have serial line interfaces.
The third category consists of terminals that are connected to the computer via a network.
This taxonomy is shown in Fig. 3-24.

The first broad category of terminals named in Fig. 3-24 consists of memory-mapped terminals.
These are an integral part of the computers themselves, especially personal computers.
They consist of a display and a keyboard.
Memory-mapped displays are interfaced via a special memory called a video RAM, which forms part of the computer’s address space and is addressed by the CPU the same way as the rest of memory (see Fig. 3-25).

Also on the video RAM card is a chip called a video controller.
This chip pulls bytes out of the video RAM and generates the video signal used to drive the display.
Displays are usually one of two types: CRT monitors or flat panel displays.
A CRT monitor generates a beam of electrons that scans horizontally across the screen, painting lines on it.
Typically the screen has 480 to 1200 lines from top to bottom, with 640 to 1920 points per line.
These points are called pixels.
The video controller signal modulates the intensity of the electron beam, determining whether a given pixel will be light or dark.
Color monitors have three beams, for red, green, and blue, which are modulated independently.

A flat panel display works very differently internally, but a CRT-compatible flat-panel display accepts the same synchronization and video signals as a CRT and uses these to control a liquid crystal element at each pixel position.

A simple monochrome display might fit each character in a box 9 pixels wide by 14 pixels high (including the space between characters), and have 25 lines of 80 characters.
The display would then have 350 scan lines of 720 pixels each.
Each of these frames is redrawn 45 to 70 times a second.
The video controller could be designed to fetch the first 80 characters from the video RAM, generate 14 scan lines, fetch the next 80 characters from the video RAM, generate the following 14 scan lines, and so on.
In fact, most fetch each character once per scan line to eliminate the need for buffering in the controller.
The 9-by-14 bit patterns for the characters are kept in a ROM used by the video controller.
(RAM may also be used to support custom fonts.) The ROM is addressed by a 12-bit address, 8 bits from the character code and 4 bits to specify a scan line.
The 8 bits in each byte of the ROM control 8 pixels; the 9th pixel between characters is always blank.
Thus 14 × 80 = 1120 memory references to the video RAM are needed per line of text on the screen.
The same number of references are made to the character generator ROM.

The original IBM PC had several modes for the screen.
In the simplest one, it used a character-mapped display for the console.
In Fig. 3-26(a) we see a portion of the video RAM.
Each character on the screen of Fig. 3-26(b) occupied two characters in the RAM.
The low-order character was the ASCII code for the character to be displayed.
The high-order character was the attribute byte, which was used to specify the color, reverse video, blinking, and so on.
The full screen of 25 by 80 characters required 4000 bytes of video RAM in this mode.
All modern displays still support this mode of operation.

Figure 3-26.
(a) A video RAM image for the IBM monochrome display.
The ×s are attribute bytes.
(b) The corresponding screen.

Contemporary bitmap displays use the same principle, except that each pixel on the screen is individually controlled.
In the simplest configuration, for a monochrome display, each pixel has a corresponding bit in the video RAM.
At the other extreme, each pixel is represented by a 24-bit number, with 8 bits each for red, green, and blue.
A 768 × 1024 color display with 24 bits per pixel requires 2 MB of RAM to hold the image.

With a memory-mapped display, the keyboard is completely decoupled from the screen.
It may be interfaced via a serial or parallel port.
On every key action the CPU is interrupted, and the keyboard driver extracts the character typed by reading an I/O port.

On a PC, the keyboard contains an embedded microprocessor which communicates through a specialized serial port with a controller chip on the main board.
An interrupt is generated whenever a key is struck and also when one is released.
Furthermore, all that the keyboard hardware provides is the key number, not the ASCII code.
When the A key is struck, the key code (30) is put in an I/O register.
It is up to the driver to determine whether it is lower case, upper case, CTRL-A, ALT-A, CTRL-ALT-A, or some other combination.
Since the driver can tell which keys have been depressed but not yet released (e.g., shift), it has enough information to do the job.
Although this keyboard interface puts the full burden on the software, it is extremely flexible.
For example, user programs may be interested in whether a digit just typed came from the top row of keys or the numeric key pad on the side.
In principle, the driver can provide this information.

RS-232 terminals are devices containing a keyboard and a display that communicate using a serial interface, one bit at a time (see Fig. 3-27).
These terminals use a 9-pin or 25-pin connector, of which one pin is used for transmitting data, one pin is for receiving data, and one pin is ground.
The other pins are for various control functions, most of which are not used.
To send a character to an RS-232 terminal, the computer must transmit it 1 bit at a time, prefixed by a start bit, and followed by 1 or 2 stop bits to delimit the character.
A parity bit which provides rudimentary error detection may also be inserted preceding the stop bits, although this is commonly required only for communication with mainframe systems.
Common transmission rates are 14,400 and 56,000 bits/sec, the former being for fax and the latter for data.
RS-232 terminals are commonly used to communicate with a remote computer using a modem and a telephone line.

Figure 3-27.
An RS-232 terminal communicates with a computer over a communication line, one bit at a time.
The computer and the terminal are completely independent.

Since both computers and terminals work internally with whole characters but must communicate over a serial line a bit at a time, chips have been developed to do the character-to-serial and serial-to-character conversions.
They are called UARTs (Universal Asynchronous Receiver Transmitters).
UARTs are attached to the computer by plugging RS-232 interface cards into the bus as illustrated in Fig. 3-27.
On modern computers the UART and RS-232 interface is frequently part of the parentboard chipset.
It may be possible disable the on-board UART to allow use of a modem interface card plugged into the bus or two of them may be able to coexist.
A modem also provides a UART (although it may be integrated with other functions in a multi-purpose chip), and the communication channel is a telephone line rather than a serial cable.
However, to the computer the UART looks the same whether the medium is a dedicated serial cable or a telephone line.

RS-232 terminals are gradually dying off, being replaced by PCs, but they are still encountered on older mainframe systems, especially in banking, airline reservation, and similar applications.
Terminal programs that allow a remote computer to simulate a terminal are still widely used, however.

To print a character, the terminal driver writes the character to the interface card, where it is buffered and then shifted out over the serial line one bit at a time by the UART.
Even at 56,000 bps, it takes just over 140 microsec to send a character.
As a result of this slow transmission rate, the driver generally outputs a character to the RS-232 card and blocks, waiting for the interrupt generated by the interface when the character has been transmitted and the UART is able to accept another character.
The UART can simultaneously send and receive characters, as its name implies.
An interrupt is also generated when a character is received, and usually a small number of input characters can be buffered.
The terminal driver must check a register when an interrupt is received to determine the cause of the interrupt.
Some interface cards have a CPU and memory and can handle multiple lines, taking over much of the I/O load from the main CPU.

RS-232 terminals can be subdivided into categories, as mentioned above.
The simplest ones were hardcopy (printing) terminals.
Characters typed on the keyboard were transmitted to the computer.
Characters sent by the computer were typed on the paper.
These terminals are obsolete and rarely seen any more.

Dumb CRT terminals work the same way, only with a screen instead of paper.
These are frequently called “glass ttys” because they are functionally the same as hardcopy ttys.
(The term “tty” is an abbreviation for Teletype, ® a former company that pioneered in the computer terminal business; “tty” has come to mean any terminal.) Glass ttys are also obsolete.

Intelligent CRT terminals are in fact miniature, specialized computers.
They have a CPU and memory and contain software, usually in ROM.
From the operating system’s viewpoint, the main difference between a glass tty and an intelligent terminal is that the latter understands certain escape sequences.
For example, by sending the ASCII ESC character (033), followed by various other characters, it may be possible to move the cursor to any position on the screen, insert text in the middle of the screen, and so forth.

1.9.2 Terminal Software

The keyboard and display are almost independent devices, so we will treat them separately here.
(They are not quite independent, since typed characters must be displayed on the screen.) In MINIX 3 the keyboard and screen drivers are part of the same process; in other systems they may be split into distinct drivers.

The basic job of the keyboard driver is to collect input from the keyboard and pass it to user programs when they read from the terminal.
Two possible philosophies can be adopted for the driver.
In the first one, the driver’s job is just to accept input and pass it upward unmodified.
A program reading from the terminal gets a raw sequence of ASCII codes.
(Giving user programs the key numbers is too primitive, as well as being highly machine dependent.)

This philosophy is well suited to the needs of sophisticated screen editors such as emacs, which allow the user to bind an arbitrary action to any character or sequence of characters.
It does, however, mean that if the user types dste instead of date and then corrects the error by typing three backspaces and ate, followed by a carriage return, the user program will be given all 11 ASCII codes typed.

Most programs do not want this much detail.
They just want the corrected input, not the exact sequence of how it was produced.
This observation leads to the second philosophy: the driver handles all the intraline editing, and just delivers corrected lines to the user programs.
The first philosophy is character-oriented; the second one is line-oriented.
Originally they were referred to as raw mode and cooked mode, respectively.
The POSIX standard uses the less-picturesque term canonical mode to describe line-oriented mode.
On most systems canonical mode refers to a well-defined configuration.
Noncanonical mode is equivalent to raw mode, although many details of terminal behavior can be changed.
POSIX-compatible systems provide several library functions that support selecting either mode and changing many aspects of terminal configuration.
In MINIX 3 the ioctl system call supports these functions.

The first task of the keyboard driver is to collect characters.
If every key-stroke causes an interrupt, the driver can acquire the character during the interrupt.
If interrupts are turned into messages by the low-level software, it is possible to put the newly acquired character in the message.
Alternatively, it can be put in a small buffer in memory and the message used to tell the driver that something has arrived.
The latter approach is actually safer if a message can be sent only to a waiting process and there is some chance that the keyboard driver might still be busy with the previous character.

Once the driver has received the character, it must begin processing it.
If the keyboard delivers key numbers rather than the character codes used by application software, then the driver must convert between the codes by using a table.
Not all IBM “compatibles” use standard key numbering, so if the driver wants to support these machines, it must map different keyboards with different tables.
A simple approach is to compile a table that maps between the codes provided by the keyboard and ASCII (American Standard Code for Information Interchange) codes into the keyboard driver, but this is unsatisfactory for users of languages other than English.
Keyboards are arranged differently in different countries, and the ASCII character set is not adequate even for the majority of people in the Western Hemisphere, where speakers of Spanish, Portuguese, and French need accented characters and punctuation marks not used in English.
To respond to the need for flexibility of keyboard layouts to provide for different languages, many operating systems provide for loadable keymaps or code pages, which make it possible to choose the mapping between keyboard codes and codes delivered to the application, either when the system is booted or later.

If the terminal is in canonical (i.e., cooked) mode, characters must be stored until an entire line has been accumulated, because the user may subsequently decide to erase part of it.
Even if the terminal is in raw mode, the program may not yet have requested input, so the characters must be buffered to allow type ahead.
(System designers who do not allow users to type far ahead ought to be tarred and feathered, or worse yet, be forced to use their own system.)

Two approaches to character buffering are common.
In the first one, the driver contains a central pool of buffers, each buffer holding perhaps 10 characters.
Associated with each terminal is a data structure, which contains, among other items, a pointer to the chain of buffers for input collected from that terminal.
As more characters are typed, more buffers are acquired and hung on the chain.
When the characters are passed to a user program, the buffers are removed and put back in the central pool.

The other approach is to do the buffering directly in the terminal data structure itself, with no central pool of buffers.
Since it is common for users to type a command that will take a little while (say, a compilation) and then type a few lines ahead, to be safe the driver should allocate something like 200 characters per terminal.
In a large-scale timesharing system with 100 terminals, allocating 20K all the time for type ahead is clearly overkill, so a central buffer pool with space for perhaps 5K is probably enough.
On the other hand, a dedicated buffer per terminal makes the driver simpler (no linked list management) and is to be preferred on personal computers with only one or two terminals.
Figure 3-28 shows the difference between these two methods.

Although the keyboard and display are logically separate devices, many users have grown accustomed to seeing the characters they have just typed appear on the screen.
Some (older) terminals oblige by automatically displaying (in hardware) whatever has just been typed, which is not only a nuisance when passwords are being entered but greatly limits the flexibility of sophisticated editors and other programs.
Fortunately, PC keyboards display nothing when keys are struck.
It is therefore up to the software to display the input.
This process is called echoing.

Echoing is complicated by the fact that a program may be writing to the screen while the user is typing.
At the very least, the keyboard driver has to figure out where to put the new input without it being overwritten by program output.

Echoing also gets complicated when more than 80 characters are typed on a terminal with 80-character lines.
Depending on the application, wrapping around to the next line may be appropriate.
Some drivers just truncate lines to 80 characters by throwing away all characters beyond column 80.

Another problem is tab handling.
All keyboards have a tab key, but displays can handle tab on output.
It is up to the driver to compute where the cursor is currently located, taking into account both output from programs and output from echoing, and compute the proper number of spaces to be echoed.

Now we come to the problem of device equivalence.
Logically, at the end of a line of text, one wants a carriage return, to move the cursor back to column 1, and a linefeed, to advance to the next line.
Requiring users to type both at the end of each line would not sell well (although some old terminals had a key which generated both, with a 50 percent chance of doing so in the order that the software wanted them).
It was (and still is) up to the driver to convert whatever comes in to the standard internal format used by the operating system.

If the standard form is just to store a linefeed (the convention in UNIX and all its descendants), carriage returns should be turned into linefeeds.
If the internal format is to store both, then the driver should generate a linefeed when it gets a carriage return and a carriage return when it gets a linefeed.
No matter what the internal convention, the terminal may require both a linefeed and a carriage return to be echoed in order to get the screen updated properly.
Since a large computer may well have a wide variety of different terminals connected to it, it is up to the keyboard driver to get all the different carriage return/linefeed combinations converted to the internal system standard and arrange for all echoing to be done right.

A related problem is the timing of carriage return and linefeeds.
On some terminals, it may take longer to display a carriage return or linefeed than a letter or number.
If the microprocessor inside the terminal actually has to copy a large block of text to achieve scrolling, then linefeeds may be slow.
If a mechanical print head has to be returned to the left margin of the paper, carriage returns may be slow.
In both cases it is up to the driver to insert filler characters (dummy null characters) into the output stream or just stop outputting long enough for the terminal to catch up.
The amount of time to delay is often related to the terminal speed; for example, at 4800 bps or slower, no delays may be needed, but at 9600 bps or higher one filler character might be required.
Terminals with hardware tabs, especially hardcopy ones, may also require a delay after a tab.

When operating in canonical mode, a number of input characters have special meanings.
Figure 3-29 shows all of the special characters required by POSIX and the additional ones recognized by MINIX 3.
The defaults are all control characters that should not conflict with text input or codes used by programs, but all except the last two can be changed using the stty command, if desired.
Older versions of UNIX used different defaults for many of these.

The ERASE character allows the user to rub out the character just typed.
In MINIX 3 it is the backspace (CTRL-H).
It is not added to the character queue but instead removes the previous character from the queue.
It should be echoed as a sequence of three characters, backspace, space, and backspace, in order to remove the previous character from the screen.
If the previous character was a tab, erasing it requires keeping track of where the cursor was prior to the tab.
In most systems, backspacing will only erase characters on the current line.
It will not erase a carriage return and back up into the previous line.

When the user notices an error at the start of the line being typed in, it is often convenient to erase the entire line and start again.
The KILL character (in MINIX 3 CTRL-U) erases the entire line.
MINIX 3 makes the erased line vanish from the screen, but some systems echo it plus a carriage return and linefeed because some users like to see the old line.
Consequently, how to echo KILL is a matter of taste.
As with ERASE it is usually not possible to go further back than the current line.
When a block of characters is killed, it may or may not be worth the trouble for the driver to return buffers to the pool, if one is used.

Sometimes the ERASE or KILL characters must be entered as ordinary data.
The LNEXT character serves as an escape character.
In MINIX 3 CTRL-V is the default.
As an example, older UNIX systems normally used the @ sign for KILL, but the Internet mail system uses addresses of the form linda@cs.washington.edu.
Someone who feels more comfortable with older conventions might redefine KILL as @, but then need to enter an @ sign literally to address e-mail.
This can be done by typing CTRL-V @.
The CTRL-V itself can be entered literally by typing CTRL-V CTRL-V.
After seeing a CTRL-V, the driver sets a flag saying that the next character is exempt from special processing.
The LNEXT character itself is not entered in the character queue.

To allow users to stop a screen image from scrolling out of view, control codes are provided to freeze the screen and restart it later.
In MINIX 3 these are STOP (CTRL-S) and START (CTRL-Q), respectively.
They are not stored but are used to set and clear a flag in the terminal data structure.
Whenever output is attempted, the flag is inspected.
If it is set, no output occurs.
Usually, echoing is also suppressed along with program output.

It is often necessary to kill a runaway program being debugged.
The INTR (CTRL-C) and QUIT (CTRL-) characters can be used for this purpose.
In MINIX 3, CTRL-C sends the SIGINT signal to all the processes started up from the terminal.
Implementing CTRL-C can be quite tricky.
The hard part is getting the information from the driver to the part of the system that handles signals, which, after all, has not asked for this information.
CTRL- is similar to CTRL-C, except that it sends the SIGQUIT signal, which forces a core dump if not caught or ignored.

When either of these keys is struck, the driver should echo a carriage return and linefeed and discard all accumulated input to allow for a fresh start.
Historically, DEL was commonly used as the default value for INTR on many UNIX systems.
Since many programs use DEL interchangeably with the backspace for editing, CTRL-C is now preferred.

Another special character is EOF (CTRL-D), which in MINIX 3 causes any pending read requests for the terminal to be satisfied with whatever is available in the buffer, even if the buffer is empty.
Typing CTRL-D at the start of a line causes the program to get a read of 0 bytes, which is conventionally interpreted as end-of-file and causes most programs to act the same way as they would upon seeing end-of-file on an input file.

Some terminal drivers allow much fancier intraline editing than we have sketched here.
They have special control characters to erase a word, skip backward or forward characters or words, go to the beginning or end of the line being typed, and so forth.
Adding all these functions to the terminal driver makes it much larger and, furthermore, is wasted when using fancy screen editors that work in raw mode anyway.

To allow programs to control terminal parameters, POSIX requires that several functions be available in the standard library, of which the most important are tcgetattr and tcsetattr.
Tcgetattr retrieves a copy of the structure shown in Fig. 3-30, the termios structure, which contains all the information needed to change special characters, set modes, and modify other characteristics of a terminal.
A program can examine the current settings and modify them as desired.
Tcsetattr then writes the structure back to the terminal driver.

struct termios {
    tcflag_t c_iflag; /* input modes */
    tcflag_t c_oflag; /* output modes */
    tcflag_t c_cflag; /* control modes */
    tcflag_t c_lflag; /* local modes */
    speed_t c_ispeed; /* input speed */
    speed_t c_ospeed; /* output speed */
    cc_t c_cc[NCCS]; /* control characters */
};

Figure 3-30.
The termios structure.
In MINIX 3 tc_flag_t is a short, speed_t is an int, and cc_t is a char.

The POSIX standard does not specify whether its requirements should be implemented through library functions or system calls.
MINIX 3 provides a system call, ioctl, called by

that is used to examine and modify the configurations of many I/O devices.
This call is used to implement the tcgetattr and tcsetattr functions.
The variable request specifies whether the termios structure is to be read or written, and in the latter case, whether the request is to take effect immediately or should be deferred until all currently queued output is complete.
The variable argp is a pointer to a termios structure in the calling program.
This particular choice of communication between program and driver was chosen for its UNIX compatibility, rather than for its inherent beauty.

A few notes about the termios structure are in order.
The four flag words provide a great deal of flexibility.
The individual bits in c_iflag control various ways input is handled.
For instance, the ICRNL bit causes CR characters to be converted into NL on input.
This flag is set by default in MINIX 3.
The c_oflag holds bits that affect output processing.
For instance, the OPOST bit enables output processing.
It and the ONLCR bit, which causes NL characters in the output to be converted into a CR NL sequence, are also set by default in MINIX 3.
The c_cflag is the control flags word.
The default settings for MINIX 3 enable a line to receive 8-bit characters and cause a modem to hang up if a user logs out on the line.
The c_lflag is the local mode flags field.
One bit, ECHO, enables echoing (this can be turned off during a login to provide security for entering a password).
Its most important bit is the ICANON bit, which enables canonical mode.
With ICANON off, several possibilities exist.
If all other settings are left at their defaults, a mode identical to the traditional cbreak mode is entered.
In this mode, characters are passed to the program without waiting for a full line, but the INTR, QUIT, START, and STOP characters retain their effects.
All of these can be disabled by resetting bits in the flags, however, to produce the equivalent of traditional raw mode.

The various special characters that can be changed, including those which are MINIX 3 extensions, are held in the c_cc array.
This array also holds two parameters which are used in noncanonical mode.
The quantity MIN, stored in c_cc[VMIN], specifies the minimum number of characters that must be received to satisfy a read call.
The quantity TIME in c_cc[VTIME] sets a time limit for such calls.
MIN and TIME interact as shown in Fig. 3-31.
A call that asks for N bytes is illustrated.
With TIME = 0 and MIN = 1, the behavior is similar to the traditional raw mode.

Figure 3-31.
MIN and TIME determine when a call to read returns in noncanonical mode.
N is the number of bytes requested.

Output is simpler than input, but drivers for RS-232 terminals are radically different from drivers for memory-mapped terminals.
The method that is commonly used for RS-232 terminals is to have output buffers associated with each terminal.
The buffers can come from the same pool as the input buffers, or be dedicated, as with input.
When programs write to the terminal, the output is first copied to the buffers.
Similarly, output from echoing is also copied to the buffers.
After all the output has been copied to the buffers (or the buffers are full), the first character is output, and the driver goes to sleep.
When the interrupt comes in, the next character is output, and so on.

With memory-mapped terminals, a simpler scheme is possible.
Characters to be printed are extracted one at a time from user space and put directly in the video RAM.
With RS-232 terminals, each character to be output is just put on the line to the terminal.
With memory mapping, some characters require special treatment, among them, backspace, carriage return, linefeed, and the audible bell (CTRL-G).
A driver for a memory-mapped terminal must keep track in software of the current position in the video RAM, so that printable characters can be put there and the current position advanced.
Backspace, carriage return, and linefeed all require this position to be updated appropriately.
Tabs also require special processing.

In particular, when a linefeed is output on the bottom line of the screen, the screen must be scrolled.
To see how scrolling works, look at Fig. 3-26.
If the video controller always began reading the RAM at 0xB0000, the only way to scroll the screen when in character mode would be to copy 24 × 80 characters (each character requiring 2 bytes) from 0xB00A0 to 0xB0000, a time-consuming proposition.
In bitmap mode, it would be even worse.

Fortunately, the hardware usually provides some help here.
Most video controllers contain a register that determines where in the video RAM to begin fetching bytes for the top line on the screen.
By setting this register to point to 0xB00A0 instead of 0xB0000, the line that was previously number two moves to the top, and the whole screen scrolls up one line.
The only other thing the driver must do is copy whatever is needed to the new bottom line.
When the video controller gets to the top of the RAM, it just wraps around and continues merrily fetching bytes starting at the lowest address.
Similar hardware assistance is provided in bitmap mode.

Another issue that the driver must deal with on a memory-mapped terminal is cursor positioning.
Again, the hardware usually provides some assistance in the form of a register that tells where the cursor is to go.
Finally, there is the problem of the bell.
It is sounded by outputting a sine or square wave to the loudspeaker, a part of the computer quite separate from the video RAM.

Screen editors and many other sophisticated programs need to be able to update the screen in more complex ways than just scrolling text onto the bottom of the display.
To accommodate them, many terminal drivers support a variety of escape sequences.
Although some terminals support idiosyncratic escape sesequence sets, it is advantageous to have a standard to facilitate adapting software from one system to another.
The American National Standards Institute (ANSI) has defined a set of standard escape sequences, and MINIX 3 supports a subset of the ANSI sequences, shown in Fig. 3-32, that is adequate for many common operations.
When the driver sees the character that starts the escape sequences, it sets a flag and waits until the rest of the escape sequence comes in.
When everything has arrived, the driver must carry it out in software.
Inserting and deleting text require moving blocks of characters around the video RAM.
The hardware is of no help with anything except scrolling and displaying the cursor.

Figure 3-32.
The ANSI escape sequences accepted by the terminal driver on output.
ESC denotes the ASCII escape character (0x1B), and n, m, and s are optional numeric parameters.

1.9.3 Overview of the Terminal Driver in MINIX 3

The terminal driver is contained in four C files (six if RS-232 and pseudo terminal support are enabled) and together they far and away constitute the largest driver in MINIX 3.
The size of the terminal driver is partly explained by the observation that the driver handles both the keyboard and the display, each of which is a complicated device in its own right, as well as two other optional types of terminals.
Still, it comes as a surprise to most people to learn that terminal I/O requires thirty times as much code as the scheduler.
(This feeling is reinforced by looking at the numerous books on operating systems that devote thirty times as much space to scheduling as to all I/O combined.)

The terminal driver accepts more than a dozen message types.
The most important are:

Other message types are used for special purposes such as generating diagnostic displays when function keys are pressed or triggering panic dumps.

The messages used for reading and writing have the same format as shown in Fig. 3-17, except that no POSITION field is needed.
With a disk, the program has to specify which block it wants to read.
With a keyboard, there is no choice: the program always gets the next character typed in.
Keyboards do not support seeks.

The POSIX functions tcgetattr and tcgetattr, used to examine and modify terminal attributes (properties), are supported by the ioctl system call.
Good programming practice is to use these functions and others in include/termios.h and leave it to the C library to convert library calls to ioctl system calls.
There are, however, some control operations needed by MINIX 3 that are not provided for in POSIX, for example, loading an alternate keymap, and for these the programmer must use ioctl explicitly.

The message sent to the driver by an ioctl system call contains a function request code and a pointer.
For the tcsetattr function, an ioctl call is made with a TCSETS, TCSETSW, or TCSETSF request type, and a pointer to a termios structure like the one shown in Fig. 3-30.
All such calls replace the current set of attributes with a new set, the differences being that a TCSETS request takes effect immediately, a TCSETSW request does not take effect until all output has been transmitted, and a TCSETSF waits for output to finish and discards all input that has not yet been read.
Tcgetattr is translated into an ioctl call with a TCGETS request type and returns a filled in termios structure to the caller, so the current state of a device can be examined.
Ioctl calls that do not correspond to functions defined by POSIX, like the KIOCSMAP request used to load a new keymap, pass pointers to other kinds of structures, in this case to a keymap_t which is a 1536-byte structure (16-bit codes for 128 keys × 6 modifiers).
Figure 3-39 summarizes how standard POSIX calls are converted into ioctl system calls.

The terminal driver uses one main data structure, tty_table, which is an array of tty structures, one per terminal.
A standard PC has only one keyboard and display, but MINIX 3 can support up to eight virtual terminals, depending upon the amount of memory on the display adapter card.
This permits the person at the console to log on multiple times, switching the display output and keyboard input from one “user” to another.
With two virtual consoles, pressing ALT-F2 selects the second one and ALT-F1 returns to the first.
ALT plus the arrow keys also can be used.
In addition, serial lines can support two users at remote locations, connected by RS-232 cable or modem, and pseudo terminals can support users connected through a network.
The driver has been written to make it easy to add additional terminals.
The standard configuration illustrated in the source code in this text has two virtual consoles, with serial lines and pseudo terminals disabled.

Each tty structure in tty_table keeps track of both input and output.
For input, it holds a queue of all characters that have been typed but not yet read by the program, information about requests to read characters that have not yet been received, and timeout information, so input can be requested without the driver blocking permanently if no character is typed.
For output, it holds the parameters of write requests that are not yet finished.
Other fields hold various general variables, such as the termios structure discussed above, which affects many properties of both input and output.
There is also a field in the tty structure to point to information which is needed for a particular class of devices but is not needed in the tty_table entry for every device.
For instance, the hardware-dependent part of the console driver needs the current position on the screen and in the video RAM, and the current attribute byte for the display, but this information is not needed to support an RS-232 line.
The private data structures for each device type are also where the buffers that receive input from the interrupt service routines are located.
Slow devices, such as the keyboard, do not need buffers as large as those needed by fast devices.

To better understand how the driver works, let us first look at how characters typed in on the keyboard work their way through the system to the program that wants them.
Although this section is intended as an overview we will use line number references to help the reader find each function used.
You may find this a wild ride, getting input exercises code in tty.c, keyboard.c, and console.c, all of which are large files,

When a user logs in on the system console, a shell is created for him with /dev/console as standard input, standard output, and standard error.
The shell starts up and tries to read from standard input by calling the library procedure read.
This procedure sends a message that contains the file descriptor, buffer address, and count to the file system.
This message is shown as (1) in Fig. 3-33.
After sending the message, the shell blocks, waiting for the reply.
(User processes execute only the sendrec primitive, which combines a send with a receive from the process sent to.)

The file system gets the message and locates the i-node corresponding to the specified file descriptor.
This i-node is for the character special file /dev/console and contains the major and minor device numbers for the terminal.
The major device type for terminals is 4; for the console the minor device number is 0.

The file system indexes into its device map, dmap, to find the number of the terminal driver, TTY.
Then it sends a message to TTY, shown as (2) in Fig. 3-33.
Normally, the user will not have typed anything yet, so the terminal driver will be unable to satisfy the request.
It sends a reply back immediately to unblock the file system and report that no characters are available, shown as (3).
The file system records the fact that a process is waiting for terminal (i.e., keyboard) input in the console’s structure in tty_table and then goes off to get the next request for work.
The user’s shell remains blocked until the requested characters arrive, of course.

When a character is finally typed on the keyboard, it causes two interrupts, one when the key is depressed and one when it is released.
An important point is that a PC keyboard does not generate ASCII codes; each key generates a scan code when pressed, and a different code when released.
The lower 7 bits of the “press” and “release” codes are identical.
The difference is the most significant bit, which is a 0 when the key is pressed and a 1 when it is released.
This also applies to modifier keys such as CTRL and SHIFT.
Although ultimately these keys do not cause ASCII codes to be returned to the user process, they generate scan codes indicating which key was pressed (the driver can distinguish between the left and right shift keys if desired), and they still cause two interrupts per key.

Figure 3-33.
Read request from the keyboard when no characters are pending.
FS is the file system.
TTY is the terminal driver.
The TTY receives a message for every keypress and queues scan codes as they are entered.
Later these are interpreted and assembled into a buffer of ASCII codes which is copied to the user process.

The keyboard interrupt is IRQ 1.
This interrupt line is not accessible on the system bus, and can not be shared by any other I/O adapter.
When_hwint01 calls intr_handle there will not be a long list of hooks to traverse to find that the TTY should be notified.
In Fig. 3-33 we show the system task originating the notification message (4) because it is generated by generic_handler in system/do_irqctl.c (not listed), but this routine is called directly by the low-level interrupt processing routines.
The system task process is not activated.
Upon receiving a HARD_INT message tty_task dispatches to kbd_interrupt which in turn calls scan_keyboard.
Scan_keyboard makes three kernel calls (5, 6, 7) to cause the system task to read from and write to several I/O ports, which ultimately returns the scan code, then is added to a circular buffer.
A tty_events flag is then set to indicate this buffer contains characters and is not empty.

No message is needed as of this point.
Every time the main loop of tty_task starts another cycle it inspects the tty_events flag for each terminal device, and, for each device which has the flag set, calls handle_events.
The tty_events flag can signal various kinds of activity (although input is the most likely), so handle_events always calls the device-specific functions for both input and output.
For input from the keyboard this results in a call to kb_read, which keeps track of keyboard codes that indicate pressing or releasing of the CTRL, SHIFT, and ALT keys and converts scan codes into ASCII codes.
Kb_read in turn calls in_process, which processes the ASCII codes, taking into account special characters and different flags that may be set, including whether or not canonical mode is in effect.
The effect is normally to add characters to the console’s input queue in tty_table, although some codes, for instance BACKSPACE, have other effects.
Normally, also, in_process initiates echoing of the ASCII codes to the display.

When enough characters have come in, the terminal driver makes another kernel call (8) to ask the system task to copy the data to the address requested by the shell.
The copying of the data is not message passing and for that reason is shown by dashed lines (9) in Fig. 3-33.
More than one such line is shown because there may be more than one such operation before the user’s request has been completely fulfilled.
When the operation is finally complete, the terminal driver sends a message to the file system telling it that the work has been done (10), and the file system reacts to this message by sending a message back to the shell to unblock it (11).

The definition of when enough characters have come in depends upon the terminal mode.
In canonical mode a request is complete when a linefeed, end-of-line, or end-of-file code is received, and, in order for proper input processing to be done, a line of input cannot exceed the size of the input queue.
In noncanonical mode a read can request a much larger number of characters, and in_process may have to transfer characters more than once before a message is returned to the file system to indicate the operation is complete.

Note that the system task copies the actual characters directly from the TTY’s address space to that of the shell.
They do not go through the file system.
With block I/O, data pass through the file system to allow it to maintain a buffer cache of the most recently used blocks.
If a requested block happens to be in the cache, the request can be satisfied directly by the file system, without doing any actual disk I/O.

For keyboard I/O, a cache makes no sense.
Furthermore, a request from the file system to a disk driver can always be satisfied in at most a few hundred milliseconds, so there is no harm in having the file system wait.
Keyboard I/O may take hours to complete, or may never be complete (in canonical mode the terminal driver waits for a complete line, and it may also wait a long time in noncanonical mode, depending upon the settings of MIN and TIME).
Thus, it is unacceptable to have the file system block until a terminal input request is satisfied.

Later on, it may happen that the user has typed ahead, and that characters are available before they have been requested, from previous interrupts and event 4.
In that case, events 1, 2, and 5 through 11 all happen in quick succession after the read request; 3 does not occur at all.

Readers who are familiar with earlier versions of MINIX may remember that in these versions the TTY driver (and all other drivers) were compiled together with the kernel.
Each driver had its own interrupt handler in kernel space.
In the case of the keyboard driver, the interrupt handler itself could buffer a certain number of scan codes, and also do some preliminary processing (scan codes for most key releases could be dropped, only for modifier keys like the shift key is it necessary to buffer the release codes).
The interrupt handler itself did not send messages to the TTY driver, because the probability was high that the TTY would not be blocked on a receive and able to receive a message at any given time.
Instead, the clock interrupt handler awakened the TTY driver periodically.
These techniques were adopted to avoid losing keyboard input.

Earlier we made something of a point of the differences between handling expected interrupts, such as those generated by a disk controller, and handling unpredictable interrupts like those from a keyboard.
But in MINIX 3 nothing special seems to have been done to deal with the problems of unpredictable interrupts.
How is this possible? One thing to keep in mind is the enormous difference in performance between the computers for which the earliest versions of MINIX were written and current designs.
CPU clock speeds have increased, and the number of clock cycles needed to execute an instruction has decreased.
The minimum processor recommended for use with MINIX 3 is an 80386.
A slow 80386 will execute instructions approximately 20 times as fast as the original IBM PC.
A 100 MHz Pentium will execute perhaps 25 times as fast as the slow 80386.
So perhaps CPU speed is enough.

Another thing to keep in mind is that keyboard input is very slow by computer standards.
At 100 words per minute a typist enters fewer than 10 characters per second.
Even with a fast typist the terminal driver will probably be sent an interrupt message for each character typed at the keyboard.
However, in the case of other input devices higher data rates are probable—rates 1000 or more times faster than those of a typist are possible from a serial port connected to a 56,000-bps modem.
At that speed approximately 120 characters may be received by the modem between clock ticks, but to allow for data compression on the modem link the serial port connected to the modem must be able to handle at least twice as many.

One thing to consider with a serial port, however, is that characters, not scan codes, are transmitted, so even with an old UART that does no buffering, there will be only one interrupt per keypress instead of two.
And newer PCs are equipped with UARTs that typically buffer at least 16, and perhaps as many 128 characters.
So one interrupt per character is not required.
For instance, a UART with a 16-character buffer might be configured to interrupt when 14 characters are in the buffer.
Ethernet-based networks can deliver characters at a rate much faster than a serial line, but ethernet adapters buffer entire packets, and only one interrupt is necessary per packet.

We will complete our overview of terminal input by summarizing the events that occur when the terminal driver is first activated by a read request and when it is reactivated after receipt of keyboard input (see Fig. 3-34).
In the first case, when a message comes in to the terminal driver requesting characters from the keyboard, the main procedure, tty_task calls do_read to handle the request.
Do_read stores the parameters of the call in the keyboard’s entry in tty_table, in case there are insufficient characters buffered to satisfy the request.

Figure 3-34.
Input handling in the terminal driver.
The left branch of the tree is taken to process a request to read characters.
The right branch is taken when a keyboard message is sent to the driver before a user has requested input.
[figure 3-X to be revised]

Then it calls in_transfer to get any input already waiting, and then handle_events which in turn calls (via the function pointer (*tp->tty_devread)) kb_read and then in_transfer once again, in order to try to milk the input stream for a few more characters.
Kb_read calls several other procedures not shown in Fig. 3-34 to accomplish its work.
The result is that whatever is immediately available is copied to the user.
If nothing is available then, nothing is copied.
If the read is completed by in_transfer or by handle_events, a message is sent to the file system when all characters have been transferred, so the file system can unblock the caller.
If the read was not completed (no characters, or not enough characters) do_read reports back to the file system, telling it whether it should suspend the original caller, or, if a nonblocking read was requested, cancel the read.

The right side of Fig. 3-34 summarizes the events that occur when the terminal driver is awakened subsequent to an interrupt from the keyboard.
When a character is typed, the interrupt “handler” kbd_interrupt calls scan_keyboard which calls the system task to do the I/O.
(We put “handler” in quotes because it is not a real handler called when an interrupt occurs, it is activated by a message sent to tty_task from generic_handler in the system task.) Then kbd_interrupt puts the scan code into the keyboard buffer, ibuf, and sets a flag to identify that the console device has experienced an event.
When kbd_interrupt returns control to tty_task a continue statement results in starting another iteration of the main loop.
The event flags of all terminal devices are checked and handle_events is called for each device with a raised flag.
In the case of the keyboard, handle_events calls kb_read and in_transfer, just as was done on receipt of the original read request.
The events shown on the right side of the figure may occur several times, until enough characters are received to fulfill the request accepted by do_read after the first message from the FS.
If the FS tries to initiate a request for more characters from the same device before the first request is complete, an error is returned.
Of course, each device is independent; a read request on behalf of a user at a remote terminal is processed separately from one for a user at the console.

The functions not shown in Fig. 3-34 that are called by kb_read include map_key, which converts the key codes (scan codes) generated by the hardware into ASCII codes, make_break, which keeps track of the state of modifier keys such as the SHIFT key, and in_process, which handles complications such as attempts by the user to backspace over input entered by mistake, other special characters, and options available in different input modes.
In_process also calls tty_echo, so the typed characters will be displayed on the screen.

In general, console output is simpler than terminal input, because the operating system is in control and does not need to be concerned with requests for output arriving at inconvenient times.
Also, because the MINIX 3 console is a memory-mapped display, output to the console is particularly simple.
No interrupts are needed; the basic operation is to copy data from one memory region to another.
On the other hand, all the details of managing the display, including handling escape sequences, must be handled by the driver software.
As we did with keyboard input in the previous section, we will trace through the steps involved in sending characters to the console display.
We will assume in this example that the active display is being written; minor complications caused by virtual consoles will be discussed later.

When a process wants to print something, it generally calls printf.
Printf calls write to send a message to the file system.
The message contains a pointer to the characters that are to be printed (not the characters themselves).
The file system then sends a message to the terminal driver, which fetches them and copies them to the video RAM.
Figure 3-35 shows the main procedures involved in output.

Figure 3-35.
Major procedures used in terminal output.
The dashed line indicates characters copied directly to ramqueue by cons_write.

When a message comes in to the terminal driver requesting it to write on the screen, do_write is called to store the parameters in the console’s tty struct in the tty_table.
Then handle_events (the same function called whenever the tty_events flag is found set) is called.
On every call this function calls both the input and output routines for the device selected in its argument.
In the case of the console display this means that any keyboard input that is waiting is processed first.
If there is input waiting, characters to be echoed are added to whatever characters are already awaiting output.
Then a call is made to cons_write, the output procedure for memory-mapped displays.
This procedure uses phys_copy to copy blocks of characters from the user process to a local buffer, possibly repeating this and the following steps a number of times, since the local buffer holds only 64 bytes.
When the local buffer is full, each 8-bit byte is transferred to another buffer, ramqueue.
This is an array of 16-bit words.
Alternate bytes are filled in with the current value of the screen attribute byte, which determines foreground and background colors and other attributes.
When possible, characters are transferred directly into ramqueue, but certain characters, such as control characters or characters that are parts of escape sequences, need special handling.
Special handling is also required when a character’s screen position would exceed the width of the screen, or when ramqueue becomes full.
In these cases out_char is called to transfer the characters and take whatever additional action is called for.
For instance, scroll_screen is called when a linefeed character is received while addressing the last line of the screen, and parse_escape handles characters during an escape sequence.
Usually out_char calls flush which copies the contents of ramqueue to the video display memory, using the assembly language routine mem_vid_copy.
Flush is also called after the last character is transferred into ramqueue to be sure all output is displayed.
The final result of flush is to command the 6845 video controller chip to display the cursor in the correct position.

Logically, the bytes fetched from the user process could be written into the video RAM one per loop iteration.
However, accumulating the characters in ram-queue and then copying the block with a call to mem_vid_copy are more efficient in the protected memory environment of Pentium-class processors.
Interestingly, this technique was introduced in early versions of MINIX 3 that ran on older processors without protected memory.
The precursor of mem_vid_copy dealt with a timing problem—with older video displays the copy into the video memory had to be done when the screen was blanked during vertical retrace of the CRT beam to avoid generating visual garbage all over the screen.
MINIX 3 no longer provides this support for obsolete equipment as the performance penalty is too great.
However, the modern version of MINIX 3 benefits in other ways from copying ram-queue as a block.

The video RAM available to a console is delimited in the console structure by the fields c_start and c_limit.
The current cursor position is stored in the c_column and c_row fields.
The coordinate (0, 0) is in the upper left corner of the screen, which is where the hardware starts to fill the screen.
Each video scan begins at the address given by c_org and continues for 80 × 25 characters (4000 bytes).
In other words, the 6845 chip pulls the word at offset c_org from the video RAM and displays the character byte in the upper left-hand corner, using the attribute byte to control color, blinking, and so forth.
Then it fetches the next word and displays the character at (1, 0).
This process continues until it gets to (79, 0), at which time it begins the second line on the screen, at coordinate (0, 1).

When the computer is first started, the screen is cleared, output is written into the video RAM starting at location c_start, and c_org is assigned the same value as c_start.
Thus the first line appears on the top line of the screen.
When output must go to a new line, either because the first line is full or because a newline character is detected by out_char, output is written into the location given by c_start plus 80.
Eventually, all 25 lines are filled, and scrolling of the screen is required.
Some programs, editors, for example, require scrolling in the downward direction too, when the cursor is on the top line and further movement upward within the text is required.

There are two ways scrolling the screen can be managed.
In software scrolling, the character to be displayed at position (0, 0) is always in the first location in video memory, word 0 relative to the position pointed to by c_start, and the video controller chip is commanded to display this location first by keeping the same address in c_org.
When the screen is to be scrolled, the contents of relative location 80 in the video RAM, the beginning of the second line on the screen, is copied to relative location 0, word 81 is copied to relative location 1, and so on.
The scan sequence is unchanged, putting the data at location 0 in the memory at screen position (0, 0) and the image on the screen appears to have moved up one line.
The cost is that the CPU has moved 80 × 24 = 1920 words.
In hardware scrolling, the data are not moved in the memory; instead the video controller chip is instructed to start the display at a different point, for instance, with the data at word 80.
The bookkeeping is done by adding 80 to the contents of c_org, saving it for future reference, and writing this value into the correct register of the video controller chip.
This requires either that the controller be smart enough to wrap around the video RAM, taking data from the beginning of the RAM (the address in c_start) when it reaches the end (the address in c_limit), or that the video RAM have more capacity than just the 80 × 2000 words necessary to store a single screen of display.

Older display adapters generally have smaller memory but are able to wrap around and do hardware scrolling.
Newer adapters generally have much more memory than needed to display a single screen of text, but the controllers are not able to wrap.
Thus an adapter with 32,768 bytes of display memory can hold 204 complete lines of 160 bytes each, and can do hardware scrolling 179 times before the inability to wrap becomes a problem.
But, eventually a memory copy operation will be needed to move the data for the last 24 lines back to location 0 in the video memory.
Whichever method is used, a row of blanks is copied to the video RAM to ensure that the new line at the bottom of the screen is empty.

When virtual consoles are enabled, the available memory within a video adapter is divided equally between the number of consoles desired by properly initializing the c_start and c_limit fields for each console.
This has an effect on scrolling.
On any adapter large enough to support virtual consoles, software scrolling takes place every so often, even though hardware scrolling is nominally in effect.
The smaller the amount of memory available to each console display, the more frequently software scrolling must be used.
The limit is reached when the maximum possible number of consoles is configured.
Then every scroll operation will be a software scroll operation.

The position of the cursor relative to the start of the video RAM can be derived from c_column and c_row, but it is faster to store it explicitly (in c_cur).
When a character is to be printed, it is put into the video RAM at location c_cur, which is then updated, as is c_column.
Figure 3-36 summarizes the fields of the console structure that affect the current position and the display origin.

Figure 3-36.
Fields of the console structure that relate to the current screen position.

The characters that affect the cursor position (e.g., linefeed, backspace) are handled by adjusting the values of c column, c row, and c cur.
This work is done at the end of flush by a call to set 6845 which sets the registers in the video controller chip.

The terminal driver supports escape sequences to allow screen editors and other interactive programs to update the screen in a flexible way.
The sequences supported are a subset of an ANSI standard and should be adequate to allow many programs written for other hardware and other operating systems to be easily ported to MINIX 3.
There are two categories of escape sequences: those that never contain a variable parameter, and those that may contain parameters.
In the first category the only representative supported by MINIX 3 is ESC M, which reverse indexes the screen, moving the cursor up one line and scrolling the screen downward if the cursor is already on the first line.
The other category can have one or two numeric parameters.
Sequences in this group all begin with ESC [.
The “[” is the control sequence introducer.
A table of escape sequences defined by the ANSI standard and recognized by MINIX 3 was shown in Fig. 3-32.

Parsing escape sequences is not trivial.
Valid escape sequences in MINIX 3 can be as short as two characters, as in ESC M, or up to 8 characters long in the case of a sequence that accepts two numeric parameters that each can have a two-digit values as in ESC [20;60H, which moves the cursor to line 20, column 60.
In a sequence that accepts a parameter, the parameter may be omitted, and in a sequence that accepts two parameters either or both of them may be omitted.

When a parameter is omitted or one that is outside the valid range is used, a default is substituted.
The default is the lowest valid value.

Consider the following ways a program could construct a sequence to move to the upper-left corner of the screen:

These examples are presented not to suggest one should deliberately use invalid parameters but to show that the code that parses such sequences is nontrivial.

MINIX 3 implements a finite state automaton to do this parsing.
The variable c_esc_state in the console structure normally has a value of 0.
When out_char detects an ESC character, it changes c_esc_state to 1, and subsequent characters are processed by parse_escape.
If the next character is the control sequence introducer, state 2 is entered; otherwise the sequence is considered complete, and do_escape is called.
In state 2, as long as incoming characters are numeric, a parameter is calculated by multiplying the previous value of the parameter (initially 0) by 10 and adding the numeric value of the current character.
The parameter values are kept in an array and when a semicolon is detected the processing shifts to the next cell in the array.
(The array in MINIX 3 has only two elements, but the principle is the same).
When a nonnumeric character that is not a semicolon is encountered the sequence is considered complete, and again do_escape is called.
The current character on entry to do_escape then is used to select exactly what action to take and how to interpret the parameters, either the defaults or those entered in the character stream.
This is illustrated in Fig. 3-44.

The IBM PC keyboard does not generate ASCII codes directly.
The keys are each identified with a number, starting with the keys that are located in the upper left of the original PC keyboard—1 for the “ESC” key, 2 for the “1”, and so on.
Each key is assigned a number, including modifier keys like the left SHIFT and right SHIFT keys, numbers 42 and 54.
When a key is pressed, MINIX 3 receives the key number as a scan code.
A scan code is also generated when a key is released, but the code generated upon release has the most significant bit set (equivalent to adding 128 to the key number).
Thus a key press and a key release can be distinguished.
By keeping track of which modifier keys have been pressed and not yet released, a large number of combinations are possible.
For ordinary purposes, of course, two-finger combinations, such as SHIFT-A or CTRL-D, are most manageable for two-handed typists, but for special occasions three-key (or more) combinations are possible, for instance, CTRL-SHIFT-A, or the well-known CTRL-ALT-DEL combination that PC users recognize as the way to reset and reboot the system.

The complexity of the PC keyboard allows for a great deal of flexibility in how it used.
A standard keyboard has 47 ordinary character keys defined (26 alphabetic, 10 numeric, and 11 punctuation).
If we are willing to use three-fingered modifier key combinations, such as CTRL-ALT-SHIFT, we can support a character set of 376 (8 × 47) members.
This is by no means the limit of what is possible, but for now let us assume we do not want to distinguish between the left- and right-hand modifier keys, or use any of the numeric keypad or function keys.
Indeed, we are not limited to using just the CTRL, ALT, and SHIFT keys as modifiers; we could retire some keys from the set of ordinary keys and use them as modifiers if we desired to write a driver that supported such a system.

Operating systems that use such keyboards use a keymap to determine what character code to pass to a program based upon the key being pressed and the modifiers in effect.
The MINIX 3 keymap logically is an array of 128 rows, representing possible scan code values (this size was chosen to accommodate Japanese keyboards; U.S.
and European keyboards do not have this many keys) and 6 columns.
The columns represent no modifier, the SHIFT key, the Control key, the left ALT key, the right ALT key, and a combination of either ALT key plus the SHIFT key.
There are thus 720 ((128 − 6) × 6) character codes that can be generated by this scheme, given an adequate keyboard.
This requires that each entry in the table be a 16-bit quantity.
For U.S.
keyboards the ALT and ALT2 columns are identical.
ALT2 is named ALTGR on keyboards for other languages, and many of these keymaps support keys with three symbols by using this key as a modifier.

in keyboard.c) is compiled into the MINIX 3 kernel at compilation time, but an

call can be used to load a different map into the kernel at address keymap.
A full keymap occupies 1536 bytes (128 × 6 × 2).
Extra keymaps are stored in compressed form.
A program called genmap is used to make a new compressed keymap.
When compiled, genmap includes the keymap.src code for a particular keymap, so the map is compiled within genmap.
Normally, genmap is executed immediately after being compiled, at which time it outputs the compressed version to a file, and then the genmap binary is deleted.
The command loadkeys reads a compressed keymap, expands it internally, and then calls ioctl to transfer the keymap into the kernel memory.
MINIX 3 can execute loadkeys automatically upon starting, and the program can also be invoked by the user at any time.

The source code for a keymap defines a large initialized array, and in the interest of saving space a keymap file is not printed in Appendix B.
Figure 3-37 shows in tabular form the contents of a few lines of src/kernel/keymaps/us-std.src which illustrate several aspects of keymaps.
There is no key on the IBM-PC keyboard that generates a scan code of 0.
The entry for code 1, the ESC key, shows that the value returned is unchanged when the SHIFT key or CTRL key are pressed, but that a different code is returned when an ALT key is pressed simultaneously with the ESC key.
The values compiled into the various columns are determined by macros defined in include/minix/keymap.h:

The first three of these macros manipulate bits in the code for the quoted character to produce the necessary code to be returned to the application.
The last one sets the HASCAPS bit in the high byte of the 16-bit value.
This is a flag that indicates that the state of the capslock variable has to be checked and the code possibly modified before being returned.
In the figure, the entries for scan codes 2, 13, and 16 show how typical numeric, punctuation, and alphabetic keys are handled.
For code 28 a special feature is seen—normally the ENTER key produces a CR (0x0D) code, represented here as C(‘M’) .
Because the newline character in UNIX files is the LF (0x0A) code, and it is sometimes necessary to enter this directly, this keyboard map provides for a CTRL-ENTER combination, which produces this code, C(‘J’) .

Scan code 29 is one of the modifier codes and must be recognized no matter what other key is pressed, so the CTRL value is returned regardless of any other key that may be pressed.
The function keys do not return ordinary ASCII values, and the row for scan code 59 shows symbolically the values (defined in include/minix/keymap.h) that are returned for the F1 key in combination with other modifiers.
These values are F1: 0x0110, SF1: 0x1010, AF1: 0x0810, ASF1: 0x0C10, and CF1: 0x0210.
The last entry shown in the figure, for scan code 127, is typical of many entries near the end of the array.
For many keyboards, certainly most of those used in Europe and the Americas, there are not enough keys to generate all the possible codes, and these entries in the table are filled with zeroes.

Early PCs had the patterns for generating characters on a video screen stored only in ROM, but the displays used on modern systems provide RAM on the video display adapters into which custom character generator patterns can be loaded.
This is supported by MINIX 3 with a

ioctl operation.
MINIX 3 supports an 80 lines × 25 rows video mode, and font files contain 4096 bytes.
Each byte represents a line of 8 pixels that are illuminated if the bit value is 1, and 16 such lines are needed to map each character.
However the video display adapter uses 32 bytes to map each character, to provide higher resolution in modes not currently supported by MINIX 3.
The loadfont command is provided to convert these files into the 8192-byte font structure referenced by the ioctl call and to use that call to load the font.
As with the keymaps, a font can be loaded at startup time, or at any time during normal operation.
However, every video adapter has a standard font built into its ROM that is available by default.
There is no need to compile a font into MINIX 3 itself, and the only font support necessary in the kernel is the code to carry out the TIOCSFON ioctl operation.

1.9.4 Implementation of the Device-Independent Terminal Driver

In this section we will begin to look at the source code of the terminal driver in detail.
We saw when we studied the block devices that multiple drivers supporting several different devices could share a common base of software.
The case with the terminal devices is similar, but with the difference that there is one terminal driver that supports several different kinds of terminal device.
Here we will start with the device-independent code.
In later sections we will look at the device-dependent code for the keyboard and the memory-mapped console display.

The file tty.h contains definitions used by the C files which implement the terminal drivers.
Since this driver supports many different devices, the minor device numbers must be used to distinguish which device is being supported on a particular call.

Within tty.h, the definitions of the O_NOCTTY and O_NONBLOCK flags (which are optional arguments to the open call) are duplicates of definitions in include/fcntl.h but they are repeated here so as not to require including another file.
The devfun_t and devfunarg_t types are used to define pointers to functions, in order to provide for indirect calls using a mechanism similar to what we saw in the code for the main loop of the disk drivers.

Many variables declared in this file are identified by the prefix tty_.
The most important definition in tty.h is the tty structure.
There is one such structure for each terminal device (the console display and keyboard together count as a single terminal).
The first variable in the tty structure, tty_events, is the flag that is set when an interrupt causes a change that requires the terminal driver to attend to the device.

The rest of the tty structure is organized to group together variables that deal with input, output, status, and information about incomplete operations.
In the input section, tty_inhead and tty_intail define the queue where received characters are buffered.
Tty_incount counts the number of characters in this queue, and tty_eotct counts lines or characters, as explained below.
All device-specific calls are done indirectly, with the exception of the routines that initialize the terminals, which are called to set up the pointers used for the indirect calls.
The tty_devread and tty_icancel fields hold pointers to device-specific code to perform the read and input cancel operations.
Tty_min is used in comparisons with tty_eotct.
When the latter becomes equal to the former, a read operation is complete.
During canonical input, tty_min is set to 1 and tty_eotct counts lines entered.
During noncanonical input, tty_eotct counts characters and tty_min is set from the MIN field of the termios structure.
The comparison of the two variables thus tells when a line is ready or when the minimum character count is reached, depending upon the mode.
Tty_tmr is a timer for this tty, used for the TIME field of termios.

Since queueing of output is handled by the device-specific code, the output section of tty declares no variables and consists entirely of pointers to device-specific functions that write, echo, send a break signal, and cancel output.
In the status section the flags tty_reprint, tty_escaped, and tty_inhibited indicate that the last character seen has a special meaning; for instance, when a CTRL-V (LNEXT) character is seen, tty_escaped is set to 1 to indicate that any special meaning of the next character is to be ignored.

The next part of the structure holds data about DEV_READ, DEV_WRITE, and DEV_IOCTL operations in progress.
There are two processes involved in each of these operations.
The server managing the system call (normally FS) is identified in tty_incaller.
The server calls the tty driver on behalf of another process that needs to do an I/O operation, and this client is identified in tty_inproc.
As described in Fig. 3-33, during a read, characters are transferred directly from the terminal driver to a buffer within the memory space of the original caller.
Tty_inproc and tty_in_vir locate this buffer.
The next two variables, tty_inleft and tty_incum, count the characters still needed and those already transferred.
Similar sets of variables are needed for a write system call.
For ioctl there may be an immediate transfer of data between the requesting process and the driver, so a virtual address is needed, but there is no need for variables to mark the progress of an operation.
An ioctl request may be postponed, for instance, until current output is complete, but when the time is right the request is carried out in a single operation.

Finally, the tty structure includes some variables that fall into no other category, including pointers to the functions to handle the DEV_IOCTL and DEV_CLOSE operations at the device level, a POSIX-style termios structure, and a winsize structure that provides support for window-oriented screen displays.
The last part of the structure provides storage for the input queue itself in the array tty_inbuf.
Note that this is an array of u16_t, not of 8-bit char characters.
Although applications and devices use 8-bit codes for characters, the C language requires the input function getchar to work with a larger data type so it can return a symbolic EOF value in addition to all 256 possible byte values.

The tty_table, an array of tty structures, is declared as extern.
There is one array element for each terminal enabled by the NR_CONS, NR_RS_LINES, and NR_PTYS definitions in include/minix/config.h.
For the configuration discussed in this book, two consoles are enabled, but MINIX 3 may be recompiled to add additional virtual consoles, one or two 2 serial lines, and up to 64 pseudo terminals.

There is one other extern definition in tty.h.
Tty_timers is a pointer used by the timer to hold the head of a linked list of timer_t fields.
The tty.h header file is included in many files and storage for tty_table and tty_timers is allocated during compilation of tty.c.

Two macros, buflen and bufend, are defined.
These are used frequently in the terminal driver code, which does much copying of data into and out of buffers.

The main terminal driver and the device-independent supporting functions are all in tty.c.
Following this there are a number of macro definitions.
If a device is not initialized, the pointers to that device’s device-specific functions will contain zeroes put there by the C compiler.
This makes it possible to define the tty_active macro, which returns FALSE if a null pointer is found.
Of course, the initialization code for a device cannot be accessed indirectly if part of its job is to initialize the pointers that make indirect access possible.
There are conditional macro definitions to equate initialization calls for RS-232 or pseudo terminal devices to calls to a null function when these devices are not configured.
Do_pty may be similarly disabled in this section.
This makes it possible to omit the code for these devices entirely if it is not needed.

Since there are so many configurable parameters for each terminal, and there may be quite a few terminals on a networked system, a termios_defaults structure is declared and initialized with default values (all of which are defined in include/termios.h).
This structure is copied into the tty_table entry for a terminal whenever it is necessary to initialize or reinitialize it.
The defaults for the special characters were shown in Fig. 3-29.
Figure 3-38 shows the default values for the various flags used.
On the following line the winsize_defaults structure is similarly declared.
It is left to be initialized to all zeroes by the C compiler.
This is the proper default action; it means “window size is unknown, use /etc/termcap.”

The final set of definitions before executable code begins are the PUBLIC declarations of global variables previously declared as extern in tty.h.

The entry point for the terminal driver is tty_task.
Before entering the main loop, a call is made to tty_init.
Information about the host machine that will be needed to initialize the keyboard and the console is obtained by a sys_getmachine kernel call, and then the keyboard hardware is initialized.
The routine called for this is kb_init_once.
It is so named to distinguish it from another initialization routine which is called as part of initialization of each virtual console later on.
Finally, a single 0 is printed to exercise the output system and kick anything that does not get initialized until first use.
The source code shows a call to printf, but this is not the same printf used by user programs, it is a special version that calls a local function in the console driver called putk.

The main loop is, in principle, like the main loop of any driver—it receives a message, executes a switch on the message type to call the appropriate function, and then generates a return message.
However, there are some complications.
The first one is that since the last interrupt additional characters may have been read or characters to be written to an output device may be ready.
Before attempting to receive a message, the main loop always checks the tp−>tty_events flags for all terminals and handle_events is called as necessary to take care of unfinished business.
Only when nothing demands immediate attention is a call made to receive.

The diagram showing message types in the comments near the beginning of tty.c shows the most often used types.
A number of message types requesting specialized services from the terminal driver are not shown.
These are not specific to any one device.
The tty_task main loop checks for these and handles them before checking for device-specific messages.
First a check is made for a SYN_ALARM message, and, if this is the message type a call is made to expire_timers to cause a watchdog routine to execute.
Then comes a continue statement.
In fact all of the next few cases we will look at are followed by continue.
We will say more about this soon.

The next message type tested for is HARD_INT.
This is most likely the result of a key being pressed or released on the local keyboard.
It could also mean bytes have been received by a serial port, if serial ports are enabled—in the configuration we are studying they are not, but we left conditional code in the file here to illustrate how serial port input would be handled.
A bit field in the message is used to determine the source of the interrupt.

Next a check is made for SYS_SIG.
System processes (drivers and servers) are expected to block waiting for messages.
Ordinary signals are received only by active processes, so the standard UNIX signaling method does not work with system processes.
A SYS_SIG message is used to signal a system process.
A signal to the terminal driver can mean the kernel is shutting down (SIGKSTOP), the terminal driver is being shut down (SIGTERM), or the kernel needs to print a message to the console (SIGKMESS), and appropriate routines are called for these cases.

The last group of non-device-specific messages are PANIC_DUMPS, DIAGNOSTICS, and FKEY_CONTROL.
We will say more about these when we get to the functions that service them.

Now, about the continue statements: in the C language, a continue statement short-circuits a loop, and returns control to the top of the loop.
So if any one of the message types mentioned so far is detected, as soon as it is serviced control returns to the top of the main loop, at line 13764, the check for events is repeated, and receive is called again to await a new message.
Particularly in the case of input it is important to be ready to respond again as quickly as possible.
Also, if any of the message-type tests in the first part of the loop succeeded there is no point in making any of the tests that come after the first switch.

Above we mentioned complications that the terminal driver must deal with.
The second complication is that this driver services several devices.
If the interrupt is not a hardware interrupt the TTY_LINE field in the message is used to determine which device should respond to the message.
The minor device number is decoded by a series of comparisons, by means of which tp is pointed to the correct entry in the tty_table.
If the device is a pseudo terminal, do_pty (in pty.c) is called and the main loop is restarted.
In this case do_pty generates the reply message.
Of course, if pseudo terminals are not enabled, the call to do_pty uses the dummy macro defined earlier.
One would hope that attempts to access nonexistent devices would not occur, but it is always easier to add another check than to verify that there are no errors elsewhere in the system.
In case the device does not exist or is not configured, a reply message with an ENXIO error message is generated and, again, control returns to the top of the loop.

The rest of this driver resembles what we have seen in the main loop of other drivers, a switch on the message type.
The appropriate function for the type of request, do_read, do_write, and so on, is called.
In each case the called function generates the reply message, rather than pass the information needed to construct the message back to the main loop.
A reply message is generated at the end of the main loop only if a valid message type was not received, in which case an EINVAL error message is sent.
Because reply messages are sent from many different places within the terminal driver a common routine, tty_reply, is called to handle the details of constructing reply messages.

If the message received by tty_task is a valid message type, not the result of an interrupt, and does not come from a pseudo terminal, the switch at the end of the main loop will dispatch to one of the functions do_read, do_write, do_ioctl, do_open, do_close, do_select, or do_cancel.
The arguments to each of these calls are tp, a pointer to a tty structure, and the address of the message.
Before looking at each of them in detail, we will mention a few general considerations.
Since tty_task may service multiple terminal devices, these functions must return quickly so the main loop can continue.

However, do_read, do_write, and do_ioctl may not be able to complete all the requested work immediately.
In order to allow FS to service other calls, an immediate reply is required.
If the request cannot be completed immediately, the SUSPEND code is returned in the status field of the reply message.
This corresponds to the message marked (3) in Fig. 3-33 and suspends the process that initiated the call, while unblocking the FS.
Messages corresponding to (10) and (11) in the figure will be sent later when the operation can be completed.
If the request can be fully satisfied, or an error occurs, either the count of bytes transferred or the error code is returned in the status field of the return message to the FS.
In this case a message will be sent immediately from the FS back to the process that made the original call, to wake it up.

Reading from a terminal is fundamentally different from reading from a disk device.
The disk driver issues a command to the disk hardware and eventually data will be returned, barring a mechanical or electrical failure.
The computer can display a prompt upon the screen, but there is no way for it to force a person sitting at the keyboard to start typing.
For that matter, there is no guarantee that anybody will be sitting there at all.
In order to make the speedy return that is required, do_read starts by storing information that will enable the request to be completed later, when and if input arrives.
There are a few error checks to be made first.
It is an error if the device is still expecting input to fulfill a previous request, or if the parameters in the message are invalid.
If these tests are passed, information about the request is copied into the proper fields in the device’s tp−>tty_table entry.
The last step, setting tp−>tty_inleft to the number of characters requested, is important.
This variable is used to determine when the read request is satisfied.
In canonical mode tp−>tty_inleft is decremented by one for each character returned, until an end of line is received, at which point it is suddenly reduced to zero.
In noncanonical mode it is handled differently, but in any case it is reset to zero whenever the call is satisfied, whether by a timeout or by receiving at least the minimum number of bytes requested.
When tp−>tty_inleft reaches zero, a reply message is sent.
As we will see, reply messages can be generated in several places.
It is sometimes necessary to check whether a reading process still expects a reply; a nonzero value of tp−>tty_inleft serves as a flag for that purpose.

In canonical mode a terminal device waits for input until either the number of characters asked for in the call has been received, or the end of a line or the end of the file is reached.
The ICANON bit in the termios structure is tested to see if canonical mode is in effect for the terminal.
If it is not set, the termios MIN and TIME values are checked to determine what action to take.

In Fig. 3-31 we saw how MIN and TIME interact to provide different ways a read call can behave.
TIME is tested.
A value of zero corresponds to the left-hand column in Fig. 3-31, and in this case no further tests are needed at this point.
If TIME is nonzero, then MIN is tested.
If it is zero, settimer is called to start the timer that will terminate the DEV_READ request after a delay, even if no bytes have been received.
Tp−>tty_min is set to 1 here, so the call will terminate immediately if one or more bytes are received before the timeout.
At this point no check for possible input has yet been made, so more than one character may already be waiting to satisfy the request.
In that case, as many characters as are ready, up to the number specified in the read call, will be returned as soon as the input is found.
If both TIME and MIN are nonzero, the timer has a different meaning.
The timer is used as an inter-character timer in this case.
It is started only after the first character is received and is restarted after each successive character.
Tp−>tty_eotct counts characters in noncanonical mode, and if it is zero at line 13993, no characters have been received yet and the inter-byte timer is inhibited.

In any case, at line 14001, in_transfer is called to transfer any bytes already in the input queue directly to the reading process.
Next there is a call to handle_events, which may put more data into the input queue and which calls in_transfer again.
This apparent duplication of calls requires some explanation.
Although the discussion so far has been in terms of keyboard input, do_read is in the device-independent part of the code and also services input from remote terminals connected by serial lines.
It is possible that previous input has filled the RS-232 input buffer to the point where input has been inhibited.
The first call to in_transfer does not start the flow again, but the call to handle_events can have this effect.
The fact that it then causes a second call to in_transfer is just a bonus.
The important thing is to be sure the remote terminal is allowed to send again.
Either of these calls may result in satisfaction of the request and sending of the reply message to the FS.
Tp−>tty_inleft is used as a flag to see if the reply has been sent; if it is still nonzero at line 14004, do_read generates and sends the reply message itself.
(We assume here that no use has been made of the select system call, and therefore there will be no call to select_retry).

If the original request specified a nonblocking read, the FS is told to pass an EAGAIN error code back to original caller.
If the call is an ordinary blocking read, the FS receives a SUSPEND code, unblocking it but telling it to leave the original caller blocked.
In this case the terminal’s tp−>tty_inrepcode field is set to REVIVE.
When and if the read is later satisfied, this code will be placed in the reply message to the FS to indicate that the original caller was put to sleep and needs to be revived.

Do_write is similar to do_read, but simpler, because there are fewer options to be concerned about in handling a write system call.
Checks similar to those made by do_read are made to see that a previous write is not still in progress and that the message parameters are valid, and then the parameters of the request are copied into the tty structure.
Handle_events is then called, and tp−>tty_outleft is checked to see if the work was done.
If so, a reply message already has been sent by handle_events and there is nothing left to do.
If not, a reply message is generated.
with the message parameters depending upon whether or not the original write call was called in nonblocking mode.

The next function, do_ioctl, is a long one, but not difficult to understand.
The body of do_ioctl is two switch statements.
The first determines the size of the parameter pointed to by the pointer in the request message.
If the size is not zero, the parameter’s validity is tested.
The contents cannot be tested here, but what can be tested is whether a structure of the required size beginning at the specified address fits within the segment it is specified to be in.
The rest of the function is another switch on the type of ioctl operation requested.

Unfortunately, supporting the POSIX-required operations with the ioctl call meant that names for ioctl operations had to be invented that suggest, but do not duplicate, names required by POSIX.
Figure 3-39 shows the relationship between the POSIX request names and the names used by the MINIX 3 ioctl call.
A TCGETS operation services a tcgetattr call by the user and simply returns a copy of the terminal device’s tp−>tty_termios structure.
The next four request types share code.
The TCSETSW, TCSETSF, and TCSETS request types correspond to user calls to the POSIX-defined function tcsetattr, and all have the basic action of copying a new termios structure into a terminal’s tty structure.
The copying is done immediately for TCSETS calls and may be done for TCSETSW and TCSETSF calls if output is complete, by a sys vircopy kernel call to get the data from the user, followed by a call to setattr.
If tcsetattr was called with a modifier requesting postponement of the action until completion of current output, the parameters for the request are placed in the terminal’s tty structure for later processing if the test of tp−>tty outleft reveals output is not complete.
Tcdrain suspends a program until output is complete and is translated into an ioctl call of type TCDRAIN.
If output is already complete, it has nothing more to do.
If output is not complete, it also must leave information in the tty structure.

The POSIX tcflush function discards unread input and/or unsent output data, according to its argument, and the ioctl translation is straightforward, consisting of a call to the tty_icancel function that services all terminals, and/or the device-specific function pointed to by tp−>tty_ocancel.
Tcflow is similarly translated in a straightforward way into an ioctl call.
To suspend or restart output, it sets a TRUE or FALSE value into tp−>tty_inhibited and then sets the tp−>tty_events flag.
To suspend or restart input, it sends the appropriate STOP (normally CTRL-S) or START (CTRL-Q) code to the remote terminal, using the device-specific echo routine pointed to by tp−>tty_echo.

Most of the rest of the operations handled by do_ioctl are handled in one line of code, by calling an appropriate function.
In the cases of the KIOCSMAP (load keymap) and TIOCSFON (load font) operations, a test is made to be sure the device really is a console, since these operations do not apply to other terminals.
If virtual terminals are in use the same keymap and font apply to all consoles, the hardware does not permit any easy way of doing otherwise.
The window size operations copy a winsize structure between the user process and the terminal driver.
Note, however, the comment under the code for the TIOCSWINSZ operation.
When a process changes its window size, the kernel is expected to send a SIGWINCH signal to the process group under some versions of UNIX .
The signal is not required by the POSIX standard and is not implemented in MINIX 3.
However, anyone thinking of using these structures should consider adding code here to initiate this signal.

The last two cases in do_ioctl support the POSIX required tcgetpgrp and tcsetpgrp functions.
There is no action associated with these cases, and they always return an error.
There is nothing wrong with this.
These functions support job control, the ability to suspend and restart a process from the keyboard.
Job control is not required by POSIX and is not supported by MINIX 3.
However, POSIX requires these functions, even when job control is not supported, to ensure portability of programs.

Do_open has a simple basic action to perform—it increments the variable tp−>tty_openct for the device so it can be verified that it is open.
However, there are some tests to be done first.
POSIX specifies that for ordinary terminals the first process to open a terminal is the session leader, and when a session leader dies, access to the terminal is revoked from other processes in its group.
Daemons need to be able to write error messages, and if their error output is not redirected to a file, it should go to a display that cannot be closed.

For this purpose a device called /dev/log exists in MINIX 3.
Physically it is the same device as /dev/console, but it is addressed by a separate minor device number and is treated differently.
It is a write-only device, and thus do_open returns an EACCESS error if an attempt is made to open it for reading.
The other test done by do_open is to test the O_NOCTTY flag.
If it is not set and the device is not /dev/log, the terminal becomes the controlling terminal for a process group.
This is done by putting the process number of the caller into the tp−>tty_pgrp field of the tty_table entry.
Following this, the tp−>tty_openct variable is incremented and the reply message is sent.

A terminal device may be opened more than once, and the next function, do_close, has nothing to do except decrement tp−>tty_openct.
The test foils an attempt to close the device if it happens to be /dev/log.
If this operation is the last close, input is canceled by calling tp−>tty_icancel.
Device-specific routines pointed to by tp−>tty_ocancel and tp−>tty_close are also called.
Then various fields in the tty structure for the device are set back to their default values and the reply message is sent.

The last message type handler we will consider is do_cancel.
This is invoked when a signal is received for a process that is blocked trying to read or write.
There are three states that must be checked:

A test is made for each case, and the general tp−>tty_icancel, or the device-specific routine pointed to by tp−>tty_ocancel, is called as necessary.
In the last case the only action required is to reset the flag tp−>tty_ioreq, to indicate the ioctl operation is now complete.
Finally, the tp−>tty_events flag is set and a reply message is sent.

Now that we have looked at the top-level functions called in the main loop of tty_task, it is time to look at the code that supports them.
We will start with handle_events.
As mentioned earlier, on each pass through the main loop of the terminal driver, the tp−>tty_events flag for each terminal device is checked and handle_events is called if it shows that attention is required for a particular terminal.
Do_read and do_write also call handle_events.
This routine must work fast.
It resets the tp−>tty_events flag and then calls device-specific routines to read and write, using the pointers to the functions tp−>tty_devread and tp−>tty devwrite.

These functions are called unconditionally, because there is no way to test whether a read or a write caused the raising of the flag—a design choice was made here, that checking two flags for each device would be more expensive than making two calls each time a device was active.
Also, most of the time a character received from a terminal must be echoed, so both calls will be necessary.
As noted in the discussion of the handling of tcsetattr calls by do_ioctl, POSIX may postpone control operations on devices until current output is complete, so immediately after calling the device-specific tty_devwrite function is a good time take care of ioctl operations.
This is done where dev_ioctl is called if there is a pending control request.

Since the tp−>tty_events flag is raised by interrupts, and characters may arrive in a rapid stream from a fast device, there is a chance that by the time the calls to the device-specific read and write routines and dev_ioctl are completed, another interrupt will have raised the flag again.
A high priority is placed on getting input moved along from the buffer where the interrupt routine places it initially.
Thus handle_events repeats the calls to the device-specific routines as long as the tp−>tty_events flag is found raised at the end of the loop.
When the flow of input stops (it also could be output, but input is more likely to make such repeated demands), in_transfer is called to transfer characters from the input queue to the buffer within the process that called for the read operation.
In_transfer itself sends a reply message if the transfer completes the request, either by transferring the maximum number of characters requested or by reaching the end of a line (in canonical mode).
If it does so, tp−>tty_left will be zero upon the return to handle_events.
Here a further test is made and a reply message is sent if the number of characters transferred has reached the minimum number requested.
Testing tp−>tty_inleft prevents sending a duplicate message.

Next we will look at in_transfer, which is responsible for moving data from the input queue in the driver’s memory space to the buffer of the user process that requested the input.
However, a straightforward block copy is not possible here.
The input queue is a circular buffer and characters have to be checked to see that the end of the file has not been reached, or, if canonical mode is in effect, that the transfer only continues up through the end of a line.
Also, the input queue is a queue of 16-bit quantities, but the recipient’s buffer is an array of 8-bit characters.
Thus an intermediate local buffer is used.
Characters are checked one by one as they are placed in the local buffer, and when it fills up or when the input queue has been emptied, sys_vircopy is called to move the contents of the local buffer to the receiving process’ buffer.

Three variables in the tty structure, tp−>tty_inleft, tp−>tty_eotct, and tp−>tty_min, are used to decide whether in_transfer has any work to do, and the first two of these control its main loop.
As mentioned earlier, tp−>tty_inleft is set initially to the number of characters requested by a read call.
Normally, it is decremented by one whenever a character is transferred but it may be abruptly decreased to zero when a condition signaling the end of input is reached.
Whenever it becomes zero, a reply message to the reader is generated, so it also serves as a flag to indicate whether or not a message has been sent.
Thus in the test, finding that tp−>tty_inleft is already zero is a sufficient reason to abort execution of in_transfer without sending a reply.

In the next part of the test, tp−>tty_eotct and tp−>tty_min are compared.
In canonical mode both of these variables refer to complete lines of input, and in noncanonical mode they refer to characters.
Tp−>tty_eotct is incremented whenever a “line break” or a byte is placed in the input queue and is decremented by in_transfer whenever a line or byte is removed from the queue.
In other words, it counts the number of lines or bytes that have been received by the terminal driver but not yet passed on to a reader.
Tp−>tty_min indicates the minimum number of lines (in canonical mode) or characters (in noncanonical mode) that must be transferred to complete a read request.
Its value is always 1 in canonical mode and may be any value from 0 up to MAX_INPUT (255 in MINIX 3) in noncanonical mode.
The second half of the test causes in_transfer to return immediately in canonical mode if a full line has not yet been received.
The transfer is not done until a line is complete so the queue contents can be modified if, for instance, an ERASE or KILL character is subsequently typed in by the user before the ENTER key is pressed.
In noncanonical mode an immediate return occurs if the minimum number of characters is not yet available.

A few lines later, tp−>tty_inleft and tp−>tty_eotct are used to control the main loop of in_transfer.
In canonical mode the transfer continues until there is no longer a complete line left in the queue.
In noncanonical mode tp−>tty_eotct is a count of pending characters.
Tp−>tty_min controls whether the loop is entered but is not used in determining when to stop.
Once the loop is entered, either all available characters or the number of characters requested in the original call will be transferred, whichever is smaller.

Figure 3-40.
The fields in a character code as it is placed into the input queue.

Characters are 16-bit quantities in the input queue.
The actual character code to be transferred to the user process is in the low 8 bits.
Fig. 3-40 shows how the high bits are used.
Three are used to flag whether the character is being escaped (by CTRL-V), whether it signifies end-of-file, or whether it represents one of several codes that signify a line is complete.
Four bits are used for a count to show how much screen space is used when the character is echoed.
The test checks whether the IN EOF bit (D in the figure) is set.
This is tested at the top of the inner loop because an end-of-file (CTRL-D) is not itself transferred to a reader, nor is it counted in the character count.
As each character is transferred, a mask is applied to zero the upper 8 bits, and only the ASCII value in the low 8 bits is transferred into the local buffer.

There is more than one way to signal the end of input, but the device-specific input routine is expected to determine whether a character received is a linefeed, CTRL-D, or other such character and to mark each such character.
In transfer only needs to test for this mark, the IN EOT bit (N in Fig. 3-40).
If this is detected, tp−>tty eotct is decremented.
In noncanonical mode every character is counted this way as it is put into the input queue, and every character is also marked with the IN EOT bit at that time, so tp−>tty eotct counts characters not yet removed from the queue.
The only difference in the operation of the main loop of in transfer in the two different modes is found.
Here tp−>tty inleft is zeroed in response to finding a character marked as a line break, but only if canonical mode is in effect.
Thus when control returns to the top of the loop, the loop terminates properly after a line break in canonical mode, but in noncanonical line breaks are ignored.

When the loop terminates there is usually a partially full local buffer to be transferred.
Then a reply message is sent if tp−>tty_inleft has reached zero.
This is always the case in canonical mode, but if noncanonical mode is in effect and the number of characters transferred is less than the full request, the reply is not sent.
This may be puzzling if you have a good enough memory for details to remember that where we have seen calls to in_transfer (in do_read and handle_events), the code following the call to in_transfer sends a reply message if in_transfer returns having transferred more than the amount specified in tp−>tty_min, which will certainly be the case here.
The reason why a reply is not made unconditionally from in_transfer will be seen when we discuss the next function, which calls in_transfer under a different set of circumstances.

That next function is in_process.
It is called from the device-specific software to handle the common processing that must be done on all input.
Its parameters are a pointer to the tty structure for the source device, a pointer to the array of 8-bit characters to be processed, and a count.
The count is returned to the caller.
In_process is a long function, but its actions are not complicated.
It adds 16-bit characters to the input queue that is later processed by in_transfer.

Let us look first at a completely normal situation: an ordinary character, such as “x” (ASCII code 0x78), typed in the middle of a short line, with no escape sequence in effect, on a terminal that is set up with the standard MINIX 3 default properties.
As received from the input device this character occupies bits 0 through 7 in Fig. 3-40.
On line 14504 it would have its most significant bit, bit 7, reset to zero if the ISTRIP bit were set, but the default in MINIX 3 is not to strip the bit, allowing full 8-bit codes to be entered.
This would not affect our “x” anyway.
The MINIX 3 default is to allow extended processing of input, so the test of the IEXTEN bit in tp−>tty_termios.c_lflag passes, but the succeeding tests fail under the conditions we postulate: no character escape is in effect, this input is not itself the character escape character, and this input is not the REPRINT character.

Tests on the next several lines find that the input character is not the special_POSIX_VDISABLE character, nor is it a CR or an NL.
Finally, a positive result:

canonical mode is in effect, this is the normal default.
However our “x” is not the ERASE character, nor is it any of the KILL, EOF (CTRL-D), NL, or EOL characters, so by line 14576 still nothing will have happened to it.
Here it is found that the IXON bit is set, by default, allowing use of the STOP (CTRL-S) and START (CTRL-Q) characters, but in the ensuing tests for these no match is found.
On line 14597 it is found that the ISIG bit, enabling the use of the INTR and QUIT characters, is set by default, but again no match is found.

In fact, the first interesting thing that might happen to an ordinary character occurs, where a test is made to see if the input queue is already full.
If this were the case, the character would be discarded at this point, since canonical mode is in effect, and the user would not see it echoed on the screen.
(The continue statement discards the character, since it causes the outer loop to restart).
However, since we postulate completely normal conditions for this illustration, let us assume the buffer is not full yet.
The next test, to see if special noncanonical mode processing is needed, fails, causing a jump forward to line 14629.
Here echo is called to display the character to the user, since the ECHO bit in tp−>tty_termios.c_lflag is set by default.

Finally, the character is disposed of by being put into the input queue.
At this time tp−>tty_incount is incremented, but since this is an ordinary character, not marked by the EOT bit, tp−>tty_eotct is not changed.

The last line in the loop calls in_transfer if the character just transferred into the queue fills it.
However, under the ordinary conditions we postulate for this example, in_transfer would do nothing, even if called, since (assuming the queue has been serviced normally and previous input was accepted when the previous line of input was complete) tp−>tty_eotct is zero, tp−>tty_min is one, and the test at the start of in_transfer causes an immediate return.

Having passed through in_process with an ordinary character under ordinary conditions, let us now go back to the start of in_process and look at what happens in less ordinary circumstances.
First, we will look at the character escape, which allows a character which ordinarily has a special effect to be passed on to the user process.
If a character escape is in effect, the tp−>tty_escaped flag is set, and when this is detected the flag is reset immediately and the IN_ESC bit, bit V in Fig. 3-40, is added to the current character.
This causes special processing when the character is echoed—escaped control characters are displayed as “ˆ” plus the character to make them visible.
The IN_ESC bit also prevents the character from being recognized by tests for special characters.

The next few lines process the escape character itself, the LNEXT character (CTRL-V by default).
When the LNEXT code is detected the tp−>tty_escaped flag is set, and rawecho is called twice to output a “ˆ” followed by a backspace.
This reminds the user at the keyboard that an escape is in effect, and when the following character is echoed, it overwrites the “ˆ”.
The LNEXT character is an example of one that affects later characters (in this case, only the very next character).
It is not placed in the queue, and the loop restarts after the two calls to rawecho.
The order of these two tests is important, making it possible to enter the LNEXT character itself twice in a row, in order to pass the second copy on to a process as actual data.

The next special character processed by in_process is the REPRINT character (CTRL-R).
When it is found a call to reprint ensues, causing the current echoed output to be redisplayed.
The REPRINT itself is then discarded with no effect upon the input queue.

Going into detail on the handling of every special character would be tedious, and the source code of in_process is straightforward.
We will mention just a few more points.
One is that the use of special bits in the high byte of the 16-bit value placed in the input queue makes it easy to identify a class of characters that have similar effects.
Thus, EOT (CTRL-D), LF, and the alternate EOL character (undefined by default) are all marked by the EOT bit, bit D in Fig. 3-40, making later recognition easy.

Finally, we will justify the peculiar behavior of in_transfer noted earlier.
A reply is not generated each time it terminates, although in the calls to in_transfer we have seen previously, it seemed that a reply would always be generated upon return.
Recall that the call to in_transfer made by in_process when the input queue is full has no effect when canonical mode is in effect.
But if noncanonical processing is desired, every character is marked with the EOT bit, and thus every character is counted by tp−>tty_eotct.
In turn, this causes entry into the main loop of in_transfer when it is called because of a full input queue in noncanonical mode.
On such occasions no message should be sent at the termination of in_transfer, because there are likely to be more characters read after returning to in_process.
Indeed, although in canonical mode input to a single read is limited by the size of the input queue (255 characters in MINIX 3), in noncanonical mode a read call must be able to deliver the POSIX-required constant_POSIX_SSIZE_MAX number of characters.
Its value in MINIX 3 is 32767.

The next few functions in tty.c support character input.
Tty_echo treats a few characters in a special way, but most just get displayed on the output side of the same device being used for input.
Output from a process may be going to a device at the same time input is being echoed, which makes things messy if the user at the keyboard tries to backspace.
To deal with this, the tp−>tty_reprint flag is always set to TRUE by the device-specific output routines when normal output is produced, so the function called to handle a backspace can tell that mixed output has been produced.
Since tty_echo also uses the device-output routines, the current value of tp−>tty_reprint is preserved while echoing, using the local variable rp.
However, if a new line of input has just begun, rp is set to FALSE instead of taking on the old value, thus assuring that tp−>tty_reprint will be reset when echo terminates.

You may have noticed that tty_echo returns a value, for instance, in the call in in_process:

The value returned by echo contains the number of spaces used on the screen for the echo display, which may be up to eight if the character is a TAB.
This count is placed in the cccc field in Fig. 3-40.
Ordinary characters occupy one space on the screen, but if a control character (other than TAB, NL, or CR or a DEL (0x7F) is echoed, it is displayed as a “ˆ” plus a printable ASCII character and occupies two positions on the screen.
On the other hand an NL or CR occupies zero spaces.
The actual echoing must be done by a device-specific routine, of course, and whenever a character must be passed to the device, an indirect call is made using tp−>tty_echo, as, for instance, for ordinary characters.

The next function, rawecho, is used to bypass the special handling done by echo.
It checks to see if the ECHO flag is set, and if it is, sends the character along to the device-specific tp−>tty_echo routine without any special processing.
A local variable rp is used here to prevent rawecho’s own call to the output routine from changing the value of tp−>tty_reprint.

When a backspace is found by in_process, the next function, back_over, is called.
It manipulates the input queue to remove the previous head of the queue if backing up is possible—if the queue is empty or if the last character is a line break, then backing up is not possible.
Here the tp−>tty_reprint flag mentioned in the discussions of echo and rawecho is tested.
If it is TRUE, then reprint is called to put a clean copy of the output line on the screen.
Then the len field of the last character displayed (the cccc field of Fig. 3-40) is consulted to find out how many characters have to be deleted on the display, and for each character a sequence of backspace-space-backspace characters is sent through rawecho to remove the unwanted character from the screen and have it replaced by a space.

Reprint is the next function.
In addition to being called by back_over, it may be invoked by the user pressing the REPRINT key (CTRL-R).
The loop searches backward through the input queue for the last line break.
If it is found in the last position filled, there is nothing to do and reprint returns.
Otherwise, it echos the CTRL-R, which appears on the display as the two character sequence “ˆR”, and then moves to the next line and redisplays the queue from the last line break to the end.

Now we have arrived at out_process.
Like in_process, it is called by device-specific output routines, but it is simpler.
It is called by the RS-232 and pseudo terminal device-specific output routines, but not by the console routine.
Out_process works upon a circular buffer of bytes but does not remove them from the buffer.
The only change it makes to the array is to insert a CR character ahead of an NL character in the buffer if the OPOST (enable output processing) and ONLCR (map NL to CR-NL) bits in tp−>tty_termios.oflag are both set.
Both bits are set by default in MINIX 3.
Its job is to keep the tp−>tty_position variable in the device’s tty structure up to date.
Tabs and backspaces complicate life.

The next routine is dev_ioctl.
It supports do_ioctl in carrying out the tcdrain function and the tcsetattr function when it is called with either the TCSADRAIN or TCSAFLUSH options.
In these cases, do_ioctl cannot complete the action immediately if output is incomplete, so information about the request is stored in the parts of the tty structure reserved for delayed ioctl operations.
Whenever handle_events runs, it first checks the tp−>tty_ioreq field after calling the device-specific output routine and calls dev_ioctl if an operation is pending.
Dev_ioctl tests tp−>tty_outleft to see if output is complete, and if so, carries out the same actions that do_ioctl would have carried out immediately if there had been no delay.
To service tcdrain, the only action is to reset the tp−>tty_ioreq field and send the reply message to the FS, telling it to wake up the process that made the original call.
The TCSAFLUSH variant of tcsetattr calls tty_icancel to cancel input.
For both variants of tcsetattr, the termios structure whose address was passed in the original call to ioctl is copied to the device’s tp−>tty_termios structure.
Setattr is then called, followed, as with tcdrain, by sending a reply message to wake up the blocked original caller.

Setattr is the next procedure.
As we have seen, it is called by do_ioctl or dev_ioctl to change the attributes of a terminal device, and by do_close to reset the attributes back to the default settings.
Setattr is always called after copying a new termios structure into a device’s tty structure, because merely copying the parameters is not enough.
If the device being controlled is now in noncanonical mode, the first action is to mark all characters currently in the input queue with the IN_EOT bit, as would have been done when these characters were originally entered in the queue if noncanonical mode had been in effect then.
It is easier just to go ahead and do this than to test whether the characters already have the bit set.
There is no way to know which attributes have just been changed and which still retain their old values.

The next action is to check the MIN and TIME values.
In canonical mode tp−>tty_min is always 1; that is set.
In noncanonical mode the combination of the two values allows for four different modes of operation, as we saw in Fig. 3-31.
On lines 14931 to 14933 tp−>tty_min is first set up with the value passed in tp−>tty_termios.cc[VMIN], which is then modified if it is zero and tp−>tty_termios.cc[VTIME] is not zero.

Finally, setattr makes sure output is not stopped if XON/XOFF control is disabled, sends a SIGHUP signal if the output speed is set to zero, and makes an indirect call to the device-specific routine pointed to by tp−>tty_ioctl to do what can only be done at the device level.

The next function, tty_reply has been mentioned many times in the preceding discussion.
Its action is entirely straightforward, constructing a message and sending it.
If for some reason the reply fails, a panic ensues.
The following functions are equally simple.
Sigchar asks MM to send a signal.
If the NOFLSH flag is not set, queued input is removed—the count of characters or lines received is zeroed and the pointers to the tail and head of the queue are equated.
This is the default action.
When a SIGHUP signal is to be caught, NOFLSH can be set, to allow input and output to resume after catching the signal.
Tty_icancel unconditionally discards pending input in the way described for sigchar, and in addition calls the device-specific function pointed to by tp−>tty_icancel, to cancel input that may exist in the device itself or be buffered in the low-level code.

Tty_init is called when tty_task first starts.
It loops through all possible terminals and sets up defaults.
Initially, a pointer to tty_devnop, a dummy function that does nothing, is set into the tp−>tty_icancel, tp−>tty_ocancel, tp−>tty_ioctl, and tp−>tty_close variables.
Tty_init then calls a device-specific initialization functions for the appropriate category of terminal (console, serial line, or pseudo terminal).
These set up the real pointers to indirectly called device-specific functions.
Recall that if there are no devices at all configured in a particular device category, a macro that returns immediately is created, so no part of the code for a nonconfigured device need be compiled.
The call to scr_init initializes the console driver and also calls the initialization routine for the keyboard.

The next three functions support timers.
A watchdog timer is initialized with a pointer to a function to run when the timer expires.
Tty_timed_out is that function for most timers set by the terminal task.
It sets the events flag to force processing of input and output.
Expire_timers handles the terminal driver’s timer queue.
Recall that this is the function called from the main loop of tty_task when a SYN_ALARM message is received.
A library routine, tmrs_exptimers, is used to traverse the linked list of timers, expiring and calling the watchdog functions of any that have timed out.
On returning from the library function, if the queue is still active a sys_setalarm kernel call is made to ask for another SYN_ALARM.
Finally, settimer, sets timers for determining when to return from a read call in noncanonical mode.
It is called with parameters of tty_ptr, a pointer to a tty structure, and enable, an integer which represents TRUE or FALSE.
Library functions tmrs_settimer and tmrs_clrtimer are used to enable or disable a timer as determined by the enable argument.
When a timer is enabled, the watchdog function is always tty_timed_out, described previously.

A description of tty_devnop is necessarily longer than its executable code, since it has none.
It is a “no-operation” function to be indirectly addressed where a device does not require a service.
We have seen tty_devnop used in tty_init as the default value entered into various function pointers before calling the initialization routine for a device.

The final item in tty.c needs some explanation.
Select is a system call used when multiple I/O devices may require service at unpredictable times by a single process.
A classic example is a communications program which needs to pay attention to a local keyboard and a remote system, perhaps connected by a modem.
The select call allows opening several device files and monitoring all of them to see when they can be read from or written to without blocking.
Without select it is necessary to use two processes to handle two-way communication, one acting as a master and handling communication in one direction, the other a slave handling communication in the other direction.
Select is an example of a feature that is very nice to have, but which substantially complicates the system.
One of the design goals of MINIX 3 is to be simple enough to be understood with reasonable effort in a reasonable time, and we have to set some limits.
For that reason we will not discuss do_select and the support routines select _try and select_retry here.

1.9.5 Implementation of the Keyboard Driver

Now we turn to the device-dependent code that supports the MINIX 3 console, which consists of an IBM PC keyboard and a memory-mapped display.
The physical devices that support these are entirely separate: on a standard desktop system the display uses an adapter card (of which there are at least a half-dozen basic types) plugged into the backplane, while the keyboard is supported by circuitry built into the parentboard which interfaces with an 8-bit single-chip computer inside the keyboard unit.
The two subdevices require entirely separate software support, which is found in the files keyboard.c and console.c.

The operating system sees the keyboard and console as parts of the same device, /dev/console.
If there is enough memory available on the display adapter, virtual console support may be compiled, and in addition to /dev/console there may be additional logical devices, /dev/ttyc1, /dev/ttyc2, and so on.
Output from only one goes to the display at any given time, and there is only one keyboard to use for input to whichever console is active.
Logically the keyboard is subservient to the console, but this is manifested in only two relatively minor ways.
First, tty_table contains a tty structure for the console, and where separate fields are provided for input and output, for instance, the tty_devread and tty_devwrite fields, pointers to functions in keyboard.c and console.c are filled in at startup time.
However, there is only one tty_priv field, and this points to the console’s data structures only.
Second, before entering its main loop, tty_task calls each logical device once to initialize it.
The routine called for /dev/console is in console.c, and the initialization code for the keyboard is called from there.
The implied hierarchy could just as well have been reversed, however.
We have always looked at input before output in dealing with I/O devices and we will continue that pattern, discussing keyboard.c in this section and leaving the discussion of console.c for the following section.

Keyboard.c begins, like most source files we have seen, with several #include statements.
One of these is unusual, however.
The file keymaps/us-std.src is not an ordinary header;
it is a C source file that results in compilation of the default keymap within keyboard.o as an initialized array.
The keymap source file is not included in Appendix B because of its size, but some representative entries are illustrated in Fig. 3-37.
Following the #include statements are macros to define various constants.
The first group are used in low-level interaction with the keyboard controller.
Many of these are I/O port addresses or bit combinations that have meaning in these interactions.
The next group includes symbolic names for special keys.
On line 15249 the size of the circular keyboard input buffer is symbolically defined as KB_IN_BYTES, with a value of 32, and the buffer itself and variables to manage it are defined next.
Since there is only one of these buffers care must be taken to make sure all of its contents are processed before virtual consoles are changed.

The next group of variables are used to hold various states that must be remembered to properly interpret a key press.
They are used in different ways.
For instance, the value of the caps_down flag is toggled between TRUE and FALSE each time the Caps Lock key is pressed.
The shift flag is set to TRUE when either Shift key is pressed and to FALSE when both Shift keys are released.
The esc variable is set when a scan code escape is received.
It is always reset upon receipt of the following character.

Map_key0 is defined as a macro.
It returns the ASCII code that corresponds to a scan code, ignoring modifiers.
This is equivalent to the first column (unshifted) in the keymap array.
Its big brother is map_key, which performs the complete mapping of a scan code to an ASCII code, including accounting for (multiple) modifier keys that are depressed at the same time as ordinary keys.

The keyboard interrupt service routine is kbd_interrupt, called whenever a key is pressed or released.
It calls scode to get the scan code from the keyboard controller chip.
The most significant bit of the scan code is set when a key release causes the interrupt, such codes could be ignored unless they were one of the modifier keys.
However, in the interest of doing as little as possible in order to service an interrupt as quickly as possible, all raw scan codes are placed in the circular buffer and the tp−>tty_events flag for the current console is raised.
For purposes of this discussion we will assume, as we did earlier, that no select calls have been made, and that kbd_interrupt returns immediately after this.
Figure 3-41 shows scan codes in the buffer for a short line of input that contains two upper case characters, each preceded by the scan code for depression of a shift key and followed by the code for the release of the shift key.
Initially codes for both key presses and releases are stored.

When a HARD_INT from the keyboard is received by tty_task, the complete main loop is not executed.
A continue statement at line 13795 causes a new iteration of the main loop to begin immediately, at line 13764.
(This is slightly simplified, we left some conditional code in the listing to show that if the serial line driver is enabled its user-space interrupt handler could also be called.) When execution transfers to the top of the loop the tp−>tty_events flag for the console device is now found to be set, and kb_read, the device-specific routine, is called using the pointer in the tp−>tty_devread field of the console’s tty structure.

Figure 3-41.
Scan codes in the input buffer, with corresponding key actions below, for a line of text entered at the keyboard.
L and R represent the left and right Shift keys.
+ and - indicate a key press and a key release.
The code for a release is 128 more than the code for a press of the same key.

Kb read takes scan codes from the keyboard’s circular buffer and places ASCII codes in its local buffer, which is large enough to hold the escape sequences that must be generated in response to some scan codes from the numeric keypad.
Then it calls in process in the hardware-independent code to put the characters into the input queue.
On line 15379 icount is decremented.
The call to make break returns the ASCII code as an integer.
Special keys, such as keypad and function keys, have values greater than 0xFF at this point.
Codes in the range from HOME to INSRT (0x101 to 0x10C, defined in file include/minix/keymap.h) result from pressing the numeric keypad, and are converted into 3-character escape sequences shown in Fig. 3-42 using the numpad map array.

The sequences are then passed to in process.
Higher codes are not passed on to in process.
Instead, a check is made for the codes for ALT-LEFT-ARROW, ALT-RIGHT-ARROW, and ALT-F1 through ALT-F12, and if one of these is found, select console is called to switch virtual consoles.
CTRL-F1 through CTRL-F12 are similarly given special handling.
CTRL-F1 shows the mappings of function keys (more on this later).
CTRL-F3 toggles between hardware scrolling and software scrolling of the console screen.
CTRL-F7, CTRL-F8, and CTRL-F9 generate signals with the same effects as CTRL-, CTRL-C, and CTRL-U, respectively, except these cannot be changed by the stty command.

Make break converts scan codes into ASCII and then updates the variables that keep track of the state of modifier keys.
First, however, it checks for the magic CTRL-ALT-DEL combination that PC users all know as the way to force a reboot under MS-DOS.
Note the comment that it would be better to do this at a lower level.
However, the simplicity of MINIX 3 interrupt handling in kernel space makes detecting CTRL-ALT-DEL impossible there, when an interrupt notification is sent the scan code has not yet been read.

An orderly shutdown is desirable, so rather than try to start the PC BIOS routines, a sys kill kernel call is made to initiate sending a SIGKILL signal TO init, the parent process of all other processes.
Init is expected to catch this signal and interpret it as a command to begin an orderly process of shutting down, prior to causing a return to the boot monitor, from which a full restart of the system or a reboot of MINIX 3 can be commanded.

Figure 3-42.
Escape codes generated by the numeric keypad.
When scan codes for ordinary keys are translated into ASCII codes the special keys are assigned “pseudo ASCII” codes with values greater than 0xFF.

this signal and interpret it as a command to begin an orderly process of shutting down, prior to causing a return to the boot monitor, from which a full restart of the system or a reboot of MINIX 3 can be commanded.

Of course, it is not realistic to expect this to work every time.
Most users understand the dangers of an abrupt shutdown and do not press CTRL-ALT-DEL until something is really going wrong and normal control of the system has become impossible.
At this point it is likely that the system may be so disrupted that signaling another process may be impossible.
This is why there is a static variable CAD count in make break.
Most system crashes leave the interrupt system still functioning, so keyboard input can still be received and the terminal driver will remain active.
Here MINIX 3 takes advantage of the expected behavior of computer users, who are likely to bang on the keys repeatedly when something does not seem to work correctly (possibly evidence our ancestors really were apes).
If the attempt to kill init fails and the user presses CTRL-ALT-DEL twice more, a sys abort kernel call is made, causing a return to the monitor without going through the call to init.

The main part of make break is not hard to follow.
The variable make records whether the scan code was generated by a key press or a key release, and then the call to map key returns the ASCII code to ch.
Next is a switch on ch.
Let us consider two cases, an ordinary key and a special key.
For an ordinary key, none of the cases match, and in the default case, the key code is returned if make is true.
If somehow an ordinary key code is accepted at key release, a value of −1 is substituted here, and this is ignored by the caller, kb_read.
A special key, for example CTRL, is identified at the appropriate place in the switch.
The corresponding variable, in this case ctrl, records the state of make, and −1 is substituted for the character code to be returned (and ignored).
The handling of the ALT, CALOCK, NLOCK, and SLOCK keys is more complicated, but for all of these special keys the effect is similar: a variable records either the current state (for keys that are only effective while pressed) or toggles the previous state (for the lock keys).

There is one more case to consider, that of the EXTKEY code and the esc variable.
This is not to be confused with the ESC key on the keyboard, which returns the ASCII code 0x1B.
There is no way to generate the EXTKEY code alone by pressing any key or combination of keys; it is the PC keyboard’s extended key prefix, the first byte of a 2-byte scan code that signifies that a key that was not part of the original PC’s complement of keys but that has the same scan code, has been pressed.
In many cases software treats the two keys identically.
For instance, this is almost always the case for the normal “/” key and the gray “/” key on the numeric keyboard.
In other cases, one would like to distinguish between such keys.
For instance, many keyboard layouts for languages other than English treat the left and right ALT keys differently, to support keys that must generate three different character codes.
Both ALT keys generate the same scan code (56), but the EXTKEY code precedes this when the right-hand ALT is pressed.
When the EXTKEY code is returned, the esc flag is set.
In this case, make_break returns from within the switch, thus bypassing the last step before a normal return, which sets esc to zero in every other case.
This has the effect of making the esc effective only for the very next code received.
If you are familiar with the intricacies of the PC keyboard as it is ordinarily used, this will be both familiar and yet a little strange, because the PC BIOS does not allow one to read the scan code for an ALT key and returns a different value for the extended code than does MINIX 3.

Set_leds turns on and off the lights that indicate whether the Num Lock, Caps Lock, or Scroll Lock keys on a PC keyboard have been pressed.
A control byte, LED_CODE, is written to an output port to instruct the keyboard that the next byte written to that port is for control of the lights, and the status of the three lights is encoded in 3 bits of that next byte.
These operations are, of course, carried out by kernel calls which ask the system task write to the outport ports.
The next two functions support this operation.
Kb_wait is called to determine that the keyboard is ready to receive a command sequence, and kb_ack is called to verify that the command has been acknowledged.
Both of these commands use busy waiting, continually reading until a desired code is seen.
This is not a recommended technique for handling most I/O operations, but turning lights on and off on the keyboard is not going to be done very often and doing it inefficiently does not waste much time.
Note also that both kb_wait and kb_ack could fail, and one can determine from the return code if this happens.
Timeouts are handled by limiting the number of retries by means of a counter in the loop.
But setting the light on the keyboard is not important enough to merit checking the value returned by either call, and set_leds just proceeds blindly.

Since the keyboard is part of the console, its initialization routine, kb_init, is called from scr_init in console.c, not directly from tty_init in tty.c.
If virtual consoles are enabled, (i.e., NR_CONS in include/minix/config.h is greater than 1), kb_init is called once for each logical console.
The next function, kb_init_once, is called just once, as its name implies.
It sets the lights on the keyboard, and scans the keyboard to be sure no leftover keystroke is read.
Then it initializes two arrays, fkey_obs and sfkey_obs which are used to bind function keys to the processes that must respond to them.
When all is ready, it makes two kernel calls, sys_irqsetpolicy and sys_irqenable to set up the IRQ for the keyboard and configure it to automatically reenable, so a notification message will be sent to tty_task whenever a key is pressed or released.

Although we will soon have more opportunities to discuss how function keys work, this is a good place to describe the fkey_obs and sfkey_obs arrays.
Each has twelve elements, since modern PC keyboards have twelve F-keys.
The first array is for unmodified F-keys, the second is used when a shifted F-key is detected.
They are composed of elements of type obs_t, which is a structure that can hold a process number and an integer.
This structure and these arrays are declared in keyboard.c.
Initialization stores a special value, symbolically represented as NONE, in the proc_nr component of the structure to indicate it is not in use.
NONE is a value outside the range of valid process numbers.
Note that the process number is not a pid, it identifies a slot in the process table.
This terminology may be confusing.
But to send a notification a process number rather than a pid is used, because process numbers are used to index the priv table which determines whether a process is allowed to receive notifications.
The integer events is also initially set to zero.
It will be used to count events.

The next three functions are all rather simple.
Kbd_loadmap is almost trivial.
It is called by do_ioctl in tty.c to do the copying of a keymap from user space to overwrite the default keymap.
The default is compiled by the inclusion of a keymap source file at the start of keyboard.c.

From its first release, MINIX has always provided for dumps of various kinds of system information or other special actions in response to pressing the function keys F1, F2, etc., on the system console.
This is not a service generally provided in other operating systems, but MINIX was always intended to be a teaching tool.
Users are encouraged to tinker with it, which means users may need extra help for debugging.
In many cases the output produced by pressing one of the F-keys will be available even when the system has crashed.
Figure 3-43 summarizes these keys and their effects.

These keys fall into two categories.
As noted earlier, the CTRL-F1 through CTRL-F12 key combinations are detected by kb_read.
These trigger events that can be handled by the terminal driver.
These events are not necessarily display dumps.
In fact, currently only CTRL-F1 provides an information display; it lists function key bindings.
CTRL-F3 toggles hardware and software scrolling of the console screen, and the others cause signals.

Function keys pressed by themselves or together with the shift key are used to trigger events that cannot be handled by the terminal driver.
They may result in notification messages to a server or driver.
Because servers and drivers can be loaded, enabled, and disabled after MINIX 3 is already running, static binding of these keys at compilation time is not satisfactory.
To enable run-time changes tty task accepts messages of type FKEY CONTROL.
Do fkey ctl services such requests.
Request types are FKEY MAP, FKEY UNMAP, or FKEY EVENTS.
The first two register or unregister a process with a key specified in a bitmap in the message, and the third message type returns a bitmap of keys belonging to the caller which have been pressed and resets the events field for these keys.
A server process, the information server, (or IS) initializes the settings for processes in the boot image and also mediates generating responses.
But individual drivers can also register to respond to a function key.
Ethernet drivers typically do this, as a dump that shows packet statistics can be helpful in solving network problems.

Func_key is called from kb_read to see if a special key meant for local processing has been pressed.
This is done for every scan code received, prior to any other processing.
If it is not a function key at most three comparisons are made before control is returned to kb_read.
If a function key is registered a notification message is sent to the appropriate process.
If the process is one that has registered only one key the notification by itself is adequate for the process to know what to do.
If a process is the information server or another that has registered several keys, a dialogue is required—the process must send an FKEY_EVENTS request to the terminal driver, to be processed by do_fkey_ctl which will inform the caller which keys have been active.
The caller can then dispatch to the routine for each key that has been pressed.

Scan_keyboard works at the hardware interface level, by reading and writing bytes from I/O ports.
The keyboard controller is informed that a character has been read by the sequence, which reads a byte, writes it again with the most significant bit set to 1, and then rewrites it with the same bit rest to 0.
This prevents the same data from being read on a subsequent read.
There is no status checking in reading the keyboard, but there should be no problems in any case, since scan_keyboard is only called in response to an interrupt.

The last function in keyboard.c is do_panic_dumps.
If invoked as a result of a system panic, it provides an opportunity for the user to use the function keys to display debugging information.
The loop is another example of busy waiting.
The keyboard is read repeatedly until an ESC is typed.
Certainly no one can claim that a more efficient technique is needed after a crash, while awaiting a command to reboot.
Within the loop, the rarely-used nonblocking receive operation, nb_receive, is used to permit alternately accepting messages, if available, and testing the keyboard for input, which can be expected to be one of the options suggested in the message

printed on entering this function.
In the next section we will see the code that implements do_newkmess and do_diagnostics.

1.9.6 Implementation of the Display Driver

The IBM PC display may be configured as several virtual terminals, if sufficient memory is available.
We will examine the console’s device-dependent code in this section.
We will also look at the debug dump routines that use low-level services of the keyboard and display.
These provide support for limited interaction with the user at the console, even when other parts of the MINIX 3 system are not functioning and can provide useful information even following a near-total system crash.

Hardware-specific support for console output to the PC memory-mapped screen is in console.c.
The console structure is defined.
In a sense this structure is an extension of the tty structure defined in tty.c.
At initialization the tp−>tty_priv field of a console’s tty structure is assigned a pointer to its own console structure.
The first item in the console structure is a pointer back to the corresponding tty structure.
The components of a console structure are what one would expect for a video display: variables to record the row and column of the cursor location, the memory addresses of the start and limit of memory used for the display, the memory address pointed to by the controller chip’s base pointer, and the current address of the cursor.
Other variables are used for managing escape sequences.
Since characters are initially received as 8-bit bytes and must be combined with attribute bytes and transferred as 16-bit words to video memory, a block to be transferred is built up in c_ramqueue, an array big enough to hold an entire 80-column row of 16-bit character-attribute pairs.
Each virtual console needs one console structure, and the storage is allocated in the array cons_table.
As we have done with the tty and other structures, we will usually refer to the elements of a console structure using a pointer, for example, cons−>c_tty.

The function whose address is stored in each console’s tp−>tty_devwrite entry is cons_write.
It is called from only one place, handle_events in tty.c.
Most of the other functions in console.c exist to support this function.
When it is called for the first time after a client process makes a write call, the data to be output are in the client’s buffer, which can be found using the tp−>tty_outproc and tp−>out_vir fields in the tty structure.
The tp−>tty_outleft field tells how many characters are to be transferred, and the tp−>tty_outcum field is initially zero, indicating none have yet been transferred.
This is the usual situation upon entry to cons_write, because normally, once called, it transfers all the data requested in the original call.
However, if the user wants to slow the process in order to review the data on the screen, he may enter a STOP (CTRL-S) character at the keyboard, resulting in raising of the tp−>tty_inhibited flag.
Cons_write returns immediately when this flag is raised, even though the write has not been completed.
In such a case handle_events will continue to call cons_write, and when tp−>tty_inhibited is finally reset, by the user entering a START (CTRL-Q) character, cons_write continues with the interrupted transfer.

Cons_write’s first argument is a pointer to the particular console’s tty structure, so the first thing that must be done is to initialize cons, the pointer to this console’s console structure.
Then, because handle_events calls cons_write whenever it runs, the first action is a test to see if there really is work to be done.
A quick return is made if not.
Following this the main loop is entered.
This loop is similar in structure to the main loop of in_transfer in tty.c.
A local buffer that can hold 64 characters is filled by using the sys_vircopy kernel call to get the data from the client’s buffer.
Following this, the pointer to the source and the counts are updated, and then each character in the local buffer is transferred to the cons−>c_ramqueue array, along with an attribute byte, for later transfer to the screen by flush.

The transfer of characters from cons−>c_ramqueue can be done in more than one way, as we saw in Fig. 3-35.
Out_char can be called to do this for each character, but it is predictable that none of the special services of out_char will be needed if the character is a visible character, an escape sequence is not in progress, the screen width has not been exceeded, and cons−>c_ramqueue is not full.
If the full service of out_char is not needed, the character is placed directly into cons−>c_ramqueue, along with the attribute byte (which is retrieved from cons−>c_attr), and cons−>c_rwords (which is the index into the queue), cons−>c_column (which keeps track of the column on the screen), and tbuf, the pointer into the buffer, are all incremented.
This direct placement of characters into cons−>c_ramqueue corresponds to the dashed line on the left side of Fig. 3-35.
If needed, out_char is called.
It does all of the bookkeeping, and additionally calls flush, which does the final transfer to screen memory, when necessary.

The transfer from the user buffer to the local buffer to the queue is repeated as long as tp−>tty_outleft indicates there are still characters to be transferred and the flag tp−>tty_inhibited has not been raised.
When the transfer stops, whether because the write operation is complete or because tp−>tty_inhibited has been raised, flush is called again to transfer the last characters in the queue to screen memory.
If the operation is complete (tested by seeing if tp−>tty_outleft is zero), a reply message is sent by calling tty_reply lines 16096 and 16097).

In addition to calls to cons_write from handle_events, characters to be displayed are also sent to the console by echo and rawecho in the hardware-independent part of the terminal driver.
If the console is the current output device, calls via the tp−>tty_echo pointer are directed to the next function, cons_echo.
Cons_echo does all of its work by calling out_char and then flush.
Input from the keyboard arrives character by character and the person doing the typing wants to see the echo with no perceptible delay, so putting characters into the output queue would be unsatisfactory.

Out_char.
does a test to see if an escape sequence is in progress, calling parse_escape and then returning immediately if so.
Otherwise, a switch is entered to check for special cases: null, backspace, the bell character, and so on.
The handling of most of these is easy to follow.
The linefeed and the tab are the most complicated, since they involve complicated changes to the position of the cursor on the screen and may require scrolling as well.
The last test is for the ESC code.
If it is found, the cons−>c_esc_state flag is set, and future calls to out_char are diverted to parse_escape until the sequence is complete.
At the end, the default is taken for printable characters.
If the screen width has been exceeded, the screen may need to be scrolled, and flush is called.
Before a character is placed in the output queue a test is made to see that the queue is not full, and flush is called if it is.
Putting a character into the queue requires the same bookkeeping we saw earlier in cons_write.

The next function is scroll_screen.
Scroll_screen handles both scrolling up, the normal situation that must be dealt with whenever the bottom line on the screen is full, and scrolling down, which occurs when cursor positioning commands attempt to move the cursor beyond the top line of the screen.
For each direction of scroll there are three possible methods.
These are required to support different kinds of video cards.

We will look at the scrolling up case.
To begin, chars is assigned the size of the screen minus one line.
Softscrolling is accomplished by a single call to vid_vid_copy to move chars characters lower in memory, the size of the move being the number of characters in a line.
Vid_vid_copy can wrap, that is, if asked to move a block of memory that overflows the upper end of the block assigned to the video display, it fetches the overflow portion from the low end of the memory block and moves it to an address higher than the part that is moved lower, treating the entire block as a circular array.
The simplicity of the call hides a fairly slow operation, even though vid_vid_copy is an assembly language routine (defined in drivers/tty/vidcopy.s, not listed in Appendix B).
This call requires the CPU to move 3840 bytes, which is a large job even in assembly language.

The softscroll method is never the default; the operator is supposed to select it only if hardware scrolling does not work or is not desired for some reason.
One reason might be a desire to use the screendump command, either to save the screen memory in a file or to view the main console display when working from a remote terminal.
When hardware scrolling is in effect, screendump is likely to give unexpected results, because the start of the screen memory is likely not to coincide with the start of the visible display.

On line 16226 the wrap variable is tested as the first part of a compound test.
Wrap is true for older displays that can support hardware scrolling, and if the test fails, simple hardware scrolling occurs, where the origin pointer used by the video controller chip, cons−>c_org, is updated to point to the first character to be displayed at the upper-left corner of the display.
If wrap is FALSE, the compound test continues with a test of whether the block to be moved up in the scroll operation overflows the bounds of the memory block designated for this console.
If this is so, vid_vid_copy is called again to make a wrapped move of the block to the start of the console’s allocated memory, and the origin pointer is updated.
If there is no overlap, control passes to the simple hardware scrolling method always used by older video controllers.
This consists of adjusting cons−>c_org and then putting the new origin in the correct register of the controller chip.
The call to do this is executed later, as is a call to blank the bottom line on the screen to achieve the “scrolling” effect.

The code for scrolling down is very similar to that for scrolling up.
Finally, mem_vid_copy is called to blank out the line at the bottom (or top) addressed by new_line.
Then set_6845 is called to write the new origin from cons−>c_org into the appropriate registers, and flush makes sure all changes become visible on the screen.

We have mentioned flush several times.
It transfers the characters in the queue to the video memory using mem_vid_copy, updates some variables, and then makes sure the row and column numbers are reasonable, adjusting them if, for instance, an escape sequence has tried to move the cursor to a negative column position.
Finally, a calculation of where the cursor ought to be is made and is compared with cons−>c_cur.
If they do not agree, and if the video memory that is currently being handled belongs to the current virtual console, a call to set_6845 is made to set the correct value in the controller’s cursor register.

Figure 3-44 shows how escape sequence handling can be represented as a finite state machine.
This is implemented by parse_escape which is called at the start of out_char if cons−>c_esc_state is nonzero.
An ESC itself is detected by out_char and makes cons−>c_esc_state equal to 1.
When the next character is received, parse_escape prepares for further processing by putting a “\0” in cons−>c_esc_intro, a pointer to the start of the array of parameters, cons−>c_esc_parmv[0] into cons−>c_esc_parmp, and zeroes into the parameter array itself.
Then the first character directly following the ESC is examined—valid values are either “[” or “M”.
In the first case the “[” is copied to cons−>c_esc_intro and the state is advanced to 2.
In the second case, do_escape is called to carry out the action, and the escape state is reset to zero.
If the first character after the ESC is not one of the valid ones, it is ignored and succeeding characters are once again displayed normally.

When an ESC [ sequence has been seen, the next character entered is processed by the escape state 2 code.
There are three possibilities at this point.
If the character is a numeric character, its value is extracted and added to 10 times the existing value in the position currently pointed to by cons−>c_esc_parmp, initially cons−>c_esc_parmv[0] (which was initialized to zero).
The escape state does not change.
This makes it possible to enter a series of decimal digits and accumulate a large numeric parameter, although the maximum value currently recognized by MINIX 3 is 80, used by the sequence that moves the cursor to an arbitrary position.
If the character is a semicolon there is another parameter, so the pointer to the parameter string is advanced, allowing succeeding numeric values to be accumulated in the second parameter.
If MAX_ESC_PARMS were to be changed to allocate a larger array for the parameters, this code would not have to be altered to accumulate additional numeric values after entry of additional parameters.
Finally, if the character is neither a numeric digit nor a semicolon, do_escape is called.

Do_escape is one of the longer functions in the MINIX 3 system source code, even though MINIX 3’s complement of recognized escape sequences is relatively modest.
For all its length, however, the code should be easy to follow.
After an initial call to flush to make sure the video display is fully updated, there is a simple if choice, depending upon whether the character immediately following the ESC character was a special control sequence introducer or not.
If not, there is only one valid action, moving the cursor up one line if the sequence was ESC M.
Note that the test for the “M” is done within a switch with a default action, as a validity check and in anticipation of addition of other sequences that do not use the ESC [ format.
The action is typical of many escape sequences: the cons−>c_row variable is inspected to determine if scrolling is required.
If the cursor is already on row 0, a SCROLL_DOWN call is made to scroll_screen; otherwise the cursor is moved up one line.
The latter is accomplished just by decrementing cons−>c_row and then calling flush.
If a control sequence introducer is found, the code following the else is taken.
A test is made for “[”, the only control sequence introducer currently recognized by MINIX 3.
If the sequence is valid, the first parameter found in the escape sequence, or zero if no numeric parameter was entered, is assigned to value.
If the sequence is invalid, nothing happens except that the large switch that ensues is skipped and the escape state is reset to zero before returning from do_escape.
In the more interesting case that the sequence is valid, the switch is entered.
We will not discuss all the cases; we will just note several that are representative of the types of actions governed by escape sequences.

The first five sequences are generated, with no numeric arguments, by the four “arrow” keys and the Home key on the IBM PC keyboard.
The first two, ESC [A and ESC [B, are similar to ESC M, except they can accept a numeric parameter and move up and down by more than one line, and they do not scroll the screen if the parameter specifies a move that exceeds the bounds of the screen.
In such cases, flush catches requests to move out of bounds and limits the move to the last row or the first row, as appropriate.
The next two sequences, ESC [C and ESC [D, which move the cursor right and left, are similarly limited by flush.
When generated by the “arrow” keys there is no numeric argument, and thus the default movement of one line or column occurs.

ESC [H can take two numeric parameters, for instance, ESC [20;60H.
The parameters specify an absolute position rather than one relative to the current position and are converted from 1-based numbers to 0-based numbers for proper interpretation.
The Home key generates the default (no parameters) sequence which moves the cursor to position (1, 1).

ESC [sJ and ESC [sK clear a part of either the entire screen or the current line, depending upon the parameter that is entered.
In each case a count of characters is calculated.
For instance, for ESC [1J, count gets the number of characters from the start of the screen to the cursor position, and the count and a position parameter, dst, which may be the start of the screen, cons−>c_org, or the current cursor position, cons−>c_cur, are used as parameters to a call to mem_vid_copy.
This procedure is called with a parameter that causes it to fill the specified region with the current background color.

The next four sequences insert and delete lines and spaces at the cursor position, and their actions do not require detailed explanation.
The last case, ESC [nm (note the n represents a numeric parameter, but the “m” is a literal character) has its effect upon cons−>c_attr, the attribute byte that is interleaved between the character codes when they are written to video memory.

The next function, set_6845, is used whenever it is necessary to update the video controller chip.
The 6845 has internal 16-bit registers that are programmed 8 bits at a time, and writing a single register requires four I/O port write operations.
These are carried out by setting up an array (vector) of (port, value) pairs and invoking a sys_voutb kernel call to get the system task to do the I/O.
Some of the registers of the 6845 video controller chip are shown in Fig. 3-45

The next function is get 6845, which returns the values of readable video controller registers.
It also uses kernel calls to accomplish its job.
It does not appear to be called from anywhere in the current MINIX 3 code, but it may be useful for future enhancements such as adding graphics support.

The beep function is called when a CTRL-G character must be output.
It takes advantage of the built-in support provided by the PC for making sounds by sending a square wave to the speaker.
The sound is initiated by more of the kind of magic manipulation of I/O ports that only assembly language programmers can love.
The more interesting part of the code is using the ability to set an alarm to turn off the beep.
As a process with system privileges (i.e., a slot in the priv table), the terminal driver is allowed to set a timer using the library function tmrs_settimers.
On line 16655 this is done, with the next function, stop_beep, specified as the function to run when the timer expires.
This timer is put into the terminal task’s own timer queue.
The sys_setalarm kernel call that follows asks the system task to set a timer in the kernel.
When that expires, a SYN_ALARM message is detected by the main loop of the terminal driver, tty_task, which calls expire_timers to deal with all timers belonging to the terminal driver, one of which is the one set by beep.

The next routine, stop_beep, is the one whose address is put into the tmr_func field of the timer initiated by beep.
It stops the beep after the designated time has elapsed and also resets the beeping flag.
This prevents superfluous calls to the beep routine from having any effect.

Scr_init is called by tty_init NR_CONS times.
Each time its argument is a pointer to a tty structure, one element of the tty_table.
On lines 16693 and 16694 line, to be used as the index into the cons_table array, is calculated, tested for validity, and, if valid, used to initialize cons, the pointer to the current console table entry.
At this point the cons−>c_tty field can be initialized with the pointer to the main tty structure for the device, and, in turn, tp−>tty_priv can be pointed to this device’s console_t structure.
Next, kb_init is called to initialize the keyboard, and then the pointers to device specific routines are set up, tp−>tty_devwrite pointing to cons_write, tp−>tty_echo pointing to cons_echo, and tp−>tty_ioctl pointing to cons_ioctl.
The I/O address of the base register of the CRT controller is fetched from the BIOS, the address and size of the video memory are determined, and the wrap flag (used to determine how to scroll) is set according to the class of video controller in use.
On line 16735 the segment descriptor for the video memory is initialized in the global descriptor table by the system task.

Next comes the initialization of virtual consoles.
Each time scr_init is called, the argument is a different value of tp, and thus a different line and cons are used to provide each virtual console with its own share of the available video memory.
Each screen is then blanked, ready to start, and finally console 0 is selected to be the first active one.

Several routines display output on behalf of the terminal driver itself, the kernel, or another system component.
The first one, kputc just calls putk, a routine to output text a byte at a time, to be described below.
This routine is here because the library routine that provides the printf function used within system components is written to be linked to a character printing routine with this name, but other functions in the terminal driver expect one named putk.

Do_new_kmess is used to print messages from the kernel.
Actually, “messages” is not the best word to use here; we do not mean messages as used for interprocess communication.
This function is for displaying text on the console to report information, warnings, or errors to the user.

The kernel needs a special mechanism to display information.
It needs to be robust, too, so it can be used during startup, before all components of MINIX 3 are running, or during a panic, another time when major parts of the system may be unavailable.
The kernel writes text into a circular character buffer, part of a structure that also contains pointers to the next byte to write and the size of the yet-to-be processed text.
The kernel sends a SYS_SIG message to the terminal driver when there is new text, and do_new_kmess is called when the main loop in tty_task is running.
When things are not going so smoothly, (i.e., when the system crashes) the SYS_SIG will be detected by the loop that includes a nonblocking read operation in do_panic_dumps, which we saw in keyboard.c, and do_new_kmess will be called from there.
In either case, the kernel call sys_getkmessages retrieves a copy of the kernel structure, and the bytes are displayed, one by one, by passing them to putk, followed by a final call to putk with a null byte to force it to flush the output.
A local static variable is used to keep track of the position in the buffer between messages.

Do_diagnostics has a function similar to that of do_new_kmess, but do_diagnostics is used to display messages from system processes, rather than the kernel.
A DIAGNOSTICS message can be received either by the tty_task main loop or the loop in do_panic_dumps, and in either case a call is made to do_diagnostics.
The message contains a pointer to a buffer in the calling process and a count of the size of the message.
No local buffer is used; instead repeated sys_vircopy kernel calls are made to get the text one byte at a time.
This protects the terminal driver; if something goes wrong and a process starts generates an excessive amount of output there is no buffer to overrun.
The characters are output one by one by calling putk, followed by a null byte.

Putk can print characters on behalf of any code linked into the terminal driver, and is used by the functions just described to output text on behalf of the kernel or other system components.
It just calls out_char for each non-null byte received, and then calls flush for the null byte at the end of the string.

The remaining routines in console.c are short and simple and we will review them quickly.
Toggle_scroll does what its name says, it toggles the flag that determines whether hardware or software scrolling is used.
It also displays a message at the current cursor position to identify the selected mode.
Cons_stop reinitializes the console to the state that the boot monitor expects, prior to a shutdown or reboot.
Cons_org0 is used only when a change of scrolling mode is forced by the F3 key, or when preparing to shut down.
Select_console selects a virtual console.
It is called with the new index and calls set_6845 twice to get the video controller to display the proper part of the video memory.

The next two routines are highly hardware-specific.
Con_loadfont loads a font into a graphics adapter, in support of the ioctl TIOCSFON operation.
It calls ga_program to do a series of magical writes to an I/O port that cause the video adapter’s font memory, which is normally not addressable by the CPU, to be visible.
Then phys_copy is called to copy the font data to this area of memory, and another magic sequence is invoked to return the graphics adapter to its normal mode of operation.

The last function is cons_ioctl.
It performs only one function, setting the screen size, and is called only by scr_init, which uses values obtained from the BIOS.
If there were a need for a real ioctl call to change the sizeMINIX 3screen code to provide the new dimensions would have to be written.

1.10 Summary

Input/output is an important topic that is often neglected.
A substantial fraction of any operating system is concerned with I/O.
But I/O device drivers are often responsible for operating system problems.
Drivers are often written by programmers working for device manufacturers.
Conventional operating system designs usually require allowing drivers to have access to critical resources, such as interrupts, I/O ports, and memory belonging to other processes.
The design of MINIX 3 isolates drivers as independent processes with limited privileges, so a bug in a driver cannot crash the entire system.

We started out by looking at I/O hardware, and the relation of I/O devices to I/O controllers, which are what the software has to deal with.
Then we moved on to the four levels of I/O software: the interrupt routines, the device drivers, the device-independent I/O software, and the I/O libraries and spoolers that run in user space.

Then we examined the problem of deadlock and how it can be tackled.
Deadlock occurs when a group of processes each have been granted exclusive access to some resources, and each one wants yet another resource that belongs to another process in the group.
All of them are blocked and none will ever run again.
Deadlock can be prevented by structuring the system so it can never occur, for example, by allowing a process to hold only one resource at any instant.
It can also be avoided by examining each resource request to see if it leads to a situation in which deadlock is possible (an unsafe state) and denying or delaying those that lead to trouble.

Device drivers in MINIX 3 are implemented as independent processes running in user space.
We have looked at the RAM disk driver, hard disk driver, and terminal driver.
Each of these drivers has a main loop that gets requests and processes them, eventually sending back replies to report on what happened.
Source code for the main loops and common functions of the RAM disk, hard disk, and floppy disk drivers is provided in a common driver library, but each driver is compiled and linked with its own copy of the library routines.
Each device driver runs in its own address space.
Several different terminals, using the system console, the serial lines, and network connections, are all supported by a single terminal driver process.

Device drivers have varying relationships to the interrupt system.
Devices which can complete their work rapidly, such as the RAM disk and the memorymapped display, do not use interrupts at all.
The hard disk driver does most of its work in the driver code itself, and the interrupt handlers just return status information.
Interrupts are always expected, and a receive can be done to wait for one.
A keyboard interrupt can happen at any time.
Messages generated by all interrupts for the terminal driver are received and processed in the main loop of the driver.
When a keyboard interrupt occurs the first stage of processing the input is done as quickly as possible in order to be ready for subsequent interrupts.

MINIX 3 drivers have limited privileges, and cannot handle interrupts or access I/O ports on their own.
Interrupts are handled by the system task, which sends a message to notify a driver when an interrupt occurs.
Access to I/O ports is similarly mediated by the system task.
Drivers cannot read or write I/O ports directly.

1 03-InputOutput