Operating systems control all the computer’s I/O (Input/Output)
devices.
An OS must issue commands to the devices, catch interrupts, and handle
errors.
It provides an interface between the devices and the rest of the
system.
That interface should be simple and easy to use.
The interface should be as similar as possbile for all devices (device
independence).
The I/O code represents a significant fraction of the total operating
system.
Thus to really understand what an operating system does,
you have to understand how I/O works.
First we will look at some of the principles of how I/O hardware is organized.
Then we will look at I/O software in general.
I/O software can be structured in layers,
with each layer having a well defined task to perform.
After that, comes a section on deadlocks.
We will define deadlocks precisely,
show how they are caused,
give two models for analyzing them,
and discuss some algorithms for preventing their occurrence.
Then we will move on to look at MINIX3.
We will start with a bird’s-eye view of I/O in MINIX3,
including interrupts, device drivers, device-dependent I/O, and
device-independent I/O.
Following that introduction,
we will look at several I/O devices in detail:
disks, keyboards, and displays.
For each device, we will look at its hardware and software.
Different professions look at I/O hardware in different ways.
Electrical engineers look at it in terms of:
chips, wires, power supplies, motors,
and all the other physical components that make up the hardware.
Programmers look at the interface presented to the software:
the commands the hardware accepts,
the functions it carries out,
and the errors that can be reported back.
We are concerned with programming I/O devices,
not designing, building, or maintaining them,
so our interest will be restricted to how the hardware is
programmed,
not how it works inside.
Programming of I/O devices is connected with their internal
operation.
Thus, we provide general background on I/O hardware,
as it relates to programming.
I/O devices can be roughly divided into two categories:
1) block devices and
2) character devices.
A block device is one that stores information in fixed-size
blocks,
each one with its own address.
Common block sizes range from 512 bytes to 32,768 bytes.
The essential property of a block device is that:
it is possible to read or write each block independently of all the
other ones.
Disks are the most common block devices.
However, the boundary between devices that are block
addressable,
and those that are not, is not well defined.
Everyone agrees that a disk is a block addressable device,
because no matter where the arm currently is,
it is always possible to seek to another cylinder,
and then wait for the required block to rotate under the head.
Now consider a tape drive used for making disk backups.
Tapes contain a sequence of blocks.
If the tape drive is given a command to read block N,
then it can always rewind the tape,
and go forward until it comes to block N.
This operation is analogous to a disk doing a seek,
except that it takes much longer.
It may or may not be possible to rewrite one block in the middle of a
tape.
Even if it were possible to use tapes as random access block
devices,
that is stretching the point somewhat:
they are not normally used that way.
The other type of I/O device is the character device.
A character device delivers or accepts a stream of characters,
without regard to any block structure.
It is not addressable, and does not have any seek operation.
Printers, network interfaces, mice (for pointing),
and most other devices that are not disk-like,
can be seen as character devices.
This classification scheme is not perfect.
Some devices just do not fit in.
Clocks, for example, are not block addressable.
Nor do they generate or accept character streams.
All they do is cause interrupts at well-defined intervals.
Still, the model of block and character devices is general
enough,
that it can make some of the operating system software for I/O, device
independent.
For example, the file system deals only with abstract block
devices,
and leaves the device-dependent part to lower-level software called
device drivers.
I/O devices cover a huge range in speeds.
The figure below shows the data rates of some common devices.
Some typical device, network, and bus data rates.
Most of these devices tend to get faster as time goes on.
+++++++++++++++++ Cahoot-03-1
I/O units typically consist of an electrical component and a
mechanical component.
It is often possible to separate the two portions,
to provide a more modular and general design.
The electronic component is called the device controller or
adapter.
On personal computers, it often takes the form of a printed circuit
card,
that can be inserted into an expansion slot.
The mechanical component is the device itself.
This arrangement is shown:
A model for connecting the CPU, memory, controllers, and I/O
devices.
The controller card usually has a connector on it,
into which a cable leading to the device itself can be plugged.
Many controllers can handle two, four, or even eight identical
devices.
If the interface between the controller and device is a standard
interface,
either an official ANSI, IEEE, or ISO standard or a de-facto one,
then companies can make controllers or devices that fit that
interface.
For example, many companies make disk drives that match either
the:
IDE (Integrated Drive Electronics) or
SCSI (Small Computer System Interface) interfaces.
We mention this distinction between controller and device,
because the operating system nearly always deals with the
controller,
not the device.
Most personal computers and servers use the bus
model above,
for communication between the CPU and the controllers.
Large mainframes often use a different model,
with specialized I/O computers called I/O channels,
taking some of the load off the main CPU.
The interface between the controller and the device is often
low-level.
A disk, for example, might be formatted,
with 1024 sectors of 512 bytes per track.
What actually comes off the drive, however,
is a serial bit stream, starting with a preamble,
then the 4096 bits in a sector, and finally a checksum,
also called an Error-Correcting Code (ECC).
The preamble is written when the disk is formatted,
and contains the cylinder and sector number,
the sector size, and similar data.
The controller converts the serial bit stream into a block of
bytes,
and performs any error correction necessary.
The block of bytes is typically first assembled,
bit by bit, in a buffer inside the controller.
After its checksum has been verified,
and the block declared to be free of errors,
then it can then be copied to main memory.
A display monitor’s controller also works as a bit serial
device,
at an equally low level.
It reads bytes containing the characters to be displayed from
memory,
and generates the signals used to modulate the CRT beam.
The controller also generates signals,
for making a CRT beam do a horizontal retrace,
after it has finished a scan line,
as well as for making it do a vertical retrace,
after the entire screen has been scanned.
On an LCD screen, these signals select individual pixels,
and control their brightness,
simulating the effect of the electron beam in a CRT.
If it were not for the video controller,
then the operating system programmer would have to program the scanning
explicitly.
The operating system initializes the controller with a few
parameters,
such as the number of characters or pixels per line,
and number of lines per screen,
and lets the controller take care of actually drawing the display.
Modern disk controllers often have excess memory inside the
controller.
When a read is being processed,
as soon as the arm gets to the correct cylinder,
the controller begins reading and storing data,
even if it has not yet reached the sector it needs.
This cached data may come in handy for satisfying subsequent
requests.
Even after the requested data has been obtained,
the controller may continue to cache data from subsequent sectors,
since they are likely to be needed later.
Many disk reads can be handled without any disk activity at all.
Each controller has a few registers,
that are used for communicating with the CPU.
By writing into these registers,
the operating system can command the device to:
deliver data, accept data, switch itself on or off,
or otherwise perform some action.
By reading from these registers,
the operating system can learn what the device’s state is,
whether it is prepared to accept a new command, and so on.
In addition to the control registers,
many devices have a data buffer,
that the operating system can read and write.
For example, a common way for computers to display pixels on the
screen,
is to have a video RAM, which is basically just a data buffer,
available for programs or the operating system to write into.
The issue thus arises,
of how the CPU communicates with the control registers,
and the device data buffers.
Two alternatives exist:
In the first approach,
each control register is assigned an I/O port number,
an 8- or 16-bit integer.
Using a special I/O instruction such as
IN REG,PORT
the CPU can read in control register PORT,
and store the result in CPU register REG.
Similarly, using
OUT PORT,REG
the CPU can write the contents of REG to a control register.
Most early computers, including nearly all mainframes,
such as the IBM 360 and all of its successors, worked this way.
In this scheme, the address spaces for memory and I/O are
different,
as shown in (a):
(a) Separate I/O and memory space.
(b) Memory-mapped I/O.
(c) Hybrid.
In the second approach,
I/O registers are part of the regular memory address space,
as shown above in (b)
This scheme is called memory-mapped I/O,
and was introduced with the PDP-11 minicomputer.
Each control register is assigned a unique memory address,
to which no memory is assigned.
Usually, the assigned addresses are at the top of the address space.
A hybrid scheme,
with memory-mapped I/O data buffers,
and separate I/O ports for the control registers,
is shown in (c).
The Pentium uses this architecture,
with addresses 640K to 1M being reserved,
for device data buffers in IBM PC compatibles,
in addition to I/O buffer 0 through 64K.
In all cases, when the CPU wants to read a word,
either from memory or from an I/O port,
it puts the address it needs on the address lines of the bus,
and then asserts a READ
signal on a bus control line.
A second signal line is used,
to tell whether I/O space or memory space is needed.
If it is memory space,
then the memory responds to the request.
If it is I/O space,
then the I/O device responds to the request.
If there is only memory space, as in (b),
then every memory module and every I/O device,
compares the address lines to the range of addresses that it
services.
If the address falls in its range,
then it responds to the request.
Since no address is ever assigned to both memory and an I/O
device,
there is no ambiguity and no conflict.
Usually, controller registers have one or more status bits,
that can be tested to determine:
if an output operation is complete, or
if new data is available from an input device.
A CPU can execute a loop, testing a status bit each time,
until a device is ready to accept or provide new data.
This is called polling or busy waiting.
We saw this busy waiting concept before,
as a possible method to deal with critical sections,
and in that context it was dismissed,
as something to be avoided in most circumstances.
In the realm of I/O, where you might have to wait a very long
time,
for the outside world to accept or produce data, polling is not
acceptable,
except for very small dedicated systems, not running multiple
processes.
In addition to status bits, many controllers use interrupts,
to tell the CPU when they are ready to have their registers read or
written.
We saw how interrupts are handled by the CPU.
In the context of I/O, all you need to know is that:
most interface devices provide an output,
which is logically the same as the previous register status bit,
indicating “operation complete” or “data ready”,
but which drives one of the IRQ (Interrupt ReQuest) lines of the system
bus.
Thus when an interrupt-enabled operation completes,
it interrupts the CPU, and starts the interrupt handler running.
This piece of code informs the operating system that I/O is
complete.
The operating system may then check the status bits,
to verify that all went well,
and either harvest the resulting data,
or initiate a retry.
The number of inputs to the interrupt controller may be
limited;
Pentium-class PCs have only 15 available for I/O devices.
Some controllers are hard-wired onto the system parentboard,
for example, the disk and keyboard controllers of an IBM PC.
On older systems, a switch or jumper associated with the
controller,
set the IRQ used by the device.
If a user bought a new plug-in board,
then they had to manually set the IRQ,
to avoid conflicts with existing IRQs.
Few users could do this correctly,
which led the industry to develop Plug ’n Play,
in which the BIOS can automatically assign IRQs to devices,
at boot time to avoid conflicts.
Whether or not a system has memory-mapped I/O,
its CPU needs to address the device controllers,
to exchange data with them.
The CPU can request data from an I/O controller, one byte at a
time,
but doing so for a device like a disk,
that produces a large block of data,
wastes the CPU’s time, so a different scheme,
called DMA (Direct Memory Access) is often used.
The operating system can only use DMA,
if the hardware has a DMA controller,
which most systems do.
Sometimes this controller is integrated into disk controllers and other
controllers,
but such a design requires a separate DMA controller for each
device.
More commonly, a single DMA controller is available
(on the parentboard) for regulating transfers to multiple devices,
often concurrently.
No matter where it is physically located,
the DMA controller has access to the system bus,
independent of the CPU, as shown:
Operation of a DMA transfer.
The DMA contains several registers that can be written and read by
the CPU.
These include:
a memory address register,
a byte count register,
and one or more control registers.
The control registers specify:
which I/O port to use,
the direction of the transfer (reading or writing to the I/O
device),
the transfer unit (byte at a time, or word at a time), and
the number of bytes to transfer in one burst.
Let us first look at how disk reads occur, when DMA is not
used.
First the controller reads the block (one or more sectors) from the
drive serially,
bit by bit, until the entire block is in the controller’s internal
buffer.
Next, it computes the checksum,
to verify that no read errors have occurred.
Then the controller causes an interrupt.
When the operating system starts running,
it can read the disk block from the controller’s buffer,
a byte or a word at a time by executing a loop,
with each iteration:
reading one byte or word from a controller device register,
storing it in main memory,
incrementing the memory address,
and decrementing the count of items to be read,
until it reaches zero.
Step 1:
When DMA is used, the procedure is different.
First the CPU programs the DMA controller,
by setting its registers, so it knows what to transfer where
It also issues a command to the disk controller,
telling it to read data from the disk into its internal buffer,
and verify the checksum.
When valid data are in the disk controller’s buffer,
DMA can begin.
Step 2:
The DMA controller initiates the transfer,
by issuing a read request over the bus to the disk controller.
This read request looks like any other read request,
and the disk controller does not know or care whether:
it came from the CPU or from a DMA controller.
Typically, the memory address to write to is on the address lines of the
bus,
so when the disk controller fetches the next word from its internal
buffer,
it knows where to write it.
Step 3:
The write to memory is another standard bus cycle.
When the write is complete,
Step 4:
The disk controller sends an acknowledgement signal to the disk
controller,
also over the bus.
The DMA controller then increments the memory address to use,
and decrements the byte count.
If the byte count is still greater than 0,
then steps 2 through 4 are repeated,
until the count reaches 0.
At this point the controller causes an interrupt.
When the operating system starts up,
it does not have to copy the block to memory;
it is already there.
Why does the controller not just store the bytes in main
memory,
as soon as it gets them from the disk.
Why does it need an internal buffer?
There are two reasons:
First, by doing internal buffering,
the disk controller can verify the checksum,
before starting a transfer.
If the checksum is incorrect,
then an error is signaled,
and no transfer to memory is done.
Second, once a disk transfer has started,
the bits keep arriving from the disk at a constant rate,
whether the controller is ready for them or not.
If the controller tried to write data directly to memory,
then it would have to go over the system bus for each word
transferred.
If the bus were busy, due to some other device using it,
then the controller would have to wait.
If the next disk word arrived before the previous one had been
stored,
then the controller would have to store it somewhere.
If the bus were very busy,
then the controller might end up storing quite a few words,
and having a lot of administration to do as well.
When the block is buffered internally,
the bus is not needed until the DMA begins,
so the design of the controller is much simpler,
because the DMA transfer to memory is not time critical.
Not all computers use DMA.
The argument against it is that:
the main CPU is often far faster than the DMA controller,
and can do the job much faster
(when the limiting factor is not the speed of the I/O device).
If there is no other work for it to do,
then having the (fast) CPU wait for the (slow) DMA controller to finish
is pointless.
Also, getting rid of the DMA controller,
and having the CPU do all the work in software saves money,
important on low-end (embedded) computers.
+++++++++++++++++ Cahoot-03-2
Let us now turn away from the I/O hardware,
and look at the I/O software.
First we will look at the goals of the I/O software,
and then at the different ways I/O can be done,
from the point of view of the operating system.
A key concept in the design of I/O software is device
independence.
What this means is that:
it should be possible to write programs that can access any I/O
device,
without having to specify the device in advance.
For example, a program that reads a file as input,
should be able to read a file on any disk,
a floppy disk, on a hard disk, or on a CD-ROM,
without having to modify the program for each different device.
Similarly, one should be able to type a shell command such as:
sort <input >output
and have it work with input coming from a:
floppy disk, an IDE disk, a SCSI disk, or the keyboard,
and the output going to any kind of disk or the screen.
It is up to the operating system to take care of the problems,
caused by the fact that these devices really are different,
and require very different command sequences to read or write.
Closely related to device independence,
is the goal of uniform naming.
The name of a file or a device should simply be a string or an
integer,
and not depend on the device in any way.
In UNIX and MINIX3, all disks can be integrated into the file system
hierarchy,
in arbitrary ways, so the user need not be aware of which name
corresponds to which device.
For example, a floppy disk can be mounted on top of the directory
/usr/ast/backup
so that copying a file to that directory copies the file to the
diskette.
In this way, all files and devices are addressed the same way:
by a path name.
Another important issue for I/O software is error handling.
In general, errors should be handled as close to the hardware as
possible.
If the controller discovers a read error,
then it should try to correct the error itself if it can.
If it cannot, then the device driver should handle it,
perhaps by just trying to read the block again.
Many errors are transient,
such as read errors caused by specks of dust on the read head,
and will go away if the operation is repeated.
Only if the lower layers are not able to deal with the problem,
should the upper layers be told about it.
In many cases, error recovery can be done transparently at a low
level,
without the upper levels even knowing about the error.
Still another key issue is synchronous (blocking) versus asynchronous (interrupt-driven) transfers.
Most physical I/O is asynchronous.
The CPU starts the transfer,
and goes off to do something else,
until the interrupt arrives.
User programs are much easier to write, if the I/O operations are
blocking.
After a receive system call, the program is automatically
suspended,
until the data are available in the buffer.
It is up to the operating system to make operations that are actually
interrupt-driven,
look blocking to the user programs.
Another issue for the I/O software is buffering.
Often data that come off a device cannot be stored directly in the final
destination.
For example, when a packet comes in to the kernel from the
network,
the operating system does not know where to put it,
until it has stored the packet somewhere and examined it.
Also, some devices have severe real-time constraints
(for example, digital audio devices),
so the data must be put into an output buffer in advance,
to decouple the rate at which the buffer is filled,
from the rate at which it is emptied,
in order to avoid buffer under-runs.
Buffering involves considerable copying,
and often has a major impact on I/O performance.
The final concept that we will mention here is sharable versus dedicated devices.
Some I/O devices, such as disks,
can be used by many users at the same time.
No problems are caused by multiple users having open files,
on the same disk at the same time.
Other devices, such as tape drives,
have to be dedicated to a single user until that user is finished.
Then another user can have the tape drive.
Having two or more users writing blocks, intermixed at random,
to the same tape, will definitely not work.
Introducing dedicated (unshared) devices also introduces a variety of
problems,
such as deadlocks.
The operating system must be able to handle both shared and dedicated
devices,
in a way that avoids problems.
I/O software is often organized in four layers, as shown:
Layers of the I/O software system.
We will look at each in turn, starting at the bottom.
The emphasis in this lecture is on the device drivers (layer 2),
but we will summarize the rest of the I/O software,
to show how the pieces of the I/O system fit together.
Interrupts are an unpleasant fact of life.
Although they cannot be avoided, they should be hidden away,
deep in the bowels of the operating system,
so that as little of the operating system as possible knows about
them.
The best way to hide them,
is to have the driver starting an I/O operation, block,
until the I/O has completed, and the interrupt occurs.
The driver can block itself by doing a:
down
on a semaphore,
a wait
on a condition variable,
a receive
on a message,
or something similar.
When the interrupt happens,
the interrupt procedure handles the interrupt.
Then it can unblock the driver that started it.
In some cases it will just complete up
on a
semaphore.
In others it will do a signal
on a condition variable in a
monitor.
In still others, it will send a message to the blocked driver.
In all cases, the net effect of the interrupt will be that:
a driver that was previously blocked, will now be able to run.
If drivers are structured as independent processes,
with their own states, stacks, and program counters.
then this model works better.
Earlier, we said each device controller has registers,
used to give it commands, or to read out its status, or both.
The number of registers, and the nature of the commands,
vary radically from device to device.
For example, a mouse driver has to accept information from the
mouse,
telling how far it has moved, and which buttons are currently
depressed.
In contrast, a disk driver has to know about:
sectors, tracks, cylinders, heads, arm motion, motor drives, head
settling times,
and all the other mechanics of making the disk work properly.
Obviously, these drivers will be very different.
Thus, each I/O device attached to a computer,
needs some device-specific code for controlling it.
This code, called the device driver,
is generally written by the device’s manufacturer,
and delivered along with the device.
Since each operating system needs its own drivers,
device manufacturers often supply drivers for popular operating
systems.
Each device driver normally handles one device type,
or one class of closely related devices.
For example, it would probably be a good idea to have a single mouse
driver,
even if the system supports several different brands of mice.
As another example, a disk driver can usually handle multiple
disks,
of different sizes and different speeds, and perhaps a CD-ROM as
well.
On the other hand, a mouse and a disk are so different,
that different drivers are necessary.
In order to access the device’s hardware,
meaning the controller’s registers,
the device driver traditionally has been part of the system
kernel.
This approach gives the good performance,
but has worse reliability,
since a bug in any device driver, can crash the entire system.
MINIX3 departs from this model, to enhance reliability and
security.
In MINIX3 each device driver is now a separate user-mode process.
Operating systems usually classify drivers:
as block devices, such as disks, or
character devices, such as keyboards and printers.
Most operating systems define a standard interface,
that all block drivers must support,
and a second standard interface,
that all character drivers must support.
These interfaces consist of a number of procedures,
that the rest of the operating system can call,
to get the driver to do work for it.
The job of a device driver is to accept abstract requests,
from the device-independent software above it,
and see to it that the request is executed.
A typical request to a disk driver is to read block n.
If the driver is idle at the time a request comes in,
then it starts carrying out the request immediately.
If, however, it is already busy with a request,
it can enter the new request into a queue of pending requests,
to be dealt with as soon as possible.
The first step in actually carrying out an I/O request,
is to check that the input parameters are valid,
and if they are not, to return an error.
If the request is valid,
then the next step is to translate it,
from abstract to concrete terms.
For a disk driver, this means:
figuring out where on the disk the requested block actually is,
checking to see if the drive’s motor is running,
determining if the arm is positioned on the proper cylinder,
and so on.
The driver must decide:
which controller operations are required,
and in what sequence.
Once the driver has determined which commands to issue to the
controller,
it starts issuing them, by writing into the controller’s device
registers.
Simple controllers can handle only one command at a time.
More sophisticated controllers accept a linked list of commands,
which they then carry out by themselves,
without further help from the operating system.
After the command or commands have been issued,
one of two situations will apply.
Blocking
In many cases, the device driver must wait,
until the controller does some work for it,
so it blocks itself,
until an interrupt arrives, to unblock it.
The blocked driver will later be awakened by the interrupt.
No blocking
In other cases, the operation finishes without delay,
so the driver need not block.
As an example of the latter situation,
scrolling the screen on some graphics cards,
requires just writing a few bytes into the controller’s registers.
No mechanical motion is needed,
so the entire operation can be completed in a few microseconds.
The driver never goes to sleep.
Either way, after the operation has been completed,
it must check for errors.
If everything is all right,
then the driver may have data to pass,
to the device-independent software (e.g., a block just read).
Finally, it returns some status information,
for error reporting, back to its caller.
If any other requests are queued,
then one of them can now be selected and started.
If nothing is queued,
then the driver blocks waiting for the next request.
Drivers mainly serve requests for reading and writing,
but there may be other requirements.
For example, the driver may initialize a device,
at system startup, or the first time it is used.
Also, there may be a need to manage power requirements,
handle Plug ’n Play, or log events.
Although some of the I/O software is device specific,
a large fraction of it is device independent.
Divisions between drivers and the device-independent software are system
dependent,
because some functions that could be device-independent,
may actually be done in the drivers,
for efficiency or other reasons.
Device-independent software typically performs the following
functions:
In MINIX3, most of the device-independent software is part of the
file system.
Although we will study the file system upcoming,
we will take a quick look at the device-independent software here,
to provide some perspective on I/O,
and show better where the drivers fit in.
The basic function of the device-independent software,
is to perform the I/O functions that are common to all devices,
and to provide a uniform interface to the user-level software.
Below we will look at the above issues in more detail.
One major issue in operating systems is:
how to make all I/O devices and drivers look more-or-less the
same.
If disks, printers, monitors, keyboards, etc.,
are all interfaced in different ways,
then every time a new peripheral device comes along,
the operating system must be modified for the new device:
Without a standard driver interface, illustrates symbolically a
situation,
in which each device driver has a different interface to the operating
system.
With a standard drive interface, shows a different design,
in which all drivers have the same interface.
With a standard interface,
it is much easier to plug in a new driver,
providing it conforms to the driver interface.
It also means that driver writers know what is expected of them
(e.g., what functions they must provide
and what kernel functions they may call).
In practice, not all devices are absolutely identical,
but usually there are only a small number of device types,
and even these are generally almost the same.
For example, even block and character devices have many functions in
common.
Another aspect of having a uniform interface is how I/O devices are
named.
The device-independent software maps symbolic device names onto the
proper driver.In UNIX and MINIX3, a device name, such as
/dev/disk0
,
uniquely specifies the i-node for a special file,
and this i-node contains the major device number,
which is used to locate the appropriate driver.
The inode also contains the minor device number,
which is passed as a parameter to the driver,
in order to specify the unit to be read or written.
All devices have major and minor numbers,
and all drivers are accessed by using the major device number,
to select the driver.
Closely related to naming is protection.
How does the system prevent users from accessing devices,
that they are not entitled to access?
In UNIX, MINIX3, and also in later Windows versions,
such as Windows 2000 and Windows XP,
devices appear in the file system as named objects,
which means that the usual protection rules for files,
also apply to I/O devices.
The system administrator can then set the proper permissions
(i.e., in UNIX the rwx bits) for each device.
Buffering is also an issue for both block and character devices.
Block:
For block devices.
hardware generally reads and writes entire blocks at once,
but user processes are free to read and write in arbitrary units.
If a user process writes half a block,
the operating system will normally keep the data around
internally,
until the rest of the data are written,
at which time the block can go out to the disk.
Character:
For character devices,
users can write data to the system faster than it can be output,
necessitating buffering.
Keyboard input that arrives before it is needed also requires
buffering.
Errors are far more common in the context of I/O,
than in any other context!
When they occur, the operating system must handle them.
Many errors are device-specific,
so only the driver knows what to do
(e.g., retry, ignore, or panic).
An example error is:
one caused by a disk block that has been damaged,
and cannot be read any more.
After the driver has tried to read the block a certain number of
times,
it gives up, and informs the device-independent software.
How the error is treated from here on, is device independent.
If the error occurred while reading a user file,
it may be sufficient to report the error back to the caller.
However, if it occurred while reading a critical system data
structure,
such as the block containing the bitmap showing which blocks are
free,
the operating system may have to display an error message and
terminate.
Some devices, such as CD-ROM recorders,
can be used only by a single process at any given moment.
It is up to the operating system to examine requests for device
usage,
and accept or reject them,
depending on whether the requested device is available or not.
A simple way to handle these requests,
is to require processes to perform an open
,
on the special files for devices directly.
If the device is unavailable, the open
fails.
Closing such a dedicated device then releases it.
Not all disks have the same sector size.
It is up to the device-independent software to hide this fact,
and provide a uniform block size to higher layers,
for example, by treating several sectors as a single logical
block.
In this way, the higher layers only deal with abstract devices,
that all use the same logical block size,
independent of the physical sector size.
Some character devices deliver their data one byte at a time (e.g.,
modems),
while others deliver theirs in larger units (e.g., network
interfaces).
These differences may also be hidden.
Although most of the I/O software is within the operating
system,
a small portion consists of libraries linked together with user
programs,
and even whole programs running outside the kernel.
System calls, including the I/O system calls,
are normally made by library procedures.
For example, when a C program contains the call
count = write(fd, buffer, nbytes);
the library procedure write
will be linked with the
program,
and contained in the binary program present in memory at run time.
The collection of all these library procedures is clearly part of the
I/O system.
These procedures merely put their parameters in place for the system
call,
but there are other I/O procedures that actually do real work.
In particular, formatting of input and output is done by library
procedures.
For example, the C printf
,
which takes as input, a format string, and possibly some
variables,
builds an ASCII string,
and then calls write
to output the string.
Consider the statement:
printf("The square of %3d is %6d\n", i, i*i);
It formats a string consisting of the 14-character string “The square
of”,
followed by the value i as a 3-character string,
then the 4-character string ” is “,
then i2 as six characters,
and finally a newline feed.
An example of a similar procedure for input is
scanf
which reads input, and stores it into variables,
in a format described in a format string,
using the same syntax as printf
.
The standard I/O library contains a number of procedures that involve
I/O,
and all run as part of user programs.
Not all user-level I/O software consists of library procedures.
Another important category is the spooling system.
Spooling deals with dedicated I/O devices in a multiprogramming
system.
Consider a typical spooled device: a printer.
One could let any user process open the character special file for the
printer.
However, suppose a process opened it and then did nothing for
hours?
No other process could print anything.
Instead what is done is to create:
a special process, called a daemon,
and a special directory, called a spooling directory.
To print a file, a process first generates the entire file to be
printed,
and puts it in the spooling directory.
It is up to the daemon to print the files in the directory.
The daemon is the only process with permission to use the printer’s
special file.
By protecting the special file against direct use by users,
the problem of having someone keeping it open unnecessarily long is
eliminated.
Spooling is used not only for printers,
but also in various other situations.
For example, electronic mail (e-mail) usually uses a daemon.
When a message is submitted, it is put in a mail spool directory.
Later on the mail daemon tries to send it.
At any given instant of time,
a particular destination may be temporarily unreachable,
so the daemon leaves the message in the spool,
with status information indicating it should be tried again.
The daemon may also send a message back to the sender,
saying delivery is delayed, or,
after a delay of hours or days,
saying the message cannot be delivered.
All of this is outside the operating system.
The image below summarizes the I/O system,
showing the layers and principal functions of each layer.
Starting at the bottom, the layers are:
the hardware,
interrupt handlers,
device drivers,
device-independent software, and
user processes.
Layers of the I/O system, and the main functions of each layer.
The arrows show the flow of control.
For example, when a user program tries to read a block from a
file,
the operating system is invoked, to carry out the call.
The device-independent software looks for it in the buffer cache.
If the needed block is not there,
then it calls the device driver,
to issue the request to the hardware,
to go get it from the disk.
The process is then blocked,
until the disk operation has been completed.
When the disk is finished, the hardware generates an interrupt.
The interrupt handler is run, to discover what has happened,
that is, which device wants attention right now.
It then extracts the status from the device,
and wakes up the sleeping process,
to finish off the I/O request,
and let the user process continue.
+++++++++++++++++ Cahoot-03-3
https://en.wikipedia.org/wiki/Deadlock
Computer systems are full of resources that can only be used by one
process at a time.
Common examples include:
printers, drives, and slots in the system’s internal tables.
Having two processes simultaneously writing to the printer leads to
gibberish.
Having two processes using the same file system table slot,
will invariably lead to a corrupted file system.
Operating systems temporarily grant a process exclusive access to
certain resources,
both hardware and software.
A process may need exclusive access to not just one resource, but several.
Suppose, for example, two processes each want to record a scanned
document on a CD.
Process A requests permission to use the scanner, and is granted
it.
Process B is programmed differently,
and requests the CD recorder first, and is also granted it.
Now A asks for the CD recorder, but the request is denied, until B
releases it.
Unfortunately, instead of releasing the CD recorder, B asks for the
scanner.
At this point, both processes are blocked, and will remain so
forever.
This situation is called a resource deadlock.
Resource deadlocks can occur in a variety of situations,
besides requesting dedicated I/O devices.
In a database system for example,
a program may have to lock several records it is using,
to avoid race conditions.
If process A locks record R1, and process B locks record R2,
and then each process tries to lock the other one’s record,
we also have a deadlock.
Deadlocks can occur on both hardware resources or on software resources.
Although this material is about deadlocks in the context of operating
systems,
they also occur in database systems, and many other contexts in computer
science,
so this material is actually applicable to a wide variety of
multiprocess systems.
Deadlocks can occur when processes have been granted exclusive access
to devices, files, etc.
To make the discussion of deadlocks as general as possible,
we will refer to the objects granted as resources.
A resource can be either a hardware device (e.g., a tape drive),
or a piece of information (e.g., a locked record in a database).
A computer will normally have many different resources that can be
acquired.
For some resources, several identical instances may be available,
such as three tape drives.
When interchangeable copies of a resource are available,
called fungible resources, any one of them can be used,
to satisfy any request for the resource.
A resource is anything that can be used by only a single process,
at any instant of time.
Resources come in two types:
preemptable and non-preemptable.
A preemptable resource is one that can be taken away from the process
owning it,
with no ill effects.
Memory is an example of a preemptable resource.
Consider, for example,
a system with 64 MB of user memory,
one printer,
and two 64-MB processes, that each want to print something.
Process A requests and gets the printer,
then starts to compute the values to print.
Before it has finished with the computation,
it exceeds its time quantum and is swapped or paged out.
Process B now runs and tries, unsuccessfully, to acquire the
printer.
Potentially, we now have a deadlock situation,
because A has the printer and B has the memory,
and neither can proceed without the resource held by the other.
Fortunately, it is possible to preempt (take away) the memory from
B,
by swapping it out (to disk for example), and swapping A in.
Now A can run, do its printing, and then release the printer.
No deadlock occurs.
A non-preemptable resource, is one that cannot be taken away from its
current owner,
without causing the computation to fail.
If a process has begun to burn a CD-ROM,
then suddenly taking the CD recorder away from it,
and giving it to another process,
will result in a garbled CD.
CD recorders are not preemptable at an arbitrary moment.
In general, deadlocks involve non-preemptable resources.
Potential deadlocks that involve preemptable resources,
can usually be resolved,
by reallocating resources from one process to another.
Thus our treatment will focus on non-preemptable resources.
The sequence of events required to use a resource is given
below,
in an abstract form:
If the resource is not available when it is requested,
then the requesting process is forced to wait.
In some operating systems,
when a resource request fails,
the process is automatically blocked,
and awakened when it becomes available.
In other systems, the request fails with an error code,
and it is up to the calling process to wait a little while,
and try again.
Resource deadlock can be defined formally as follows:
A set of processes is deadlocked,
if each process in the set is waiting for an event,
that only another process in the set can cause.
Because all the processes are waiting,
none of them will ever cause any of the needed events,
that could wake up any of the other members of the set,
and all the processes continue to wait forever.
For this model, we assume that processes have only a single
thread,
and that there are no interrupts possible,
to wake up a blocked process.
The no-interrupts condition is needed for deadlock,
to prevent an otherwise deadlocked process,
from being awakened by, for example, an alarm,
and then causing events that release other processes in the set.
In most cases, the event that each process is waiting for,
is the release of some resource,
currently possessed by another member of the set.
Each member of the set of deadlocked processes is waiting for a
resource,
that is owned by a deadlocked process.
None of the processes can run,
none of them can release any resources,
and none of them can be awakened.
The number of processes, and the number and kind of resources possessed
and requested,
are unimportant.
This result holds for any kind of resource,
including both hardware and software.
Coffman et al. (1971) showed that four conditions must hold for there to be a deadlock:
Mutual exclusion condition.
Each resource is either currently assigned to exactly one process, or is
available.
Hold and wait condition.
Processes currently holding resources that were granted earlier, can
request new resources.
No preemption condition.
Resources previously granted cannot be forcibly taken away from a
process.
They must be explicitly released by the process holding them.
Circular wait condition.
There must be a circular chain of two or more processes,
each of which is waiting for a resource held by the next member of the
chain.
All four of these conditions must be present for a deadlock to
occur.
If one of them is absent, then no deadlock is possible.
In a series of papers,
Levine (2003a, 2003b, 2005) points out that there are various
situations,
all called deadlock in the literature,
and that Coffman et al.’s conditions apply more narrowly,
only to what should properly be called resource deadlock.
The literature contains other examples of “deadlock”,
that do not really meet all of these conditions.
For example, if four vehicles arrive simultaneously at a
crossroad,
and try to obey the rule that each should yield to the vehicle on the
right,
then none can proceed,
but this is not a case where one process already has possession of a
unique resource.
Rather, this problem is a “scheduling deadlock”,
which can be resolved by a decision about priorities,
imposed from outside by a policeman.
It is worth noting that:
each condition relates to a policy that a system can have, or not
have.
Can a given resource be assigned to more than one process at once?
Can a process hold a resource and ask for another?
Can resources be preempted?
Can circular waits exist?
Later on, we will see how deadlocks can be attacked,
by trying to negate some of these conditions.
Holt (1972) showed how these four conditions can be modeled using
directed graphs.
The graphs have two kinds of nodes:
processes, shown as circles, and
resources, shown as squares.
An arc from a resource node (square) to a process node
(circle),
means that the resource has previously been
requested by, granted to, and is currently held by that process.
An arc from a process to a resource,
means that the process is currently blocked waiting for that
resource.
Resource allocation graphs.
Holding a resource.
In the image, resource R is currently assigned to process A:
Requesting a resource.
Process B is waiting for resource S.
Deadlock.
Process C is waiting for resource T,
which is currently held by process D.
Process D is not about to release resource T,
because it is waiting for resource U, held by C.
Both processes will wait forever.
A cycle in the graph, means that there is a deadlock,
involving the processes and resources in the cycle
(assuming that there is one resource of each kind).
In this example, the cycle is C−T−D−U−C.
Now let us see how resource graphs can be used.
Imagine that we have three processes, A, B, and C,
and three resources, R, S, and T.
An example of how deadlock occurs and how it can be avoided.
(a-c) The requests and releases of the three processes are
given.
The operating system is free to run any unblocked process at any
instant,
so it could decide to run A until A finished all its work,
then run B to completion, and finally run C.
This ordering does not lead to any deadlocks
(because there is no competition for resources)
but it also has no parallelism at all.
In addition to requesting and releasing resources,
processes compute and do I/O.
When the processes are run sequentially,
there is no possibility that while one process is waiting for I/O,
another can use the CPU.
Thus running the processes strictly sequentially may not be
optimal.
On the other hand, if none of the processes do any I/O at all,
shortest job first is better than round robin,
so under such circumstances,
running all processes sequentially may be the best way.
The operating system is not required to run the processes in any
special order.
In particular, if granting a particular request might lead to
deadlock,
then the operating system can simply suspend the process,
without granting the request until it is safe
(i.e., just not schedule the process).
If the operating system knew about the impending deadlock,
then it could suspend B instead of granting it S.
After step (q), process B can be granted S,
because A is finished and C has everything it needs.
Even if B should eventually block when requesting T, no deadlock can
occur.
B will just wait until C is finished.
Some algorithms can make allocation decisions that do not lead to
deadlock.
For now, the point to understand is that resource graphs are a
tool,
that let us see if a given request/release sequence leads to
deadlock.
We just carry out the requests and releases step by step,
and after every step check the graph,
to see if it contains any cycles.
If so, we have a deadlock;
if not, there is no deadlock.
This example was of case of a single resource of each type,
but resource graphs can also be generalized,
to handle multiple resources of the same type (Holt, 1972).
However, Levine (2003a, 2003b) points out, that with fungible
resources,
this can get very complicated indeed.
If even one branch of the graph is not part of a cycle, that is,
if one process which is not deadlocked, holds a copy of one of the
resources,
then deadlock may not occur.
In general, four strategies are used for dealing with deadlocks.
Just ignore the problem altogether.
Maybe if you ignore it, it will ignore you.
Detection and recovery.
Let deadlocks occur, detect them, and take action.
Dynamic avoidance by careful resource allocation.
Prevention, by structurally negating one of the four conditions necessary to cause a deadlock.
We will examine each of these methods in turn in the next four sections.
+++++++++++++++++ Cahoot-03-4
The first and simplest approach is the ostrich algorithm:
stick your head in the sand, and pretend there is no problem at
all.
Different people react to this strategy in very different ways.
Mathematicians find it completely unacceptable,
and say that deadlocks must be prevented at all costs.
Engineers ask how often the problem is expected,
how often the system crashes for other reasons,
and how serious a deadlock is.
If deadlocks occur on the average once every five years,
but system crashes due to other bugs occur once a week,
then most engineers would not be willing to pay a large penalty,
in performance or convenience, to eliminate deadlocks.
To make this contrast more specific,
UNIX (and MINIX3) potentially suffer from deadlocks that are not even
detected,
let alone automatically broken.
The total number of processes in a system,
is determined by the number of entries in the process table.
Thus process table slots are finite resources.
If a fork fails because the table is full,
a reasonable approach for the program doing the fork,
is to wait a random time and try again.
Now suppose that a MINIX3 system has 100 process slots.
Ten programs are running,
each of which needs to create 12 (sub)processes.
After each process has created 9 processes,
the 10 original processes and the 90 new processes have exhausted the
table.
Each of the 10 original processes now sits in an endless loop,
forking and failing—a deadlock.
The probability of this happening is minuscule,
but it could happen.
Should we abandon processes and the fork call to eliminate the
problem?
The maximum number of open files is similarly restricted,
by the size of the i-node table,
so a similar problem occurs when it fills up.
Swap space on the disk is another limited resource.
In fact, almost every table in the operating system represents a finite
resource.
Should we abolish all of these, because it might happen,
that a collection of n processes might each claim 1/n of the
total,
and then each try to claim another one?
Most operating systems, including UNIX, MINIX3, and Windows,
just ignore the problem, on the assumption that:
most users would prefer an occasional deadlock,
to a rule restricting all users to only one open resource.
If deadlocks could be eliminated for free,
then there would not be much discussion.
The problem is that, the price is high,
mostly in terms of putting inconvenient restrictions on processes,
as we will see shortly.
Thus we are faced with an unpleasant trade-off,
between convenience and correctness,
and a great deal of discussion about which is more important, and to
whom.
Under these conditions, general solutions are hard to find.
A second technique is detection and recovery.
The system only monitors the requests and releases of resources,
rather than preventing, just fixed errors after they occur.
Every time a resource is requested or released,
the resource graph is updated,
and a check is made, to see if any cycles exist.
If a cycle exists, then one of the processes in the cycle is
killed.
If this does not break the deadlock,
another process is killed,
and so on until the cycle is broken.
A somewhat cruder method,
is not even to maintain the resource graph,
but instead periodically to check timing,
to see if there are any processes that have been continuously
blocked,
for more than some fixed time, perhaps, 1 hour.
Such processes are then killed.
Detection and recovery is the strategy often used on large mainframe
computers,
especially batch systems, in which killing a process and then restarting
it,
is usually acceptable.
Care must be taken to restore any modified files to their original
state,
and undo any other side effects that may have occurred.
The third deadlock strategy is to impose suitable restrictions on
processes,
so that deadlocks are structurally impossible.
The four conditions stated by Coffman et al.
(1971) provide a clue to some possible solutions.
First let us attack the mutual exclusion condition.
If no resource was ever assigned exclusively to a single process,
then we would never have deadlocks.
For example:
Allowing two processes to write on the printer at the same time will
lead to chaos.
But, by spooling printer output,
several processes can generate output at the same time.
The only process that actually requests the physical printer is the
printer daemon.
Since the daemon never requests any other resources,
we can eliminate deadlock for the printer.
Unfortunately, not all devices can be spooled
(the process table does not lend itself well to being spooled).
Furthermore, competition for disk space for spooling can itself lead to
deadlock.
What if two processes each filled up half of the available spooling
space with output,
and neither was finished producing output?
If the daemon was programmed to begin printing even before all the
output was spooled,
and an output process decided to wait several hours after the first
burst of output,
then the printer could still lie idle.
Daemons are normally programmed to print only after the complete
output file is available.
In this case, we can have two processes that have each finished
part,
but not all, of their output, and cannot continue.
Neither process will ever finish, so we have a deadlock on the disk.
The second of the conditions stated by Coffman et al. looks slightly
more promising.
If we can prevent processes that hold resources,
from waiting for more resources,
then we can eliminate deadlocks.
One way to achieve this goal, while still allowing multiple resource
access,
is to require all processes to request all their resources before
starting execution.
If everything is available,
then the process will be allocated whatever it needs,
and can run to completion.
If one or more of the desired resources are busy,
then nothing will be allocated and the process would just wait.
An immediate problem with this approach is that:
many processes do not know how many resources they will need,
until after they have started running.
Another problem is that:
resources will not be used optimally with this approach.
Take, as an example, a process that reads data from an input
tape,
analyzes it for an hour, and then writes an output tape,
as well as plotting the results.
If all resources must be requested in advance,
then the process will tie up the output tape drive and the plotter for
an hour.
A slightly different way to break the hold-and-wait condition,
is to require that a process requesting a resource,
to first temporarily release all the resources it currently holds.
Then it tries to get everything it needs all at once.
Attacking the third condition (no preemption) is even less
promising,
compared to attacking the second one.
If a process has been assigned the printer,
and is in the middle of printing its output,
then forcibly taking away the printer,
because a needed plotter is not available,
is tricky at best, and impossible at worst.
Only one condition is left.
The circular wait can be eliminated in several ways.
One way is simply to have a rule saying that:
a process is entitled only to a single resource at any moment.
If it needs a second one, then it must release the first one.
For a process that needs to copy a huge file from a tape to a
printer,
this restriction is unacceptable.
Another way to avoid the circular wait is to:
provide a global numbering of all the resources, as shown in (a).
(a) Numerically ordered resources.
(b) A resource graph.
Now the rule is this:
processes can request resources whenever they want to,
but all requests must be made in numerical order.
A process may request a scanner first, and then a tape drive,
but it may not request a plotter first, and then a scanner.
With this rule, the resource allocation graph can never have
cycles.
Let us see why this is true for the case of two processes (b).
We can get a deadlock only if:
A requests resource j,
and B requests resource i.
Assuming i and j are distinct resources,
they will have different numbers.
If i > j, then A is not allowed to request j,
because that is lower than what it already has.
If i < j, then B is not allowed to request i,
because that is lower than what it already has.
Either way, deadlock is impossible.
With multiple processes, the same logic holds.
At every instant, one of the assigned resources will be highest.
The process holding that resource will never ask for a resource already
assigned.
It will either finish, or at worst,
request even higher numbered resources,
all of which are available.
Eventually, it will finish and free its resources.
At this point, some other process will hold the highest resource,
and can also finish.
In short, there exists a scenario in which all processes finish,
so no deadlock is present.
A minor variation of this algorithm is to:
drop the requirement that resources be acquired in strictly increasing
sequence,
and merely insist that no process request a resource lower than what it
is already holding.
If a process initially requests 9 and 10, and then releases both of
them,
it is effectively starting all over,
so there is no reason to prohibit it from now requesting resource 1.
Although numerically ordering the resources eliminates the problem of
deadlocks,
it may be impossible to find an ordering that satisfies everyone.
When the resources include process table slots, disk spooler
space,
locked database records, and other abstract resources,
the number of potential resources and different uses may be so
large,
that no ordering could possibly work.
Also, as Levine (2005) points out, ordering resources negates
fungibility,
a perfectly good and available copy of a resource,
could be inaccessible with such a rule.
The various approaches to deadlock prevention are summarized:
Summary of approaches to deadlock prevention.
We saw that deadlock was avoidable,
by carefully analyzing each resource request,
to see if it could be safely granted.
Question:
is there an algorithm that can always avoid deadlock,
by making the right choice all the time?
Answer:
The answer is a qualified yes, we can avoid deadlocks,
but only if certain information is available in advance.
Deadlock can be avoided by careful resource allocation.
The Banker’s Algorithm can avoid deadlocks (Dijkstra, 1965).
It is modeled on the way a small-town banker might lend,
in dealing with a group of customers to whom he has granted lines of
credit.
The banker does not necessarily have enough cash on hand,
to simultaneously lend every customer the full amount of each one’s line
of credit.
In (a) we see four customers, A, B, C, and D,
each of whom has been granted a certain number of credit units
(e.g., 1 unit is 1K dollars):
Three resource allocation states:
(a) Safe.
(b) Safe.
(c) Unsafe.
The banker knows that not all customers will need their maximum
credit immediately,
so he has reserved only 10 units rather than 22 to service them.
He also trusts every customer to be able to repay his loan,
very soon after receiving his total line of credit (it is a small
town),
A customer can only repay their loan,
after having received it all.
so he knows eventually he can service all the requests.
In this analogy:
customers are like processes,
units are like drives,
and the banker is the operating system.
Each part of the figure shows a state of the system,
with respect to resource allocation, that is,
a list of customers showing the money already loaned (drives already
assigned)
and the maximum credit available (maximum number of drives needed at
once later).
A state is safe, if there exists a sequence of other states,
that leads to all customers getting loans up to their credit
limits
(all processes getting all their resources and terminating).
The customers go about their respective businesses,
making loan requests from time to time (i.e., asking for
resources).
At a certain moment, the situation is as shown in (b).
This state is safe because with two units left,
the banker can delay any requests except C’s,
thus letting C finish, and release all four of his resources.
With four units in hand, the banker can let either D or B have the
necessary units, and so on.
Consider what would happen:
If a request from B for one more unit were granted in (b).
Then, we would have situation (c), which is unsafe.
If all the customers suddenly asked for their maximum loans,
then the banker could not satisfy any of them,
and we would have a deadlock.
An unsafe state does not have to lead to deadlock,
since a customer might not need the entire credit line available,
but the banker cannot count on this behavior.
The banker’s algorithm considers each request as it occurs,
and calculates if granting it leads to a safe state.
If it does, the request is granted;
otherwise, it is postponed until later.
To see if a state is safe,
the banker checks to see if he has enough resources to satisfy some
customer.
If so, those loans are assumed to be repayable,
and the customer now closest to the limit is checked, and so on.
If all loans can eventually be repaid,
the state is safe and the initial request can be granted.
The above algorithm was described in terms of a single resource
class
(e.g., only tape drives or only printers, but not some of each).
Below is a model for dealing with two processes and two resources,
for example, a printer and a plotter.
Two process resource trajectories.
The horizontal axis represents the number of instructions executed by
process A.
The vertical axis represents the number of instructions executed by
process B.
At I1 A requests a printer; at I2 it needs a
plotter.
The printer and plotter are released at I3 and I4
respectively.
Process B needs the plotter from I5 to I7 and the
printer from I6 to I8.
Every point in the diagram represents a joint state of the two
processes.
Initially, the state is at p,
with neither process having executed any instructions.
If the scheduler chooses to run A first,
then we get to the point q,
in which A has executed some number of instructions,
but B has executed none.
At point q the trajectory becomes vertical,
indicating that the scheduler has chosen to run B.
With a single processor, all paths must be horizontal or vertical, never
diagonal.
Furthermore, motion is always to the north or east,
never to the south or west (processes cannot run backward).
When A crosses the I1 line on the path from r to s,
it requests and is granted the printer.
When B reaches point t, it requests the plotter.
The regions that are shaded are especially interesting.
and represent both processes having the resource.
The mutual exclusion rule makes it impossible to enter this
region.
Under no conditions can the system enter the shaded regions.
If the system ever enters the box bounded by:
I1 and I2 on the sides and I5 and
I6 top and bottom,
then it will eventually deadlock,
when it gets to the intersection of I2 and
I6.
At this point, A is requesting the plotter,
and B is requesting the printer,
and both are already assigned.
The entire box is unsafe, and must not be entered.
At point t, the only safe thing to do,
is run process A until it gets to I4.
Beyond that, any trajectory to u will do.
The important thing to see here is that:
at point t, B is requesting a resource.
The system must decide whether to grant it or not.
If the grant is made,
then the system will enter an unsafe region,
and eventually deadlock.
To avoid the deadlock, B should be suspended,
until A has requested and released the plotter.
The previous graphical model is difficult to apply to the general
case,
with an arbitrary number of processes,
and an arbitrary number of resource classes,
each with multiple instances (e.g., two plotters, three tape
drives).
However, the banker’s algorithm can be generalized to do the job:
We see two matrices.
The one on the left:
shows how many of each resource are currently assigned,
to each of the five processes.
The matrix on the right:
shows how many resources each process still needs,
in order to complete.
As in the single resource case,
processes must state their total resource needs before executing,
so that the system can compute the right-hand matrix at each
instant.
The three vectors at the right of the figure show;
the existing resources, E,
the possessed resources, P, and
the available resources, A, respectively.
From E we see that the system has:
six tape drives, three plotters, four printers, and two CD-ROM
drives.
Of these, five tape drives, three plotters, two printers, and two CD-ROM
drives are currently assigned.
This fact can be seen,
by adding up the four resource columns in the left-hand matrix.
The available resource vector is simply the difference between what the
system has,
and what is currently in use.
The algorithm for checking to see if a state is safe can now be stated.
Look for a row, R,
whose unmet resource needs are all smaller than, or equal to A.
If no such row exists,
then the system will eventually deadlock,
since no process can run to completion.
Assume the process of the row chosen,
requests all the resources it needs (which is guaranteed to be possible)
and finishes.
Mark that process as terminated,
and add all its resources to the A vector.
Repeat steps 1 and 2,
until either all processes are marked terminated,
in which case the initial state was safe,
or until a deadlock occurs,
in which case it was not.
If several processes are eligible to be chosen in step 1,
then it does not matter which one is selected:
the pool of available resources either gets larger, or stays the
same.
Now let us get back to the example image above.
The current state is safe.
Suppose that process B now requests a printer.
This request can be granted, because the resulting state is still
safe
(process D can finish, and then processes A or E, followed by the
rest).
Now imagine that:
after giving B one of the two remaining printers,
E wants the last printer.
Granting that request would reduce the vector of available resources to
(1 0 0 0),
which leads to deadlock.
Clearly E’s request must be deferred for a while.
The banker’s algorithm was first published by Dijkstra in 1965.
Since that time, nearly every book on operating systems has described it
in detail.
Innumerable papers have been written about various aspects of it.
Few authors have had the audacity to point out that:
although in theory the algorithm is wonderful,
in practice it is essentially useless,
because processes rarely know in advance what their maximum resource
needs will be.
In addition, the number of processes is not fixed,
but dynamically varying, as new users log in and out.
Furthermore, resources that were thought to be available,
can suddenly vanish (tape drives can break).
Thus in practice, few, if any, existing systems use the banker’s
algorithm for avoiding deadlocks.
The schemes described earlier under the name “prevention”,
are overly restrictive,
and the algorithm described here as “avoidance”,
requires information that is usually not available.
If you can think of a general-purpose algorithm,
that does the job in practice as well as in theory,
then write it up and send it to your local computer science journal.
Although both avoidance and prevention are not promising in the
general case,
for specific applications, many excellent special-purpose algorithms are
known.
As an example, in many database systems,
an operation that occurs frequently is:
requesting locks on several records,
and then updating all the locked records.
When multiple processes are running at the same time,
there is a real danger of deadlock.
To eliminate this problem, special techniques are used.
The approach most often used is called two-phase locking.
In the first phase,
the process tries to lock all the records it needs,
one at a time.
If it succeeds, then it begins the second phase,
performing its updates and releasing the locks.
No real work is done in the first phase.
If during the first phase,
some record is needed that is already locked,
then the process just releases all its locks,
and starts the first phase all over.
This approach is similar to requesting all the resources needed in
advance,
or at least before anything irreversible is done.
In some versions of two-phase locking,
if a lock is encountered during the first phase,
then there is no release and restart.
In these versions, deadlock can occur.
However, this strategy is not applicable in general.
In real-time systems and process control systems, for example,
it is not acceptable to just terminate a process partway through,
because a resource is not available, and start all over again.
Neither is it acceptable to start over,
if the process has read or written messages to the network, updated
files,
or anything else that cannot be safely repeated.
The algorithm works only in some situations,
where the programmer has very carefully arranged things,
so that the program can be stopped at any point during the first phase,
and restarted.
Many applications cannot be structured this way.
https://docs.kernel.org/locking/
https://docs.kernel.org/kernel-hacking/locking.html
The top four layers correspond to the four-layered structure of MINIX
3.
Interrupt handling was already covered.
In the following sections we will look briefly at each of the
layers,
with an emphasis on the device drivers.
The device-independent I/O will be discussed later,
when we discuss the file system itself in-depth.
Many device drivers start some I/O device, and then block,
waiting for a message to arrive.
That message is usually generated by the interrupt handler for the
device.
Other device drivers do not start any physical I/O
(e.g., reading from RAM disk and writing to a memory-mapped
display),
do not use interrupts, and do not wait for a message from an I/O
device.
has been presented in great detail, and we will say no more about it
here.
Here we will discuss interrupts for I/O in device drivers.
For disk devices,
input and output involves commanding a device to perform its
operation,
and then waiting until the operation is complete.
The disk controller does most of the work,
and very little is required of the interrupt handler.
However, there is sometimes more for the low-level handler to
do.
The message passing mechanism has a cost.
When an interrupt may occur frequently,
but the amount of I/O handled per interrupt is small,
it may be ideal to make the handler itself do somewhat more work,
and to postpone sending a message to the driver until a subsequent
interrupt,
when there is more for the driver to do.
But, in MINIX 3 this is not possible for most I/O,
because the low level handler in the kernel is a general purpose
routine,
used for almost all devices.
The clock is an exception to that constraint.
Because it is compiled with the kernel,
the clock can have its own handler that does extra work.
On many clock ticks, there is very little to be done,
except for maintaining the time.
This is done without sending a message to the clock task itself.
The clock’s interrupt handler increments a variable,
appropriately named realtime
,
possibly adding a correction for ticks counted during a BIOS call.
The handler does some additional very simple arithmetic;
it increments counters for user time and billing time,
decrements the ticks_left
counter for the current
process,
and tests to see if a timer has expired.
Only if the current process has used up its quantum,
or a timer has expired,
is a message sent to the clock task.
The clock interrupt handler is unique compared to other devices in
MINIX 3,
because the clock is the only interrupt driven device that runs in
kernel space.
The clock hardware is integral to the PC.
The clock interrupt line does not connect to any pin on the sockets
where add-on I/O controllers can be plugged in.
So it is impossible to install a clock upgrade package,
with replacement clock hardware,
with a driver provided by another manufacturer.
It is reasonable, then, for the clock driver to be compiled into the
kernel,
and have access to any variable in kernel space.
MINIX 3 aims to make it unnecessary for any other device driver to have
that kind of access.
The remaining device drivers that run in user space are
restricted,
and cannot directly access kernel memory or I/O ports.
Although possible, it would also violate the design principles of MINIX
3,
to allow an interrupt service routine to make a call to execute a
service routine within the text segment of a user process.
The kernel should not trust code provided by a user program.
There are several different levels of I/O access,
that might be needed by a user-space device driver.
A driver might need access to memory outside its normal data
space.
The memory driver, which manages the RAM disk,
is an example of a driver which needs only this kind of access.
A driver may need to read and write to I/O ports.
These machine-level instructions are available only in kernel
mode.
For example, the hard disk driver needs this kind of access.
A driver may need to respond to predictable interrupts.
For example, the hard disk driver writes commands to the disk
controller,
which causes an interrupt to occur when the desired operation is
complete.
A driver may need to respond to unpredictable interrupts.
The keyboard driver is in this category.
This could be considered a subclass of the preceding item,
but unpredictability complicates things.
All of these cases are supported by kernel calls handled by the system task.
The first case, access to extra memory segments,
uses hardware segmentation support provided by Intel processors.
A normal process has access only to its own text, data, and stack
segments.
However the system task allows other segments to be defined,
and accessed by user-space processes.
The memory driver can access a memory region reserved for use as a
RAM disk,
as well as other regions designated for special access.
The console driver accesses memory on a video display adapter in the
same way.
For the second case, MINIX 3 provides kernel calls,
which user drivers call to use I/O instructions.
The system task does the actual I/O,
on behalf of a less-privileged process.
The hard disk driver uses this service.
The disk driver may have to write to a single output port to select a
disk,
then read from another port, in order to verify the device is
ready.
If response is normally expected to be very quick, polling can be
done.
There are kernel calls to specify a port and data to be written,
or a location for receipt of data read.
This requires that a call to read a port be nonblocking,
and in fact, kernel calls do not block.
Some insurance against device failure is useful.
A polling loop could include a counter,
that terminates the loop if the device does not become ready,
after a certain time.
A way is provided by the MINIX 3 system library,
which provides a getuptime function.
This uses a kernel call to retrieve a counter of clock ticks,
since system startup, maintained by the clock task.
The cost of using this information to keep track of time spent in a
loop,
is the overhead of an additional kernel call on each iteration.
Another possibility is to ask the system task to set a watchdog
timer.
But to receive a notification from a timer on a receive operation,
which will block, is required.
This is not a good solution if a fast response is expected.
The hard disk also makes use of variants of the kernel calls for
I/O,
that make it possible to send to the system task,
a list of ports and data to write, or variables to be altered.
The hard disk driver writes a sequence of byte values to seven output
ports,
to initiate an operation.
The last byte in the sequence is a command,
and when the disk controller completes a command,
it generates an interrupt.
All this can be accomplished with a single kernel call,
greatly reducing the number of messages needed.
This brings us to the third item in the list:
responding to an expected interrupt.
As noted in the discussion of the system task,
when an interrupt is initialized on behalf of a user space program
(using a sys_irqctl
kernel call),
the handler routine for the interrupt is always
generic_handler
,
a function defined as part of the system task.
This routine converts the interrupt,
into a notification message to the process,
on whose behalf the interrupt was set.
The device driver therefore must initiate a receive operation,
after the kernel call that issues the command to the controller.
When the notification is received,
then the device driver services the interrupt.
Although in this case an interrupt is expected,
it is possibility that something might go wrong.
Given the possibility that an interrupt might fail to be
triggered,
a process can request the system task to set up a watchdog timer.
Watchdog timers also generate notification messages,
and thus the receive operation could get a notification,
either because an interrupt occurred, or because a timer expired.
This is not a problem because the notification message indicates its
origin,
although a notification does not convey much other information.
Although both notifications are generated by the system task,
notification of an interrupt will appear to come from HARDWARE,
and notification of a timer expiring will appear to come from CLOCK.
There is another problem.
If an interrupt is received in a timely way,
and a watchdog timer has been set,
then expiration of the timer at some future time,
will be detected by another receive operation,
possibly in the main loop of the driver.
One solution is to make a kernel call,
to disable the timer when the notification from HARDWARE is
received.
Alternatively, if it is likely that the next receive operation will be
one where a message from CLOCK is not expected, such a message could be
ignored and receive called again.
Although less likely, it is conceivable that a disk operation could
occur after an unexpectedly long delay, generating the interrupt only
after the watchdog has timed out.
The same solutions apply here.
When a timeout occurs a kernel call can be made to disable an interrupt,
or a receive operation that does not expect an interrupt could ignore
any message from HARDWARE.
This is a good time to mention that when an interrupt is first
enabled, a kernel call can be made to set a “policy” for the
interrupt.
The policy is simply a flag that determines whether the interrupt should
be automatically reenabled or whether it should remain disabled until
the device driver it serves makes a kernel call to reenable it.
For the disk driver there may be a substantial amount of work to be done
after an interrupt, and thus it may be best to leave the interrupt
disabled until all data has been copied.
The fourth item in our list is the most problematic.
Keyboard support is part of the tty driver, which provides output as
well as input.
Furthermore, multiple devices may be supported.
So input may come from a local keyboard, but it can also come from a
remote user connected by a serial line or a network connection.
And several processes may be running, each producing output for a
different local or remote terminal.
When you do not know when, if ever, an interrupt might occur, you cannot
just make a blocking receive call to accept input from a single source
if the same process may need to respond to other input and output
sources.
MINIX 3 uses several techniques to deal with this problem.
The principal technique used by the terminal driver for dealing with
keyboard input is to make the interrupt response as fast as possible, so
characters will not be lost.
The minimum possible amount of work is done to get characters from the
keyboard hardware to a buffer.
Additionally, when data has been fetched from the keyboard in response
to an interrupt, as soon as the data is buffered the keyboard is read
again before returning from the interrupt.
Interrupts generate notification messages, which do not block the
sender; this helps to prevent loss of input.
A nonblocking receive operation is available, too, although it is only
used to handle messages during a system crash.
Watchdog timers are also used to activate the routine that checks the
keyboard.
For each class of I/O device present in a MINIX 3 system, a separate
I/O device driver is present.
These drivers are full-fledged processes, each one with its own state,
registers, stack, and so on.
Device drivers communicate with the file system using the standard
message passing mechanism used by all MINIX 3 processes.
A simple device driver may be written as a single source file.
For the RAM disk, hard disk, and floppy disk there is a source file to
support each type of device, as well as a set of common routines in
driver.c and drvlib.c to support all blcok device types.
This separation of the hardware-dependent and hardwareindependent parts
of the software makes for easy adaptation to a variety of different
hardware configurations.
Although some common source code is used, the driver for each disk type
runs as a separate process, in order to support rapid data transfers and
isolate drivers from each other.
The terminal driver source code is organized in a similar way, with
the hardware-independent code in tty.c and source code to support
different devices, such as memory-mapped consoles, the keyboard, serial
lines, and pseudo terminals in separate files.
In this case, however, a single process supports all of the different
device types.
For groups of devices such as disk devices and terminals, for which
there are several source files, there are also header files.
Driver.h supports all the block device drivers.
Tty.h provides common definitions for all the terminal devices.
The MINIX 3 design principle of running components of the operating
system as completely separate processes in user space is highly modular
and moderately efficient.
It is also one of the few places where MINIX 3 differs from UNIX in an
essential way.
In MINIX 3 a process reads a file by sending a message to the file
system process.
The file system, in turn, may send a message to the disk driver asking
it to read the needed block.
The disk driver uses kernel calls to ask the system task to do the
actual I/O and to copy data between processes.
This sequence (slightly simplified from reality) is shown in Fig.
3-16(a).
By making these interactions via the message mechanism, we force various
parts of the system to interface in standard ways with other parts.
Figure 3-16.
Two ways of structuring user-system communication.
In UNIX all processes have two parts: a user-space part and a
kernel-space part, as shown in Fig. 3-16(b).
When a system call is made, the operating system switches from the
user-space part to the kernel-space part in a somewhat magical
way.
This structure is a remnant of the MULTICS design, in which the switch
was just an ordinary procedure call, rather than a trap followed by
saving the state of the user-part, as it is in UNIX .
Device drivers in UNIX are simply kernel procedures that are called
by the kernel-space part of the process.
When a driver needs to wait for an interrupt, it calls a kernel
procedure that puts it to sleep until some interrupt handler wakes it
up.
Note that it is the user process itself that is being put to sleep here,
because the kernel and user parts are really different parts of the same
process.
Among operating system designers, arguments about the merits of
monolithic systems, as in UNIX, versus process-structured systems, as in
MINIX 3, are endless.
The MINIX 3 approach is better structured (more modular), has cleaner
interfaces between the pieces, and extends easily to distributed systems
in which the various processes run on different computers.
The UNIX approach is more efficient, because procedure calls are much
faster than sending messages.
MINIX 3 was split into many processes because we believe that with
increasingly powerful personal computers available, cleaner software
structure was worth making the system slightly slower.
The performance loss due to having most of the operating system run in
user space is typically in the range of 5–10%.
Be warned that some operating system designers do not share the belief
that it is worth sacrificing a little speed for a more modular and more
reliable system.
In this chapter, drivers for RAM disk, hard disk, clock, and terminal
are discussed.
The standard MINIX 3 configuration also includes drivers for the floppy
disk and the printer, which are not discussed in detail.
The MINIX 3 software distribution contains source code for additional
drivers for RS-232 serial lines, CD-ROMs, various Ethernet adapter, and
sound cards.
These may be compiled separately and started on the fly at any time.
All of these drivers interface with other parts of the MINIX 3 system
in the same way: request messages are sent to the drivers.
The messages contain a variety of fields used to hold the operation code
(e.g., READ or WRITE) and its parameters.
A driver attempts to fulfill a request and returns a reply message.
For block devices, the fields of the request and reply messages are
shown in Fig. 3-17.
The request message includes the address of a buffer area containing
data to be transmitted or in which received data are expected.
The reply includes status information so the requesting process can
verify that its request was properly carried out.
The fields for the character devices are basically similar but can vary
slightly from driver to driver.
Messages to the terminal driver can contain the address of a data
structure which specifies all of the many configurable aspects of a
terminal, such as the characters to use for the intraline editing
functions erase-character and kill-line.
The function of each driver is to accept requests from other
processes, normally the file system, and carry them out.
All the block device drivers have been written to get a message, carry
it out, and send a reply.
Among other things, this decision means that these drivers are strictly
sequential and do not contain any internal multiprogramming, to keep
them simple.
When a hardware request has been issued, the driver does a receive
operation specifying that it is interested only in accepting interrupt
messages, not new requests for work.
Any new request messages are just kept waiting until the current work
has been done (rendezvous principle).
The terminal driver is slightly different, since a single driver
services several devices.
Thus, it is possible to accept a new request for input from the keyboard
while a request to read from a serial line is still being
fulfilled.
Nevertheless, for each device a request must be completed before
beginning a new one.
The main program for each block device driver is structurally the
same and is outlined in Fig. 3-18.
When the system first comes up, each one of the drivers is started up in
turn to give each a chance to initialize internal tables and similar
things.
Then each device driver blocks by trying to get a message.
When a message comes in, the identity of the caller is saved, and a
procedure is called to carry out the work, with a different procedure
invoked for each operation available.
After the work has been finished, a reply is sent back to the caller,
and the driver then goes back to the top of the loop to wait for the
next request.
Figure 3-17.
Fields of the messages sent by the file system to the block device
drivers and fields of the replies sent back.
Each of the dev_XXX procedures handles one of the operations of which
the driver is capable.
It returns a status code telling what happened.
The status code, which is included in the reply message as the field REP
STATUS, is the count of bytes transferred (zero or positive) if all went
well, or the error number (negative) if something went wrong.
This count may differ from the number of bytes requested.
When the end of a file is reached, the number of bytes available may be
less than number requested.
On terminals at most one line is returned (except in raw mode), even if
the count requested is larger.
In MINIX 3 the file system process contains all the
device-independent I/O code.
The I/O system is so closely related to the file system that they were
merged into one process.
The functions performed by the file system are those shown in Fig. 3-6,
except for requesting and releasing dedicated devices, which do not
exist in MINIX 3 as it is presently configured.
They could, however, easily be added to the relevant device drivers
should the need arise in the future.
message mess; /* message buffer */
void io_driver() {
initialize(); /* only done once, during system init.
*/
while (TRUE) {
receive(ANY, &mess); /* wait for a request for work */
caller = mess.source; /* process from whom message came */
switch(mess.type) {
case READ:
rcode = dev_read(&mess); break;
case WRITE:
rcode = dev_write(&mess); break;
/* Other cases go here, including OPEN, CLOSE, and IOCTL */
default:
rcode = ERROR;
}
mess.type = DRIVER_REPLY;
mess.status = rcode; /* result code */
send(caller, &mess); /* send reply message back to caller */
}
}
Figure 3-18.
Outline of the main procedure of an I/O device driver.
In addition to handling the interface with the drivers, buffering,
and block allocation, the file system also handles protection and the
management of i-nodes, directories, and mounted file systems.
This will be covered in detail in Chap.
5.
The general model outlined earlier in this chapter also applies
here.
Library procedures are available for making system calls and for all the
C functions required by the POSIX standard, such as the formatted input
and output functions printf and scanf.
The standard MINIX 3 configuration contains one spooler daemon, lpd,
which spools and prints files passed to it by the lp command.
The standard MINIX 3 software distribution also provides a number of
daemons that support various network functions.
The MINIX 3 configuration described in this book supports most network
operations, all that is needed is to enable the network server and
drivers for ethernet adapters at startup time.
Recompiling the terminal driver with pseudo terminals and serial line
support will add support for logins from remote terminals and networking
over serial lines (including modems).
The network server runs at the same priority as the memory manager and
the file system, and like them, it runs as a user process.
True to its heritage, MINIX 3 follows the same path as UNIX with
respect to deadlocks of the types described earlier in this chapter: it
just ignores the problem.
Normally, MINIX 3 does not contain any dedicated I/O devices, although
if someone wanted to hang an industry standard DAT tape drive on a PC,
making the software for it would not pose any special problems.
In short, the only place deadlocks can occur are with the implicit
shared resources, such as process table slots, i-node table slots, and
so on.
None of the known deadlock algorithms can deal with resources like these
that are not requested explicitly.
Actually, the above is not strictly true.
Accepting the risk that user processes could deadlock is one thing, but
within the operating system itself a few places do exist where
considerable care has been taken to avoid problems.
The main one is the message-passing interaction between processes.
For instance, user processes are only allowed to use the sendrec
messaging method, so a user process should never lock up because it did
a receive when there was no process with an interest in sending to
it.
Servers only use send or sendrec to communicate with device drivers, and
device drivers only use send or sendrec to communicate with the system
task in the kernel layer.
In the rare case where servers must communicate between themselves, such
as exchanges between the process manager and the file system as they
initialize their parts of the process table, the order of communication
is very carefully designed to avoid deadlock.
Also, at the very lowest level of the message passing system there is a
check to make sure that when a process is about to do a send that the
destination process is not trying to the same thing.
In addition to the above restrictions, in MINIX 3 the new notify
message primitive is provided to handle those situations in which a
message must be sent in the “upstream” direction.
Notify is nonblocking, and notifications are stored when a recipient is
not immediately available.
As we examine the implementation of MINIX 3 device drivers in this
chapter we will see that notify is used extensively.
Locks are another mechanism that can prevent deadlocks.
It is possible to lock devices and files even without operating system
support.
A file name can serve as a truly global variable, whose presence or
absence can be noted by all other processes.
A special directory, /usr/spool/locks/, is usually present on MINIX 3
systems, as on most UNIX -like systems, where processes can create lock
files, to mark any resources they are using.
The MINIX 3 file system also supports POSIX-style advisory file
locking.
But neither of these mechanisms is enforceable.
They depend upon the good behavior of processes, and there is nothing to
prevent a program from trying to use a resource that is locked by
another process.
This is not exactly the same thing as preemption of the resource,
because it does not prevent the first process from attempting to
continue its use of the resource.
In other words, there is no mutual exclusion.
The result of such an action by an ill-behaved process is likely to be a
mess, but no deadlock results.
MINIX 3 supports several different block devices, so we will begin by
discussing common aspects of all block devices.
Then we will discuss the RAM disk, the hard disk, and the floppy
disk.
Each of these is interesting for a different reason.
The RAM disk is a good example to study because it has all the
properties of block devices in general except the actual I/O—because the
“disk” is actually just a portion of memory.
This simplicity makes it a good place to start.
The hard disk shows what a real disk driver looks like.
One might expect the floppy disk to be easier to support than the hard
disk, but, in fact, it is not.
We will not discuss all the details of the floppy disk, but we will
point out several of the complications to be found in the floppy disk
driver.
Looking ahead, after the discussion of block drivers, we will discuss the terminal (keyboard + display) driver, which is important on all systems, and, furthermore is a good example of a character device driver.
Each of these sections describes the relevant hardware, the software
principles behind the driver, an overview of the implementation, and the
code itself.
This structure may make the sections useful reading even for readers who
are not interested in the details of the code itself.
We mentioned earlier that the main procedures of all I/O drivers have
a similar structure.
MINIX 3 always has at least two block device drivers compiled into the
system: the RAM disk driver, and either one of several possible hard
disk drivers or a floppy disk driver.
Usually, there are three block devices—both the floppy disk driver and
an IDE (Integrated Drive Electronics) hard disk driver are
present.
The driver for each block device driver is compiled independently, but a
common library of source code is shared by all of them.
In older versions of MINIX a separate CD-ROM driver was sometimes
present, and could be added if necessary.
Separate CD-ROM drivers are now obsolete.
They used to be necessary to support the proprietary interfaces of
different drive manufacturers, but modern CD-ROM drives are usually
connected to the IDE controller, although on notebook computers some
CD-ROMs are USB.
The full version of the MINIX 3 hard disk device driver includes CD-ROM
support, but we have taken the CD-ROM support out of the driver as
described in this text and listed in Appendix B.
Each block device driver has to do some initialization, of
course.
The RAM disk driver has to reserve some memory, the hard disk driver has
to determine the parameters of the hard disk hardware, and so on.
All of the disk drivers are called individually for hardware-specific
initialization.
After doing whatever may be necessary, each driver then calls the
function containing its main loop.
This loop is executed forever; there is no return to the caller.
Within the main loop a message is received, a function to perform the
operation needed by each message is called, and then a reply message is
generated.
The common main loop called by each disk driver process is compiled when drivers/libdriver/driver.c and the other files in its directory are compiled, and then a copy of the object file driver.o is linked into each disk driver’s executable file.
The technique used is to have each driver pass to the main loop a parameter consisting of a pointer to a table of the addresses of the functions that driver will use for each operation and then to call these functions indirectly.
If the drivers were compiled together in a single executable file
only one copy of the main loop would be needed.
This code was, in fact, first written for an earlier version of MINIX in
which all the drivers were compiled together.
The emphasis in MINIX 3 is on making individual operating system
components as independent as possible, but using common source code for
separate programs is still a good way to increase reliability.
Assuming you get it right once, it will be right for all the
drivers.
Or, a bug found in one use might very well exist unnoticed in other
uses.
Thus, shared source code gets tested more thoroughly.
A number of other functions potentially useful to multiple disk
drivers are defined in drivers/libdriver/drvlib.c, and linking drvlib.o
makes these available.
All of the functionality could have been provided in a single file, but
not all of it is needed by every disk driver.
For instance, the memory driver, which is simpler than other drivers,
links in only driver.o.
The at_wini driver links in both driver.o and drvlib.o.
Figure 3-19 shows an outline of the main loop, in a form similar to
that of Fig. 3-18.
Statements like
code = (*entry_points−>dev_read)(&mess);
are indirect function calls.
A different dev_read function is called by each driver, even though each
driver is executing a main loop compiled from the same source
file.
But some other operations, for example close, are simple enough that
more than one device can call the same function.
There are six possible operations that can be requested of any device
driver.
These correspond to the possible values that can be found in the
m.m_type field of the message of Fig. 3-17.
They are:
1. OPEN
2. CLOSE
3. READ
4. WRITE
5. IOCTL
6. SCATTERED_IO
Many of these operations are most likely familiar to readers with
programming experience.
At the device driver level most operations are related to system calls
with the same name.
For instance, the meanings of READ and WRITE should be fairly
clear.
For each of these operations, a block of data is transferred from the
device to the memory of the process that initiated the call, or vice
versa.
A READ operation normally does not result in a return to the caller
until the data transfer is complete, but an operating system may buffer
data transferred during a WRITE for actual transfer to the destination
at a later time, and return to the caller immediately.
That is fine as far as the caller is concerned; it is then free to reuse
the buffer from which the operating system has copied the data to
write.
OPEN and CLOSE for a device have similar meanings to the way the open
and close system calls apply to operations on files: an OPEN operation
should verify that the device is accessible, or return an error message
if not, and a CLOSE should guarantee that any buffered data that were
written by the caller are completely transferred to their final
destination on the device.
message mess; /* message buffer */
void shared_io_driver(struct driver_table *entry_points) {
/* initialization is done by each driver before calling this */
while (TRUE) {
receive(ANY, &mess);
caller = mess.source;
switch(mess.type) {
case READ:
rcode = (*entry_points−>dev_read)(&mess); break;
case WRITE:
rcode = (*entry_points−>dev_write)(&mess); break;
/* Other cases go here, including OPEN, CLOSE, and IOCTL */
default:
rcode = ERROR;
}
mess.type = DRIVER_REPLY;
mess.status = rcode; /* result code */
send(caller, &mess);
}
}
Figure 3-19.
An I/O driver main procedure using indirect calls.
The IOCTL operation may not be so familiar.
Many I/O devices have operational parameters which occasionally must be
examined and perhaps changed.
IOCTL operations do this.
A familiar example is changing the speed of transmission or the parity
of a communications line.
For block devices, IOCTL operations are less common.
Examining or changing the way a disk device is partitioned is done using
an IOCTL operation in MINIX 3 (although it could just as well have been
done by reading and writing a block of data).
The SCATTERED_IO operation is no doubt the least familiar of
these.
Except with exceedingly fast disk devices (for example, the RAM disk),
satisfactory disk I/O performance is difficult to obtain if all disk
requests are for individual blocks, one at a time.
A SCATTERED_IO request allows the file system to make a request to read
or write multiple blocks.
In the case of a READ operation, the additional blocks may not have been
requested by the process on whose behalf the call is made; the operating
system attempts to anticipate future requests for data.
In such a request not all the transfers requested are necessarily
honored by the device driver.
The request for each block may be modified by a flag bit that tells the
device driver that the request is optional.
In effect the file system can say: “It would be nice to have all these
data, but I do not really need them all right now.” The device can do
what is best for it.
The floppy disk driver, for instance, will return all the data blocks it
can read from a single track, effectively saying, “I will give you
these, but it takes too long to move to another track; ask me again
later for the rest.”
When data must be written, there is no question of its being
optional; every write is mandatory.
Nevertheless, the operating system may buffer a number of write requests
in the hope that writing multiple blocks can be done more efficiently
than handling each request as it comes in.
In a SCATTERED_IO request, whether for reading or writing, the list of
blocks requested is sorted, and this makes the operation more efficient
than handling the requests randomly.
In addition, making only one call to the driver to transfer multiple
blocks reduces the number of messages sent within MINIX 3.
Definitions that are needed by all of the block device drivers are
located in drivers/libdriver/driver.h.
The most important thing in this file is the driver structure, which is
used by each driver to pass a list of the addresses of the functions it
will use to perform each part of its job.
Also defined here is the device structure which holds the most important
information about partitions, the base address, and the size, in byte
units.
This format was chosen so no conversions are necessary when working with
memory-based devices, maximizing speed of response.
With real disks there are so many other factors delaying access that
converting to sectors is not a significant inconvenience.
The source of the main loop and common functions of all the block
device drivers are in driver.c.
After doing whatever hardware-specific initialization may be necessary,
each driver calls driver_task, passing a driver structure as the
argument to the call.
After obtaining the address of a buffer to use for DMA operations the
main loop is entered.
In the switch statement in the main loop, the first five message
types, DEV_OPEN, DEV_CLOSE, DEV_IOCTL, DEV_CANCEL, and DEV_SELECT result
in indirect calls using addresses passed in the driver structure.
The DEV_READ and DEV_WRITE messages both result in direct calls to
do_rdwt; DEV_GATHER and DEV_SCATTER messages both result in direct calls
to do_vrdwt.
The driver structure is passed as an argument by all the calls from
within the switch, whether direct or indirect, so all called functions
can make further use of it as needed.
Do_rdwt and do_vrdwt do some preliminary processing, but then they too
make indirect calls to device-specific routines.
The other cases, HARD_INT, SYS_SIG, and SYN_ALARM, respond to
notifications.
These also result in indirect calls, but upon completion each of these
executes a continue statement.
This causes control to return to the top of the loop, bypassing the
cleanup and reply message steps.
After doing whatever is requested in the message, some sort of
cleanup may be necessary, depending upon the nature of the device.
For a floppy disk, for instance, this might involve starting a timer to
turn off the disk drive motor if another request does not arrive
soon.
An indirect call is used for this as well.
Following the cleanup, a reply message is constructed and sent to the
caller.
It is possible for a routine that services one of the message types to
return a EDONTREPLY value to suppress the reply message, but none of the
current drivers use this option.
The first thing each driver does after entering the main loop is to
make a call to init_buffer, which assigns a buffer for use in DMA
operations.
That this initialization is even necessary at all is due to a quirk of
the hardware of the original IBM PC, which requires that the DMA buffer
not cross a 64K boundary.
That is, a 1-KB DMA buffer may begin at 64510, but not at 64514, because
a buffer starting at the latter address extends just beyond the 64K
boundary at 65536.
This annoying rule occurs because the IBM PC used an old DMA chip,
the Intel 8237A, which contains a 16-bit counter.
A bigger counter is needed because DMA uses absolute addresses, not
addresses relative to a segment register.
On older machines that can address only 1M of memory, the low-order 16
bits of the DMA address are loaded into the 8237A, and the high-order 4
bits are loaded into a 4-bit latch.
Newer machines use an 8-bit latch and can address 16M.
When the 8237A goes from 0xFFFF to 0x0000, it does not generate a carry
into the latch, so the DMA address suddenly jumps down by 64K in
memory.
A portable C program cannot specify an absolute memory location for a
data structure, so there is no way to prevent the compiler from placing
the buffer in an unusable location.
The solution is to allocate an array of bytes twice as large as
necessary at buffer and to reserve a pointer tmp_buf to use for actually
accessing this array.
Init_buffer makes a trial setting of tmp_buf pointing to the beginning
of buffer, then tests to see if that allows enough space before a 64K
boundary is hit.
If the trial setting does not provide enough space, tmp_buf is
incremented by the number of bytes actually required.
Thus some space is always wasted at one end or the other of the space
allocated in buffer, but there is never a failure due to the buffer
falling on a 64K boundary.
Newer computers of the IBM PC family have better DMA controllers, and
this code could be simplified, and a small amount of memory reclaimed,
if one could be sure that one’s machine were immune to this
problem.
If you are considering this, however, consider how the bug will manifest
itself if you are wrong.
If a 1K DMA buffer is desired, the chance is 1 in 64 that there will be
a problem on a machine with the old DMA chip.
Every time the kernel source code is modified in a way that changes the
size of the compiled kernel, there is the same probability that the
problem will manifest itself.
Most likely, when the failure occurs next month or next year, it will be
attributed to the code that was last modified.
Unexpected hardware “features” like this can cause weeks of time spent
looking for exceedingly obscure bugs (all the more so when, like this
one, the technical reference manual says nary a word about them).
Do_rdwt is the next function in driver.c.
It, in turn calls two device-dependent functions pointed to by the
dr_prepare and dr_transfer fields in the driver structure.
Here and in what follows we will use the C language-like notation
(*function_pointer
) to indicate we are talking about the
function pointed to by function_pointer.
After checking to see that the byte count in the request is positive,
do_rdwt calls (*dr_prepare
).
This operation fills in the base and size of the disk, partition, or
subpartition being accessed in a device structure.
For the memory driver, which does not support partitions, it just checks
that the minor device number is valid.
For the hard disk it uses the minor device number to get the size of the
partition or subpartition indicated by the minor device number.
This should succeed, since (*dr_prepare
) can fail only if
an invalid device is specified in an open operation.
Next, an iovec_t structure (which is defined in include/minix/type.h),
iovec1, is filled in.
This structure specifies the virtual address and size of the local
buffer to or from which data will be copied by the system task.
This is the same structure that is used as an element of an array of
requests when the call is for multiple blocks.
The address of a variable and the address of the first element of an
array of the same type of variable can be handled exactly the same
way.
Then comes another indirect call, this time to
(*dr_transfer
), which performs the data copy and I/O
operations required.
The routines that handle transfers all expect to receive an array of
requests.
In do_rdwt the last argument to the call is 1, specifying an array of
one element.
As we will see in the discussion of disk hardware in the next
section, responding to disk requests in the order they are received can
be inefficient, and this routine allows a particular device to handle
requests in the way that is best for the device.
The indirection here masks much possible variation in the way individual
devices perform.
For the RAM disk, dr_transfer points to a routine that makes a kernel
call to ask the system task to copy data from one part of physical
memory to another, if the minor device being accessed is /dev/ram,
/dev/mem, /dev/kmem, /dev/boot, or /dev/zero.
(No copying is required to access /dev/null, of course.) For a real
disk, the code pointed to by dr_transfer also has to ask the system task
for a data transfer.
But before the copy operation (for a read) or after it (for a write) a
kernel call must also be made to ask the system task to do actual I/O,
writing bytes to registers that are part of the disk controller to
select the location on the disk and the size and direction of the
transfer.
In the transfer routine the iov_size count in the iovec1 structure is
modified, returning an error code (a negative number) if there was an
error or a positive number indicating the number of bytes
transferred.
It is not necessarily an error if no bytes are transferred; this
indicates that the end of the device has been reached.
Upon returning to the main loop, the error code or the byte count is
returned in the REP_STATUS field in the reply message from
driver_task.
The next function, do_vrdwt, handles scattered I/O requests.
A message that requests a scattered I/O request uses the ADDRESS field
to point to an array of iovec_t structures, each of which specifies the
address of a buffer and the number of bytes to transfer.
In MINIX 3 such a request can be made only for contiguous blocks on the
disk; the initial offset on the device and whether the operation is a
read or a write are in the message.
So all the operations in one request will be for either reading or
writing, and they will be sorted into block order on the device.
On line 11198 a check is done to see if this call is being done on
behalf of a kernel-space I/O task; this is a vestige of an early phase
of the development of MINIX 3 before all the disk drivers had been
rewritten to run in user space.
Fundamentally, the code for this operation is very similar to that
for the simple read or write performed by do_rdwt.
The same indirect calls to the device-dependent
(*dr_prepare
) and (*dr_transfer
) routines are
made.
The looping in order to handle multiple requests is all done internal to
the function pointed to by (*dr_transfer
).
The last argument in this case is not 1, it is the size of the array of
iovec_t elements.
After termination of the loop the array of requests is copied back where
it came from.
The io_size field of each element in the array will show the number of
bytes transferred for that request, and although the total is not passed
back directly in the reply message that driver_task constructs, the
caller can extract the total from this array.
The next few routines in driver.c are for general support of the
above operations.
A (*dr_name
) call can be used to return the name of a
device.
For a device with no specific name the no_name function returns the
string “noname”.
Some devices may not require a particular service, for instance, a RAM
disk does not require that anything special be done upon a DEV_CLOSE
request.
The do_nop function fills in here, returning various codes depending
upon the kind of request made.
Additional functions, nop_signal, nop_alarm, nop_prepare, nop_cleanup,
and nop_cancel, are similar dummy routines for devices that do not need
these services.
Finally, do_diocntl carries out DEV_IOCTL requests for a block
device.
It is an error if any DEV_IOCTL operation other than reading (DIOCGETP)
or writing (DIOCSETP) partition information is requested.
Do_diocntl calls the device’s (*dr_prepare
) function to
verify the device is valid and to get a pointer to the device structure
that describes the partition base and size in byte units.
On a request to read, it calls the device’s (*dr_geometry
)
function to get the last cylinder, head, and sector information about
the partition.
In each case a sys_datacopy kernel call is made to request that the
system task copy the data between the memory spaces of the driver and
the requesting process.
The files drvlib.h and drvlib.c contain system-dependent code that supports disk partitions on IBM PC compatible computers.
Partitioning allows a single storage device to be divided up into
subdevices.
It is most commonly used with hard disks, but MINIX 3 provides support
for partitioning floppy disks, as well.
Some reasons to partition a disk device are:
Disk capacity is cheaper per unit in large disks.
If two or more operating systems with different file systems are used,
it is more economical to partition a single large disk than to install
multiple smaller disks for each operating system.
Operating systems may have limits to the device size they can
handle.
The version of MINIX 3 discussed here can handle a 4-GB file system, but
older versions are limited to 256 MB.
Any disk space beyond that is wasted.
Two or more different file systems may be used by an operating
system.
For example, a standard file system may be used for ordinary files and a
differently structured file system may be used for virtual memory swap
space.
It may be convenient to put a portion of a system’s files on a
separate logical device.
Putting the MINIX 3 root file system on a small device makes it easy to
back up and facilitates copying it to a RAM disk at boot time.
Support for disk partitions is platform specific.
This specificity is not related to the hardware.
Partition support is device independent.
But if more than one operating system is to run on a particular set of
hardware, all must agree on a format for the partition table.
On IBM PCs the standard is set by the MS-DOS fdisk command, and other
OSs, such as MINIX 3, Windows, and Linux, use this format so they can
coexist with MS-DOS.
When MINIX 3 is ported to another machine type, it makes sense to use a
partition table format compatible with other operating systems used on
the new hardware.
Thus the MINIX 3 source code to support partitions on IBM computers is
put in drvlib.c, rather than being included in driver.c, for two
reasons.
First, not all disk types support partitions.
As noted earlier, the memory driver links to driver.o but does not use
the functions compiled into drvlib.o.
Second, this makes it easier to port MINIX 3 to different
hardware.
It is easier to replace one small file than to edit a large one with
many sections to be conditionally compiled for different
environments.
The basic data structure inherited from the firmware designers is
defined in include/ibm/partition.h, which is included by a #include
statement in drvlib.h.
This includes information on the cylinder-head-sector geometry of each
partition, as well as codes identifying the type of file system on the
partition and an active flag indicating if it is bootable.
Most of this information is not needed by MINIX 3 once the file system
is verified.
The partition function (in drvlib.c, line 11426) is called the first
time a block device is opened.
Its arguments include a driver structure, so it can call device-specific
functions, an initial minor device number, and a parameter indicating
whether the partitioning style is floppy disk, primary partition, or
subpartition.
It calls the device-specific (*dr_prepare
) function to
verify the device is valid and to get the base address and the size into
a device structure of the type mentioned in the previous section.
Then it calls get_part_table to determine if a partition table is
present and, if so, to read it.
If there is no partition table, the work is done.
Otherwise the minor device number of the first partition is computed,
using the rules for numbering minor devices that apply to the style of
partitioning specified in the original call.
In the case of primary partitions the partition table is sorted so the
order of the partitions is consistent with that used by other operating
systems.
At this point another call is made to (*dr_prepare
),
this time using the newly calculated device number of the first
partition.
If the subdevice is valid, then a loop is made over all the entries in
the table, checking that the values read from the table on the device
are not out of the range obtained earlier for the base and size of the
entire device.
If there is a discrepancy, the table in memory is adjusted to
conform.
This may seem paranoid, but since partition tables may be written by
different operating systems, a programmer using another system may have
cleverly tried to use the partition table for something unexpected or
there could be garbage in the table on disk for some other reason.
We put the most trust in the numbers we calculate using MINIX 3.
Better safe than sorry.
Still within the loop, for all partitions on the device, if the
partition is identified as a MINIX 3 partition, partition is called
recursively to gather subpartition information.
If a partition is identified as an extended partition, the next
function, extpartition, is called instead.
Extpartition has nothing to do with MINIX 3 itself, so we will not
discuss details.
Some other operating systems (e.g., Windows) use extended
partitions.
These use linked lists rather than fixed-size arrays to support
subpartitions.
For simplicity MINIX 3 uses the same mechanism for subpartitions as for
primary partitions.
However, minimal support for extended partitions is provided to support
MINIX 3 commands to read and write files and directories of other
operating systems.
These operations are easy; providing full support for mounting and
otherwise using extended partitions in the same way as primary
partitions would be much more complicated.
Get_part_table calls do_rdwt to get the sector on a device (or
subdevice) where a partition table is located.
The offset argument is zero if it is called to get a primary partition
or nonzero for a subpartition.
It checks for the magic number (0xaa55) and returns true or false status
to indicate whether a valid partition table was found.
If a table is found, it copies it to the table address that was passed
as an argument.
Finally, sort sorts the entries in a partition table by lowest
sector.
Entries that are marked as having no partition are excluded from the
sort, so they come at the end, even though they may have a zero value in
their low sector field.
The sort is a simple bubble sort; there is no need to use a fancy
algorithm to sort a list of four items.
Now we will get back to the individual block device drivers and study
several of them in detail.
The first one we will look at is the memory driver.
It can be used to provide access to any part of memory.
Its primary use is to allow a part of memory to be reserved for use like
an ordinary disk, and we will also refer to it as the RAM disk
driver.
A RAM disk does not provide permanent storage, but once files have been
copied to this area they can be accessed extremely quickly.
A RAM disk is also useful for initial installation of an operating
system on a computer with only one removable storage device, whether a
floppy disk, CD-ROM, or some other device.
By putting the root device on the RAM disk, removable storage devices
can be mounted and unmounted as needed to transfer data to the hard
disk.
Putting the root device on a floppy disk would make it impossible to
save files on floppies, since the root device (the only floppy) cannot
be unmounted.
RAM disks also are used with “live” CD-ROMs that allow one to run an
operating system for tests and demonstrations, without copying any files
onto the hard disk.
Having the root device on the RAM disk makes the system highly flexible:
any combination of floppy disks or hard disks can be mounted on
it.
MINIX 3 and many other operating systems are distributed on live
CD-ROMs.
As we shall see, the memory driver supports several other functions
in addition to a RAM disk.
It supports straightforward random access to any part of memory, byte by
byte or in chunks of any size.
Used this way it acts as a character device rather than as a block
device.
Other character devices supported by the memory driver are /dev/zero,
and /dev/null, otherwise known as the great bit bucket in the sky.
The idea behind a RAM disk is simple.
A block device is a storage medium with two commands: write a block and
read a block.
Normally, these blocks are stored on rotating memories, such as floppy
disks or hard disks.
A RAM disk is simpler.
It just uses a preallocated portion of main memory for storing the
blocks.
A RAM disk has the advantage of having instant access (no seek or
rotational delay), making it suitable for storing programs or data that
are frequently accessed.
As an aside, it is worth briefly pointing out a difference between
systems that support mounted file systems and those that do not (e.g.,
MS-DOS and Windows).
With mounted file systems, the root device is always present and in a
fixed location, and removable file systems (i.e., disks) can be mounted
in the file tree to form an integrated file system.
Once everything has been mounted, the user need not worry at all about
which device a file is on.
In contrast, with systems like MS-DOS, the user must specify the
location of each file, either explicitly as in B: DIR FILE or by using
certain defaults (current device, current directory, and so on).
With only one or two floppy disks, this burden is manageable, but on a
large computer system, with dozens of disks, having to keep track of
devices all the time would be unbearable.
Remember that UNIX -like operating systems run on hardware ranging from
small home and office machines to supercomputers such as the IBM Blue
Gene/L supercomputer, the world’s fastest computer as of this writing;
MS-DOS runs only on small systems.
Figure 3-20 shows the idea behind a RAM disk.
The RAM disk is split up into n blocks, depending on how much memory has
been allocated for it.
Each block is the same size as the block size used on the real
disks.
When the driver receives a message to read or write a block, it just
computes where in the RAM disk memory the requested block lies and reads
from it or writes to it, instead of from or to a floppy or hard
disk.
Ultimately the system task is called to carry out the transfer.
This is done by phys_copy, an assembly language procedure in the kernel
that copies to or from the user program at the maximum speed of which
the hardware is capable.
Figure 3-20.
A RAM disk.
A RAM disk driver may support several areas of memory used as RAM
disk, each distinguished by a different minor device number.
Usually, these areas are distinct, but in some fairly specific
situations it may be convenient to have them overlap, as we shall see in
the next section.
The MINIX 3 RAM disk driver is actually six closely related drivers
in one.
Each message to it specifies a minor device as follows:
0: /dev/ram
1: /dev/mem
2: /dev/kmem
3: /dev/null
4: /dev/boot
5: /dev/zero
The first special file listed above, /dev/ram, is a true RAM
disk.
Neither its size nor its origin is built into the driver.
They are determined by the file system when MINIX 3 is booted.
If the boot parameters specify that the root file system is to be on the
RAM disk but the RAM disk size is not specified, a RAM disk of the same
size as the root file system image device is created.
A boot parameter can be used to specify a RAM disk larger than the root
file system, or if the root is not to be copied to the RAM, the
specified size may be any value that fits in memory and leaves enough
memory for system operation.
Once the size is known, a block of memory big enough is found and
removed from the memory pool by the process manager during its
initialization.
This strategy makes it possible to increase or reduce the amount of RAM
disk present without having to recompile the operating system.
The next two minor devices are used to read and write physical memory
and kernel memory, respectively.
When /dev/mem is opened and read, it yields the contents of physical
memory locations starting at absolute address zero (the real-mode
interrupt vectors).
Ordinary user programs never do this, but a system program concerned
with debugging the system might possibly need this facility.
Opening /dev/mem and writing on it will change the interrupt
vectors.
Needless to say, this should only be done with the greatest of caution
by an experienced user who knows exactly what he is doing.
The special file /dev/kmem is like /dev/mem, except that byte 0 of
this file is byte 0 of the kernel’s data memory, a location whose
absolute address varies, depending on the size of the MINIX 3 kernel
text segment.
It too is used mostly for debugging and very special programs.
Note that the RAM disk areas covered by these two minor devices
overlap.
If you know exactly how the kernel is placed in memory, you can open
/dev/mem, seek to the beginning of the kernel’s data area, and see
exactly the same thing as reading from the beginning of /dev/kmem.
But, if you recompile the kernel, changing its size, or if in a
subsequent version of MINIX 3 the kernel is moved somewhere else in
memory, you will have to seek a different amount in /dev/mem to see the
same thing you now see at the start of /dev/kmem.
Both of these special files should be protected to prevent everyone
except the superuser from using them.
The next file in this group, /dev/null, is a special file that
accepts data and throws them away.
It is commonly used in shell commands when the program being called
generates output that is not needed.
For example,
a.out >/dev/null
runs the program a.out but discards its output.
The RAM disk driver effectively treats this minor device as having zero
size, so no data are ever copied to or from it.
If you read from it you will get an immediate EOF (End of File).
If you have looked at the directory entries for these files in /dev/
you may have noticed that, of those mentioned so far, only /dev/ram is a
block special file.
All the others are character devices.
There is one more block device supported by the memory driver.
This is /dev/boot.
From the point of view of the device driver it is another block device
implemented in RAM, just like /dev/ram.
However, it is meant to be initialized by copying a file appended to the
boot image after init into memory, rather than starting with an empty
block of memory, as is done for /dev/ram.
Support for this device is provided for future use and it is not used in
MINIX 3 as described in this text.
Finally, the last device supported by the memory driver is another
character special file, /dev/zero.
It is sometimes convenient to have a source of zeros.
Writing to /dev/zero is like writing to /dev/null; it throws data
away.
But reading /dev/zero gives you zeros, in any quantity you want, whether
a single character or a disk full.
At the driver level, the code for handling /dev/ram, /dev/mem,
/dev/kmem, and /dev/boot is identical.
The only difference among them is that each one corresponds to a
different region of memory, indicated by the arrays ram_origin and
ram_limit, each indexed by minor device number.
The file system manages devices at a higher level.
The file system interprets devices as character or block devices, and
thus can mount /dev/ram and /dev/boot and manage directories and files
on these devices.
For the devices defined as character devices the file system can only
read and write streams of data (although a stream read from /dev/null
gets only EOF).
### Implementation of the RAM Disk Driver in MINIX 3
As with other disk drivers, the main loop of the RAM disk driver is in
the file driver.c.
The device-specific support for memory devices is in memory.c.
When the memory driver is compiled, a copy of the object file called
drivers/libdriver/driver.o, produced by compiling
drivers/libdriver/driver.c, is linked with the object file
drivers/memory/memory.o , the product of compiling
drivers/memory/memory.c .
It may be worth taking a moment to consider how the main loop is
compiled.
The declaration of the driver structure in driver.h defines a data
structure, but does not create one.
The declaration of m_dtab creates an instance of this with each part of
the structure filled in with a pointer to a function.
Some of these functions are generic code compiled when driver.c is
compiled, for instance, all of the nop functions.
Others are code compiled when memory.c is compiled, for instance,
m_do_open.
Note that for the memory driver seven of the entries are do-little or
do-nothing routines and the last two are defined as NULL (which means
these functions will never be called, there is no need even for a
do_nop).
All this is a sure clue that the operation of a RAM disk is not terribly
complicated.
The memory device does not require definition of a large number of
data structures, either.
The array m_geom[NR_DEVS]
holds the base and size of each
of the six memory devices in bytes, as 64 bit unsigned integers, so
there is no immediate danger of MINIX 3 not being able to have a big
enough RAM disk.
The next line defines an interesting structure that will not be seen in
other drivers.
M_seg[NR_DEVS] is apparently just an aray of integers, but these
integers are indices that allow segment descriptors to be found.
The memory device driver is unusual among user-space processes in having
the ability to access regions of memory outside of the ordinary text,
data, and stack segments every process owns.
This array holds the information that allows access to the designated
additional memory regions.
The variable m_device just holds the index into these arrays of the
currently active minor device.
To use /dev/ram as the root device the memory driver must be
initialized very early during startup of MINIX 3.
The kinfo and machine structures that are defined next will hold data
retrieved from the kernel during startup that is necessary for
initializing the memory driver.
One other data structure is defined before the executable code
begins.
This is dev_zero, an array of 1024 bytes, used to supply data when a
read call is made to /dev/zero.
The main procedure main calls one function to do some local
initialization.
After that, it calls the main loop, which gets messages, dispatches to
the appropriate procedures, and sends the replies.
There is no return to main upon completion.
The next function, m_name, is trivial.
It returns the string “memory” when called.
On a read or write operation, the main loop makes three calls: one to
prepare a device, one to do the actual data transfer, and one to do
cleanup.
For a memory device, a call to m_prepare is the first of these.
It checks that a valid minor device has been requested and then returns
the address of the structure that holds the base address and size of the
requested RAM area.
The second call is for m_transfer.
This does all the work.
As we saw in driver.c, all calls to read or write data are transformed
into calls to read or write multiple contiguous blocks of data—if only
one block is needed the request is passed on as a request for multiple
blocks with a count of one.
So only two kinds of transfer requests are passed on to the driver,
DEV_GATHER, requesting a read of one or more blocks, and DEV_SCATTER, a
request to write one or more blocks.
Thus, after getting the minor device number, m_transfer enters a loop,
repeated for the number of transfers requested.
Within the loop there is a switch on the device type.
The first case is for /dev/null, and the action is to return
immediately on a DEV_GATHER request or on a DEV_SCATTER request to fall
through to the end of the switch.
This is so the number of bytes transferred (although this number is zero
for /dev/null) can be returned, as would be done for any write
operation.
For all of the device types that refer to real locations in memory
the action is similar.
The requested offset is checked against the size of the device to
determine that the request is within the bounds of the memory allocated
to the device.
Then a kernel call is made to copy data either to or from the memory of
the caller.
There are two chunks of code that do this, however.
For /dev/ram, /dev/kmem, and /dev/boot virtual addresses are used, which
requires retrieving the segment address of the memory region to be
accessed from the m_seg array, and then making a sys_vircopy kernel
call.
For /dev/mem a physical address is used and the call is to
sys_physcopy.
The remaining operation is a read or write to /dev/zero.
For reading the data is taken from the dev_zero array mentioned
earlier.
You might ask, why not just generate zero values as needed, rather than
copying from a buffer full of them? Since the copying of the data to its
destination has to be done by a kernel call, such a method would require
either an inefficient copying of single bytes from the memory driver to
the system task, or building code to generate zeros into the system
task.
The latter approach would increase the complexity of kernel-space code,
something that we would like to avoid in MINIX 3.
A memory device does not need a third step to finish a read or write operation, and the corresponding slot in m_dtab is a call to nop_finish.
Opening a memory device is done by m_do_open.
The job is done by calling m_prepare to check that a valid device is
being referenced.
More interesting than the code that exists is a comment about code that
was found here in older versions of MINIX.
Previously a trick was hidden here.
A call by a user process to open /dev/mem or /dev/kmem would also
magically confer upon the caller the ability to execute instructions
which access I/O ports.
Pentium-class CPUs implement four privilege levels, and user processes
normally run at the least-privileged level.
The CPU generates a general protection exception when an process tries
to execute an instruction not allowed at its privilege level.
Providing a way to get around this was considered safe because the
memory devices could only be accessed by a user with root
privileges.
In any case, this possibly risky “feature” is absent from MINIX 3
because kernel calls that allow I/O access via the system task are now
available.
The comment remains, to point out that if MINIX 3 is ported to hardware
that uses memory-mapped I/O such a feature might need to be
reintroduced.
The function to do this, enable_iop, remains in the kernel code to show
how this can be done, although it is now an orphan.
The next function, m_init, is called only once, when mem_task is
called for the first time.
This routine uses a number of kernel calls, and is worth study to see
how MINIX 3 drivers interact with kernel space by using system task
services.
First a sys_getkinfo kernel call is made to get a copy of the kernel’s
kinfo data.
From this data it copies the base address and size of /dev/kmem into the
corresponding fields of the m_geom data structure.
A different kernel call, sys_segctl, converts the physical address and
size of /dev/kmem into the segment descriptor information needed to
treat the kernel memory as a virtual memory space.
If an image of a boot device has been compiled into the system boot
image, the field for the base address of /dev/boot will be
non-zero.
If this is so, then information to access the memory region for this
device is set up in exactly the same way it was done for
/dev/kmem.
Next the array used to supply data when /dev/zero is accessed is
explicitly filled with zeros.
This is probably unnecessary; C compilers are supposed to initialize
newly created static variables to all zeros.
Finally, m_init uses a sys_getmachine kernel call to get another set
of data from the kernel, the machine structure which flags various
possible hardware alternatives.
In this case the information needed is whether or not the CPU is capable
of protected mode operation.
Based on this information the size of /dev/mem is set to either 1 MB, or
4 GB − 1, depending upon whether MINIX 3 is running in 8088 or 80386
mode.
These sizes are the maximum sizes supported by MINIX 3 and do not have
anything to do with how much RAM is installed in the machine.
Only the size of the device is set; the compiler is trusted to set the
base address correctly to zero.
Also, since /dev/mem is accessed as physical (not virtual) memory there
is no need to make a sys_segctl kernel call to set up a segment
descriptor.
Before we leave m_init we should mention another kernel call used
here, although it is not obvious in the code.
Many of the actions taken during initialization of the memory driver are
essential to proper functioning of MINIX 3, and thus several tests are
made and panic is called if a test fails.
In this case panic is a library routine which ultimately results in a
sys_exit kernel call.
The kernel and (as we shall see) the process manager and the file system
have their own panic routines.
The library routine is provided for device drivers and other small
system components.
Surprisingly, the function we just examined, m_init, does not
initialize the quintessential memory device, /dev/ram.
This is taken care of in the next function, m_ioctl.
In fact, there is only one ioctl operation defined for the RAM disk;
this is MIOCRAMSIZE, which is used by the file system to set the RAM
disk size.
Much of the job is done without requiring any services from the
kernel.
The call to allocmem is a system call, but not a kernel call.
It is handled by the process manager, which maintains all of the
information necessary to find an available region of memory.
However, at the end one kernel call is needed.
At line 11894 a sys_segctl call is made to convert the physical address
and size returned by allocmem into the segment information needed for
further access.
The last function defined in memory.c is m_geometry.
This is a fake.
Obviously, cylinders, heads, and sectors are irrelevant in addressing
semiconductor memory, but if a request is made for such information for
a memory device this function pretends it has 64 heads and 32 sectors
per track, and calculates from the size how many cylinders there
are.
All modern computers except embedded ones have disk drives.
For that reason, we will now study them, starting with the hardware,
then moving on to say some general things about disk software.
After that we will delve into the way MINIX 3 controls its disks.
All real disks are organized into cylinders, each one containing as
many tracks as there are heads stacked vertically.
The tracks are divided into sectors, with the number of sectors around
the circumference typically being 8 to 32 on floppy disks, and up to
several hundred on some hard disks.
The simplest designs have the same number of sectors on each
track.
All sectors contain the same number of bytes, although a little thought
will make it clear that sectors close to the outer rim of the disk will
be physically longer than those close to the hub.
The time to read or write each sector will be same, however.
The data density is obviously higher on the innermost cylinders, and
some disk designs require a change in the drive current to the
read-write heads for the inner tracks.
This is handled by the disk controller hardware and is not visible to
the user (or the implementer of an operating system).
The difference in data density between inner and outer tracks means a
sacrifice in capacity, and more sophisticated systems exist.
Floppy disk designs that rotate at higher speeds when the heads are over
the outer tracks have been tried.
This allows more sectors on those tracks, increasing disk
capacity.
Such disks are not supported by any system for which MINIX 3 is
currently available, however.
Modern large hard drives also have more sectors per track on outer
tracks than on inner tracks.
These are IDE (Integrated Drive Electronics) drives, and the
sophisticated processing done by the drive’s built-in electronics masks
the details.
To the operating system they appear to have a simple geometry with the
same number of sectors on each track.
The drive and controller electronics are as important as the
mechanical hardware.
The main element of the disk controller is a specialized integrated
circuit, really a small microcomputer.
Once this would have been on a card plugged into the computer’s
backplane, but on modern systems, the disk controller is on the
parentboard.
For a modern hard disk this disk controller circuitry may be simpler
than for a floppy disk, since a hard drive has a powerful electronic
controller integrated into the drive itself.
A device feature that has important implications for the disk driver
is the possibility of a controller doing seeks on two or more drives at
the same time.
These are known as overlapped seeks.
While the controller and software are waiting for a seek to complete on
one drive, the controller can initiate a seek on another drive.
Many controllers can also read or write on one drive while seeking on
one or more other drives, but a floppy disk controller cannot read or
write on two drives at the same time.
(Reading or writing requires the controller to move bits on a
microsecond time scale, so one transfer uses up most of its computing
power.) The situation is different for hard disks with integrated
controllers, and in a system with more than one of these hard drives
they can operate simultaneously, at least to the extent of transferring
between the disk and the controller’s buffer memory.
Only one transfer between the controller and the system memory is
possible at once, however.
The ability to perform two or more operations at the same time can
reduce the average access time considerably.
One thing to be aware of in looking at the specifications of modern
hard disks is that the geometry specified, and used by the driver
software, is almost always different from the physical format.
In fact, if you look up the “recommended setup parameters” for a large
hard disk, you are likely to find it specified as 16383 cylinders, 16
heads, and 63 sectors per track, no matter what the size of the
disk.
These numbers correspond to a disk size of 8 GB, but are used for all
disks this size or larger.
The designers of the original IBM PC ROM BIOS allotted a 6-bit field for
the sector count, 4 bits to specify the head, and 14 bits to select a
cylinder.
With 512 byte sectors this comes out to 8 GB.
So if you try to install a large hard drive into a very old computer you
may find you can access only 8 GB, even though you have a much bigger
disk.
The usual way around this limitation is to use logical block addressing
in which disk sectors are just numbered consecutively starting at zero,
without regard to the disk geometry.
The geometry of a modern disk is a fiction, anyway.
On a modern disk the surface is divided into 20 or more zones.
Zones closer to the center of the disk have fewer sectors per track than
zones nearer the periphery.
Thus sectors have approximately the same physical length no matter where
they are located on the disk, making more efficient use of the disk
surface.
Internally, the integrated controller addresses the disk by calculating
the zone, cylinder, head, and sector.
But this is never visible to the user, and the details are rarely found
in published specifications.
The bottom line is, there is no point to using cylinder, head, sector
addressing of a disk unless you are working with a very old computer
that does not support logical block addressing.
Also, it does not make sense to buy a new 400 GB drive for the PC-XT you
bought in 1983; you will get no more than 8 GB use out of it.
This is a good place to mention a confusing point about disk capacity
specifications.
Computer professionals are accustomed to using powers of 2—a Kilobyte
(KB) is 210 = 1024 bytes, a Megabyte (MB) is 220 = 10242 bytes, etc., to
express the size of memory devices.
A Gigabyte (GB), then, should be 10243 , or 230 bytes.
However, disk manufacturers have adopted the habit of using the term
“Gigabyte” to mean 109 , which (on paper) instantly increases the size
of their products.
Thus the 8 GB limit mentioned above is an 8.4 GB disk in the language of
the disk salesman.
Recently there has been a move toward using the term Gibibyte (GiB) to
mean 230 .
However, in this text the authors, being set in their ways and in
protest of the hijacking of tradition for advertising purposes, will
continue to use terms like Megabyte and Gigabyte to mean what they have
always meant.
Although modern disks are much faster than older ones, improvements
in CPU performance have far exceeded improvements in disk
performance.
It has occurred to various people over the years that parallel disk I/O
might be helpful.
Thus has come about a new class of I/O device called a RAID, an acronym
for Redundant Array of Independent Disks.
Actually, the designers of RAID (at Berkeley) originally used the
acronym RAID to stand for “Redundant Array of Inexpensive Disks” to
contrast this design with a SLED (Single Large Expensive Disk).
However, when RAID became commercially popular, disk manufacturers
changed the meaning of the acronym because it was tough to sell an
expensive product whose name stood for “inexpensive.” The basic idea
behind a RAID is to install a box full of disks next to the computer,
typically a large server, replace the disk controller card with a RAID
controller, copy the data over to the RAID, and then continue normal
operation.
The independent disks can be used together in a variety of
ways.
We do not have space for an exhaustive description of all of these, and
MINIX 3 does not (yet) support RAID, but an introduction to operating
systems should at least mention some of the possibilities.
RAID can be used both to speed disk access and to make data more
secure.
For example, consider a very simple RAID of two drives.
When multiple sectors of data are to be written to the “disk” the RAID
controller sends sectors 0, 2, 4, etc., to the first drive, and sectors
1, 3, 5, etc., to the second drive.
The controller divides up the data and the two disks are written
simultaneously, doubling the writing speed.
When reading, both drives are read simultaneously, but the controller
reassembles the data in the proper order, and to the rest of the system
it just looks like the reading speed is twice as fast.
This technique is called striping.
This is a simple example of RAID level 0.
In practice four or more drives would be used.
This works best when data are usually read or written in large
blocks.
Obviously, nothing is gained if a typical disk request is for a single
sector at a time.
The previous example shows how multiple drives can increase
speed.
What about reliability? RAID level 1 works like RAID level 0, except the
data is duplicated.
Again, a very simple array of two drives could be used, and all of the
data could be written to both of them.
This provides no speedup, but there is 100% redundancy.
If an error is detected during reading there is no need for a retry if
the other drive reads the data correctly.
The controller just has to make sure the correct data is passed on to
the system.
It probably would not be a good idea to skip retries if errors are
detected while writing, however.
And if errors occur frequently enough that skipping retries actually
makes reading noticeably faster it is probably time to decide complete
failure is imminent.
Typically the drives used for RAIDs are hot-swappable, meaning they can
be replaced without powering down the system.
More complex arrays of multiple disks can increase both speed and
reliability.
Consider, for instance, an array of 7 disks.
Bytes could be split into 4-bit nybbles, with each bit being recorded on
one of four drives and with the other three drives being used to record
a three bit error-correcting code.
If a drive goes bad and needs to be hot-swapped for a new one, a missing
drive is equivalent to one bad bit, so the system can keep running while
maintenance is done.
For the cost of seven drives you get reliable performance that is four
times as fast as one drive, and no downtime.
In this section we will look at some issues related to disk drivers
in general.
First, consider how long it takes to read or write a disk block.
The time required is determined by three factors:
The seek time (the time to move the arm to the proper cylinder).
The rotational delay (the time for the proper sector to rotate under the head).
The actual data transfer time.
For most disks, the seek time dominates the other two times, so reducing the mean seek time can improve system performance substantially.
Disk devices are prone to errors.
Some kind of error check, a checksum or a cyclic redundancy check, is
always recorded along with the data in each sector on a disk.
Even the sector addresses recorded when the disk is formatted have check
data.
Floppy disk controller hardware can usually report when an error is
detected, but the software must then decide what to do about it.
Hard disk controllers often take on much of this burden.
Particularly with hard disks, the transfer time for consecutive
sectors within a track can be very fast.
Thus reading more data than requested and caching it in memory can be
very effective in speeding disk access.
Disk Arm Scheduling Algorithms
If the disk driver accepts requests one at a time and carries them
out in that order, that is, First-Come, First-Served (FCFS), little can
be done to optimize seek time.
However, another strategy is possible when the disk is heavily
loaded.
It is likely that while the arm is seeking on behalf of one request,
other disk requests may be generated by other processes.
Many disk drivers maintain a table, indexed by cylinder number, with all
pending requests for each cylinder chained together in a linked list
headed by the table entries.
Given this kind of data structure, we can improve upon the
first-come, firstserved scheduling algorithm.
To see how, consider a disk with 40 cylinders.
A request comes in to read a block on cylinder 11.
While the seek to cylinder 11 is in progress, new requests come in for
cylinders 1, 36, 16, 34, 9, and 12, in that order.
They are entered into the table of pending requests, with a separate
linked list for each cylinder.
The requests are shown in Fig. 3-21.
Figure 3-21.
Shortest Seek First (SSF) disk scheduling algorithm.
When the current request (for cylinder 11) is finished, the disk
driver has a choice of which request to handle next.
Using FCFS, it would go next to cylinder 1, then to 36, and so on.
This algorithm would require arm motions of 10, 35, 20, 18, 25, and 3,
respectively, for a total of 111 cylinders.
Alternatively, it could always handle the closest request next, to
minimize seek time.
Given the requests of Fig. 3-21, the sequence is 12, 9, 16, 1, 34, and
36, as shown as the jagged line at the bottom of Fig. 3-21.
With this sequence, the arm motions are 1, 3, 7, 15, 33, and 2, for a
total of 61 cylinders.
This algorithm, Shortest Seek First (SSF), cuts the total arm motion
almost in half compared to FCFS.
Unfortunately, SSF has a problem.
Suppose that more requests keep coming in while the requests of Fig.
3-21 are being processed.
For example, if, after going to cylinder 16, a new request for cylinder
8 is present, that request will have priority over cylinder 1.
If a request for cylinder 13 then comes in, the arm will next go to 13,
instead of 1.
With a heavily loaded disk, the arm will tend to stay in the middle of
the disk most of the time, so requests at either extreme will have to
wait until a statistical fluctuation in the load causes there to be no
requests near the middle.
Requests far from the middle may get poor service.
The goals of minimal response time and fairness are in conflict
here.
Tall buildings also have to deal with this trade-off.
The problem of scheduling an elevator in a tall building is similar to
that of scheduling a disk arm.
Requests come in continuously calling the elevator to floors (cylinders)
at random.
The microprocessor running the elevator could easily keep track of
the sequence in which customers pushed the call button and service them
using FCFS.
It could also use SSF.
However, most elevators use a different algorithm to reconcile the
conflicting goals of efficiency and fairness.
They keep moving in the same direction until there are no more
outstanding requests in that direction, then they switch
directions.
This algorithm, known both in the disk world and the elevator world as
the elevator algorithm, requires the software to maintain 1 bit: the
current direction bit, UP or DOWN.
When a request finishes, the disk or elevator driver checks the
bit.
If it is UP, the arm or cabin is moved to the next highest pending
request.
If no requests are pending at higher positions, the direction bit is
reversed.
When the bit is set to DOWN, the move is to the next lowest requested
position, if any.
Figure 3-22 shows the elevator algorithm using the same seven
requests as Fig. 3-21, assuming the direction bit was initially
UP.
The order in which the cylinders are serviced is 12, 16, 34, 36, 9, and
1, which yields arm motions of 1, 4, 18, 2, 27, and 8, for a total of 60
cylinders.
In this case the elevator algorithm is slightly better than SSF,
although it is usually worse.
One nice property that the elevator algorithm has is that given any
collection of requests, the upper bound on the total motion is fixed: it
is just twice the number of cylinders.
Figure 3-22.
The elevator algorithm for scheduling disk requests.
A slight modification of this algorithm that has a smaller variance
in response times is to always scan in the same direction (Teory,
1972).
When the highest numbered cylinder with a pending request has been
serviced, the arm goes to the lowest-numbered cylinder with a pending
request and then continues moving in an upward direction.
In effect, the lowest-numbered cylinder is thought of as being just
above the highest-numbered cylinder.
Some disk controllers provide a way for the software to inspect the
current sector number under the head.
With such a controller, another optimization is possible.
If two or more requests for the same cylinder are pending, the driver
can issue a request for the sector that will pass under the head
next.
Note that when multiple tracks are present in a cylinder, consecutive
requests can be for different tracks with no penalty.
The controller can select any of its heads instantaneously, because head
selection involves neither arm motion nor rotational delay.
With a modern hard disk, the data transfer rate is so much faster
than that of a floppy disk that some kind of automatic caching is
necessary.
Typically any request to read a sector will cause that sector and up to
the rest of the current track to be read, depending upon how much space
is available in the controller’s cache memory.
Current caches are often 8 MB or more.
When several drives are present, a pending request table should be
kept for each drive separately.
Whenever any drive is idle, a seek should be issued to move its arm to
the cylinder where it will be needed next (assuming the controller
allows overlapped seeks).
When the current transfer finishes, a check can be made to see if any
drives are positioned on the correct cylinder.
If one or more are, the next transfer can be started on a drive that is
already on the right cylinder.
If none of the arms is in the right place, the driver should issue a new
seek on the drive that just completed a transfer and wait until the next
interrupt to see which arm gets to its destination first.
Error Handling
RAM disks do not have to worry about seek or rotational optimization:
at any instant all blocks can be read or written without any physical
motion.
Another area in which RAM disks are simpler than real disks is error
handling.
RAM disks always work; real ones do not always work.
They are subject to a wide variety of errors.
Some of the more common ones are:
It is up to the disk driver to handle each of these as best it can.
Programming errors occur when the driver tells the controller to seek
to a nonexistent cylinder, read from a nonexistent sector, use a
nonexistent head, or transfer to or from nonexistent memory.
Most controllers check the parameters given to them and complain if they
are invalid.
In theory, these errors should never occur, but what should the driver
do if the controller indicates that one has happened? For a home-grown
system, the best thing to do is stop and print a message like “Call the
programmer” so the error can be tracked down and fixed.
For a commercial software product in use at thousands of sites around
the world, this approach is less attractive.
Probably the only thing to do is terminate the current disk request with
an error and hope it will not recur too often.
Transient checksum errors are caused by specks of dust in the air
that get between the head and the disk surface.
Most of the time they can be eliminated by just repeating the operation
a few times.
If the error persists, the block has to be marked as a bad block and
avoided.
One way to avoid bad blocks is to write a very special program that
takes a list of bad blocks as input and carefully hand crafts a file
containing all the bad blocks.
Once this file has been made, the disk allocator will think these blocks
are occupied and never allocate them.
As long as no one ever tries to read the bad block file, no problems
will occur.
Not reading the bad block file is easier said than done.
Many disks are backed up by copying their contents a track at a time to
a backup tape or disk drive.
If this procedure is followed, the bad blocks will cause trouble.
Backing up the disk one file at a time is slower but will solve the
problem, provided that the backup program knows the name of the bad
block file and refrains from copying it.
Another problem that cannot be solved with a bad block file is the
problem of a bad block in a file system data structure that must be in a
fixed location.
Almost every file system has at least one data structure whose location
is fixed, so it can be found easily.
On a partitioned file system it may be possible to repartition and work
around a bad track, but a permanent error in the first few sectors of
either a floppy or hard disk generally means the disk is unusable.
“Intelligent” controllers reserve a few tracks not normally available
to user programs.
When a disk drive is formatted, the controller determines which blocks
are bad and automatically substitutes one of the spare tracks for the
bad one.
The table that maps bad tracks to spare tracks is kept in the
controller’s internal memory and on the disk.
This substitution is transparent (invisible) to the driver, except that
its carefully worked out elevator algorithm may perform poorly if the
controller is secretly using cylinder 800 whenever cylinder 3 is
requested.
The technology of manufacturing disk recording surfaces is better than
it used to be, but it is still not perfect.
However, the technology of hiding the imperfections from the user has
also improved.
Many controllers also manage new errors that may develop with use,
permanently assigning substitute blocks when they determine that an
error is unrecoverable.
With such disks the driver software rarely sees any indication that
there any bad blocks.
Seek errors are caused by mechanical problems in the arm.
The controller keeps track of the arm position internally.
To perform a seek, it issues a series of pulses to the arm motor, one
pulse per cylinder, to move the arm to the new cylinder.
When the arm gets to its destination, the controller reads the actual
cylinder number (written when the drive was formatted).
If the arm is in the wrong place, a seek error has occurred and some
corrective action is required.
Most hard disk controllers correct seek errors automatically, but
many floppy controllers (including the IBM PCs) just set an error bit
and leave the rest to the driver.
The driver handles this error by issuing a recalibrate command, to move
the arm as far out as it will go and reset the controller’s internal
idea of the current cylinder to 0.
Usually, this solves the problem.
If it does not, the drive must be repaired.
As we have seen, the controller is really a specialized little
computer, complete with software, variables, buffers, and occasionally,
bugs.
Sometimes an unusual sequence of events such as an interrupt on one
drive occurring simultaneously with a recalibrate command for another
drive will trigger a bug and cause the controller to go into a loop or
lose track of what it was doing.
Controller designers usually plan for the worst and provide a pin on the
chip which, when asserted, forces the controller to forget whatever it
was doing and reset itself.
If all else fails, the disk driver can set a bit to invoke this signal
and reset the controller.
If that does not help, all the driver can do is print a message and give
up.
Track-at-a-Time Caching
The time required to seek to a new cylinder is usually much more than
the rotational delay, and always vastly more than the transfer time to
read or write one sector.
In other words, once the driver has gone to the trouble of moving the
arm somewhere, it hardly matters whether it reads one sector or a whole
track.
This effect is especially true if the controller provides rotational
sensing, so the driver can see which sector is currently under the head
and issue a request for the next sector, thereby making it possible to
read an entire disk track in a single rotation time.
(Normally it takes half a rotation plus one sector time just to read a
single sector, on the average.)
Some disk drivers take advantage of these timing properties by
maintaining a secret track-at-a-time cache, unknown to the
device-independent software.
If a sector that is in the cache is needed, no disk transfer is
required.
A disadvantage of track-at-a-time caching (in addition to the software
complexity and buffer space needed) is that transfers from the cache to
the calling program will have to be done by the CPU using a programmed
loop, rather than letting the DMA hardware do the job.
Some controllers take this process a step further, and do
track-at-a-time caching in their own internal memory, transparent to the
driver, so that transfer between the controller and memory can use
DMA.
If the controller works this way, there is little point in having the
disk driver do it as well.
Note that both the controller and the driver are in a good position to
read and write entire tracks in one command, but that the
device-independent software cannot, because it regards a disk as a
linear sequence of blocks, without regard to how they are divided up
into tracks and cylinders.
Only the controller knows the true geometry for sure.
The hard disk driver is the first part of MINIX 3 we have looked at
that has to deal with a range of different types of hardware.
Before we discuss the driver, we will briefly consider some of the
problems hardware differences can cause.
The “PC” is really a family of different computers.
Not only are different processors used in different members of the
family, there are also some major differences in the basic
hardware.
MINIX 3 has been developed on and for newer systems with Pentium-class
CPUs, but even among these there are differences.
For instance, the oldest Pentium systems use the 16-bit AT bus
originally designed for the 80286 processor.
A feature of the AT bus is that it was cleverly designed so older 8-bit
peripherals could still be used.
Later systems added a 32-bit PCI bus for peripherals, while still
providing AT bus slots.
The newest designs have dropped AT-bus support, providing only a PCI
bus.
But it is reasonable to expect that users with computers of a certain
age may want to be able to use MINIX 3 with a mix of 8-bit, 16-bit, and
32-bit peripherals.
For every bus there is a different family of I/O adapters.
On older systems these are separate circuit boards which plug into the
system parentboard.
On newer systems many standard adapters, especially disk controllers,
are integrated parts of the parentboard chipset.
In itself this is not a problem for the programmer, as integrated
adapters usually have a software interface identical to that of
removable devices.
Also, integrated controllers can usually be disabled.
This allows use of a more advanced add-on device, such as a SCSI
controller, in place of a built-in device.
To take advantage of this flexibility the operating system should not be
restricted to using just one kind of adapter.
In the IBM PC family, as in most other computer systems, each bus
design also comes with firmware in the Basic I/O System Read-Only Memory
(the BIOS ROM) which is designed to bridge the gap between the operating
system and the peculiarities of the hardware.
Some peripheral devices may even provide extensions to the BIOS in ROM
chips on the peripheral cards themselves.
The difficulty faced by an operating system implementer is that the BIOS
in IBM-type computers (certainly the early ones) was designed for an
operating system, MS-DOS, that does not support multiprogramming and
that runs in 16-bit real mode, the lowest common denominator of the
various modes of operation available from the 80x86 family of CPUs.
The implementer of a new operating system for the IBM PC is thus
faced with several choices.
One is whether to use the driver support for peripherals in the BIOS or
to write new drivers from scratch.
This was not a hard choice in the design of early versions of MINIX,
since the BIOS was in many ways not suitable to its needs.
Of course, to start MINIX 3 the boot monitor uses the BIOS to do the
initial loading of the system, whether from hard disk, CD-ROM, or floppy
disk— there is no practical alternative to doing it this way.
Once we have loaded the system, including our own I/O drivers, we can do
better than the BIOS.
The second choice then must be faced: without the BIOS support how
are we going to make our drivers adapt to the varied kinds of hardware
on different systems? To make the discussion concrete, consider that
there are two fundamentally different types of hard disk controller
usable on the modern 32-bit Pentium systems for which MINIX 3 has been
designed: the integrated IDE controller and add-on SCSI controllers for
the PCI bus.
If you would like to take advantage of older hardware and adapt MINIX 3
to work on the hardware targeted by earlier versions of MINIX, there are
four hard disk controller types to consider: the original 8-bit XT-type
controller, the 16-bit AT-type controller, and two different controllers
for two different types of IBM PS/2 series computers.
There are several possible ways to deal with all these alternatives:
Recompile a unique version of the operating system for each type of hard disk controller we need to accommodate.
Compile several different hard disk drivers into the boot image and have the system automatically determine at startup time which one to use.
Compile several different hard disk drivers into the boot image and provide a way for the user to determine which one to use.
As we shall see, these are not mutually exclusive.
The first way is really the best way in the long run.
For use on a particular installation there is no need to use up disk and
memory space with code for alternative drivers that will never be
used.
However, it is a nightmare for the distributor of the software.
Supplying four different startup disks and advising users on how to use
them is expensive and difficult.
Thus, another method is advisable, at least for the initial
installation.
The second method is to have the operating system probe the
peripherals, by reading the ROM on each card or writing and reading I/O
ports to identify each card.
This is possible (and works better on newer IBM-type systems than on
older ones), but it does not accommodate nonstandard I/O devices.
Also, probing I/O ports to identify one device sometimes can activate
another device which seizes control and disables the system.
This method complicates the startup code for each device, and yet still
does not work very well.
Operating systems that do use this method generally have to provide some
kind of override, typically a mechanism such as we use with MINIX 3.
The third method, used in MINIX 3, is to allow inclusion of several
drivers in the boot image.
The MINIX 3 boot monitor allows various boot parameters to be read at
startup time.
These can be entered by hand, or stored permanently on the disk.
At startup time, if a boot parameter of the form
label = AT
is found, this forces the IDE disk controller (at_wini) to be used
when MINIX 3 is started.
This depends upon the at_wini driver being assigned this label.
Labels are assigned when the boot image is compiled.
There are two other things MINIX 3 does to try to minimize problems
with multiple hard disk drivers.
One is that there is, after all, a driver that interfaces between MINIX
3 and the ROM BIOS hard disk support.
This driver is almost guaranteed to work on any system and can be
selected by use of a
label=BIOS
boot parameter.
Generally, this should be a last resort, however.
MINIX 3 as described here runs only in protected mode on systems with an
80386 or better processor, but the BIOS code always runs in real (8086)
mode.
Switching out of protected mode and back again whenever a routine in the
BIOS is called is very slow.
The other strategy MINIX 3 uses in dealing with drivers is to postpone
initialization until the last possible moment.
Thus, if on some hardware configuration none of the hard disk drivers
work, we can still start MINIX 3 from a floppy disk and do some useful
work.
MINIX 3 will have no problems as long as no attempt is made to access
the hard disk.
This may not seem like a major breakthrough in user friendliness, but
consider this: if all the drivers try to initialize immediately on
system startup, the system can be totally paralyzed by improper
configuration of some device we do not need anyway.
By postponing initialization of each driver until it is needed, the
system can continue with whatever does work, while the user tries to
resolve the problems.
We learned this lesson the hard way: earlier versions of MINIX tried
to initialize the hard disk as soon as the system was booted.
If no hard disk was present, the system hung.
This behavior was especially unfortunate because MINIX would run quite
happily on a system without a hard disk, albeit with restricted storage
capacity and reduced performance.
In the discussion in this section and the next, we will take as our
model the AT-style hard disk driver, which is the default driver in the
standard MINIX 3 distribution.
This is a versatile driver that handles hard disk controllers from the
ones used in the earliest 80286 systems to modern EIDE (Extended
Integrated Drive Electronics) controllers that handle gigabyte capacity
hard disks.
Modern EIDE controllers also support standard CD-ROM drives.
However, in order to simplify our discussion the extensions that support
CD-ROMs have been taken out of the code listed in Appendix B.
The general aspects of hard disk operation we discuss in this section
apply to the other supported drivers as well.
The main loop of the hard disk driver is the same common code we have
already discussed, and supports the standard nine kinds of requests that
can be made.
A DEV_OPEN request can entail a substantial amount of work, as there are
always partitions and may be subpartitions on a hard disk.
These must be read when a device is opened, (i.e., when it is first
accessed).
When CD-ROMs are supported, on a DEV_OPEN the presence of the medium
must be verified, since it is removable.
On a CD-ROM a DEV_CLOSE operation also has meaning: it requires that the
door be unlocked and the CD-ROM ejected.
There are other complications of removable media that are more
applicable to floppy drives, so we will discuss these in a later
section.
For CD-ROMs a DEV_IOCTL operation is used to set a flag to mark that the
medium should be ejected from the drive upon a DEV_CLOSE.
A DEV_IOCTL operation is also used to read and write partition
tables.
DEV_READ, DEV_WRITE, DEV_GATHER and DEV_SCATTER requests are each
handled in two phases, prepare and transfer, as we saw previously.
For the hard disk DEV_CANCEL and DEV_SELECT calls are ignored.
No scheduling is done by the hard disk device driver at all, that is
done by the file system, which assembles the vector requests for
gather/scatter I/O.
Requests come from the file system cache as DEV__ GATHER or DEV_SCATTER
requests for multiples of blocks (4-KB in the default configuration of
MINIX 3), but the hard disk driver is able to handle requests for any
multiple of a sector (512 bytes).
In any case, as we have seen, the main loop of all disk drivers
transforms requests
for single blocks of data into one element vector requests.
Requests for reading and writing are not mixed in a vector of
requests, nor can requests be marked as optional.
The elements of a request vector are for contiguous disk sectors, and
the vector is sorted by the file system before being passed to the
device driver, so it suffices to specify just the starting position on
the disk for an entire array of requests.
The driver is expected to succeed in reading or writing at least the
first request in a request vector, and to return when a request
fails.
It is up to the file system to decide what to do; the file system will
try to complete a write operation but will return to the calling process
only as much data as it can get on a read.
The file system itself, by using scattered I/O, can implement
something similar to Teory’s version of the elevator algorithm—recall
that in a scattered I/O request the list of requests is sorted on the
block number.
The second step in scheduling takes place in the controller of a modern
hard disk.
Such controllers are “smart” and can buffer large quantities of data,
using internally programmed algorithms to retrieve data in the most
efficient order, irrespective of the order of receipt of the
requests.
Small hard disks used on microcomputers are sometimes called
“winchester” disks.
The term was IBM’s code name for the project that developed the disk
technology in which the read/write heads fly on a thin cushion of air
and land on the recording medium when the disk stops spinning.
The explanation of the name is that an early model had two data modules,
a 30-Mbyte fixed and a 30-Mbyte removable one.
Supposedly this reminded the developers of the Winchester 30-30 firearm
which figures in many tales of the United States’ western
frontier.
Whatever the origin of the name, the basic technology remains the same,
although today’s typical PC disk is much smaller and the capacity is
much larger than the 14-inch disks that were typical of the early 1970s
when the winchester technology was developed.
The MINIX 3 AT-style hard disk driver is in at_wini.c.
This is a complicated driver for a sophisticated device, and there are
several pages of macro definitions specifying controller registers,
status bits and commands, data structures, and prototypes.
As with other block device drivers, a driver structure, w_dtab, is
initialized with pointers to the functions that actually do the
work.
Most of them are defined in at_wini.c, but as the hard disk requires no
special cleanup operation, its dr_cleanup entry points to the common
nop_cleanup in driver.c, shared with other drivers that have no special
cleanup requirement.
Several other possible functions are also irrelevant for this driver and
also are initialized to point to nop_functions.
The entry function, called at_winchester_task, calls a procedure that
does hardware-specific initialization and then calls the main loop in
driver.c, passing the address of w_dtab.
The main loop, driver_task in libdriver/driver.c, runs forever,
dispatching calls to the various functions pointed to by the driver
table.
Since we are now dealing with real electromechanical storage devices,
there is a substantial amount of work to be done by init_params to
initialize the hard disk driver.
Various parameters about the hard disks are kept in the wini table,
which has an element for each of the MAX_DRIVES (8) drives supported, up
to four conventional IDE drives, and up to four drives on the PCI bus,
either plug-in IDE controllers or SATA (Serial AT Attachment)
controllers.
Following the policy of postponing initialization steps that could
fail until the first time they are truly necessary, init_params does not
do anything that requires accessing the disk devices themselves.
The main thing it does is to copy information about the hard disk
logical configuration into the wini array.
The ROM BIOS on a Pentium-class computer retrieves basic configuration
information from the CMOS memory used to preserve basic configuration
data.
The BIOS does this when the computer is first turned on, before the
first part of the MINIX 3 loading process begins.
The information is copied from the BIOS.
Many of the constants used here, such as NR_HD_DRIVES_ADDR are defined
in include/ibm/bios.h, a file which is not listed in Appendix B but
which can be found on the MINIX 3 CD-ROM.
It is not necessarily fatal if this information cannot be
retrieved.
If the disk is a modern one, the information can be retrieved directly
from the disk when it is accessed for the first time.
Following the entry of data obtained from the BIOS, additional disk
information is filled in for each drive using a call to the next
function, init_drive.
On older systems with IDE controllers, the disk functions as if it
were an AT-style peripheral card, even though it may be integrated on
the parentboard.
Modern drive controllers usually function as PCI devices, with a 32-bit
data path to the CPU, rather than the 16-bit AT bus.
Fortunately for us, once initialization is complete, the interface to
both generations of disk controller appears the same to the
programmer.
To make this work, init_params_pci is called if necessary to get the
parameters of the PCI devices.
We will not describe the details of this routine, but a few points
should be mentioned.
First, the boot parameter ata_instance is used to set the value of the
variable w_instance.
If the boot parameter is not explicitly set the value will be
zero.
If it is set and greater than zero the test causes querying the BIOS and
initialization of standard IDE drives to be skipped.
In this case only drives found on the PCI bus will be registered.
The second point is that a controller found on the PCI bus will be
identified as controlling devices c0d4 through c0d7.
If w_instance is non-zero the drive identifiers c0d0 through c0d3 will
be skipped, unless a PCI bus controller identifies itself as
“compatible.” Drives handled by a compatible PCI bus controller will be
designated c0d0 through c0d3.
For most MINIX 3 users all of these complications can probably be
ignored.
A computer with less than four drives (including the CD-ROM drive), will
most likely appear to the user to have the classical configuration, with
drives designated c0d0 to c0d3, whether they are connected to IDE or PCI
controllers, and whether or not they use the classic 40-pin parallel
connectors or the newer serial connectors.
But the programming required to create this illusion is complicated.
After the call to the common main loop, nothing may happen for a
while until the first attempt is made to access the hard disk.
When the first attempt to access a disk is made a message requesting a
DEV_OPEN operation will be received by the main loop and w_do_open will
be indirectly called.
In turn, w_do_open calls w_prepare to determine if the device requested
is valid, and then w_identify to identify the type of device and
initialize some more parameters in the wini array.
Finally, a counter in the wini array is used to test whether this is
first time the device has been opened since MINIX 3 was started.
After being examined, the counter is incremented.
If it is the first DEV_OPEN operation, the partition function (in
drvlib.c) is called.
The next function, w_prepare, accepts an integer argument, device,
which is the minor device number of the drive or partition to be used,
and returns a pointer to the device structure that indicates the base
address and size of the device.
In the C language, the use of an identifier to name a structure does not
preclude use of the same identifier to name a variable.
Whether a device is a drive, a partition, or a subpartition can be
determined from the minor device number.
Once w_prepare has completed its job, none of the other functions used
to read or write the disk need to concern themselves with
partitioning.
As we have seen, w_prepare is called when a DEV_OPEN request is made; it
is also one phase of the prepare/transfer cycle used by all data
transfer requests.
Software-compatible AT-style disks have been in use for quite a
while, and w_identify has to distinguish between a number of different
designs that have been introduced over the years.
The first step is to see that a readable and writeable I/O port exists
where one should exist on all disk controllers in this family.
This is the first example we have seen of I/O port access by a
user-space driver, and the operation merits a description.
For a disk device I/O is done using a command structure, which is filled
in with a series of byte values.
We will describe this in a bit more detail later; for the moment note
that two bytes of this structure are filled in, one with a value
ATA_IDENTIFY, interpreted as a command that asks an ATA (AT Attached)
drive to identify itself, and another with a bit pattern that selects
the drive.
Then com_simple is called.
This function hides all the work of constructing a vector of seven
I/O port addresses and bytes to be written to them, sending this
information to the system task, waiting for an interrupt, and checking
the status returned.
This tests that the drive is alive and allows a string of 16-bit values
to be read by the sys_insw kernel call.
Decoding this information is a messy process, and we will not describe
it in detail.
Suffice it to say that a considerable amount of information is
retrieved, including a string that identifies the model of the disk, and
the preferred physical cylinder, head, and sector parameters for the
device.
(Note that the “physical” configuration reported may not be the true
physical configuration, but we have no alternative to accepting what the
disk drive claims.) The disk information also indicates whether or not
the disk is capable of Logical Block Addressing (LBA).
If it is, the driver can ignore the cylinder, head, and sector
parameters and can address the disk using absolute sector numbers, which
is much simpler.
As we mentioned earlier, it is possible that init_params may not
recover the logical disk configuration information from the BIOS
tables.
If that happens, the code tries to create an appropriate set of
parameters based on what it reads from the drive itself.
The idea is that the maximum cylinder, head, and sector numbers can be
1023, 255, and 63 respectively, due to the number of bits allowed for
these fields in the original BIOS data structures.
If the ATA_IDENTIFY command fails, it may simply mean that the disk
is an older model that does not support the command.
In this case the logical configuration values previously read by
init_params are all we have.
If they are valid, they are copied to the physical parameter fields of
wini; otherwise an error is returned and the disk is not usable.
Finally, MINIX 3 uses a u32_t variable to count addresses in
bytes.
This limits the size of a partition to 4 GB.
However, the device structure used to record the base and size of a
partition (defined in drivers/libdriver/driver.h) uses u64_t numbers,
and a 64 bit multiplication operation is used to calculate the size of
the drive on, and the base and size of the whole drive are then entered
into the wini array, and w_specify is called, twice if necessary, to
pass the parameters to be used back to the disk controller.
Finally, more kernel calls are made: a sys_irqsetpolicy call ensures
that when a disk controller interrupt occurs and is serviced the
interrupt will be automatically reenabled in preparation for the next
one.
Following that, a sys_irqenablecall actually enables the interrupt.
W_name returns a pointer to a string containing the device name, which will be either “AT-D0,” “AT-D1” “AT-D2,” or “AT-D3.” When an error message must be generated this function tells which drive produced it.
It is possible that a drive will turn out to be incompatible with
MINIX 3 for some reason.
The function w_io_test is provided to test each drive the first time an
attempt is made to open it.
This routine tries to read the first block on the drive, with shorter
timeout values than are used in normal operation.
If the test fails the drive is permanently marked as unavailable.
W_specify, in addition to passing the parameters to the controller, also recalibrates the drive (if it is an older model), by doing a seek to cylinder zero.
Do_transfer does what its name implies, it assembles a command
structure with all the byte values needed to request transfer of a chunk
of data (possibly as many as 255 disk sectors), and then it calls
com_out, which sends the command to the disk controller.
The data must be formatted differently depending upon how the disk is to
be addressed, that is, whether by cylinder, head, and sector or by
LBA.
Internally MINIX 3 addresses disk blocks linearly, so if LBA is
supported the first three byte-wide fields are filled in by shifting the
sector count an appropriate number of bits to the right and then masking
to get 8-bit values.
The sector count is a 28 bit number, so the last masking operation uses
a 4-bit mask.
If the disk does not support LBA then cylinder, head, and sector values
are calculated, based on the parameters of the disk in use.
The code contains a hint of a future enhancement.
LBA addressing with a 28-bit sector count limits MINIX 3 to fully
utilizing disks of 128 GB or smaller size.
(You can use a bigger disk, but MINIX 3 can only access the first 128
GB).
The programmers have been thinking about, but have not yet implemented,
use of the newer LBA48 method, which uses 48 bits to address disk
blocks.
On line 12824 a test is made for whether this is enabled.
The test will always fail with the version of MINIX 3 described
here.
This is good, because no code is provided to be executed if the test
succeeds.
Keep in mind if you decide to modify MINIX 3 yourself to use LBA48 that
you need to do more than just add some code here.
You will have to make changes in many places to handle the 48-bit
addresses.
You might find it easier to wait until MINIX 3 has been ported to a
64-bit processor, too.
But if a 128 GB disk is not big enough for you, LBA48 will give you
access to 128 PB (Petabytes).
Now we will briefly look at how a data transfer takes place at a
higher level.
W_prepare, which we have already discussed, is called first.
If the transfer operation requested was for multiple blocks (that is, a
DEV_GATHER or DEV_SCATTER request), w_transfer line 12848 is called
immediately afterward.
If the transfer is for a single block (a DEV_READ or DEV_WRITE request),
a one element scatter/gather vector is created, and then w_transfer is
called.
Accordingly, w_transfer is written to expect a vector of iovec_t
requests.
Each element of the request vector consists of a buffer address and the
size of the buffer, constrained that the size must be a multiple of the
size of a disk sector.
All other information needed is passed as an argument to the call, and
applies to the entire request vector.
The first thing done is a simple test to see if the disk address
requested for the start of the transfer is aligned on a sector
boundary.
Then the outer loop of the function is entered.
This loop repeats for each element of the request vector.
Within the loop, as we have seen many times before, a number of tests
are made before the real work of the function is done.
First the total number of bytes remaining in the request is calculated
by summing the iov_size fields of each element of the request
vector.
This result is checked to be sure it is an exact multiple of the size of
a sector.
Other tests check that the starting position is not at or beyond the end
of the device, and if the request would end past the end of the device
the size of the request is truncated.
All calculations so far have been in bytes, but a calculation is made of
the block position on the disk, using 64 bit arithmetic.
Note that although the variable used is named block, this is a number of
disk blocks, that is, 512 byte sectors, not the “block” used internally
by MINIX 3, normally 4096 bytes.
After this one more adjustment is made.
Every drive has a maximum number of bytes that can be requested at one
time, and the request is scaled back to this quantity if
necessary.
After verifying that the disk has been initialized, and doing so again
if necessary, a request for a chunk of data is made by calling
do_transfer.
After a transfer request has been made the inner loop is entered,
which repeats for each sector.
For a read or write operation an interrupt will be generated for each
sector.
On a read the interrupt signifies data is ready and can be
transferred.
The sys_insw kernel call asks the system task to read the specified I/O
port repeatedly, transferring the data to a virtual address in the data
space of the specified process.
For a write operation the order is reversed.
The sys_outsw call a few lines further down writes a string of data to
the controller, and the interrupt comes from the disk controller when
the transfer to the disk is complete.
In the case of either a read or a write, at_intr_wait is called to
receive the interrupt, for example, following the write operation.
Although the interrupt is expected, this function provides a way to
abort the wait if a malfunction occurs and the interrupt never
arrives.
At_intr_wait also reads the disk controller’s status register and
returns various codes.
On an error when either reading or writing, there is a break which skips
over the section where results are recorded and poiners and counters
adjusted for the next sector, so the next time through the inner loop
will be a retry of the same sector, if another try is allowed.
If the disk controller reports a bad sector w_transfer terminates
immediately.
For other errors a counter is incremented and the function is allowed to
continue if max_errors has not been reached.
The next function we will discuss is com_out, which sends the command
to the disk controller, but before we look at its code let us first look
at the controller as it is seen by the software.
The disk controller is controlled through a set of registers, which
could be memory mapped on some systems, but on an IBM compatible appear
as I/O ports.
We will look at these registers and discuss a few aspects of how they
(and I/O control registers in general) are used.
In MINIX 3 there is the added complication that drivers run in user
space and cannot execute the instructions that read or write
registers.
This will provide an opportunity to look at how kernel calls are used to
work around this restriction.
The registers used by a standard IBM-AT class hard disk controller are shown in Fig. 3-23.
Figure 3-23.
(a) The control registers of an IDE hard disk controller.
The numbers in parentheses are the bits of the logical block address
selected by each register in LBA mode.
(b) The fields of the Select Drive/Head register.
We have mentioned several times reading and writing to I/O ports, but
we tacitly treated them just like memory addresses.
In fact, I/O ports often behave differently from memory addresses.
For one thing, input and output registers that happen to have the same
I/O port address are not the same register.
Thus, the data written to a particular address cannot necessarily be
retrieved by a subsequent read operation.
For example, the last register address shown in Fig. 3-23 shows the
status of the disk controller when read and is used to issue commands to
the controller when written to.
It is also common that the very act of reading or writing an I/O device
register causes an action to occur, independently of the details of the
data transferred.
This is true of the command register on the AT disk controller.
In use, data are written to the lower-numbered registers to select the
disk address to be read from or written to, and then the command
register is written last with an operation code.
The data written to the command register determines what the operation
will be.
The act of writing the operation code into the command register starts
the operation.
It is also the case that the use of some registers or fields in the
registers may vary with different modes of operation.
In the example given in the figure, writing a 0 or a 1 to the LBA bit,
bit 6 of register 6, selects whether CHS (Cylinder-Head-Sector) or LBA
(Logical Block Addressing) mode is used.
The data written to or read from registers 3, 4, and 5, and the low four
bits of register 6 are interpreted differently according to the setting
of the LBA bit.
Now let us take a look at how a command is sent to the controller by
calling com_out.
This function is called after setting up a cmd structure (with
do_transfer, which we saw earlier).
Before changing any registers, the status register is read to determine
that the controller is not busy.
This is done by testing the STATUS_BSY bit.
Speed is important here, and normally the disk controller is ready or
will be ready in a short time, so busy waiting is used.
On line 12960 w_waitfor is called to test STATUS_BSY.
W_waitfor uses a kernel call to ask the system task to read an I/O port
so w_waitfor can test a bit in the status register.
It loops until the bit is ready or until there is a timeout.
The loop is programmed for a quick return when the disk is ready.
Thus the returned value will be true with the minimum possible delay if
the controller is ready, true after a delay if it is temporarily
unavailable, or false if it is not ready after the timeout period.
We will have more to say about the timeout when we discuss w_waitfor
itself.
A controller can handle more than one drive, so once it is determined
that the controller is ready, a byte is written to select the drive,
head, and mode of operation and w_waitfor is called again.
A disk drive sometimes fails to carry out a command or to properly
return an error code—it is, after all, a mechanical device that can
stick, jam, or break internally—and as insurance a sys_setalarm kernel
call is made to have the system task schedule a call to a wakeup
routine.
Following this, the command is issued by first writing all the
parameters to the various registers and finally writing the command code
itself to the command register.
This is done with a sys_voutb kernel call, which sends a vector of
(value, address) pairs to the system task.
The system task writes each value to the I/O port specified by the
address in order.
The vector of data for the sys_voutb call is constructed by use of a
macro, pv_set, which is defined in include/minix/devio.h.
The act of writing the operation code to the command register makes the
operation begin.
When it is complete, an interrupt is generated and a notification
message is sent.
If the command times out the alarm will expire and a synchronous alarm
notification will wake up the disk driver.
The next several functions are short.
W_need_reset is called when timeouts occur while waiting for the disk to
interrupt or become ready.
The action of w_need_reset is just to mark the state variable for every
drive in the wini array to force initialization on the next access.
W_do_close has very little to do for a conventional hard disk.
Additional code is needed to support CD-ROMs.
Com_simple is called to issue controller commands that terminate
immediately without a data transfer phase.
Commands that fall into this category include those that retrieve the
disk identification, setting of some parameters, and
recalibration.
We saw an example of its use in w_identify.
Before it is called the command structure must be correctly
initialized.
Note that immediately after the call to com_out a call to at_intr_wait
is made.
This eventually does a receive which blocks until a notification arrives
signifying that an interrupt has occurred.
We noted that com_out does a sys_setalarm kernel call before asking
the system task to write the registers which set up and execute a
command.
As we mentioned in the overview section, the next receive operation
normally should receive a notification indicating an interrupt.
If an alarm has been set and no interrupt occurs, the next message will
be a SYN_ALARM.
In this case w_timeout line 13046 is called.
What needs to be done depends on the current command in w_command.
The timeout might have been left over from a previous operation, and
w_command may have the value CMD_IDLE, meaning the disk completed its
operation.
In that case there is nothing to do.
If the command does not complete and the operation is a read or write,
it may help to reduce the size of I/O requests.
This is done in two steps, first reducing the maximum number of sectors
that can be requested to 8, and then to 1.
For all timeouts a message is printed and w_need_reset is called to
force re-initialization of all drives on the next attempted access.
When a reset is required, w_reset is called.
This function makes use of a library function, tickdelay, that sets a
watchdog timer and then waits for it to expire.
After an initial delay to give the drive time to recover from previous
operations, a bit in the disk controller’s control register is
strobed—that is, set to a logical 1 level for a definite period, then
returned to the logical 0 level.
Following this operation, w_waitfor is called to give the drive a
reasonable period to signal it is ready.
In case the reset does not succeed, a message is printed and an error
status returned.
Commands to the disk that involve data transfer normally terminate by
generating an interrupt, which sends a message back to the disk
driver.
In fact, an interrupt is generated for each sector read or
written.
The function w_intr_wait calls receive in a loop, and if a SYN_ALARM
message is received w_timeout is called.
The only other message type this function should see is HARD_INT.
When this is received the status register is read and ack_args is called
to reinitialize the interrupt.
W_intr_wait is not called directly; when an interrupt is expected the
function called is the next one, at_intr_wait.
After an interrupt is received by at_intr_wait a quick check is made of
the drive status bits.
All is OK if the bits corresponding to busy, write fault, and error are
all clear.
Otherwise a closer look is taken.
If the register could not be read at all, it is panic time.
If the problem was a bad sector a specific error is returned, any other
problem results in a general error code.
In all cases the STATUS_ADMBSY bit is set, to be reset later by the
caller.
We have seen several places where w_waitfor is called to do busy
waiting on a bit in the disk controller status register.
This is used in situations where it is expected the bit might be clear
on the first test, and a quick test is desirable.
For the sake of speed, a macro that read the I/O port directly was used
in earlier versions of MINIX—this is, of course, not allowable for a
user-space driver in MINIX 3.
The solution here is to use a do …
while loop with a minimum of overhead before the first test is
made.
If the bit being tested is clear there is an immediate return from
within the loop.
To deal with the possibility of failure a timeout is implemented within
the loop by keeping track of clock ticks.
If a timeout does occur w_need_reset is called.
The timeout parameter that is used by the w_waitfor function is
defined by DEF_TIMEOUT_TICKS as 300 ticks, or 5 seconds.
A similar parameter, WAKEUP, used to schedule wakeups from the clock
task, is set to 31 seconds.
These are very long periods of time to spend busy waiting, when you
consider that an ordinary process only gets 100 msec to run before it
will be evicted.
But, these numbers are based upon the published standard for interfacing
disk devices to AT-class computers, which states that up to 31 seconds
must be allowed for a disk to “spin up” to speed.
The fact is, of course, that this is a worst-case specification, and
that on most systems spin up will only occur at power-on time, or
possibly after long periods of inactivity, at least for hard
disks.
For CD-ROMs or other devices which must spin up frequently this may be a
more important issue.
There are a few more functions in at_wini.c.
W_geometry returns the logical maximum cylinder, head, and sector values
of the selected hard disk device.
In this case the numbers are real ones, not made up as they were for the
RAM disk driver.
W_other is a catch-all for unrecognized commands and ioctls.
In fact, it is not used in the current release of MINIX 3, and we should
probably have removed it from the Appendix B listing.
W_hw_int is called when a hardware interrupt is received when it is not
expected.
In the overview we mentioned that this can happen when a timeout expires
before an expected interrupt occurs.
This will satisfy a receive operation that was blocked waiting for the
interrupt, but the interrupt notification may then be found by a
subsequent receive.
The only thing to be done is to reenable the interrupt, which is done by
calling the next function, ack_irqs.
It cycles through all the known drives and uses the sys_irqenable kernel
call to ensure all interrupts are enabled.
Finally, at the end of at_wini.c two strange little functions are found,
strstatus and strerr.
These use macros defined just ahead of them to concatenate error codes
into strings.
These functions are not used in MINIX 3 as described here.
The floppy disk driver is longer and more complicated than the hard
disk driver.
This may seem paradoxical, since floppy disk mechanisms are simpler than
those of hard disks, but the simpler mechanism has a more primitive
controller that requires more attention from the operating system.
Also, the fact that the medium is removable adds complications.
In this section we will describe some of the things an implementer must
consider in dealing with floppy disks.
However, we will not go into the details of the MINIX 3 floppy disk
driver code.
In fact, we have not listed the floppy disk driver in Appendix B.
The most important parts are similar to those for the hard disk.
One of the things we do not have to worry about with the floppy
driver is the multiple types of controller to support that we had to
deal with in the case of the hard disk driver.
Although the high-density floppy disks currently used were not supported
in the design of the original IBM PC, the floppy disk controllers of all
computers in the IBM PC family are supported by a single software
driver.
The contrast with the hard disk situation is probably due to lack of
motivation to increase floppy disk performance.
Floppy disks are rarely used as working storage during operation of a
computer system; their speed and data capacity are too limited compared
to those of hard disks.
Floppy disks at one time were important for distribution of new software
and for backup, but as networks and larger-capacity removable storage
devices have become common, PCs rarely come standard with a floppy disk
drives any more.
The floppy disk driver does not use the SSF or the elevator
algorithm.
It is strictly sequential, accepting a request and carrying it out
before even accepting another request.
In the original design of MINIX it was felt that, since MINIX was
intended for use on personal computers, most of the time there would be
only one process active.
Thus the chance of a disk request arriving while another was being
carried out was small.
There would be little to gain from the considerable increase in software
complexity that would be required for queueing requests.
Complexity is even less worthwhile now, since floppy disks are rarely
used for anything but transferring data into or out of a system with a
hard disk.
That said, the floppy driver, like any other block driver, can handle
a request for scattered I/O.
However, in the case of the floppy driver the array of requests is
smaller than for the hard disk, limited to the maximum number of sectors
per track on a floppy diskette.
The simplicity of the floppy disk hardware is responsible for some of
the complications in floppy disk driver software.
Cheap, slow, low-capacity floppy drives do not justify the sophisticated
integrated controllers that are part of modern hard drives, so the
driver software has to deal explicitly with aspects of disk operation
that are hidden in the operation of a hard drive.
As an example of a complication caused by the simplicity of floppy
drives, consider positioning the read/write head to a particular track
during a SEEK operation.
No hard disk has ever required the driver software to explicitly call
for a SEEK.
For a hard disk the cylinder, head, and sector geometry visible to the
programmer often do not correspond to the physical geometry.
In fact, the physical geometry may be quite complicated.
Typically there are multiple zones (groups of cylinders) with more
sectors per track on outer zones than on inner ones.
This is not visible to the user, however.
Modern hard disks accept Logical Block Addressing (LBA), addressing by
the absolute sector number on the disk, as an alternative to cylinder,
head, and sector addressing.
Even if addressing is done by cylinder, head, and sector, any geometry
that does not address nonexistent sectors may be used, since the
integrated controller on the disk calculates where to move the
read/write heads and does a seek operation when required.
For a floppy disk, however, explicit programming of SEEK operations
is needed.
In case a SEEK fails, it is necessary to provide a routine to perform a
RECALIBRATE operation, which forces the heads to cylinder 0.
This makes it possible for the controller to advance them to a desired
track position by stepping the heads a known number of times.
Similar operations are necessary for the hard drive, of course, but the
controller handles them without detailed guidance from the device driver
software.
Some characteristics of a floppy disk drive that complicate its driver are:
Some hard disk controllers provide for removable media, for instance,
on a CD-ROM drive, but the drive controller is generally able to handle
any complications without support in the device driver software.
With a floppy disk, however, the built-in support is not there, and yet
it is needed more.
Some of the most common uses for floppy disks—installing new software or
backing up files—are likely to require switching of disks in and out of
the drives.
It will cause grief if data intended for one diskette are written onto
another.
The device driver should do what it can to prevent this.
This is not always possible, as not all floppy drive hardware allows
determination of whether the drive door has been opened since the last
access.
Another problem that can be caused by removable media is that a system
can become hung up if an attempt is made to access a floppy drive that
currently has no diskette inserted.
This can be solved if an open door can be detected, but since this is
not always possible some provision must be made for a timeout and an
error return if an operation on a floppy disk does not terminate in a
reasonable time.
Removable media can be replaced with other media, and in the case of
floppy disks there are many different possible formats.
IBM compatible hardware supports both 3.5-inch and 5.25-inch disk drives
and the diskettes can be formatted in a variety of ways to hold from 360
KB up to 1.2 MB (on a 5.25-inch diskette) or 1.44 MB (on a 3.5-inch
diskette).
MINIX 3 supports seven different floppy disk formats.
Two possible solutions are possible for the problem this causes.
One way is to refer to each possible format as a distinct drive and
provide multiple minor devices.
Older versions of MINIX did this.
Fourteen different devices were defined, ranging from /dev/pc0, a 360 KB
5.25-inch diskette in the first drive, to /dev/PS1, a 1.44 MB 3.5-inch
diskette in the second drive.
This was a cumbersome solution.
MINIX 3 uses another method: when the first floppy disk drive is
addressed as /dev/fd0, or the second as /dev/fd1, the floppy disk driver
tests the diskette currently in the drive when it is accessed, in order
to determine the format.
Some formats have more cylinders, and others have more sectors per track
than other formats.
Determination of the format of a diskette is done by attempting to read
the higher numbered sectors and tracks.
By a process of elimination the format can be determined.
This takes time, but on modern computers only 1.44 MB 3.5-inch diskettes
are likely to be found, and this format is probed first.
Another possible problem is that a disk with bad sectors could be
misidentified.
A utility program is available for testing disks; doing so automatically
in the operating system would be too slow.
The final complication of the floppy disk driver is motor
control.
Diskettes cannot be read or written unless they are revolving.
Hard disks are designed to run for thousands of hours on end without
wearing out, but leaving the motors on all the time causes a floppy
drive and diskette to wear out quickly.
If the motor is not already on when a drive is accessed, it is necessary
to issue a command to start the drive and then to wait about a half
second before attempting to read or write data.
Turning the motors on or off is slow, so MINIX 3 leaves a drive motor on
for a few seconds after a drive is used.
If the drive is used again within this interval, the timer is extended
for another few seconds.
If the drive is not used in this interval, the motor is turned off.
For decades, users have communicated with computers using devices
consisting of a keyboard for user input and a display for computer
output.
For many years, these were combined into free-standing devices called
terminals, which were connected to the computer by a wire.
Large mainframes used in the financial and travel industries sometimes
still use these terminals, typically connected to the mainframe via a
modem, especially when they are far from the mainframe.
However, with the emergence of the personal computer, the keyboard and
display have become separate peripherals rather than a single device,
but they are so closely interrelated that we will discuss them together
here under the combined name of “terminal.”
Historically, terminals have come in a variety of forms.
It is up to the terminal driver to hide all these differences, so that
the device-independent part of the operating system and the user
programs do not have to be rewritten for each kind of terminal.
In the following sections we will follow our now-standard approach of
first discussing terminal hardware and software in general, and then
discussing the MINIX 3 software.
From the operating system’s point of view, terminals can be divided
into three broad categories based on how the operating system
communicates with them as well as their actual hardware
characteristics.
The first category consists of memory-mapped terminals, which consist of
a keyboard and a display, both of which are hardwired to the
computer.
This model is used in all personal computers for the keyboard and the
monitor.
The second category consists of terminals that interface via a serial
communication line using the RS-232 standard, most frequently over a
modem.
This model is still used on some mainframes, but PCs also have serial
line interfaces.
The third category consists of terminals that are connected to the
computer via a network.
This taxonomy is shown in Fig. 3-24.
Figure 3-24.
Terminal types.
Memory-Mapped Terminals
The first broad category of terminals named in Fig. 3-24 consists of
memory-mapped terminals.
These are an integral part of the computers themselves, especially
personal computers.
They consist of a display and a keyboard.
Memory-mapped displays are interfaced via a special memory called a
video RAM, which forms part of the computer’s address space and is
addressed by the CPU the same way as the rest of memory (see Fig.
3-25).
Also on the video RAM card is a chip called a video controller.
This chip pulls bytes out of the video RAM and generates the video
signal used to drive the display.
Displays are usually one of two types: CRT monitors or flat panel
displays.
A CRT monitor generates a beam of electrons that scans horizontally
across the screen, painting lines on it.
Typically the screen has 480 to 1200 lines from top to bottom, with 640
to 1920 points per line.
These points are called pixels.
The video controller signal modulates the intensity of the electron
beam, determining whether a given pixel will be light or dark.
Color monitors have three beams, for red, green, and blue, which are
modulated independently.
A flat panel display works very differently internally, but a CRT-compatible flat-panel display accepts the same synchronization and video signals as a CRT and uses these to control a liquid crystal element at each pixel position.
Figure 3-25.
Memory-mapped terminals write directly into video RAM.
A simple monochrome display might fit each character in a box 9
pixels wide by 14 pixels high (including the space between characters),
and have 25 lines of 80 characters.
The display would then have 350 scan lines of 720 pixels each.
Each of these frames is redrawn 45 to 70 times a second.
The video controller could be designed to fetch the first 80 characters
from the video RAM, generate 14 scan lines, fetch the next 80 characters
from the video RAM, generate the following 14 scan lines, and so
on.
In fact, most fetch each character once per scan line to eliminate the
need for buffering in the controller.
The 9-by-14 bit patterns for the characters are kept in a ROM used by
the video controller.
(RAM may also be used to support custom fonts.) The ROM is addressed by
a 12-bit address, 8 bits from the character code and 4 bits to specify a
scan line.
The 8 bits in each byte of the ROM control 8 pixels; the 9th pixel
between characters is always blank.
Thus 14 × 80 = 1120 memory references to the video RAM are needed per
line of text on the screen.
The same number of references are made to the character generator
ROM.
The original IBM PC had several modes for the screen.
In the simplest one, it used a character-mapped display for the
console.
In Fig. 3-26(a) we see a portion of the video RAM.
Each character on the screen of Fig. 3-26(b) occupied two characters in
the RAM.
The low-order character was the ASCII code for the character to be
displayed.
The high-order character was the attribute byte, which was used to
specify the color, reverse video, blinking, and so on.
The full screen of 25 by 80 characters required 4000 bytes of video RAM
in this mode.
All modern displays still support this mode of operation.
Figure 3-26.
(a) A video RAM image for the IBM monochrome display.
The ×s are attribute bytes.
(b) The corresponding screen.
Contemporary bitmap displays use the same principle, except that each
pixel on the screen is individually controlled.
In the simplest configuration, for a monochrome display, each pixel has
a corresponding bit in the video RAM.
At the other extreme, each pixel is represented by a 24-bit number, with
8 bits each for red, green, and blue.
A 768 × 1024 color display with 24 bits per pixel requires 2 MB of RAM
to hold the image.
With a memory-mapped display, the keyboard is completely decoupled
from the screen.
It may be interfaced via a serial or parallel port.
On every key action the CPU is interrupted, and the keyboard driver
extracts the character typed by reading an I/O port.
On a PC, the keyboard contains an embedded microprocessor which
communicates through a specialized serial port with a controller chip on
the main board.
An interrupt is generated whenever a key is struck and also when one is
released.
Furthermore, all that the keyboard hardware provides is the key number,
not the ASCII code.
When the A key is struck, the key code (30) is put in an I/O
register.
It is up to the driver to determine whether it is lower case, upper
case, CTRL-A, ALT-A, CTRL-ALT-A, or some other combination.
Since the driver can tell which keys have been depressed but not yet
released (e.g., shift), it has enough information to do the job.
Although this keyboard interface puts the full burden on the software,
it is extremely flexible.
For example, user programs may be interested in whether a digit just
typed came from the top row of keys or the numeric key pad on the
side.
In principle, the driver can provide this information.
RS-232 Terminals
RS-232 terminals are devices containing a keyboard and a display that
communicate using a serial interface, one bit at a time (see Fig.
3-27).
These terminals use a 9-pin or 25-pin connector, of which one pin is
used for transmitting data, one pin is for receiving data, and one pin
is ground.
The other pins are for various control functions, most of which are not
used.
To send a character to an RS-232 terminal, the computer must transmit it
1 bit at a time, prefixed by a start bit, and followed by 1 or 2 stop
bits to delimit the character.
A parity bit which provides rudimentary error detection may also be
inserted preceding the stop bits, although this is commonly required
only for communication with mainframe systems.
Common transmission rates are 14,400 and 56,000 bits/sec, the former
being for fax and the latter for data.
RS-232 terminals are commonly used to communicate with a remote computer
using a modem and a telephone line.
Figure 3-27.
An RS-232 terminal communicates with a computer over a communication
line, one bit at a time.
The computer and the terminal are completely independent.
Since both computers and terminals work internally with whole
characters but must communicate over a serial line a bit at a time,
chips have been developed to do the character-to-serial and
serial-to-character conversions.
They are called UARTs (Universal Asynchronous Receiver
Transmitters).
UARTs are attached to the computer by plugging RS-232 interface cards
into the bus as illustrated in Fig. 3-27.
On modern computers the UART and RS-232 interface is frequently part of
the parentboard chipset.
It may be possible disable the on-board UART to allow use of a modem
interface card plugged into the bus or two of them may be able to
coexist.
A modem also provides a UART (although it may be integrated with other
functions in a multi-purpose chip), and the communication channel is a
telephone line rather than a serial cable.
However, to the computer the UART looks the same whether the medium is a
dedicated serial cable or a telephone line.
RS-232 terminals are gradually dying off, being replaced by PCs, but
they are still encountered on older mainframe systems, especially in
banking, airline reservation, and similar applications.
Terminal programs that allow a remote computer to simulate a terminal
are still widely used, however.
To print a character, the terminal driver writes the character to the
interface card, where it is buffered and then shifted out over the
serial line one bit at a time by the UART.
Even at 56,000 bps, it takes just over 140 microsec to send a
character.
As a result of this slow transmission rate, the driver generally outputs
a character to the RS-232 card and blocks, waiting for the interrupt
generated by the interface when the character has been transmitted and
the UART is able to accept another character.
The UART can simultaneously send and receive characters, as its name
implies.
An interrupt is also generated when a character is received, and usually
a small number of input characters can be buffered.
The terminal driver must check a register when an interrupt is received
to determine the cause of the interrupt.
Some interface cards have a CPU and memory and can handle multiple
lines, taking over much of the I/O load from the main CPU.
RS-232 terminals can be subdivided into categories, as mentioned
above.
The simplest ones were hardcopy (printing) terminals.
Characters typed on the keyboard were transmitted to the computer.
Characters sent by the computer were typed on the paper.
These terminals are obsolete and rarely seen any more.
Dumb CRT terminals work the same way, only with a screen instead of
paper.
These are frequently called “glass ttys” because they are functionally
the same as hardcopy ttys.
(The term “tty” is an abbreviation for Teletype, ® a former company that
pioneered in the computer terminal business; “tty” has come to mean any
terminal.) Glass ttys are also obsolete.
Intelligent CRT terminals are in fact miniature, specialized
computers.
They have a CPU and memory and contain software, usually in ROM.
From the operating system’s viewpoint, the main difference between a
glass tty and an intelligent terminal is that the latter understands
certain escape sequences.
For example, by sending the ASCII ESC character (033), followed by
various other characters, it may be possible to move the cursor to any
position on the screen, insert text in the middle of the screen, and so
forth.
The keyboard and display are almost independent devices, so we will
treat them separately here.
(They are not quite independent, since typed characters must be
displayed on the screen.) In MINIX 3 the keyboard and screen drivers are
part of the same process; in other systems they may be split into
distinct drivers.
Input Software
The basic job of the keyboard driver is to collect input from the
keyboard and pass it to user programs when they read from the
terminal.
Two possible philosophies can be adopted for the driver.
In the first one, the driver’s job is just to accept input and pass it
upward unmodified.
A program reading from the terminal gets a raw sequence of ASCII
codes.
(Giving user programs the key numbers is too primitive, as well as being
highly machine dependent.)
This philosophy is well suited to the needs of sophisticated screen
editors such as emacs, which allow the user to bind an arbitrary action
to any character or sequence of characters.
It does, however, mean that if the user types dste instead of date and
then corrects the error by typing three backspaces and ate, followed by
a carriage return, the user program will be given all 11 ASCII codes
typed.
Most programs do not want this much detail.
They just want the corrected input, not the exact sequence of how it was
produced.
This observation leads to the second philosophy: the driver handles all
the intraline editing, and just delivers corrected lines to the user
programs.
The first philosophy is character-oriented; the second one is
line-oriented.
Originally they were referred to as raw mode and cooked mode,
respectively.
The POSIX standard uses the less-picturesque term canonical mode to
describe line-oriented mode.
On most systems canonical mode refers to a well-defined
configuration.
Noncanonical mode is equivalent to raw mode, although many details of
terminal behavior can be changed.
POSIX-compatible systems provide several library functions that support
selecting either mode and changing many aspects of terminal
configuration.
In MINIX 3 the ioctl system call supports these functions.
The first task of the keyboard driver is to collect characters.
If every key-stroke causes an interrupt, the driver can acquire the
character during the interrupt.
If interrupts are turned into messages by the low-level software, it is
possible to put the newly acquired character in the message.
Alternatively, it can be put in a small buffer in memory and the message
used to tell the driver that something has arrived.
The latter approach is actually safer if a message can be sent only to a
waiting process and there is some chance that the keyboard driver might
still be busy with the previous character.
Once the driver has received the character, it must begin processing
it.
If the keyboard delivers key numbers rather than the character codes
used by application software, then the driver must convert between the
codes by using a table.
Not all IBM “compatibles” use standard key numbering, so if the driver
wants to support these machines, it must map different keyboards with
different tables.
A simple approach is to compile a table that maps between the codes
provided by the keyboard and ASCII (American Standard Code for
Information Interchange) codes into the keyboard driver, but this is
unsatisfactory for users of languages other than English.
Keyboards are arranged differently in different countries, and the ASCII
character set is not adequate even for the majority of people in the
Western Hemisphere, where speakers of Spanish, Portuguese, and French
need accented characters and punctuation marks not used in
English.
To respond to the need for flexibility of keyboard layouts to provide
for different languages, many operating systems provide for loadable
keymaps or code pages, which make it possible to choose the mapping
between keyboard codes and codes delivered to the application, either
when the system is booted or later.
If the terminal is in canonical (i.e., cooked) mode, characters must
be stored until an entire line has been accumulated, because the user
may subsequently decide to erase part of it.
Even if the terminal is in raw mode, the program may not yet have
requested input, so the characters must be buffered to allow type
ahead.
(System designers who do not allow users to type far ahead ought to be
tarred and feathered, or worse yet, be forced to use their own
system.)
Two approaches to character buffering are common.
In the first one, the driver contains a central pool of buffers, each
buffer holding perhaps 10 characters.
Associated with each terminal is a data structure, which contains, among
other items, a pointer to the chain of buffers for input collected from
that terminal.
As more characters are typed, more buffers are acquired and hung on the
chain.
When the characters are passed to a user program, the buffers are
removed and put back in the central pool.
The other approach is to do the buffering directly in the terminal
data structure itself, with no central pool of buffers.
Since it is common for users to type a command that will take a little
while (say, a compilation) and then type a few lines ahead, to be safe
the driver should allocate something like 200 characters per
terminal.
In a large-scale timesharing system with 100 terminals, allocating 20K
all the time for type ahead is clearly overkill, so a central buffer
pool with space for perhaps 5K is probably enough.
On the other hand, a dedicated buffer per terminal makes the driver
simpler (no linked list management) and is to be preferred on personal
computers with only one or two terminals.
Figure 3-28 shows the difference between these two methods.
Although the keyboard and display are logically separate devices,
many users have grown accustomed to seeing the characters they have just
typed appear on the screen.
Some (older) terminals oblige by automatically displaying (in hardware)
whatever has just been typed, which is not only a nuisance when
passwords are being entered but greatly limits the flexibility of
sophisticated editors and other programs.
Fortunately, PC keyboards display nothing when keys are struck.
It is therefore up to the software to display the input.
This process is called echoing.
Echoing is complicated by the fact that a program may be writing to
the screen while the user is typing.
At the very least, the keyboard driver has to figure out where to put
the new input without it being overwritten by program output.
Echoing also gets complicated when more than 80 characters are typed
on a terminal with 80-character lines.
Depending on the application, wrapping around to the next line may be
appropriate.
Some drivers just truncate lines to 80 characters by throwing away all
characters beyond column 80.
Figure 3-28.
(a) Central buffer pool.
(b) Dedicated buffer for each terminal.
Another problem is tab handling.
All keyboards have a tab key, but displays can handle tab on
output.
It is up to the driver to compute where the cursor is currently located,
taking into account both output from programs and output from echoing,
and compute the proper number of spaces to be echoed.
Now we come to the problem of device equivalence.
Logically, at the end of a line of text, one wants a carriage return, to
move the cursor back to column 1, and a linefeed, to advance to the next
line.
Requiring users to type both at the end of each line would not sell well
(although some old terminals had a key which generated both, with a 50
percent chance of doing so in the order that the software wanted
them).
It was (and still is) up to the driver to convert whatever comes in to
the standard internal format used by the operating system.
If the standard form is just to store a linefeed (the convention in
UNIX and all its descendants), carriage returns should be turned into
linefeeds.
If the internal format is to store both, then the driver should generate
a linefeed when it gets a carriage return and a carriage return when it
gets a linefeed.
No matter what the internal convention, the terminal may require both a
linefeed and a carriage return to be echoed in order to get the screen
updated properly.
Since a large computer may well have a wide variety of different
terminals connected to it, it is up to the keyboard driver to get all
the different carriage return/linefeed combinations converted to the
internal system standard and arrange for all echoing to be done
right.
A related problem is the timing of carriage return and
linefeeds.
On some terminals, it may take longer to display a carriage return or
linefeed than a letter or number.
If the microprocessor inside the terminal actually has to copy a large
block of text to achieve scrolling, then linefeeds may be slow.
If a mechanical print head has to be returned to the left margin of the
paper, carriage returns may be slow.
In both cases it is up to the driver to insert filler characters (dummy
null characters) into the output stream or just stop outputting long
enough for the terminal to catch up.
The amount of time to delay is often related to the terminal speed; for
example, at 4800 bps or slower, no delays may be needed, but at 9600 bps
or higher one filler character might be required.
Terminals with hardware tabs, especially hardcopy ones, may also require
a delay after a tab.
When operating in canonical mode, a number of input characters have
special meanings.
Figure 3-29 shows all of the special characters required by POSIX and
the additional ones recognized by MINIX 3.
The defaults are all control characters that should not conflict with
text input or codes used by programs, but all except the last two can be
changed using the stty command, if desired.
Older versions of UNIX used different defaults for many of these.
Figure 3-29.
Characters that are handled specially in canonical mode.
The ERASE character allows the user to rub out the character just
typed.
In MINIX 3 it is the backspace (CTRL-H).
It is not added to the character queue but instead removes the previous
character from the queue.
It should be echoed as a sequence of three characters, backspace, space,
and backspace, in order to remove the previous character from the
screen.
If the previous character was a tab, erasing it requires keeping track
of where the cursor was prior to the tab.
In most systems, backspacing will only erase characters on the current
line.
It will not erase a carriage return and back up into the previous
line.
When the user notices an error at the start of the line being typed
in, it is often convenient to erase the entire line and start
again.
The KILL character (in MINIX 3 CTRL-U) erases the entire line.
MINIX 3 makes the erased line vanish from the screen, but some systems
echo it plus a carriage return and linefeed because some users like to
see the old line.
Consequently, how to echo KILL is a matter of taste.
As with ERASE it is usually not possible to go further back than the
current line.
When a block of characters is killed, it may or may not be worth the
trouble for the driver to return buffers to the pool, if one is
used.
Sometimes the ERASE or KILL characters must be entered as ordinary
data.
The LNEXT character serves as an escape character.
In MINIX 3 CTRL-V is the default.
As an example, older UNIX systems normally used the @ sign for KILL, but
the Internet mail system uses addresses of the form
linda@cs.washington.edu.
Someone who feels more comfortable with older conventions might redefine
KILL as @, but then need to enter an @ sign literally to address
e-mail.
This can be done by typing CTRL-V @.
The CTRL-V itself can be entered literally by typing CTRL-V
CTRL-V.
After seeing a CTRL-V, the driver sets a flag saying that the next
character is exempt from special processing.
The LNEXT character itself is not entered in the character queue.
To allow users to stop a screen image from scrolling out of view,
control codes are provided to freeze the screen and restart it
later.
In MINIX 3 these are STOP (CTRL-S) and START (CTRL-Q),
respectively.
They are not stored but are used to set and clear a flag in the terminal
data structure.
Whenever output is attempted, the flag is inspected.
If it is set, no output occurs.
Usually, echoing is also suppressed along with program output.
It is often necessary to kill a runaway program being debugged.
The INTR (CTRL-C) and QUIT (CTRL-) characters can be used for this
purpose.
In MINIX 3, CTRL-C sends the SIGINT signal to all the processes started
up from the terminal.
Implementing CTRL-C can be quite tricky.
The hard part is getting the information from the driver to the part of
the system that handles signals, which, after all, has not asked for
this information.
CTRL- is similar to CTRL-C, except that it sends the SIGQUIT signal,
which forces a core dump if not caught or ignored.
When either of these keys is struck, the driver should echo a
carriage return and linefeed and discard all accumulated input to allow
for a fresh start.
Historically, DEL was commonly used as the default value for INTR on
many UNIX systems.
Since many programs use DEL interchangeably with the backspace for
editing, CTRL-C is now preferred.
Another special character is EOF (CTRL-D), which in MINIX 3 causes
any pending read requests for the terminal to be satisfied with whatever
is available in the buffer, even if the buffer is empty.
Typing CTRL-D at the start of a line causes the program to get a read of
0 bytes, which is conventionally interpreted as end-of-file and causes
most programs to act the same way as they would upon seeing end-of-file
on an input file.
Some terminal drivers allow much fancier intraline editing than we
have sketched here.
They have special control characters to erase a word, skip backward or
forward characters or words, go to the beginning or end of the line
being typed, and so forth.
Adding all these functions to the terminal driver makes it much larger
and, furthermore, is wasted when using fancy screen editors that work in
raw mode anyway.
To allow programs to control terminal parameters, POSIX requires that
several functions be available in the standard library, of which the
most important are tcgetattr and tcsetattr.
Tcgetattr retrieves a copy of the structure shown in Fig. 3-30, the
termios structure, which contains all the information needed to change
special characters, set modes, and modify other characteristics of a
terminal.
A program can examine the current settings and modify them as
desired.
Tcsetattr then writes the structure back to the terminal driver.
struct termios {
tcflag_t c_iflag; /* input modes */
tcflag_t c_oflag; /* output modes */
tcflag_t c_cflag; /* control modes */
tcflag_t c_lflag; /* local modes */
speed_t c_ispeed; /* input speed */
speed_t c_ospeed; /* output speed */
cc_t c_cc[NCCS]; /* control characters */
};
Figure 3-30.
The termios structure.
In MINIX 3 tc_flag_t is a short, speed_t is an int, and cc_t is a
char.
The POSIX standard does not specify whether its requirements should
be implemented through library functions or system calls.
MINIX 3 provides a system call, ioctl, called by
ioctl(file_descriptor, request, argp);
that is used to examine and modify the configurations of many I/O
devices.
This call is used to implement the tcgetattr and tcsetattr
functions.
The variable request specifies whether the termios structure is to be
read or written, and in the latter case, whether the request is to take
effect immediately or should be deferred until all currently queued
output is complete.
The variable argp is a pointer to a termios structure in the calling
program.
This particular choice of communication between program and driver was
chosen for its UNIX compatibility, rather than for its inherent
beauty.
A few notes about the termios structure are in order.
The four flag words provide a great deal of flexibility.
The individual bits in c_iflag control various ways input is
handled.
For instance, the ICRNL bit causes CR characters to be converted into NL
on input.
This flag is set by default in MINIX 3.
The c_oflag holds bits that affect output processing.
For instance, the OPOST bit enables output processing.
It and the ONLCR bit, which causes NL characters in the output to be
converted into a CR NL sequence, are also set by default in MINIX
3.
The c_cflag is the control flags word.
The default settings for MINIX 3 enable a line to receive 8-bit
characters and cause a modem to hang up if a user logs out on the
line.
The c_lflag is the local mode flags field.
One bit, ECHO, enables echoing (this can be turned off during a login to
provide security for entering a password).
Its most important bit is the ICANON bit, which enables canonical
mode.
With ICANON off, several possibilities exist.
If all other settings are left at their defaults, a mode identical to
the traditional cbreak mode is entered.
In this mode, characters are passed to the program without waiting for a
full line, but the INTR, QUIT, START, and STOP characters retain their
effects.
All of these can be disabled by resetting bits in the flags, however, to
produce the equivalent of traditional raw mode.
The various special characters that can be changed, including those
which are MINIX 3 extensions, are held in the c_cc array.
This array also holds two parameters which are used in noncanonical
mode.
The quantity MIN, stored in c_cc[VMIN], specifies the minimum number of
characters that must be received to satisfy a read call.
The quantity TIME in c_cc[VTIME] sets a time limit for such calls.
MIN and TIME interact as shown in Fig. 3-31.
A call that asks for N bytes is illustrated.
With TIME = 0 and MIN = 1, the behavior is similar to the traditional
raw mode.
Figure 3-31.
MIN and TIME determine when a call to read returns in noncanonical
mode.
N is the number of bytes requested.
Output Software
Output is simpler than input, but drivers for RS-232 terminals are
radically different from drivers for memory-mapped terminals.
The method that is commonly used for RS-232 terminals is to have output
buffers associated with each terminal.
The buffers can come from the same pool as the input buffers, or be
dedicated, as with input.
When programs write to the terminal, the output is first copied to the
buffers.
Similarly, output from echoing is also copied to the buffers.
After all the output has been copied to the buffers (or the buffers are
full), the first character is output, and the driver goes to
sleep.
When the interrupt comes in, the next character is output, and so
on.
With memory-mapped terminals, a simpler scheme is possible.
Characters to be printed are extracted one at a time from user space and
put directly in the video RAM.
With RS-232 terminals, each character to be output is just put on the
line to the terminal.
With memory mapping, some characters require special treatment, among
them, backspace, carriage return, linefeed, and the audible bell
(CTRL-G).
A driver for a memory-mapped terminal must keep track in software of the
current position in the video RAM, so that printable characters can be
put there and the current position advanced.
Backspace, carriage return, and linefeed all require this position to be
updated appropriately.
Tabs also require special processing.
In particular, when a linefeed is output on the bottom line of the
screen, the screen must be scrolled.
To see how scrolling works, look at Fig. 3-26.
If the video controller always began reading the RAM at 0xB0000, the
only way to scroll the screen when in character mode would be to copy 24
× 80 characters (each character requiring 2 bytes) from 0xB00A0 to
0xB0000, a time-consuming proposition.
In bitmap mode, it would be even worse.
Fortunately, the hardware usually provides some help here.
Most video controllers contain a register that determines where in the
video RAM to begin fetching bytes for the top line on the screen.
By setting this register to point to 0xB00A0 instead of 0xB0000, the
line that was previously number two moves to the top, and the whole
screen scrolls up one line.
The only other thing the driver must do is copy whatever is needed to
the new bottom line.
When the video controller gets to the top of the RAM, it just wraps
around and continues merrily fetching bytes starting at the lowest
address.
Similar hardware assistance is provided in bitmap mode.
Another issue that the driver must deal with on a memory-mapped
terminal is cursor positioning.
Again, the hardware usually provides some assistance in the form of a
register that tells where the cursor is to go.
Finally, there is the problem of the bell.
It is sounded by outputting a sine or square wave to the loudspeaker, a
part of the computer quite separate from the video RAM.
Screen editors and many other sophisticated programs need to be able
to update the screen in more complex ways than just scrolling text onto
the bottom of the display.
To accommodate them, many terminal drivers support a variety of escape
sequences.
Although some terminals support idiosyncratic escape sesequence sets, it
is advantageous to have a standard to facilitate adapting software from
one system to another.
The American National Standards Institute (ANSI) has defined a set of
standard escape sequences, and MINIX 3 supports a subset of the ANSI
sequences, shown in Fig. 3-32, that is adequate for many common
operations.
When the driver sees the character that starts the escape sequences, it
sets a flag and waits until the rest of the escape sequence comes
in.
When everything has arrived, the driver must carry it out in
software.
Inserting and deleting text require moving blocks of characters around
the video RAM.
The hardware is of no help with anything except scrolling and displaying
the cursor.
Figure 3-32.
The ANSI escape sequences accepted by the terminal driver on
output.
ESC denotes the ASCII escape character (0x1B), and n, m, and s are
optional numeric parameters.
The terminal driver is contained in four C files (six if RS-232 and
pseudo terminal support are enabled) and together they far and away
constitute the largest driver in MINIX 3.
The size of the terminal driver is partly explained by the observation
that the driver handles both the keyboard and the display, each of which
is a complicated device in its own right, as well as two other optional
types of terminals.
Still, it comes as a surprise to most people to learn that terminal I/O
requires thirty times as much code as the scheduler.
(This feeling is reinforced by looking at the numerous books on
operating systems that devote thirty times as much space to scheduling
as to all I/O combined.)
The terminal driver accepts more than a dozen message types.
The most important are:
Other message types are used for special purposes such as generating diagnostic displays when function keys are pressed or triggering panic dumps.
The messages used for reading and writing have the same format as
shown in Fig. 3-17, except that no POSITION field is needed.
With a disk, the program has to specify which block it wants to
read.
With a keyboard, there is no choice: the program always gets the next
character typed in.
Keyboards do not support seeks.
The POSIX functions tcgetattr and tcgetattr, used to examine and
modify terminal attributes (properties), are supported by the ioctl
system call.
Good programming practice is to use these functions and others in
include/termios.h and leave it to the C library to convert library calls
to ioctl system calls.
There are, however, some control operations needed by MINIX 3 that are
not provided for in POSIX, for example, loading an alternate keymap, and
for these the programmer must use ioctl explicitly.
The message sent to the driver by an ioctl system call contains a
function request code and a pointer.
For the tcsetattr function, an ioctl call is made with a TCSETS,
TCSETSW, or TCSETSF request type, and a pointer to a termios structure
like the one shown in Fig. 3-30.
All such calls replace the current set of attributes with a new set, the
differences being that a TCSETS request takes effect immediately, a
TCSETSW request does not take effect until all output has been
transmitted, and a TCSETSF waits for output to finish and discards all
input that has not yet been read.
Tcgetattr is translated into an ioctl call with a TCGETS request type
and returns a filled in termios structure to the caller, so the current
state of a device can be examined.
Ioctl calls that do not correspond to functions defined by POSIX, like
the KIOCSMAP request used to load a new keymap, pass pointers to other
kinds of structures, in this case to a keymap_t which is a 1536-byte
structure (16-bit codes for 128 keys × 6 modifiers).
Figure 3-39 summarizes how standard POSIX calls are converted into ioctl
system calls.
The terminal driver uses one main data structure, tty_table, which is
an array of tty structures, one per terminal.
A standard PC has only one keyboard and display, but MINIX 3 can support
up to eight virtual terminals, depending upon the amount of memory on
the display adapter card.
This permits the person at the console to log on multiple times,
switching the display output and keyboard input from one “user” to
another.
With two virtual consoles, pressing ALT-F2 selects the second one and
ALT-F1 returns to the first.
ALT plus the arrow keys also can be used.
In addition, serial lines can support two users at remote locations,
connected by RS-232 cable or modem, and pseudo terminals can support
users connected through a network.
The driver has been written to make it easy to add additional
terminals.
The standard configuration illustrated in the source code in this text
has two virtual consoles, with serial lines and pseudo terminals
disabled.
Each tty structure in tty_table keeps track of both input and
output.
For input, it holds a queue of all characters that have been typed but
not yet read by the program, information about requests to read
characters that have not yet been received, and timeout information, so
input can be requested without the driver blocking permanently if no
character is typed.
For output, it holds the parameters of write requests that are not yet
finished.
Other fields hold various general variables, such as the termios
structure discussed above, which affects many properties of both input
and output.
There is also a field in the tty structure to point to information which
is needed for a particular class of devices but is not needed in the
tty_table entry for every device.
For instance, the hardware-dependent part of the console driver needs
the current position on the screen and in the video RAM, and the current
attribute byte for the display, but this information is not needed to
support an RS-232 line.
The private data structures for each device type are also where the
buffers that receive input from the interrupt service routines are
located.
Slow devices, such as the keyboard, do not need buffers as large as
those needed by fast devices.
Terminal Input
To better understand how the driver works, let us first look at how
characters typed in on the keyboard work their way through the system to
the program that wants them.
Although this section is intended as an overview we will use line number
references to help the reader find each function used.
You may find this a wild ride, getting input exercises code in tty.c,
keyboard.c, and console.c, all of which are large files,
When a user logs in on the system console, a shell is created for him
with /dev/console as standard input, standard output, and standard
error.
The shell starts up and tries to read from standard input by calling the
library procedure read.
This procedure sends a message that contains the file descriptor, buffer
address, and count to the file system.
This message is shown as (1) in Fig. 3-33.
After sending the message, the shell blocks, waiting for the
reply.
(User processes execute only the sendrec primitive, which combines a
send with a receive from the process sent to.)
The file system gets the message and locates the i-node corresponding
to the specified file descriptor.
This i-node is for the character special file /dev/console and contains
the major and minor device numbers for the terminal.
The major device type for terminals is 4; for the console the minor
device number is 0.
The file system indexes into its device map, dmap, to find the number
of the terminal driver, TTY.
Then it sends a message to TTY, shown as (2) in Fig. 3-33.
Normally, the user will not have typed anything yet, so the terminal
driver will be unable to satisfy the request.
It sends a reply back immediately to unblock the file system and report
that no characters are available, shown as (3).
The file system records the fact that a process is waiting for terminal
(i.e., keyboard) input in the console’s structure in tty_table and then
goes off to get the next request for work.
The user’s shell remains blocked until the requested characters arrive,
of course.
When a character is finally typed on the keyboard, it causes two
interrupts, one when the key is depressed and one when it is
released.
An important point is that a PC keyboard does not generate ASCII codes;
each key generates a scan code when pressed, and a different code when
released.
The lower 7 bits of the “press” and “release” codes are identical.
The difference is the most significant bit, which is a 0 when the key is
pressed and a 1 when it is released.
This also applies to modifier keys such as CTRL and SHIFT.
Although ultimately these keys do not cause ASCII codes to be returned
to the user process, they generate scan codes indicating which key was
pressed (the driver can distinguish between the left and right shift
keys if desired), and they still cause two interrupts per key.
Figure 3-33.
Read request from the keyboard when no characters are pending.
FS is the file system.
TTY is the terminal driver.
The TTY receives a message for every keypress and queues scan codes as
they are entered.
Later these are interpreted and assembled into a buffer of ASCII codes
which is copied to the user process.
The keyboard interrupt is IRQ 1.
This interrupt line is not accessible on the system bus, and can not be
shared by any other I/O adapter.
When_hwint01 calls intr_handle there will not be a long list of hooks to
traverse to find that the TTY should be notified.
In Fig. 3-33 we show the system task originating the notification
message (4) because it is generated by generic_handler in
system/do_irqctl.c (not listed), but this routine is called directly by
the low-level interrupt processing routines.
The system task process is not activated.
Upon receiving a HARD_INT message tty_task dispatches to kbd_interrupt
which in turn calls scan_keyboard.
Scan_keyboard makes three kernel calls (5, 6, 7) to cause the system
task to read from and write to several I/O ports, which ultimately
returns the scan code, then is added to a circular buffer.
A tty_events flag is then set to indicate this buffer contains
characters and is not empty.
No message is needed as of this point.
Every time the main loop of tty_task starts another cycle it inspects
the tty_events flag for each terminal device, and, for each device which
has the flag set, calls handle_events.
The tty_events flag can signal various kinds of activity (although input
is the most likely), so handle_events always calls the device-specific
functions for both input and output.
For input from the keyboard this results in a call to kb_read, which
keeps track of keyboard codes that indicate pressing or releasing of the
CTRL, SHIFT, and ALT keys and converts scan codes into ASCII
codes.
Kb_read in turn calls in_process, which processes the ASCII codes,
taking into account special characters and different flags that may be
set, including whether or not canonical mode is in effect.
The effect is normally to add characters to the console’s input queue in
tty_table, although some codes, for instance BACKSPACE, have other
effects.
Normally, also, in_process initiates echoing of the ASCII codes to the
display.
When enough characters have come in, the terminal driver makes
another kernel call (8) to ask the system task to copy the data to the
address requested by the shell.
The copying of the data is not message passing and for that reason is
shown by dashed lines (9) in Fig. 3-33.
More than one such line is shown because there may be more than one such
operation before the user’s request has been completely fulfilled.
When the operation is finally complete, the terminal driver sends a
message to the file system telling it that the work has been done (10),
and the file system reacts to this message by sending a message back to
the shell to unblock it (11).
The definition of when enough characters have come in depends upon
the terminal mode.
In canonical mode a request is complete when a linefeed, end-of-line, or
end-of-file code is received, and, in order for proper input processing
to be done, a line of input cannot exceed the size of the input
queue.
In noncanonical mode a read can request a much larger number of
characters, and in_process may have to transfer characters more than
once before a message is returned to the file system to indicate the
operation is complete.
Note that the system task copies the actual characters directly from
the TTY’s address space to that of the shell.
They do not go through the file system.
With block I/O, data pass through the file system to allow it to
maintain a buffer cache of the most recently used blocks.
If a requested block happens to be in the cache, the request can be
satisfied directly by the file system, without doing any actual disk
I/O.
For keyboard I/O, a cache makes no sense.
Furthermore, a request from the file system to a disk driver can always
be satisfied in at most a few hundred milliseconds, so there is no harm
in having the file system wait.
Keyboard I/O may take hours to complete, or may never be complete (in
canonical mode the terminal driver waits for a complete line, and it may
also wait a long time in noncanonical mode, depending upon the settings
of MIN and TIME).
Thus, it is unacceptable to have the file system block until a terminal
input request is satisfied.
Later on, it may happen that the user has typed ahead, and that
characters are available before they have been requested, from previous
interrupts and event 4.
In that case, events 1, 2, and 5 through 11 all happen in quick
succession after the read request; 3 does not occur at all.
Readers who are familiar with earlier versions of MINIX may remember
that in these versions the TTY driver (and all other drivers) were
compiled together with the kernel.
Each driver had its own interrupt handler in kernel space.
In the case of the keyboard driver, the interrupt handler itself could
buffer a certain number of scan codes, and also do some preliminary
processing (scan codes for most key releases could be dropped, only for
modifier keys like the shift key is it necessary to buffer the release
codes).
The interrupt handler itself did not send messages to the TTY driver,
because the probability was high that the TTY would not be blocked on a
receive and able to receive a message at any given time.
Instead, the clock interrupt handler awakened the TTY driver
periodically.
These techniques were adopted to avoid losing keyboard input.
Earlier we made something of a point of the differences between
handling expected interrupts, such as those generated by a disk
controller, and handling unpredictable interrupts like those from a
keyboard.
But in MINIX 3 nothing special seems to have been done to deal with the
problems of unpredictable interrupts.
How is this possible? One thing to keep in mind is the enormous
difference in performance between the computers for which the earliest
versions of MINIX were written and current designs.
CPU clock speeds have increased, and the number of clock cycles needed
to execute an instruction has decreased.
The minimum processor recommended for use with MINIX 3 is an
80386.
A slow 80386 will execute instructions approximately 20 times as fast as
the original IBM PC.
A 100 MHz Pentium will execute perhaps 25 times as fast as the slow
80386.
So perhaps CPU speed is enough.
Another thing to keep in mind is that keyboard input is very slow by
computer standards.
At 100 words per minute a typist enters fewer than 10 characters per
second.
Even with a fast typist the terminal driver will probably be sent an
interrupt message for each character typed at the keyboard.
However, in the case of other input devices higher data rates are
probable—rates 1000 or more times faster than those of a typist are
possible from a serial port connected to a 56,000-bps modem.
At that speed approximately 120 characters may be received by the modem
between clock ticks, but to allow for data compression on the modem link
the serial port connected to the modem must be able to handle at least
twice as many.
One thing to consider with a serial port, however, is that
characters, not scan codes, are transmitted, so even with an old UART
that does no buffering, there will be only one interrupt per keypress
instead of two.
And newer PCs are equipped with UARTs that typically buffer at least 16,
and perhaps as many 128 characters.
So one interrupt per character is not required.
For instance, a UART with a 16-character buffer might be configured to
interrupt when 14 characters are in the buffer.
Ethernet-based networks can deliver characters at a rate much faster
than a serial line, but ethernet adapters buffer entire packets, and
only one interrupt is necessary per packet.
We will complete our overview of terminal input by summarizing the
events that occur when the terminal driver is first activated by a read
request and when it is reactivated after receipt of keyboard input (see
Fig. 3-34).
In the first case, when a message comes in to the terminal driver
requesting characters from the keyboard, the main procedure, tty_task
calls do_read to handle the request.
Do_read stores the parameters of the call in the keyboard’s entry in
tty_table, in case there are insufficient characters buffered to satisfy
the request.
Figure 3-34.
Input handling in the terminal driver.
The left branch of the tree is taken to process a request to read
characters.
The right branch is taken when a keyboard message is sent to the driver
before a user has requested input.
[figure 3-X to be revised]
Then it calls in_transfer to get any input already waiting, and then
handle_events which in turn calls (via the function pointer
(*tp->tty_devread
)) kb_read and then in_transfer once
again, in order to try to milk the input stream for a few more
characters.
Kb_read calls several other procedures not shown in Fig. 3-34 to
accomplish its work.
The result is that whatever is immediately available is copied to the
user.
If nothing is available then, nothing is copied.
If the read is completed by in_transfer or by handle_events, a message
is sent to the file system when all characters have been transferred, so
the file system can unblock the caller.
If the read was not completed (no characters, or not enough characters)
do_read reports back to the file system, telling it whether it should
suspend the original caller, or, if a nonblocking read was requested,
cancel the read.
The right side of Fig. 3-34 summarizes the events that occur when the
terminal driver is awakened subsequent to an interrupt from the
keyboard.
When a character is typed, the interrupt “handler” kbd_interrupt calls
scan_keyboard which calls the system task to do the I/O.
(We put “handler” in quotes because it is not a real handler called when
an interrupt occurs, it is activated by a message sent to tty_task from
generic_handler in the system task.) Then kbd_interrupt puts the scan
code into the keyboard buffer, ibuf, and sets a flag to identify that
the console device has experienced an event.
When kbd_interrupt returns control to tty_task a continue statement
results in starting another iteration of the main loop.
The event flags of all terminal devices are checked and handle_events is
called for each device with a raised flag.
In the case of the keyboard, handle_events calls kb_read and
in_transfer, just as was done on receipt of the original read
request.
The events shown on the right side of the figure may occur several
times, until enough characters are received to fulfill the request
accepted by do_read after the first message from the FS.
If the FS tries to initiate a request for more characters from the same
device before the first request is complete, an error is returned.
Of course, each device is independent; a read request on behalf of a
user at a remote terminal is processed separately from one for a user at
the console.
The functions not shown in Fig. 3-34 that are called by kb_read
include map_key, which converts the key codes (scan codes) generated by
the hardware into ASCII codes, make_break, which keeps track of the
state of modifier keys such as the SHIFT key, and in_process, which
handles complications such as attempts by the user to backspace over
input entered by mistake, other special characters, and options
available in different input modes.
In_process also calls tty_echo, so the typed characters will be
displayed on the screen.
Terminal Output
In general, console output is simpler than terminal input, because
the operating system is in control and does not need to be concerned
with requests for output arriving at inconvenient times.
Also, because the MINIX 3 console is a memory-mapped display, output to
the console is particularly simple.
No interrupts are needed; the basic operation is to copy data from one
memory region to another.
On the other hand, all the details of managing the display, including
handling escape sequences, must be handled by the driver software.
As we did with keyboard input in the previous section, we will trace
through the steps involved in sending characters to the console
display.
We will assume in this example that the active display is being written;
minor complications caused by virtual consoles will be discussed
later.
When a process wants to print something, it generally calls
printf.
Printf calls write to send a message to the file system.
The message contains a pointer to the characters that are to be printed
(not the characters themselves).
The file system then sends a message to the terminal driver, which
fetches them and copies them to the video RAM.
Figure 3-35 shows the main procedures involved in output.
Figure 3-35.
Major procedures used in terminal output.
The dashed line indicates characters copied directly to ramqueue by
cons_write.
When a message comes in to the terminal driver requesting it to write
on the screen, do_write is called to store the parameters in the
console’s tty struct in the tty_table.
Then handle_events (the same function called whenever the tty_events
flag is found set) is called.
On every call this function calls both the input and output routines for
the device selected in its argument.
In the case of the console display this means that any keyboard input
that is waiting is processed first.
If there is input waiting, characters to be echoed are added to whatever
characters are already awaiting output.
Then a call is made to cons_write, the output procedure for
memory-mapped displays.
This procedure uses phys_copy to copy blocks of characters from the user
process to a local buffer, possibly repeating this and the following
steps a number of times, since the local buffer holds only 64
bytes.
When the local buffer is full, each 8-bit byte is transferred to another
buffer, ramqueue.
This is an array of 16-bit words.
Alternate bytes are filled in with the current value of the screen
attribute byte, which determines foreground and background colors and
other attributes.
When possible, characters are transferred directly into ramqueue, but
certain characters, such as control characters or characters that are
parts of escape sequences, need special handling.
Special handling is also required when a character’s screen position
would exceed the width of the screen, or when ramqueue becomes
full.
In these cases out_char is called to transfer the characters and take
whatever additional action is called for.
For instance, scroll_screen is called when a linefeed character is
received while addressing the last line of the screen, and parse_escape
handles characters during an escape sequence.
Usually out_char calls flush which copies the contents of ramqueue to
the video display memory, using the assembly language routine
mem_vid_copy.
Flush is also called after the last character is transferred into
ramqueue to be sure all output is displayed.
The final result of flush is to command the 6845 video controller chip
to display the cursor in the correct position.
Logically, the bytes fetched from the user process could be written
into the video RAM one per loop iteration.
However, accumulating the characters in ram-queue and then copying the
block with a call to mem_vid_copy are more efficient in the protected
memory environment of Pentium-class processors.
Interestingly, this technique was introduced in early versions of MINIX
3 that ran on older processors without protected memory.
The precursor of mem_vid_copy dealt with a timing problem—with older
video displays the copy into the video memory had to be done when the
screen was blanked during vertical retrace of the CRT beam to avoid
generating visual garbage all over the screen.
MINIX 3 no longer provides this support for obsolete equipment as the
performance penalty is too great.
However, the modern version of MINIX 3 benefits in other ways from
copying ram-queue as a block.
The video RAM available to a console is delimited in the console
structure by the fields c_start and c_limit.
The current cursor position is stored in the c_column and c_row
fields.
The coordinate (0, 0) is in the upper left corner of the screen, which
is where the hardware starts to fill the screen.
Each video scan begins at the address given by c_org and continues for
80 × 25 characters (4000 bytes).
In other words, the 6845 chip pulls the word at offset c_org from the
video RAM and displays the character byte in the upper left-hand corner,
using the attribute byte to control color, blinking, and so forth.
Then it fetches the next word and displays the character at (1,
0).
This process continues until it gets to (79, 0), at which time it begins
the second line on the screen, at coordinate (0, 1).
When the computer is first started, the screen is cleared, output is
written into the video RAM starting at location c_start, and c_org is
assigned the same value as c_start.
Thus the first line appears on the top line of the screen.
When output must go to a new line, either because the first line is full
or because a newline character is detected by out_char, output is
written into the location given by c_start plus 80.
Eventually, all 25 lines are filled, and scrolling of the screen is
required.
Some programs, editors, for example, require scrolling in the downward
direction too, when the cursor is on the top line and further movement
upward within the text is required.
There are two ways scrolling the screen can be managed.
In software scrolling, the character to be displayed at position (0, 0)
is always in the first location in video memory, word 0 relative to the
position pointed to by c_start, and the video controller chip is
commanded to display this location first by keeping the same address in
c_org.
When the screen is to be scrolled, the contents of relative location 80
in the video RAM, the beginning of the second line on the screen, is
copied to relative location 0, word 81 is copied to relative location 1,
and so on.
The scan sequence is unchanged, putting the data at location 0 in the
memory at screen position (0, 0) and the image on the screen appears to
have moved up one line.
The cost is that the CPU has moved 80 × 24 = 1920 words.
In hardware scrolling, the data are not moved in the memory; instead the
video controller chip is instructed to start the display at a different
point, for instance, with the data at word 80.
The bookkeeping is done by adding 80 to the contents of c_org, saving it
for future reference, and writing this value into the correct register
of the video controller chip.
This requires either that the controller be smart enough to wrap around
the video RAM, taking data from the beginning of the RAM (the address in
c_start) when it reaches the end (the address in c_limit), or that the
video RAM have more capacity than just the 80 × 2000 words necessary to
store a single screen of display.
Older display adapters generally have smaller memory but are able to
wrap around and do hardware scrolling.
Newer adapters generally have much more memory than needed to display a
single screen of text, but the controllers are not able to wrap.
Thus an adapter with 32,768 bytes of display memory can hold 204
complete lines of 160 bytes each, and can do hardware scrolling 179
times before the inability to wrap becomes a problem.
But, eventually a memory copy operation will be needed to move the data
for the last 24 lines back to location 0 in the video memory.
Whichever method is used, a row of blanks is copied to the video RAM to
ensure that the new line at the bottom of the screen is empty.
When virtual consoles are enabled, the available memory within a
video adapter is divided equally between the number of consoles desired
by properly initializing the c_start and c_limit fields for each
console.
This has an effect on scrolling.
On any adapter large enough to support virtual consoles, software
scrolling takes place every so often, even though hardware scrolling is
nominally in effect.
The smaller the amount of memory available to each console display, the
more frequently software scrolling must be used.
The limit is reached when the maximum possible number of consoles is
configured.
Then every scroll operation will be a software scroll operation.
The position of the cursor relative to the start of the video RAM can
be derived from c_column and c_row, but it is faster to store it
explicitly (in c_cur).
When a character is to be printed, it is put into the video RAM at
location c_cur, which is then updated, as is c_column.
Figure 3-36 summarizes the fields of the console structure that affect
the current position and the display origin.
Figure 3-36.
Fields of the console structure that relate to the current screen
position.
The characters that affect the cursor position (e.g., linefeed,
backspace) are handled by adjusting the values of c column, c row, and
c cur.
This work is done at the end of flush by a call to set 6845 which sets
the registers in the video controller chip.
The terminal driver supports escape sequences to allow screen editors
and other interactive programs to update the screen in a flexible
way.
The sequences supported are a subset of an ANSI standard and should be
adequate to allow many programs written for other hardware and other
operating systems to be easily ported to MINIX 3.
There are two categories of escape sequences: those that never contain a
variable parameter, and those that may contain parameters.
In the first category the only representative supported by MINIX 3 is
ESC M, which reverse indexes the screen, moving the cursor up one line
and scrolling the screen downward if the cursor is already on the first
line.
The other category can have one or two numeric parameters.
Sequences in this group all begin with ESC [.
The “[” is the control sequence introducer.
A table of escape sequences defined by the ANSI standard and recognized
by MINIX 3 was shown in Fig. 3-32.
Parsing escape sequences is not trivial.
Valid escape sequences in MINIX 3 can be as short as two characters, as
in ESC M, or up to 8 characters long in the case of a sequence that
accepts two numeric parameters that each can have a two-digit values as
in ESC [20;60H, which moves the cursor to line 20, column 60.
In a sequence that accepts a parameter, the parameter may be omitted,
and in a sequence that accepts two parameters either or both of them may
be omitted.
When a parameter is omitted or one that is outside the valid range is
used, a default is substituted.
The default is the lowest valid value.
Consider the following ways a program could construct a sequence to move to the upper-left corner of the screen:
ESC [H is acceptable, because if no parameters are entered the lowest valid parameters are assumed.
ESC [1;1H will correctly send the cursor to row 1 and column 1 (with ANSI, the row and column numbers start at 1).
Both ESC [1;H and ESC [;1H have an omitted parameter, which defaults to 1 as in the first example.
ESC [0;0H will do the same, since each parameter is less than the minimum valid value the minimum is substituted.
These examples are presented not to suggest one should deliberately use invalid parameters but to show that the code that parses such sequences is nontrivial.
MINIX 3 implements a finite state automaton to do this parsing.
The variable c_esc_state in the console structure normally has a value
of 0.
When out_char detects an ESC character, it changes c_esc_state to 1, and
subsequent characters are processed by parse_escape.
If the next character is the control sequence introducer, state 2 is
entered; otherwise the sequence is considered complete, and do_escape is
called.
In state 2, as long as incoming characters are numeric, a parameter is
calculated by multiplying the previous value of the parameter (initially
0) by 10 and adding the numeric value of the current character.
The parameter values are kept in an array and when a semicolon is
detected the processing shifts to the next cell in the array.
(The array in MINIX 3 has only two elements, but the principle is the
same).
When a nonnumeric character that is not a semicolon is encountered the
sequence is considered complete, and again do_escape is called.
The current character on entry to do_escape then is used to select
exactly what action to take and how to interpret the parameters, either
the defaults or those entered in the character stream.
This is illustrated in Fig. 3-44.
Loadable Keymaps
The IBM PC keyboard does not generate ASCII codes directly.
The keys are each identified with a number, starting with the keys that
are located in the upper left of the original PC keyboard—1 for the
“ESC” key, 2 for the “1”, and so on.
Each key is assigned a number, including modifier keys like the left
SHIFT and right SHIFT keys, numbers 42 and 54.
When a key is pressed, MINIX 3 receives the key number as a scan
code.
A scan code is also generated when a key is released, but the code
generated upon release has the most significant bit set (equivalent to
adding 128 to the key number).
Thus a key press and a key release can be distinguished.
By keeping track of which modifier keys have been pressed and not yet
released, a large number of combinations are possible.
For ordinary purposes, of course, two-finger combinations, such as
SHIFT-A or CTRL-D, are most manageable for two-handed typists, but for
special occasions three-key (or more) combinations are possible, for
instance, CTRL-SHIFT-A, or the well-known CTRL-ALT-DEL combination that
PC users recognize as the way to reset and reboot the system.
The complexity of the PC keyboard allows for a great deal of
flexibility in how it used.
A standard keyboard has 47 ordinary character keys defined (26
alphabetic, 10 numeric, and 11 punctuation).
If we are willing to use three-fingered modifier key combinations, such
as CTRL-ALT-SHIFT, we can support a character set of 376 (8 × 47)
members.
This is by no means the limit of what is possible, but for now let us
assume we do not want to distinguish between the left- and right-hand
modifier keys, or use any of the numeric keypad or function keys.
Indeed, we are not limited to using just the CTRL, ALT, and SHIFT keys
as modifiers; we could retire some keys from the set of ordinary keys
and use them as modifiers if we desired to write a driver that supported
such a system.
Operating systems that use such keyboards use a keymap to determine
what character code to pass to a program based upon the key being
pressed and the modifiers in effect.
The MINIX 3 keymap logically is an array of 128 rows, representing
possible scan code values (this size was chosen to accommodate Japanese
keyboards; U.S.
and European keyboards do not have this many keys) and 6 columns.
The columns represent no modifier, the SHIFT key, the Control key, the
left ALT key, the right ALT key, and a combination of either ALT key
plus the SHIFT key.
There are thus 720 ((128 − 6) × 6) character codes that can be generated
by this scheme, given an adequate keyboard.
This requires that each entry in the table be a 16-bit quantity.
For U.S.
keyboards the ALT and ALT2 columns are identical.
ALT2 is named ALTGR on keyboards for other languages, and many of these
keymaps support keys with three symbols by using this key as a
modifier.
A standard keymap (determined by the line
#include keymaps/us-std.src
in keyboard.c
) is compiled into the MINIX 3 kernel at
compilation time, but an
ioctl(0, KIOCSMAP, keymap)
call can be used to load a different map into the kernel at address
keymap.
A full keymap occupies 1536 bytes (128 × 6 × 2).
Extra keymaps are stored in compressed form.
A program called genmap is used to make a new compressed keymap.
When compiled, genmap includes the keymap.src code for a particular
keymap, so the map is compiled within genmap.
Normally, genmap is executed immediately after being compiled, at which
time it outputs the compressed version to a file, and then the genmap
binary is deleted.
The command loadkeys reads a compressed keymap, expands it internally,
and then calls ioctl to transfer the keymap into the kernel
memory.
MINIX 3 can execute loadkeys automatically upon starting, and the
program can also be invoked by the user at any time.
Figure 3-37.
A few entries from a keymap source file.
The source code for a keymap defines a large initialized array, and
in the interest of saving space a keymap file is not printed in Appendix
B.
Figure 3-37 shows in tabular form the contents of a few lines of
src/kernel/keymaps/us-std.src which illustrate several aspects of
keymaps.
There is no key on the IBM-PC keyboard that generates a scan code of
0.
The entry for code 1, the ESC key, shows that the value returned is
unchanged when the SHIFT key or CTRL key are pressed, but that a
different code is returned when an ALT key is pressed simultaneously
with the ESC key.
The values compiled into the various columns are determined by macros
defined in include/minix/keymap.h:
#define C(c) ((c) & 0x1F) /* Map to control code */
#define A(c) ((c) | 0x80) /* Set eight bit (ALT) */
#define CA(c) A(C(c)) /* CTRL-ALT */
#define L(c) ((c) | HASCAPS) /* Add "Caps Lock has effect" attribute */
The first three of these macros manipulate bits in the code for the
quoted character to produce the necessary code to be returned to the
application.
The last one sets the HASCAPS bit in the high byte of the 16-bit
value.
This is a flag that indicates that the state of the capslock variable
has to be checked and the code possibly modified before being
returned.
In the figure, the entries for scan codes 2, 13, and 16 show how typical
numeric, punctuation, and alphabetic keys are handled.
For code 28 a special feature is seen—normally the ENTER key produces a
CR (0x0D) code, represented here as C(‘M’) .
Because the newline character in UNIX files is the LF (0x0A) code, and
it is sometimes necessary to enter this directly, this keyboard map
provides for a CTRL-ENTER combination, which produces this code, C(‘J’)
.
Scan code 29 is one of the modifier codes and must be recognized no
matter what other key is pressed, so the CTRL value is returned
regardless of any other key that may be pressed.
The function keys do not return ordinary ASCII values, and the row for
scan code 59 shows symbolically the values (defined in
include/minix/keymap.h) that are returned for the F1 key in combination
with other modifiers.
These values are F1: 0x0110, SF1: 0x1010, AF1: 0x0810, ASF1: 0x0C10, and
CF1: 0x0210.
The last entry shown in the figure, for scan code 127, is typical of
many entries near the end of the array.
For many keyboards, certainly most of those used in Europe and the
Americas, there are not enough keys to generate all the possible codes,
and these entries in the table are filled with zeroes.
Loadable Fonts
Early PCs had the patterns for generating characters on a video
screen stored only in ROM, but the displays used on modern systems
provide RAM on the video display adapters into which custom character
generator patterns can be loaded.
This is supported by MINIX 3 with a
ioctl(0, TIOCSFON, font)
ioctl operation.
MINIX 3 supports an 80 lines × 25 rows video mode, and font files
contain 4096 bytes.
Each byte represents a line of 8 pixels that are illuminated if the bit
value is 1, and 16 such lines are needed to map each character.
However the video display adapter uses 32 bytes to map each character,
to provide higher resolution in modes not currently supported by MINIX
3.
The loadfont command is provided to convert these files into the
8192-byte font structure referenced by the ioctl call and to use that
call to load the font.
As with the keymaps, a font can be loaded at startup time, or at any
time during normal operation.
However, every video adapter has a standard font built into its ROM that
is available by default.
There is no need to compile a font into MINIX 3 itself, and the only
font support necessary in the kernel is the code to carry out the
TIOCSFON ioctl operation.
In this section we will begin to look at the source code of the
terminal driver in detail.
We saw when we studied the block devices that multiple drivers
supporting several different devices could share a common base of
software.
The case with the terminal devices is similar, but with the difference
that there is one terminal driver that supports several different kinds
of terminal device.
Here we will start with the device-independent code.
In later sections we will look at the device-dependent code for the
keyboard and the memory-mapped console display.
Terminal Driver Data Structures
The file tty.h contains definitions used by the C files which
implement the terminal drivers.
Since this driver supports many different devices, the minor device
numbers must be used to distinguish which device is being supported on a
particular call.
Within tty.h, the definitions of the O_NOCTTY and O_NONBLOCK flags
(which are optional arguments to the open call) are duplicates of
definitions in include/fcntl.h but they are repeated here so as not to
require including another file.
The devfun_t and devfunarg_t types are used to define pointers to
functions, in order to provide for indirect calls using a mechanism
similar to what we saw in the code for the main loop of the disk
drivers.
Many variables declared in this file are identified by the prefix
tty_.
The most important definition in tty.h is the tty structure.
There is one such structure for each terminal device (the console
display and keyboard together count as a single terminal).
The first variable in the tty structure, tty_events, is the flag that is
set when an interrupt causes a change that requires the terminal driver
to attend to the device.
The rest of the tty structure is organized to group together
variables that deal with input, output, status, and information about
incomplete operations.
In the input section, tty_inhead and tty_intail define the queue where
received characters are buffered.
Tty_incount counts the number of characters in this queue, and tty_eotct
counts lines or characters, as explained below.
All device-specific calls are done indirectly, with the exception of the
routines that initialize the terminals, which are called to set up the
pointers used for the indirect calls.
The tty_devread and tty_icancel fields hold pointers to device-specific
code to perform the read and input cancel operations.
Tty_min is used in comparisons with tty_eotct.
When the latter becomes equal to the former, a read operation is
complete.
During canonical input, tty_min is set to 1 and tty_eotct counts lines
entered.
During noncanonical input, tty_eotct counts characters and tty_min is
set from the MIN field of the termios structure.
The comparison of the two variables thus tells when a line is ready or
when the minimum character count is reached, depending upon the
mode.
Tty_tmr is a timer for this tty, used for the TIME field of termios.
Since queueing of output is handled by the device-specific code, the
output section of tty declares no variables and consists entirely of
pointers to device-specific functions that write, echo, send a break
signal, and cancel output.
In the status section the flags tty_reprint, tty_escaped, and
tty_inhibited indicate that the last character seen has a special
meaning; for instance, when a CTRL-V (LNEXT) character is seen,
tty_escaped is set to 1 to indicate that any special meaning of the next
character is to be ignored.
The next part of the structure holds data about DEV_READ, DEV_WRITE,
and DEV_IOCTL operations in progress.
There are two processes involved in each of these operations.
The server managing the system call (normally FS) is identified in
tty_incaller.
The server calls the tty driver on behalf of another process that needs
to do an I/O operation, and this client is identified in
tty_inproc.
As described in Fig. 3-33, during a read, characters are transferred
directly from the terminal driver to a buffer within the memory space of
the original caller.
Tty_inproc and tty_in_vir locate this buffer.
The next two variables, tty_inleft and tty_incum, count the characters
still needed and those already transferred.
Similar sets of variables are needed for a write system call.
For ioctl there may be an immediate transfer of data between the
requesting process and the driver, so a virtual address is needed, but
there is no need for variables to mark the progress of an
operation.
An ioctl request may be postponed, for instance, until current output is
complete, but when the time is right the request is carried out in a
single operation.
Finally, the tty structure includes some variables that fall into no
other category, including pointers to the functions to handle the
DEV_IOCTL and DEV_CLOSE operations at the device level, a POSIX-style
termios structure, and a winsize structure that provides support for
window-oriented screen displays.
The last part of the structure provides storage for the input queue
itself in the array tty_inbuf.
Note that this is an array of u16_t, not of 8-bit char characters.
Although applications and devices use 8-bit codes for characters, the C
language requires the input function getchar to work with a larger data
type so it can return a symbolic EOF value in addition to all 256
possible byte values.
The tty_table, an array of tty structures, is declared as
extern.
There is one array element for each terminal enabled by the NR_CONS,
NR_RS_LINES, and NR_PTYS definitions in include/minix/config.h.
For the configuration discussed in this book, two consoles are enabled,
but MINIX 3 may be recompiled to add additional virtual consoles, one or
two 2 serial lines, and up to 64 pseudo terminals.
There is one other extern definition in tty.h.
Tty_timers is a pointer used by the timer to hold the head of a linked
list of timer_t fields.
The tty.h header file is included in many files and storage for
tty_table and tty_timers is allocated during compilation of tty.c.
Two macros, buflen and bufend, are defined.
These are used frequently in the terminal driver code, which does much
copying of data into and out of buffers.
The Device-Independent Terminal Driver
The main terminal driver and the device-independent supporting
functions are all in tty.c.
Following this there are a number of macro definitions.
If a device is not initialized, the pointers to that device’s
device-specific functions will contain zeroes put there by the C
compiler.
This makes it possible to define the tty_active macro, which returns
FALSE if a null pointer is found.
Of course, the initialization code for a device cannot be accessed
indirectly if part of its job is to initialize the pointers that make
indirect access possible.
There are conditional macro definitions to equate initialization calls
for RS-232 or pseudo terminal devices to calls to a null function when
these devices are not configured.
Do_pty may be similarly disabled in this section.
This makes it possible to omit the code for these devices entirely if it
is not needed.
Since there are so many configurable parameters for each terminal,
and there may be quite a few terminals on a networked system, a
termios_defaults structure is declared and initialized with default
values (all of which are defined in include/termios.h).
This structure is copied into the tty_table entry for a terminal
whenever it is necessary to initialize or reinitialize it.
The defaults for the special characters were shown in Fig. 3-29.
Figure 3-38 shows the default values for the various flags used.
On the following line the winsize_defaults structure is similarly
declared.
It is left to be initialized to all zeroes by the C compiler.
This is the proper default action; it means “window size is unknown, use
/etc/termcap.”
The final set of definitions before executable code begins are the PUBLIC declarations of global variables previously declared as extern in tty.h.
Figure 3-38.
Default termios flag values.
The entry point for the terminal driver is tty_task.
Before entering the main loop, a call is made to tty_init.
Information about the host machine that will be needed to initialize the
keyboard and the console is obtained by a sys_getmachine kernel call,
and then the keyboard hardware is initialized.
The routine called for this is kb_init_once.
It is so named to distinguish it from another initialization routine
which is called as part of initialization of each virtual console later
on.
Finally, a single 0 is printed to exercise the output system and kick
anything that does not get initialized until first use.
The source code shows a call to printf, but this is not the same printf
used by user programs, it is a special version that calls a local
function in the console driver called putk.
The main loop is, in principle, like the main loop of any driver—it
receives a message, executes a switch on the message type to call the
appropriate function, and then generates a return message.
However, there are some complications.
The first one is that since the last interrupt additional characters may
have been read or characters to be written to an output device may be
ready.
Before attempting to receive a message, the main loop always checks the
tp−>tty_events flags for all terminals and handle_events is called as
necessary to take care of unfinished business.
Only when nothing demands immediate attention is a call made to
receive.
The diagram showing message types in the comments near the beginning
of tty.c shows the most often used types.
A number of message types requesting specialized services from the
terminal driver are not shown.
These are not specific to any one device.
The tty_task main loop checks for these and handles them before checking
for device-specific messages.
First a check is made for a SYN_ALARM message, and, if this is the
message type a call is made to expire_timers to cause a watchdog routine
to execute.
Then comes a continue statement.
In fact all of the next few cases we will look at are followed by
continue.
We will say more about this soon.
The next message type tested for is HARD_INT.
This is most likely the result of a key being pressed or released on the
local keyboard.
It could also mean bytes have been received by a serial port, if serial
ports are enabled—in the configuration we are studying they are not, but
we left conditional code in the file here to illustrate how serial port
input would be handled.
A bit field in the message is used to determine the source of the
interrupt.
Next a check is made for SYS_SIG.
System processes (drivers and servers) are expected to block waiting for
messages.
Ordinary signals are received only by active processes, so the standard
UNIX signaling method does not work with system processes.
A SYS_SIG message is used to signal a system process.
A signal to the terminal driver can mean the kernel is shutting down
(SIGKSTOP), the terminal driver is being shut down (SIGTERM), or the
kernel needs to print a message to the console (SIGKMESS), and
appropriate routines are called for these cases.
The last group of non-device-specific messages are PANIC_DUMPS,
DIAGNOSTICS, and FKEY_CONTROL.
We will say more about these when we get to the functions that service
them.
Now, about the continue statements: in the C language, a continue
statement short-circuits a loop, and returns control to the top of the
loop.
So if any one of the message types mentioned so far is detected, as soon
as it is serviced control returns to the top of the main loop, at line
13764, the check for events is repeated, and receive is called again to
await a new message.
Particularly in the case of input it is important to be ready to respond
again as quickly as possible.
Also, if any of the message-type tests in the first part of the loop
succeeded there is no point in making any of the tests that come after
the first switch.
Above we mentioned complications that the terminal driver must deal
with.
The second complication is that this driver services several
devices.
If the interrupt is not a hardware interrupt the TTY_LINE field in the
message is used to determine which device should respond to the
message.
The minor device number is decoded by a series of comparisons, by means
of which tp is pointed to the correct entry in the tty_table.
If the device is a pseudo terminal, do_pty (in pty.c) is called and the
main loop is restarted.
In this case do_pty generates the reply message.
Of course, if pseudo terminals are not enabled, the call to do_pty uses
the dummy macro defined earlier.
One would hope that attempts to access nonexistent devices would not
occur, but it is always easier to add another check than to verify that
there are no errors elsewhere in the system.
In case the device does not exist or is not configured, a reply message
with an ENXIO error message is generated and, again, control returns to
the top of the loop.
The rest of this driver resembles what we have seen in the main loop
of other drivers, a switch on the message type.
The appropriate function for the type of request, do_read, do_write, and
so on, is called.
In each case the called function generates the reply message, rather
than pass the information needed to construct the message back to the
main loop.
A reply message is generated at the end of the main loop only if a valid
message type was not received, in which case an EINVAL error message is
sent.
Because reply messages are sent from many different places within the
terminal driver a common routine, tty_reply, is called to handle the
details of constructing reply messages.
If the message received by tty_task is a valid message type, not the
result of an interrupt, and does not come from a pseudo terminal, the
switch at the end of the main loop will dispatch to one of the functions
do_read, do_write, do_ioctl, do_open, do_close, do_select, or
do_cancel.
The arguments to each of these calls are tp, a pointer to a tty
structure, and the address of the message.
Before looking at each of them in detail, we will mention a few general
considerations.
Since tty_task may service multiple terminal devices, these functions
must return quickly so the main loop can continue.
However, do_read, do_write, and do_ioctl may not be able to complete
all the requested work immediately.
In order to allow FS to service other calls, an immediate reply is
required.
If the request cannot be completed immediately, the SUSPEND code is
returned in the status field of the reply message.
This corresponds to the message marked (3) in Fig. 3-33 and suspends the
process that initiated the call, while unblocking the FS.
Messages corresponding to (10) and (11) in the figure will be sent later
when the operation can be completed.
If the request can be fully satisfied, or an error occurs, either the
count of bytes transferred or the error code is returned in the status
field of the return message to the FS.
In this case a message will be sent immediately from the FS back to the
process that made the original call, to wake it up.
Reading from a terminal is fundamentally different from reading from
a disk device.
The disk driver issues a command to the disk hardware and eventually
data will be returned, barring a mechanical or electrical failure.
The computer can display a prompt upon the screen, but there is no way
for it to force a person sitting at the keyboard to start typing.
For that matter, there is no guarantee that anybody will be sitting
there at all.
In order to make the speedy return that is required, do_read starts by
storing information that will enable the request to be completed later,
when and if input arrives.
There are a few error checks to be made first.
It is an error if the device is still expecting input to fulfill a
previous request, or if the parameters in the message are invalid.
If these tests are passed, information about the request is copied into
the proper fields in the device’s tp−>tty_table entry.
The last step, setting tp−>tty_inleft to the number of characters
requested, is important.
This variable is used to determine when the read request is
satisfied.
In canonical mode tp−>tty_inleft is decremented by one for each
character returned, until an end of line is received, at which point it
is suddenly reduced to zero.
In noncanonical mode it is handled differently, but in any case it is
reset to zero whenever the call is satisfied, whether by a timeout or by
receiving at least the minimum number of bytes requested.
When tp−>tty_inleft reaches zero, a reply message is sent.
As we will see, reply messages can be generated in several places.
It is sometimes necessary to check whether a reading process still
expects a reply; a nonzero value of tp−>tty_inleft serves as a flag
for that purpose.
In canonical mode a terminal device waits for input until either the
number of characters asked for in the call has been received, or the end
of a line or the end of the file is reached.
The ICANON bit in the termios structure is tested to see if canonical
mode is in effect for the terminal.
If it is not set, the termios MIN and TIME values are checked to
determine what action to take.
In Fig. 3-31 we saw how MIN and TIME interact to provide different
ways a read call can behave.
TIME is tested.
A value of zero corresponds to the left-hand column in Fig. 3-31, and in
this case no further tests are needed at this point.
If TIME is nonzero, then MIN is tested.
If it is zero, settimer is called to start the timer that will terminate
the DEV_READ request after a delay, even if no bytes have been
received.
Tp−>tty_min is set to 1 here, so the call will terminate immediately
if one or more bytes are received before the timeout.
At this point no check for possible input has yet been made, so more
than one character may already be waiting to satisfy the request.
In that case, as many characters as are ready, up to the number
specified in the read call, will be returned as soon as the input is
found.
If both TIME and MIN are nonzero, the timer has a different
meaning.
The timer is used as an inter-character timer in this case.
It is started only after the first character is received and is
restarted after each successive character.
Tp−>tty_eotct counts characters in noncanonical mode, and if it is
zero at line 13993, no characters have been received yet and the
inter-byte timer is inhibited.
In any case, at line 14001, in_transfer is called to transfer any
bytes already in the input queue directly to the reading process.
Next there is a call to handle_events, which may put more data into the
input queue and which calls in_transfer again.
This apparent duplication of calls requires some explanation.
Although the discussion so far has been in terms of keyboard input,
do_read is in the device-independent part of the code and also services
input from remote terminals connected by serial lines.
It is possible that previous input has filled the RS-232 input buffer to
the point where input has been inhibited.
The first call to in_transfer does not start the flow again, but the
call to handle_events can have this effect.
The fact that it then causes a second call to in_transfer is just a
bonus.
The important thing is to be sure the remote terminal is allowed to send
again.
Either of these calls may result in satisfaction of the request and
sending of the reply message to the FS.
Tp−>tty_inleft is used as a flag to see if the reply has been sent;
if it is still nonzero at line 14004, do_read generates and sends the
reply message itself.
(We assume here that no use has been made of the select system call, and
therefore there will be no call to select_retry).
If the original request specified a nonblocking read, the FS is told
to pass an EAGAIN error code back to original caller.
If the call is an ordinary blocking read, the FS receives a SUSPEND
code, unblocking it but telling it to leave the original caller
blocked.
In this case the terminal’s tp−>tty_inrepcode field is set to
REVIVE.
When and if the read is later satisfied, this code will be placed in the
reply message to the FS to indicate that the original caller was put to
sleep and needs to be revived.
Do_write is similar to do_read, but simpler, because there are fewer
options to be concerned about in handling a write system call.
Checks similar to those made by do_read are made to see that a previous
write is not still in progress and that the message parameters are
valid, and then the parameters of the request are copied into the tty
structure.
Handle_events is then called, and tp−>tty_outleft is checked to see
if the work was done.
If so, a reply message already has been sent by handle_events and there
is nothing left to do.
If not, a reply message is generated.
with the message parameters depending upon whether or not the original
write call was called in nonblocking mode.
The next function, do_ioctl, is a long one, but not difficult to
understand.
The body of do_ioctl is two switch statements.
The first determines the size of the parameter pointed to by the pointer
in the request message.
If the size is not zero, the parameter’s validity is tested.
The contents cannot be tested here, but what can be tested is whether a
structure of the required size beginning at the specified address fits
within the segment it is specified to be in.
The rest of the function is another switch on the type of ioctl
operation requested.
Unfortunately, supporting the POSIX-required operations with the
ioctl call meant that names for ioctl operations had to be invented that
suggest, but do not duplicate, names required by POSIX.
Figure 3-39 shows the relationship between the POSIX request names and
the names used by the MINIX 3 ioctl call.
A TCGETS operation services a tcgetattr call by the user and simply
returns a copy of the terminal device’s tp−>tty_termios
structure.
The next four request types share code.
The TCSETSW, TCSETSF, and TCSETS request types correspond to user calls
to the POSIX-defined function tcsetattr, and all have the basic action
of copying a new termios structure into a terminal’s tty
structure.
The copying is done immediately for TCSETS calls and may be done for
TCSETSW and TCSETSF calls if output is complete, by a sys vircopy
kernel call to get the data from the user, followed by a call to
setattr.
If tcsetattr was called with a modifier requesting postponement of the
action until completion of current output, the parameters for the
request are placed in the terminal’s tty structure for later processing
if the test of tp−>tty outleft reveals output is not complete.
Tcdrain suspends a program until output is complete and is translated
into an ioctl call of type TCDRAIN.
If output is already complete, it has nothing more to do.
If output is not complete, it also must leave information in the tty
structure.
Figure 3-39.
POSIX calls and IOCTL operations.
The POSIX tcflush function discards unread input and/or unsent output
data, according to its argument, and the ioctl translation is
straightforward, consisting of a call to the tty_icancel function that
services all terminals, and/or the device-specific function pointed to
by tp−>tty_ocancel.
Tcflow is similarly translated in a straightforward way into an ioctl
call.
To suspend or restart output, it sets a TRUE or FALSE value into
tp−>tty_inhibited and then sets the tp−>tty_events flag.
To suspend or restart input, it sends the appropriate STOP (normally
CTRL-S) or START (CTRL-Q) code to the remote terminal, using the
device-specific echo routine pointed to by tp−>tty_echo.
Most of the rest of the operations handled by do_ioctl are handled in
one line of code, by calling an appropriate function.
In the cases of the KIOCSMAP (load keymap) and TIOCSFON (load font)
operations, a test is made to be sure the device really is a console,
since these operations do not apply to other terminals.
If virtual terminals are in use the same keymap and font apply to all
consoles, the hardware does not permit any easy way of doing
otherwise.
The window size operations copy a winsize structure between the user
process and the terminal driver.
Note, however, the comment under the code for the TIOCSWINSZ
operation.
When a process changes its window size, the kernel is expected to send a
SIGWINCH signal to the process group under some versions of UNIX .
The signal is not required by the POSIX standard and is not implemented
in MINIX 3.
However, anyone thinking of using these structures should consider
adding code here to initiate this signal.
The last two cases in do_ioctl support the POSIX required tcgetpgrp
and tcsetpgrp functions.
There is no action associated with these cases, and they always return
an error.
There is nothing wrong with this.
These functions support job control, the ability to suspend and restart
a process from the keyboard.
Job control is not required by POSIX and is not supported by MINIX
3.
However, POSIX requires these functions, even when job control is not
supported, to ensure portability of programs.
Do_open has a simple basic action to perform—it increments the
variable tp−>tty_openct for the device so it can be verified that it
is open.
However, there are some tests to be done first.
POSIX specifies that for ordinary terminals the first process to open a
terminal is the session leader, and when a session leader dies, access
to the terminal is revoked from other processes in its group.
Daemons need to be able to write error messages, and if their error
output is not redirected to a file, it should go to a display that
cannot be closed.
For this purpose a device called /dev/log exists in MINIX 3.
Physically it is the same device as /dev/console, but it is addressed by
a separate minor device number and is treated differently.
It is a write-only device, and thus do_open returns an EACCESS error if
an attempt is made to open it for reading.
The other test done by do_open is to test the O_NOCTTY flag.
If it is not set and the device is not /dev/log, the terminal becomes
the controlling terminal for a process group.
This is done by putting the process number of the caller into the
tp−>tty_pgrp field of the tty_table entry.
Following this, the tp−>tty_openct variable is incremented and the
reply message is sent.
A terminal device may be opened more than once, and the next
function, do_close, has nothing to do except decrement
tp−>tty_openct.
The test foils an attempt to close the device if it happens to be
/dev/log.
If this operation is the last close, input is canceled by calling
tp−>tty_icancel.
Device-specific routines pointed to by tp−>tty_ocancel and
tp−>tty_close are also called.
Then various fields in the tty structure for the device are set back to
their default values and the reply message is sent.
The last message type handler we will consider is do_cancel.
This is invoked when a signal is received for a process that is blocked
trying to read or write.
There are three states that must be checked:
A test is made for each case, and the general tp−>tty_icancel, or
the device-specific routine pointed to by tp−>tty_ocancel, is called
as necessary.
In the last case the only action required is to reset the flag
tp−>tty_ioreq, to indicate the ioctl operation is now complete.
Finally, the tp−>tty_events flag is set and a reply message is
sent.
Terminal Driver Support Code
Now that we have looked at the top-level functions called in the main
loop of tty_task, it is time to look at the code that supports
them.
We will start with handle_events.
As mentioned earlier, on each pass through the main loop of the terminal
driver, the tp−>tty_events flag for each terminal device is checked
and handle_events is called if it shows that attention is required for a
particular terminal.
Do_read and do_write also call handle_events.
This routine must work fast.
It resets the tp−>tty_events flag and then calls device-specific
routines to read and write, using the pointers to the functions
tp−>tty_devread and tp−>tty devwrite.
These functions are called unconditionally, because there is no way
to test whether a read or a write caused the raising of the flag—a
design choice was made here, that checking two flags for each device
would be more expensive than making two calls each time a device was
active.
Also, most of the time a character received from a terminal must be
echoed, so both calls will be necessary.
As noted in the discussion of the handling of tcsetattr calls by
do_ioctl, POSIX may postpone control operations on devices until current
output is complete, so immediately after calling the device-specific
tty_devwrite function is a good time take care of ioctl
operations.
This is done where dev_ioctl is called if there is a pending control
request.
Since the tp−>tty_events flag is raised by interrupts, and
characters may arrive in a rapid stream from a fast device, there is a
chance that by the time the calls to the device-specific read and write
routines and dev_ioctl are completed, another interrupt will have raised
the flag again.
A high priority is placed on getting input moved along from the buffer
where the interrupt routine places it initially.
Thus handle_events repeats the calls to the device-specific routines as
long as the tp−>tty_events flag is found raised at the end of the
loop.
When the flow of input stops (it also could be output, but input is more
likely to make such repeated demands), in_transfer is called to transfer
characters from the input queue to the buffer within the process that
called for the read operation.
In_transfer itself sends a reply message if the transfer completes the
request, either by transferring the maximum number of characters
requested or by reaching the end of a line (in canonical mode).
If it does so, tp−>tty_left will be zero upon the return to
handle_events.
Here a further test is made and a reply message is sent if the number of
characters transferred has reached the minimum number requested.
Testing tp−>tty_inleft prevents sending a duplicate message.
Next we will look at in_transfer, which is responsible for moving
data from the input queue in the driver’s memory space to the buffer of
the user process that requested the input.
However, a straightforward block copy is not possible here.
The input queue is a circular buffer and characters have to be checked
to see that the end of the file has not been reached, or, if canonical
mode is in effect, that the transfer only continues up through the end
of a line.
Also, the input queue is a queue of 16-bit quantities, but the
recipient’s buffer is an array of 8-bit characters.
Thus an intermediate local buffer is used.
Characters are checked one by one as they are placed in the local
buffer, and when it fills up or when the input queue has been emptied,
sys_vircopy is called to move the contents of the local buffer to the
receiving process’ buffer.
Three variables in the tty structure, tp−>tty_inleft,
tp−>tty_eotct, and tp−>tty_min, are used to decide whether
in_transfer has any work to do, and the first two of these control its
main loop.
As mentioned earlier, tp−>tty_inleft is set initially to the number
of characters requested by a read call.
Normally, it is decremented by one whenever a character is transferred
but it may be abruptly decreased to zero when a condition signaling the
end of input is reached.
Whenever it becomes zero, a reply message to the reader is generated, so
it also serves as a flag to indicate whether or not a message has been
sent.
Thus in the test, finding that tp−>tty_inleft is already zero is a
sufficient reason to abort execution of in_transfer without sending a
reply.
In the next part of the test, tp−>tty_eotct and tp−>tty_min are
compared.
In canonical mode both of these variables refer to complete lines of
input, and in noncanonical mode they refer to characters.
Tp−>tty_eotct is incremented whenever a “line break” or a byte is
placed in the input queue and is decremented by in_transfer whenever a
line or byte is removed from the queue.
In other words, it counts the number of lines or bytes that have been
received by the terminal driver but not yet passed on to a reader.
Tp−>tty_min indicates the minimum number of lines (in canonical mode)
or characters (in noncanonical mode) that must be transferred to
complete a read request.
Its value is always 1 in canonical mode and may be any value from 0 up
to MAX_INPUT (255 in MINIX 3) in noncanonical mode.
The second half of the test causes in_transfer to return immediately in
canonical mode if a full line has not yet been received.
The transfer is not done until a line is complete so the queue contents
can be modified if, for instance, an ERASE or KILL character is
subsequently typed in by the user before the ENTER key is pressed.
In noncanonical mode an immediate return occurs if the minimum number of
characters is not yet available.
A few lines later, tp−>tty_inleft and tp−>tty_eotct are used to
control the main loop of in_transfer.
In canonical mode the transfer continues until there is no longer a
complete line left in the queue.
In noncanonical mode tp−>tty_eotct is a count of pending
characters.
Tp−>tty_min controls whether the loop is entered but is not used in
determining when to stop.
Once the loop is entered, either all available characters or the number
of characters requested in the original call will be transferred,
whichever is smaller.
Figure 3-40.
The fields in a character code as it is placed into the input queue.
Characters are 16-bit quantities in the input queue.
The actual character code to be transferred to the user process is in
the low 8 bits.
Fig. 3-40 shows how the high bits are used.
Three are used to flag whether the character is being escaped (by
CTRL-V), whether it signifies end-of-file, or whether it represents one
of several codes that signify a line is complete.
Four bits are used for a count to show how much screen space is used
when the character is echoed.
The test checks whether the IN EOF bit (D in the figure) is set.
This is tested at the top of the inner loop because an end-of-file
(CTRL-D) is not itself transferred to a reader, nor is it counted in the
character count.
As each character is transferred, a mask is applied to zero the upper 8
bits, and only the ASCII value in the low 8 bits is transferred into the
local buffer.
There is more than one way to signal the end of input, but the
device-specific input routine is expected to determine whether a
character received is a linefeed, CTRL-D, or other such character and to
mark each such character.
In transfer only needs to test for this mark, the IN EOT bit (N in
Fig. 3-40).
If this is detected, tp−>tty eotct is decremented.
In noncanonical mode every character is counted this way as it is put
into the input queue, and every character is also marked with the IN
EOT bit at that time, so tp−>tty eotct counts characters not yet
removed from the queue.
The only difference in the operation of the main loop of in transfer in
the two different modes is found.
Here tp−>tty inleft is zeroed in response to finding a character
marked as a line break, but only if canonical mode is in effect.
Thus when control returns to the top of the loop, the loop terminates
properly after a line break in canonical mode, but in noncanonical line
breaks are ignored.
When the loop terminates there is usually a partially full local
buffer to be transferred.
Then a reply message is sent if tp−>tty_inleft has reached
zero.
This is always the case in canonical mode, but if noncanonical mode is
in effect and the number of characters transferred is less than the full
request, the reply is not sent.
This may be puzzling if you have a good enough memory for details to
remember that where we have seen calls to in_transfer (in do_read and
handle_events), the code following the call to in_transfer sends a reply
message if in_transfer returns having transferred more than the amount
specified in tp−>tty_min, which will certainly be the case
here.
The reason why a reply is not made unconditionally from in_transfer will
be seen when we discuss the next function, which calls in_transfer under
a different set of circumstances.
That next function is in_process.
It is called from the device-specific software to handle the common
processing that must be done on all input.
Its parameters are a pointer to the tty structure for the source device,
a pointer to the array of 8-bit characters to be processed, and a
count.
The count is returned to the caller.
In_process is a long function, but its actions are not
complicated.
It adds 16-bit characters to the input queue that is later processed by
in_transfer.
There are several categories of treatment provided by in_transfer.
Normal characters are added to the input queue, extended to 16 bits.
Characters which affect later processing modify flags to signal the effect but are not placed in the queue.
Characters which control echoing are acted upon immediately without being placed in the queue.
Characters with special significance have codes such as the EOT bit added to their high byte as they are placed in the input queue.
Let us look first at a completely normal situation: an ordinary
character, such as “x” (ASCII code 0x78), typed in the middle of a short
line, with no escape sequence in effect, on a terminal that is set up
with the standard MINIX 3 default properties.
As received from the input device this character occupies bits 0 through
7 in Fig. 3-40.
On line 14504 it would have its most significant bit, bit 7, reset to
zero if the ISTRIP bit were set, but the default in MINIX 3 is not to
strip the bit, allowing full 8-bit codes to be entered.
This would not affect our “x” anyway.
The MINIX 3 default is to allow extended processing of input, so the
test of the IEXTEN bit in tp−>tty_termios.c_lflag passes, but the
succeeding tests fail under the conditions we postulate: no character
escape is in effect, this input is not itself the character escape
character, and this input is not the REPRINT character.
Tests on the next several lines find that the input character is not
the special_POSIX_VDISABLE character, nor is it a CR or an NL.
Finally, a positive result:
canonical mode is in effect, this is the normal default.
However our “x” is not the ERASE character, nor is it any of the KILL,
EOF (CTRL-D), NL, or EOL characters, so by line 14576 still nothing will
have happened to it.
Here it is found that the IXON bit is set, by default, allowing use of
the STOP (CTRL-S) and START (CTRL-Q) characters, but in the ensuing
tests for these no match is found.
On line 14597 it is found that the ISIG bit, enabling the use of the
INTR and QUIT characters, is set by default, but again no match is
found.
In fact, the first interesting thing that might happen to an ordinary
character occurs, where a test is made to see if the input queue is
already full.
If this were the case, the character would be discarded at this point,
since canonical mode is in effect, and the user would not see it echoed
on the screen.
(The continue statement discards the character, since it causes the
outer loop to restart).
However, since we postulate completely normal conditions for this
illustration, let us assume the buffer is not full yet.
The next test, to see if special noncanonical mode processing is needed,
fails, causing a jump forward to line 14629.
Here echo is called to display the character to the user, since the ECHO
bit in tp−>tty_termios.c_lflag is set by default.
Finally, the character is disposed of by being put into the input
queue.
At this time tp−>tty_incount is incremented, but since this is an
ordinary character, not marked by the EOT bit, tp−>tty_eotct is not
changed.
The last line in the loop calls in_transfer if the character just
transferred into the queue fills it.
However, under the ordinary conditions we postulate for this example,
in_transfer would do nothing, even if called, since (assuming the queue
has been serviced normally and previous input was accepted when the
previous line of input was complete) tp−>tty_eotct is zero,
tp−>tty_min is one, and the test at the start of in_transfer causes
an immediate return.
Having passed through in_process with an ordinary character under
ordinary conditions, let us now go back to the start of in_process and
look at what happens in less ordinary circumstances.
First, we will look at the character escape, which allows a character
which ordinarily has a special effect to be passed on to the user
process.
If a character escape is in effect, the tp−>tty_escaped flag is set,
and when this is detected the flag is reset immediately and the IN_ESC
bit, bit V in Fig. 3-40, is added to the current character.
This causes special processing when the character is echoed—escaped
control characters are displayed as “ˆ” plus the character to make them
visible.
The IN_ESC bit also prevents the character from being recognized by
tests for special characters.
The next few lines process the escape character itself, the LNEXT
character (CTRL-V by default).
When the LNEXT code is detected the tp−>tty_escaped flag is set, and
rawecho is called twice to output a “ˆ” followed by a backspace.
This reminds the user at the keyboard that an escape is in effect, and
when the following character is echoed, it overwrites the “ˆ”.
The LNEXT character is an example of one that affects later characters
(in this case, only the very next character).
It is not placed in the queue, and the loop restarts after the two calls
to rawecho.
The order of these two tests is important, making it possible to enter
the LNEXT character itself twice in a row, in order to pass the second
copy on to a process as actual data.
The next special character processed by in_process is the REPRINT
character (CTRL-R).
When it is found a call to reprint ensues, causing the current echoed
output to be redisplayed.
The REPRINT itself is then discarded with no effect upon the input
queue.
Going into detail on the handling of every special character would be
tedious, and the source code of in_process is straightforward.
We will mention just a few more points.
One is that the use of special bits in the high byte of the 16-bit value
placed in the input queue makes it easy to identify a class of
characters that have similar effects.
Thus, EOT (CTRL-D), LF, and the alternate EOL character (undefined by
default) are all marked by the EOT bit, bit D in Fig. 3-40, making later
recognition easy.
Finally, we will justify the peculiar behavior of in_transfer noted
earlier.
A reply is not generated each time it terminates, although in the calls
to in_transfer we have seen previously, it seemed that a reply would
always be generated upon return.
Recall that the call to in_transfer made by in_process when the input
queue is full has no effect when canonical mode is in effect.
But if noncanonical processing is desired, every character is marked
with the EOT bit, and thus every character is counted by
tp−>tty_eotct.
In turn, this causes entry into the main loop of in_transfer when it is
called because of a full input queue in noncanonical mode.
On such occasions no message should be sent at the termination of
in_transfer, because there are likely to be more characters read after
returning to in_process.
Indeed, although in canonical mode input to a single read is limited by
the size of the input queue (255 characters in MINIX 3), in noncanonical
mode a read call must be able to deliver the POSIX-required
constant_POSIX_SSIZE_MAX number of characters.
Its value in MINIX 3 is 32767.
The next few functions in tty.c support character input.
Tty_echo treats a few characters in a special way, but most just get
displayed on the output side of the same device being used for
input.
Output from a process may be going to a device at the same time input is
being echoed, which makes things messy if the user at the keyboard tries
to backspace.
To deal with this, the tp−>tty_reprint flag is always set to TRUE by
the device-specific output routines when normal output is produced, so
the function called to handle a backspace can tell that mixed output has
been produced.
Since tty_echo also uses the device-output routines, the current value
of tp−>tty_reprint is preserved while echoing, using the local
variable rp.
However, if a new line of input has just begun, rp is set to FALSE
instead of taking on the old value, thus assuring that
tp−>tty_reprint will be reset when echo terminates.
You may have noticed that tty_echo returns a value, for instance, in the call in in_process:
ch = tty_echo(tp, ch)
The value returned by echo contains the number of spaces used on the
screen for the echo display, which may be up to eight if the character
is a TAB.
This count is placed in the cccc field in Fig. 3-40.
Ordinary characters occupy one space on the screen, but if a control
character (other than TAB, NL, or CR or a DEL (0x7F) is echoed, it is
displayed as a “ˆ” plus a printable ASCII character and occupies two
positions on the screen.
On the other hand an NL or CR occupies zero spaces.
The actual echoing must be done by a device-specific routine, of course,
and whenever a character must be passed to the device, an indirect call
is made using tp−>tty_echo, as, for instance, for ordinary
characters.
The next function, rawecho, is used to bypass the special handling
done by echo.
It checks to see if the ECHO flag is set, and if it is, sends the
character along to the device-specific tp−>tty_echo routine without
any special processing.
A local variable rp is used here to prevent rawecho’s own call to the
output routine from changing the value of tp−>tty_reprint.
When a backspace is found by in_process, the next function,
back_over, is called.
It manipulates the input queue to remove the previous head of the queue
if backing up is possible—if the queue is empty or if the last character
is a line break, then backing up is not possible.
Here the tp−>tty_reprint flag mentioned in the discussions of echo
and rawecho is tested.
If it is TRUE, then reprint is called to put a clean copy of the output
line on the screen.
Then the len field of the last character displayed (the cccc field of
Fig. 3-40) is consulted to find out how many characters have to be
deleted on the display, and for each character a sequence of
backspace-space-backspace characters is sent through rawecho to remove
the unwanted character from the screen and have it replaced by a
space.
Reprint is the next function.
In addition to being called by back_over, it may be invoked by the user
pressing the REPRINT key (CTRL-R).
The loop searches backward through the input queue for the last line
break.
If it is found in the last position filled, there is nothing to do and
reprint returns.
Otherwise, it echos the CTRL-R, which appears on the display as the two
character sequence “ˆR”, and then moves to the next line and redisplays
the queue from the last line break to the end.
Now we have arrived at out_process.
Like in_process, it is called by device-specific output routines, but it
is simpler.
It is called by the RS-232 and pseudo terminal device-specific output
routines, but not by the console routine.
Out_process works upon a circular buffer of bytes but does not remove
them from the buffer.
The only change it makes to the array is to insert a CR character ahead
of an NL character in the buffer if the OPOST (enable output processing)
and ONLCR (map NL to CR-NL) bits in tp−>tty_termios.oflag are both
set.
Both bits are set by default in MINIX 3.
Its job is to keep the tp−>tty_position variable in the device’s tty
structure up to date.
Tabs and backspaces complicate life.
The next routine is dev_ioctl.
It supports do_ioctl in carrying out the tcdrain function and the
tcsetattr function when it is called with either the TCSADRAIN or
TCSAFLUSH options.
In these cases, do_ioctl cannot complete the action immediately if
output is incomplete, so information about the request is stored in the
parts of the tty structure reserved for delayed ioctl operations.
Whenever handle_events runs, it first checks the tp−>tty_ioreq field
after calling the device-specific output routine and calls dev_ioctl if
an operation is pending.
Dev_ioctl tests tp−>tty_outleft to see if output is complete, and if
so, carries out the same actions that do_ioctl would have carried out
immediately if there had been no delay.
To service tcdrain, the only action is to reset the tp−>tty_ioreq
field and send the reply message to the FS, telling it to wake up the
process that made the original call.
The TCSAFLUSH variant of tcsetattr calls tty_icancel to cancel
input.
For both variants of tcsetattr, the termios structure whose address was
passed in the original call to ioctl is copied to the device’s
tp−>tty_termios structure.
Setattr is then called, followed, as with tcdrain, by sending a reply
message to wake up the blocked original caller.
Setattr is the next procedure.
As we have seen, it is called by do_ioctl or dev_ioctl to change the
attributes of a terminal device, and by do_close to reset the attributes
back to the default settings.
Setattr is always called after copying a new termios structure into a
device’s tty structure, because merely copying the parameters is not
enough.
If the device being controlled is now in noncanonical mode, the first
action is to mark all characters currently in the input queue with the
IN_EOT bit, as would have been done when these characters were
originally entered in the queue if noncanonical mode had been in effect
then.
It is easier just to go ahead and do this than to test whether the
characters already have the bit set.
There is no way to know which attributes have just been changed and
which still retain their old values.
The next action is to check the MIN and TIME values.
In canonical mode tp−>tty_min is always 1; that is set.
In noncanonical mode the combination of the two values allows for four
different modes of operation, as we saw in Fig. 3-31.
On lines 14931 to 14933 tp−>tty_min is first set up with the value
passed in tp−>tty_termios.cc[VMIN], which is then modified if it is
zero and tp−>tty_termios.cc[VTIME] is not zero.
Finally, setattr makes sure output is not stopped if XON/XOFF control is disabled, sends a SIGHUP signal if the output speed is set to zero, and makes an indirect call to the device-specific routine pointed to by tp−>tty_ioctl to do what can only be done at the device level.
The next function, tty_reply has been mentioned many times in the
preceding discussion.
Its action is entirely straightforward, constructing a message and
sending it.
If for some reason the reply fails, a panic ensues.
The following functions are equally simple.
Sigchar asks MM to send a signal.
If the NOFLSH flag is not set, queued input is removed—the count of
characters or lines received is zeroed and the pointers to the tail and
head of the queue are equated.
This is the default action.
When a SIGHUP signal is to be caught, NOFLSH can be set, to allow input
and output to resume after catching the signal.
Tty_icancel unconditionally discards pending input in the way described
for sigchar, and in addition calls the device-specific function pointed
to by tp−>tty_icancel, to cancel input that may exist in the device
itself or be buffered in the low-level code.
Tty_init is called when tty_task first starts.
It loops through all possible terminals and sets up defaults.
Initially, a pointer to tty_devnop, a dummy function that does nothing,
is set into the tp−>tty_icancel, tp−>tty_ocancel,
tp−>tty_ioctl, and tp−>tty_close variables.
Tty_init then calls a device-specific initialization functions for the
appropriate category of terminal (console, serial line, or pseudo
terminal).
These set up the real pointers to indirectly called device-specific
functions.
Recall that if there are no devices at all configured in a particular
device category, a macro that returns immediately is created, so no part
of the code for a nonconfigured device need be compiled.
The call to scr_init initializes the console driver and also calls the
initialization routine for the keyboard.
The next three functions support timers.
A watchdog timer is initialized with a pointer to a function to run when
the timer expires.
Tty_timed_out is that function for most timers set by the terminal
task.
It sets the events flag to force processing of input and output.
Expire_timers handles the terminal driver’s timer queue.
Recall that this is the function called from the main loop of tty_task
when a SYN_ALARM message is received.
A library routine, tmrs_exptimers, is used to traverse the linked list
of timers, expiring and calling the watchdog functions of any that have
timed out.
On returning from the library function, if the queue is still active a
sys_setalarm kernel call is made to ask for another SYN_ALARM.
Finally, settimer, sets timers for determining when to return from a
read call in noncanonical mode.
It is called with parameters of tty_ptr, a pointer to a tty structure,
and enable, an integer which represents TRUE or FALSE.
Library functions tmrs_settimer and tmrs_clrtimer are used to enable or
disable a timer as determined by the enable argument.
When a timer is enabled, the watchdog function is always tty_timed_out,
described previously.
A description of tty_devnop is necessarily longer than its executable
code, since it has none.
It is a “no-operation” function to be indirectly addressed where a
device does not require a service.
We have seen tty_devnop used in tty_init as the default value entered
into various function pointers before calling the initialization routine
for a device.
The final item in tty.c needs some explanation.
Select is a system call used when multiple I/O devices may require
service at unpredictable times by a single process.
A classic example is a communications program which needs to pay
attention to a local keyboard and a remote system, perhaps connected by
a modem.
The select call allows opening several device files and monitoring all
of them to see when they can be read from or written to without
blocking.
Without select it is necessary to use two processes to handle two-way
communication, one acting as a master and handling communication in one
direction, the other a slave handling communication in the other
direction.
Select is an example of a feature that is very nice to have, but which
substantially complicates the system.
One of the design goals of MINIX 3 is to be simple enough to be
understood with reasonable effort in a reasonable time, and we have to
set some limits.
For that reason we will not discuss do_select and the support routines
select _try and select_retry here.
Now we turn to the device-dependent code that supports the MINIX 3
console, which consists of an IBM PC keyboard and a memory-mapped
display.
The physical devices that support these are entirely separate: on a
standard desktop system the display uses an adapter card (of which there
are at least a half-dozen basic types) plugged into the backplane, while
the keyboard is supported by circuitry built into the parentboard which
interfaces with an 8-bit single-chip computer inside the keyboard
unit.
The two subdevices require entirely separate software support, which is
found in the files keyboard.c and console.c.
The operating system sees the keyboard and console as parts of the
same device, /dev/console.
If there is enough memory available on the display adapter, virtual
console support may be compiled, and in addition to /dev/console there
may be additional logical devices, /dev/ttyc1, /dev/ttyc2, and so
on.
Output from only one goes to the display at any given time, and there is
only one keyboard to use for input to whichever console is active.
Logically the keyboard is subservient to the console, but this is
manifested in only two relatively minor ways.
First, tty_table contains a tty structure for the console, and where
separate fields are provided for input and output, for instance, the
tty_devread and tty_devwrite fields, pointers to functions in keyboard.c
and console.c are filled in at startup time.
However, there is only one tty_priv field, and this points to the
console’s data structures only.
Second, before entering its main loop, tty_task calls each logical
device once to initialize it.
The routine called for /dev/console is in console.c, and the
initialization code for the keyboard is called from there.
The implied hierarchy could just as well have been reversed,
however.
We have always looked at input before output in dealing with I/O devices
and we will continue that pattern, discussing keyboard.c in this section
and leaving the discussion of console.c for the following section.
Keyboard.c begins, like most source files we have seen, with several
#include statements.
One of these is unusual, however.
The file keymaps/us-std.src is not an ordinary header;
it is a C source file that results in compilation of the default keymap
within keyboard.o as an initialized array.
The keymap source file is not included in Appendix B because of its
size, but some representative entries are illustrated in Fig.
3-37.
Following the #include statements are macros to define various
constants.
The first group are used in low-level interaction with the keyboard
controller.
Many of these are I/O port addresses or bit combinations that have
meaning in these interactions.
The next group includes symbolic names for special keys.
On line 15249 the size of the circular keyboard input buffer is
symbolically defined as KB_IN_BYTES, with a value of 32, and the buffer
itself and variables to manage it are defined next.
Since there is only one of these buffers care must be taken to make sure
all of its contents are processed before virtual consoles are
changed.
The next group of variables are used to hold various states that must
be remembered to properly interpret a key press.
They are used in different ways.
For instance, the value of the caps_down flag is toggled between TRUE
and FALSE each time the Caps Lock key is pressed.
The shift flag is set to TRUE when either Shift key is pressed and to
FALSE when both Shift keys are released.
The esc variable is set when a scan code escape is received.
It is always reset upon receipt of the following character.
Map_key0 is defined as a macro.
It returns the ASCII code that corresponds to a scan code, ignoring
modifiers.
This is equivalent to the first column (unshifted) in the keymap
array.
Its big brother is map_key, which performs the complete mapping of a
scan code to an ASCII code, including accounting for (multiple) modifier
keys that are depressed at the same time as ordinary keys.
The keyboard interrupt service routine is kbd_interrupt, called
whenever a key is pressed or released.
It calls scode to get the scan code from the keyboard controller
chip.
The most significant bit of the scan code is set when a key release
causes the interrupt, such codes could be ignored unless they were one
of the modifier keys.
However, in the interest of doing as little as possible in order to
service an interrupt as quickly as possible, all raw scan codes are
placed in the circular buffer and the tp−>tty_events flag for the
current console is raised.
For purposes of this discussion we will assume, as we did earlier, that
no select calls have been made, and that kbd_interrupt returns
immediately after this.
Figure 3-41 shows scan codes in the buffer for a short line of input
that contains two upper case characters, each preceded by the scan code
for depression of a shift key and followed by the code for the release
of the shift key.
Initially codes for both key presses and releases are stored.
When a HARD_INT from the keyboard is received by tty_task, the
complete main loop is not executed.
A continue statement at line 13795 causes a new iteration of the main
loop to begin immediately, at line 13764.
(This is slightly simplified, we left some conditional code in the
listing to show that if the serial line driver is enabled its user-space
interrupt handler could also be called.) When execution transfers to the
top of the loop the tp−>tty_events flag for the console device is now
found to be set, and kb_read, the device-specific routine, is called
using the pointer in the tp−>tty_devread field of the console’s tty
structure.
Figure 3-41.
Scan codes in the input buffer, with corresponding key actions below,
for a line of text entered at the keyboard.
L and R represent the left and right Shift keys.
+ and - indicate a key press and a key release.
The code for a release is 128 more than the code for a press of the same
key.
Kb read takes scan codes from the keyboard’s circular buffer and
places ASCII codes in its local buffer, which is large enough to hold
the escape sequences that must be generated in response to some scan
codes from the numeric keypad.
Then it calls in process in the hardware-independent code to put the
characters into the input queue.
On line 15379 icount is decremented.
The call to make break returns the ASCII code as an integer.
Special keys, such as keypad and function keys, have values greater than
0xFF at this point.
Codes in the range from HOME to INSRT (0x101 to 0x10C, defined in file
include/minix/keymap.h) result from pressing the numeric keypad, and are
converted into 3-character escape sequences shown in Fig. 3-42 using the
numpad map array.
The sequences are then passed to in process.
Higher codes are not passed on to in process.
Instead, a check is made for the codes for ALT-LEFT-ARROW,
ALT-RIGHT-ARROW, and ALT-F1 through ALT-F12, and if one of these is
found, select console is called to switch virtual consoles.
CTRL-F1 through CTRL-F12 are similarly given special handling.
CTRL-F1 shows the mappings of function keys (more on this later).
CTRL-F3 toggles between hardware scrolling and software scrolling of the
console screen.
CTRL-F7, CTRL-F8, and CTRL-F9 generate signals with the same effects as
CTRL-, CTRL-C, and CTRL-U, respectively, except these cannot be changed
by the stty command.
Make break converts scan codes into ASCII and then updates the
variables that keep track of the state of modifier keys.
First, however, it checks for the magic CTRL-ALT-DEL combination that PC
users all know as the way to force a reboot under MS-DOS.
Note the comment that it would be better to do this at a lower
level.
However, the simplicity of MINIX 3 interrupt handling in kernel space
makes detecting CTRL-ALT-DEL impossible there, when an interrupt
notification is sent the scan code has not yet been read.
An orderly shutdown is desirable, so rather than try to start the PC
BIOS routines, a sys kill kernel call is made to initiate sending a
SIGKILL signal TO init, the parent process of all other processes.
Init is expected to catch this signal and interpret it as a command to
begin an orderly process of shutting down, prior to causing a return to
the boot monitor, from which a full restart of the system or a reboot of
MINIX 3 can be commanded.
Figure 3-42.
Escape codes generated by the numeric keypad.
When scan codes for ordinary keys are translated into ASCII codes the
special keys are assigned “pseudo ASCII” codes with values greater than
0xFF.
this signal and interpret it as a command to begin an orderly process of shutting down, prior to causing a return to the boot monitor, from which a full restart of the system or a reboot of MINIX 3 can be commanded.
Of course, it is not realistic to expect this to work every
time.
Most users understand the dangers of an abrupt shutdown and do not press
CTRL-ALT-DEL until something is really going wrong and normal control of
the system has become impossible.
At this point it is likely that the system may be so disrupted that
signaling another process may be impossible.
This is why there is a static variable CAD count in make break.
Most system crashes leave the interrupt system still functioning, so
keyboard input can still be received and the terminal driver will remain
active.
Here MINIX 3 takes advantage of the expected behavior of computer users,
who are likely to bang on the keys repeatedly when something does not
seem to work correctly (possibly evidence our ancestors really were
apes).
If the attempt to kill init fails and the user presses CTRL-ALT-DEL
twice more, a sys abort kernel call is made, causing a return to the
monitor without going through the call to init.
The main part of make break is not hard to follow.
The variable make records whether the scan code was generated by a key
press or a key release, and then the call to map key returns the ASCII
code to ch.
Next is a switch on ch.
Let us consider two cases, an ordinary key and a special key.
For an ordinary key, none of the cases match, and in the default case,
the key code is returned if make is true.
If somehow an ordinary key code is accepted at key release, a value of
−1 is substituted here, and this is ignored by the caller,
kb_read.
A special key, for example CTRL, is identified at the appropriate place
in the switch.
The corresponding variable, in this case ctrl, records the state of
make, and −1 is substituted for the character code to be returned (and
ignored).
The handling of the ALT, CALOCK, NLOCK, and SLOCK keys is more
complicated, but for all of these special keys the effect is similar: a
variable records either the current state (for keys that are only
effective while pressed) or toggles the previous state (for the lock
keys).
There is one more case to consider, that of the EXTKEY code and the
esc variable.
This is not to be confused with the ESC key on the keyboard, which
returns the ASCII code 0x1B.
There is no way to generate the EXTKEY code alone by pressing any key or
combination of keys; it is the PC keyboard’s extended key prefix, the
first byte of a 2-byte scan code that signifies that a key that was not
part of the original PC’s complement of keys but that has the same scan
code, has been pressed.
In many cases software treats the two keys identically.
For instance, this is almost always the case for the normal “/” key and
the gray “/” key on the numeric keyboard.
In other cases, one would like to distinguish between such keys.
For instance, many keyboard layouts for languages other than English
treat the left and right ALT keys differently, to support keys that must
generate three different character codes.
Both ALT keys generate the same scan code (56), but the EXTKEY code
precedes this when the right-hand ALT is pressed.
When the EXTKEY code is returned, the esc flag is set.
In this case, make_break returns from within the switch, thus bypassing
the last step before a normal return, which sets esc to zero in every
other case.
This has the effect of making the esc effective only for the very next
code received.
If you are familiar with the intricacies of the PC keyboard as it is
ordinarily used, this will be both familiar and yet a little strange,
because the PC BIOS does not allow one to read the scan code for an ALT
key and returns a different value for the extended code than does MINIX
3.
Set_leds turns on and off the lights that indicate whether the Num
Lock, Caps Lock, or Scroll Lock keys on a PC keyboard have been
pressed.
A control byte, LED_CODE, is written to an output port to instruct the
keyboard that the next byte written to that port is for control of the
lights, and the status of the three lights is encoded in 3 bits of that
next byte.
These operations are, of course, carried out by kernel calls which ask
the system task write to the outport ports.
The next two functions support this operation.
Kb_wait is called to determine that the keyboard is ready to receive a
command sequence, and kb_ack is called to verify that the command has
been acknowledged.
Both of these commands use busy waiting, continually reading until a
desired code is seen.
This is not a recommended technique for handling most I/O operations,
but turning lights on and off on the keyboard is not going to be done
very often and doing it inefficiently does not waste much time.
Note also that both kb_wait and kb_ack could fail, and one can determine
from the return code if this happens.
Timeouts are handled by limiting the number of retries by means of a
counter in the loop.
But setting the light on the keyboard is not important enough to merit
checking the value returned by either call, and set_leds just proceeds
blindly.
Since the keyboard is part of the console, its initialization
routine, kb_init, is called from scr_init in console.c, not directly
from tty_init in tty.c.
If virtual consoles are enabled, (i.e., NR_CONS in
include/minix/config.h is greater than 1), kb_init is called once for
each logical console.
The next function, kb_init_once, is called just once, as its name
implies.
It sets the lights on the keyboard, and scans the keyboard to be sure no
leftover keystroke is read.
Then it initializes two arrays, fkey_obs and sfkey_obs which are used to
bind function keys to the processes that must respond to them.
When all is ready, it makes two kernel calls, sys_irqsetpolicy and
sys_irqenable to set up the IRQ for the keyboard and configure it to
automatically reenable, so a notification message will be sent to
tty_task whenever a key is pressed or released.
Although we will soon have more opportunities to discuss how function
keys work, this is a good place to describe the fkey_obs and sfkey_obs
arrays.
Each has twelve elements, since modern PC keyboards have twelve
F-keys.
The first array is for unmodified F-keys, the second is used when a
shifted F-key is detected.
They are composed of elements of type obs_t, which is a structure that
can hold a process number and an integer.
This structure and these arrays are declared in keyboard.c.
Initialization stores a special value, symbolically represented as NONE,
in the proc_nr component of the structure to indicate it is not in
use.
NONE is a value outside the range of valid process numbers.
Note that the process number is not a pid, it identifies a slot in the
process table.
This terminology may be confusing.
But to send a notification a process number rather than a pid is used,
because process numbers are used to index the priv table which
determines whether a process is allowed to receive notifications.
The integer events is also initially set to zero.
It will be used to count events.
The next three functions are all rather simple.
Kbd_loadmap is almost trivial.
It is called by do_ioctl in tty.c to do the copying of a keymap from
user space to overwrite the default keymap.
The default is compiled by the inclusion of a keymap source file at the
start of keyboard.c.
From its first release, MINIX has always provided for dumps of
various kinds of system information or other special actions in response
to pressing the function keys F1, F2, etc., on the system console.
This is not a service generally provided in other operating systems, but
MINIX was always intended to be a teaching tool.
Users are encouraged to tinker with it, which means users may need extra
help for debugging.
In many cases the output produced by pressing one of the F-keys will be
available even when the system has crashed.
Figure 3-43 summarizes these keys and their effects.
These keys fall into two categories.
As noted earlier, the CTRL-F1 through CTRL-F12 key combinations are
detected by kb_read.
These trigger events that can be handled by the terminal driver.
These events are not necessarily display dumps.
In fact, currently only CTRL-F1 provides an information display; it
lists function key bindings.
CTRL-F3 toggles hardware and software scrolling of the console screen,
and the others cause signals.
Figure 3-43.
The function keys detected by func key().
Function keys pressed by themselves or together with the shift key
are used to trigger events that cannot be handled by the terminal
driver.
They may result in notification messages to a server or driver.
Because servers and drivers can be loaded, enabled, and disabled after
MINIX 3 is already running, static binding of these keys at compilation
time is not satisfactory.
To enable run-time changes tty task accepts messages of type FKEY
CONTROL.
Do fkey ctl services such requests.
Request types are FKEY MAP, FKEY UNMAP, or FKEY EVENTS.
The first two register or unregister a process with a key specified in a
bitmap in the message, and the third message type returns a bitmap of
keys belonging to the caller which have been pressed and resets the
events field for these keys.
A server process, the information server, (or IS) initializes the
settings for processes in the boot image and also mediates generating
responses.
But individual drivers can also register to respond to a function
key.
Ethernet drivers typically do this, as a dump that shows packet
statistics can be helpful in solving network problems.
Func_key is called from kb_read to see if a special key meant for
local processing has been pressed.
This is done for every scan code received, prior to any other
processing.
If it is not a function key at most three comparisons are made before
control is returned to kb_read.
If a function key is registered a notification message is sent to the
appropriate process.
If the process is one that has registered only one key the notification
by itself is adequate for the process to know what to do.
If a process is the information server or another that has registered
several keys, a dialogue is required—the process must send an
FKEY_EVENTS request to the terminal driver, to be processed by
do_fkey_ctl which will inform the caller which keys have been
active.
The caller can then dispatch to the routine for each key that has been
pressed.
Scan_keyboard works at the hardware interface level, by reading and
writing bytes from I/O ports.
The keyboard controller is informed that a character has been read by
the sequence, which reads a byte, writes it again with the most
significant bit set to 1, and then rewrites it with the same bit rest to
0.
This prevents the same data from being read on a subsequent read.
There is no status checking in reading the keyboard, but there should be
no problems in any case, since scan_keyboard is only called in response
to an interrupt.
The last function in keyboard.c is do_panic_dumps.
If invoked as a result of a system panic, it provides an opportunity for
the user to use the function keys to display debugging
information.
The loop is another example of busy waiting.
The keyboard is read repeatedly until an ESC is typed.
Certainly no one can claim that a more efficient technique is needed
after a crash, while awaiting a command to reboot.
Within the loop, the rarely-used nonblocking receive operation,
nb_receive, is used to permit alternately accepting messages, if
available, and testing the keyboard for input, which can be expected to
be one of the options suggested in the message
Hit ESC to reboot, DEL to shutdown, F-keys for debug dumps
printed on entering this function.
In the next section we will see the code that implements do_newkmess and
do_diagnostics.
The IBM PC display may be configured as several virtual terminals, if
sufficient memory is available.
We will examine the console’s device-dependent code in this
section.
We will also look at the debug dump routines that use low-level services
of the keyboard and display.
These provide support for limited interaction with the user at the
console, even when other parts of the MINIX 3 system are not functioning
and can provide useful information even following a near-total system
crash.
Hardware-specific support for console output to the PC memory-mapped
screen is in console.c.
The console structure is defined.
In a sense this structure is an extension of the tty structure defined
in tty.c.
At initialization the tp−>tty_priv field of a console’s tty structure
is assigned a pointer to its own console structure.
The first item in the console structure is a pointer back to the
corresponding tty structure.
The components of a console structure are what one would expect for a
video display: variables to record the row and column of the cursor
location, the memory addresses of the start and limit of memory used for
the display, the memory address pointed to by the controller chip’s base
pointer, and the current address of the cursor.
Other variables are used for managing escape sequences.
Since characters are initially received as 8-bit bytes and must be
combined with attribute bytes and transferred as 16-bit words to video
memory, a block to be transferred is built up in c_ramqueue, an array
big enough to hold an entire 80-column row of 16-bit character-attribute
pairs.
Each virtual console needs one console structure, and the storage is
allocated in the array cons_table.
As we have done with the tty and other structures, we will usually refer
to the elements of a console structure using a pointer, for example,
cons−>c_tty.
The function whose address is stored in each console’s
tp−>tty_devwrite entry is cons_write.
It is called from only one place, handle_events in tty.c.
Most of the other functions in console.c exist to support this
function.
When it is called for the first time after a client process makes a
write call, the data to be output are in the client’s buffer, which can
be found using the tp−>tty_outproc and tp−>out_vir fields in the
tty structure.
The tp−>tty_outleft field tells how many characters are to be
transferred, and the tp−>tty_outcum field is initially zero,
indicating none have yet been transferred.
This is the usual situation upon entry to cons_write, because normally,
once called, it transfers all the data requested in the original
call.
However, if the user wants to slow the process in order to review the
data on the screen, he may enter a STOP (CTRL-S) character at the
keyboard, resulting in raising of the tp−>tty_inhibited flag.
Cons_write returns immediately when this flag is raised, even though the
write has not been completed.
In such a case handle_events will continue to call cons_write, and when
tp−>tty_inhibited is finally reset, by the user entering a START
(CTRL-Q) character, cons_write continues with the interrupted
transfer.
Cons_write’s first argument is a pointer to the particular console’s
tty structure, so the first thing that must be done is to initialize
cons, the pointer to this console’s console structure.
Then, because handle_events calls cons_write whenever it runs, the first
action is a test to see if there really is work to be done.
A quick return is made if not.
Following this the main loop is entered.
This loop is similar in structure to the main loop of in_transfer in
tty.c.
A local buffer that can hold 64 characters is filled by using the
sys_vircopy kernel call to get the data from the client’s buffer.
Following this, the pointer to the source and the counts are updated,
and then each character in the local buffer is transferred to the
cons−>c_ramqueue array, along with an attribute byte, for later
transfer to the screen by flush.
The transfer of characters from cons−>c_ramqueue can be done in
more than one way, as we saw in Fig. 3-35.
Out_char can be called to do this for each character, but it is
predictable that none of the special services of out_char will be needed
if the character is a visible character, an escape sequence is not in
progress, the screen width has not been exceeded, and
cons−>c_ramqueue is not full.
If the full service of out_char is not needed, the character is placed
directly into cons−>c_ramqueue, along with the attribute byte (which
is retrieved from cons−>c_attr), and cons−>c_rwords (which is the
index into the queue), cons−>c_column (which keeps track of the
column on the screen), and tbuf, the pointer into the buffer, are all
incremented.
This direct placement of characters into cons−>c_ramqueue corresponds
to the dashed line on the left side of Fig. 3-35.
If needed, out_char is called.
It does all of the bookkeeping, and additionally calls flush, which does
the final transfer to screen memory, when necessary.
The transfer from the user buffer to the local buffer to the queue is
repeated as long as tp−>tty_outleft indicates there are still
characters to be transferred and the flag tp−>tty_inhibited has not
been raised.
When the transfer stops, whether because the write operation is complete
or because tp−>tty_inhibited has been raised, flush is called again
to transfer the last characters in the queue to screen memory.
If the operation is complete (tested by seeing if tp−>tty_outleft is
zero), a reply message is sent by calling tty_reply lines 16096 and
16097).
In addition to calls to cons_write from handle_events, characters to
be displayed are also sent to the console by echo and rawecho in the
hardware-independent part of the terminal driver.
If the console is the current output device, calls via the
tp−>tty_echo pointer are directed to the next function,
cons_echo.
Cons_echo does all of its work by calling out_char and then flush.
Input from the keyboard arrives character by character and the person
doing the typing wants to see the echo with no perceptible delay, so
putting characters into the output queue would be unsatisfactory.
Out_char.
does a test to see if an escape sequence is in progress, calling
parse_escape and then returning immediately if so.
Otherwise, a switch is entered to check for special cases: null,
backspace, the bell character, and so on.
The handling of most of these is easy to follow.
The linefeed and the tab are the most complicated, since they involve
complicated changes to the position of the cursor on the screen and may
require scrolling as well.
The last test is for the ESC code.
If it is found, the cons−>c_esc_state flag is set, and future calls
to out_char are diverted to parse_escape until the sequence is
complete.
At the end, the default is taken for printable characters.
If the screen width has been exceeded, the screen may need to be
scrolled, and flush is called.
Before a character is placed in the output queue a test is made to see
that the queue is not full, and flush is called if it is.
Putting a character into the queue requires the same bookkeeping we saw
earlier in cons_write.
The next function is scroll_screen.
Scroll_screen handles both scrolling up, the normal situation that must
be dealt with whenever the bottom line on the screen is full, and
scrolling down, which occurs when cursor positioning commands attempt to
move the cursor beyond the top line of the screen.
For each direction of scroll there are three possible methods.
These are required to support different kinds of video cards.
We will look at the scrolling up case.
To begin, chars is assigned the size of the screen minus one line.
Softscrolling is accomplished by a single call to vid_vid_copy to move
chars characters lower in memory, the size of the move being the number
of characters in a line.
Vid_vid_copy can wrap, that is, if asked to move a block of memory that
overflows the upper end of the block assigned to the video display, it
fetches the overflow portion from the low end of the memory block and
moves it to an address higher than the part that is moved lower,
treating the entire block as a circular array.
The simplicity of the call hides a fairly slow operation, even though
vid_vid_copy is an assembly language routine (defined in
drivers/tty/vidcopy.s, not listed in Appendix B).
This call requires the CPU to move 3840 bytes, which is a large job even
in assembly language.
The softscroll method is never the default; the operator is supposed
to select it only if hardware scrolling does not work or is not desired
for some reason.
One reason might be a desire to use the screendump command, either to
save the screen memory in a file or to view the main console display
when working from a remote terminal.
When hardware scrolling is in effect, screendump is likely to give
unexpected results, because the start of the screen memory is likely not
to coincide with the start of the visible display.
On line 16226 the wrap variable is tested as the first part of a
compound test.
Wrap is true for older displays that can support hardware scrolling, and
if the test fails, simple hardware scrolling occurs, where the origin
pointer used by the video controller chip, cons−>c_org, is updated to
point to the first character to be displayed at the upper-left corner of
the display.
If wrap is FALSE, the compound test continues with a test of whether the
block to be moved up in the scroll operation overflows the bounds of the
memory block designated for this console.
If this is so, vid_vid_copy is called again to make a wrapped move of
the block to the start of the console’s allocated memory, and the origin
pointer is updated.
If there is no overlap, control passes to the simple hardware scrolling
method always used by older video controllers.
This consists of adjusting cons−>c_org and then putting the new
origin in the correct register of the controller chip.
The call to do this is executed later, as is a call to blank the bottom
line on the screen to achieve the “scrolling” effect.
The code for scrolling down is very similar to that for scrolling
up.
Finally, mem_vid_copy is called to blank out the line at the bottom (or
top) addressed by new_line.
Then set_6845 is called to write the new origin from cons−>c_org into
the appropriate registers, and flush makes sure all changes become
visible on the screen.
We have mentioned flush several times.
It transfers the characters in the queue to the video memory using
mem_vid_copy, updates some variables, and then makes sure the row and
column numbers are reasonable, adjusting them if, for instance, an
escape sequence has tried to move the cursor to a negative column
position.
Finally, a calculation of where the cursor ought to be is made and is
compared with cons−>c_cur.
If they do not agree, and if the video memory that is currently being
handled belongs to the current virtual console, a call to set_6845 is
made to set the correct value in the controller’s cursor register.
Figure 3-44.
Finite state machine for processing escape sequences.
Figure 3-44 shows how escape sequence handling can be represented as
a finite state machine.
This is implemented by parse_escape which is called at the start of
out_char if cons−>c_esc_state is nonzero.
An ESC itself is detected by out_char and makes cons−>c_esc_state
equal to 1.
When the next character is received, parse_escape prepares for further
processing by putting a “\0” in cons−>c_esc_intro, a pointer to the
start of the array of parameters, cons−>c_esc_parmv[0] into
cons−>c_esc_parmp, and zeroes into the parameter array itself.
Then the first character directly following the ESC is examined—valid
values are either “[” or “M”.
In the first case the “[” is copied to cons−>c_esc_intro and the
state is advanced to 2.
In the second case, do_escape is called to carry out the action, and the
escape state is reset to zero.
If the first character after the ESC is not one of the valid ones, it is
ignored and succeeding characters are once again displayed normally.
When an ESC [ sequence has been seen, the next character entered is
processed by the escape state 2 code.
There are three possibilities at this point.
If the character is a numeric character, its value is extracted and
added to 10 times the existing value in the position currently pointed
to by cons−>c_esc_parmp, initially cons−>c_esc_parmv[0] (which was
initialized to zero).
The escape state does not change.
This makes it possible to enter a series of decimal digits and
accumulate a large numeric parameter, although the maximum value
currently recognized by MINIX 3 is 80, used by the sequence that moves
the cursor to an arbitrary position.
If the character is a semicolon there is another parameter, so the
pointer to the parameter string is advanced, allowing succeeding numeric
values to be accumulated in the second parameter.
If MAX_ESC_PARMS were to be changed to allocate a larger array for the
parameters, this code would not have to be altered to accumulate
additional numeric values after entry of additional parameters.
Finally, if the character is neither a numeric digit nor a semicolon,
do_escape is called.
Do_escape is one of the longer functions in the MINIX 3 system source
code, even though MINIX 3’s complement of recognized escape sequences is
relatively modest.
For all its length, however, the code should be easy to follow.
After an initial call to flush to make sure the video display is fully
updated, there is a simple if choice, depending upon whether the
character immediately following the ESC character was a special control
sequence introducer or not.
If not, there is only one valid action, moving the cursor up one line if
the sequence was ESC M.
Note that the test for the “M” is done within a switch with a default
action, as a validity check and in anticipation of addition of other
sequences that do not use the ESC [ format.
The action is typical of many escape sequences: the cons−>c_row
variable is inspected to determine if scrolling is required.
If the cursor is already on row 0, a SCROLL_DOWN call is made to
scroll_screen; otherwise the cursor is moved up one line.
The latter is accomplished just by decrementing cons−>c_row and then
calling flush.
If a control sequence introducer is found, the code following the else
is taken.
A test is made for “[”, the only control sequence introducer currently
recognized by MINIX 3.
If the sequence is valid, the first parameter found in the escape
sequence, or zero if no numeric parameter was entered, is assigned to
value.
If the sequence is invalid, nothing happens except that the large switch
that ensues is skipped and the escape state is reset to zero before
returning from do_escape.
In the more interesting case that the sequence is valid, the switch is
entered.
We will not discuss all the cases; we will just note several that are
representative of the types of actions governed by escape sequences.
The first five sequences are generated, with no numeric arguments, by
the four “arrow” keys and the Home key on the IBM PC keyboard.
The first two, ESC [A and ESC [B, are similar to ESC M, except they can
accept a numeric parameter and move up and down by more than one line,
and they do not scroll the screen if the parameter specifies a move that
exceeds the bounds of the screen.
In such cases, flush catches requests to move out of bounds and limits
the move to the last row or the first row, as appropriate.
The next two sequences, ESC [C and ESC [D, which move the cursor right
and left, are similarly limited by flush.
When generated by the “arrow” keys there is no numeric argument, and
thus the default movement of one line or column occurs.
ESC [H can take two numeric parameters, for instance, ESC
[20;60H.
The parameters specify an absolute position rather than one relative to
the current position and are converted from 1-based numbers to 0-based
numbers for proper interpretation.
The Home key generates the default (no parameters) sequence which moves
the cursor to position (1, 1).
ESC [sJ and ESC [sK clear a part of either the entire screen or the
current line, depending upon the parameter that is entered.
In each case a count of characters is calculated.
For instance, for ESC [1J, count gets the number of characters from the
start of the screen to the cursor position, and the count and a position
parameter, dst, which may be the start of the screen, cons−>c_org, or
the current cursor position, cons−>c_cur, are used as parameters to a
call to mem_vid_copy.
This procedure is called with a parameter that causes it to fill the
specified region with the current background color.
The next four sequences insert and delete lines and spaces at the
cursor position, and their actions do not require detailed
explanation.
The last case, ESC [nm (note the n represents a numeric parameter, but
the “m” is a literal character) has its effect upon cons−>c_attr, the
attribute byte that is interleaved between the character codes when they
are written to video memory.
The next function, set_6845, is used whenever it is necessary to
update the video controller chip.
The 6845 has internal 16-bit registers that are programmed 8 bits at a
time, and writing a single register requires four I/O port write
operations.
These are carried out by setting up an array (vector) of (port, value)
pairs and invoking a sys_voutb kernel call to get the system task to do
the I/O.
Some of the registers of the 6845 video controller chip are shown in
Fig. 3-45
Figure 3-45.
Some of the 6845’s registers.
The next function is get 6845, which returns the values of readable
video controller registers.
It also uses kernel calls to accomplish its job.
It does not appear to be called from anywhere in the current MINIX 3
code, but it may be useful for future enhancements such as adding
graphics support.
The beep function is called when a CTRL-G character must be
output.
It takes advantage of the built-in support provided by the PC for making
sounds by sending a square wave to the speaker.
The sound is initiated by more of the kind of magic manipulation of I/O
ports that only assembly language programmers can love.
The more interesting part of the code is using the ability to set an
alarm to turn off the beep.
As a process with system privileges (i.e., a slot in the priv table),
the terminal driver is allowed to set a timer using the library function
tmrs_settimers.
On line 16655 this is done, with the next function, stop_beep, specified
as the function to run when the timer expires.
This timer is put into the terminal task’s own timer queue.
The sys_setalarm kernel call that follows asks the system task to set a
timer in the kernel.
When that expires, a SYN_ALARM message is detected by the main loop of
the terminal driver, tty_task, which calls expire_timers to deal with
all timers belonging to the terminal driver, one of which is the one set
by beep.
The next routine, stop_beep, is the one whose address is put into the
tmr_func field of the timer initiated by beep.
It stops the beep after the designated time has elapsed and also resets
the beeping flag.
This prevents superfluous calls to the beep routine from having any
effect.
Scr_init is called by tty_init NR_CONS times.
Each time its argument is a pointer to a tty structure, one element of
the tty_table.
On lines 16693 and 16694 line, to be used as the index into the
cons_table array, is calculated, tested for validity, and, if valid,
used to initialize cons, the pointer to the current console table
entry.
At this point the cons−>c_tty field can be initialized with the
pointer to the main tty structure for the device, and, in turn,
tp−>tty_priv can be pointed to this device’s console_t
structure.
Next, kb_init is called to initialize the keyboard, and then the
pointers to device specific routines are set up, tp−>tty_devwrite
pointing to cons_write, tp−>tty_echo pointing to cons_echo, and
tp−>tty_ioctl pointing to cons_ioctl.
The I/O address of the base register of the CRT controller is fetched
from the BIOS, the address and size of the video memory are determined,
and the wrap flag (used to determine how to scroll) is set according to
the class of video controller in use.
On line 16735 the segment descriptor for the video memory is initialized
in the global descriptor table by the system task.
Next comes the initialization of virtual consoles.
Each time scr_init is called, the argument is a different value of tp,
and thus a different line and cons are used to provide each virtual
console with its own share of the available video memory.
Each screen is then blanked, ready to start, and finally console 0 is
selected to be the first active one.
Several routines display output on behalf of the terminal driver
itself, the kernel, or another system component.
The first one, kputc just calls putk, a routine to output text a byte at
a time, to be described below.
This routine is here because the library routine that provides the
printf function used within system components is written to be linked to
a character printing routine with this name, but other functions in the
terminal driver expect one named putk.
Do_new_kmess is used to print messages from the kernel.
Actually, “messages” is not the best word to use here; we do not mean
messages as used for interprocess communication.
This function is for displaying text on the console to report
information, warnings, or errors to the user.
The kernel needs a special mechanism to display information.
It needs to be robust, too, so it can be used during startup, before all
components of MINIX 3 are running, or during a panic, another time when
major parts of the system may be unavailable.
The kernel writes text into a circular character buffer, part of a
structure that also contains pointers to the next byte to write and the
size of the yet-to-be processed text.
The kernel sends a SYS_SIG message to the terminal driver when there is
new text, and do_new_kmess is called when the main loop in tty_task is
running.
When things are not going so smoothly, (i.e., when the system crashes)
the SYS_SIG will be detected by the loop that includes a nonblocking
read operation in do_panic_dumps, which we saw in keyboard.c, and
do_new_kmess will be called from there.
In either case, the kernel call sys_getkmessages retrieves a copy of the
kernel structure, and the bytes are displayed, one by one, by passing
them to putk, followed by a final call to putk with a null byte to force
it to flush the output.
A local static variable is used to keep track of the position in the
buffer between messages.
Do_diagnostics has a function similar to that of do_new_kmess, but
do_diagnostics is used to display messages from system processes, rather
than the kernel.
A DIAGNOSTICS message can be received either by the tty_task main loop
or the loop in do_panic_dumps, and in either case a call is made to
do_diagnostics.
The message contains a pointer to a buffer in the calling process and a
count of the size of the message.
No local buffer is used; instead repeated sys_vircopy kernel calls are
made to get the text one byte at a time.
This protects the terminal driver; if something goes wrong and a process
starts generates an excessive amount of output there is no buffer to
overrun.
The characters are output one by one by calling putk, followed by a null
byte.
Putk can print characters on behalf of any code linked into the
terminal driver, and is used by the functions just described to output
text on behalf of the kernel or other system components.
It just calls out_char for each non-null byte received, and then calls
flush for the null byte at the end of the string.
The remaining routines in console.c are short and simple and we will
review them quickly.
Toggle_scroll does what its name says, it toggles the flag that
determines whether hardware or software scrolling is used.
It also displays a message at the current cursor position to identify
the selected mode.
Cons_stop reinitializes the console to the state that the boot monitor
expects, prior to a shutdown or reboot.
Cons_org0 is used only when a change of scrolling mode is forced by the
F3 key, or when preparing to shut down.
Select_console selects a virtual console.
It is called with the new index and calls set_6845 twice to get the
video controller to display the proper part of the video memory.
The next two routines are highly hardware-specific.
Con_loadfont loads a font into a graphics adapter, in support of the
ioctl TIOCSFON operation.
It calls ga_program to do a series of magical writes to an I/O port that
cause the video adapter’s font memory, which is normally not addressable
by the CPU, to be visible.
Then phys_copy is called to copy the font data to this area of memory,
and another magic sequence is invoked to return the graphics adapter to
its normal mode of operation.
The last function is cons_ioctl.
It performs only one function, setting the screen size, and is called
only by scr_init, which uses values obtained from the BIOS.
If there were a need for a real ioctl call to change the sizeMINIX
3screen code to provide the new dimensions would have to be written.
Input/output is an important topic that is often neglected.
A substantial fraction of any operating system is concerned with
I/O.
But I/O device drivers are often responsible for operating system
problems.
Drivers are often written by programmers working for device
manufacturers.
Conventional operating system designs usually require allowing drivers
to have access to critical resources, such as interrupts, I/O ports, and
memory belonging to other processes.
The design of MINIX 3 isolates drivers as independent processes with
limited privileges, so a bug in a driver cannot crash the entire
system.
We started out by looking at I/O hardware, and the relation of I/O
devices to I/O controllers, which are what the software has to deal
with.
Then we moved on to the four levels of I/O software: the interrupt
routines, the device drivers, the device-independent I/O software, and
the I/O libraries and spoolers that run in user space.
Then we examined the problem of deadlock and how it can be
tackled.
Deadlock occurs when a group of processes each have been granted
exclusive access to some resources, and each one wants yet another
resource that belongs to another process in the group.
All of them are blocked and none will ever run again.
Deadlock can be prevented by structuring the system so it can never
occur, for example, by allowing a process to hold only one resource at
any instant.
It can also be avoided by examining each resource request to see if it
leads to a situation in which deadlock is possible (an unsafe state) and
denying or delaying those that lead to trouble.
Device drivers in MINIX 3 are implemented as independent processes
running in user space.
We have looked at the RAM disk driver, hard disk driver, and terminal
driver.
Each of these drivers has a main loop that gets requests and processes
them, eventually sending back replies to report on what happened.
Source code for the main loops and common functions of the RAM disk,
hard disk, and floppy disk drivers is provided in a common driver
library, but each driver is compiled and linked with its own copy of the
library routines.
Each device driver runs in its own address space.
Several different terminals, using the system console, the serial lines,
and network connections, are all supported by a single terminal driver
process.
Device drivers have varying relationships to the interrupt
system.
Devices which can complete their work rapidly, such as the RAM disk and
the memorymapped display, do not use interrupts at all.
The hard disk driver does most of its work in the driver code itself,
and the interrupt handlers just return status information.
Interrupts are always expected, and a receive can be done to wait for
one.
A keyboard interrupt can happen at any time.
Messages generated by all interrupts for the terminal driver are
received and processed in the main loop of the driver.
When a keyboard interrupt occurs the first stage of processing the input
is done as quickly as possible in order to be ready for subsequent
interrupts.
MINIX 3 drivers have limited privileges, and cannot handle interrupts
or access I/O ports on their own.
Interrupts are handled by the system task, which sends a message to
notify a driver when an interrupt occurs.
Access to I/O ports is similarly mediated by the system task.
Drivers cannot read or write I/O ports directly.