Only one thread may have the mutex locked at any given time.
Threads attempting to lock an already locked mutex will
block until the thread is later unlocked. When the thread
unlocks the mutex, the highest-priority thread waiting to
lock the mutex will unblock and become the new owner of the
mutex. In this way, threads will sequence through a critical
region in priority-order.
Acquisition of the mutex requires only a single opcode
(compare and swap) if the mutex isn't already held by
another thread, and a single opcode to release the mutex.
Entry to the kernel is done only at acquisition time if the
mutex is already held so that the thread can go on a blocked
list; entry is done on exit if other threads are waiting to
be unblocked on that mutex. This allows acquisition and
release of an uncontested critical section or resource to be
very quick, incurring work by the OS only to resolve
contention.
In this code sample, the mutex is acquired before the
condition is tested. This ensures that only this thread has
access to the arbitrary condition being examined. While the
condition is true, the code sample will block on the wait
call until some other thread performs a signal or broadcast
on the condvar.
A thread that performs a signal will unblock the
highest-priority thread queued on the condvar, while a
broadcast will unblock all threads queued on the condvar.
The associated mutex is locked atomically by the
highest-priority unblocked thread; the thread must then
unlock the mutex after proceeding through the critical
section.
More formally known as "Multiple readers, single
writer locks," these locks are used when the access
pattern for a data structure consists of many threads
reading the data, and (at most) one thread writing the data.
These locks are more expensive than mutexes, but can be
useful for this data access pattern.
Multiple writing threads can queue (in priority order)
waiting for their chance to write the protected data
structure, and all the blocked writer-threads will get to
run before reading threads are allowed access again. The
priorities of the reading threads are not considered.
Reader/writer locks aren't implemented directly within the
kernel, but are instead built from the mutex and condvar
services provided by the kernel.
If you wait on a semaphore that is positive, you will not
block. Waiting on a non-positive semaphore will block until
some other thread executes a post. It is valid to post one
or more times before a wait. This use will allow one or more
threads to execute the wait without blocking.
A significant difference between semaphores and other
synchronization primitives is that semaphores are
``async safe'' and can be manipulated by signal
handlers. If the desired effect is to have a signal handler
wake a thread, semaphores are the right choice.
Another useful property of semaphores is that they were
defined to operate between processes. Although QNX/Neutrino
mutexes work between processes, the POSIX thread standard
considers this an optional capability and as such may not be
portable across systems. For synchronization between threads
in a single process, mutexes will be more efficient than
semaphores.
As a useful variation, a named semaphore service is also
available. It uses a resource manager and as such allows
semaphores to be used between processes on different
machines connected by a network.
Since semaphores, like condition variables, can legally
return a non-zero value because of a false wakeup, correct
usage requires a loop:
This "release" can also occur when the thread
blocks as part of requesting the service of another process,
or when a signal occurs. The critical region must therefore
be carefully coded and documented so that later maintenance
of the code doesn't violate this condition.
In addition, higher-priority threads in that (or any other)
process could still preempt these FIFO-scheduled threads.
So, all the threads that could ``collide'' within
the critical section must be FIFO-scheduled at the
same priority. Having enforced this condition, the
threads can then casually access this shared memory without
having to first make explicit synchronization calls.
IPC plays a fundamental role in the transformation of
QNX/Neutrino from an embedded realtime kernel into a
full-scale POSIX operating system. As various
service-providing processes are added to the Neutrino
microkernel, IPC is the ``glue'' that connects
those components into a cohesive whole.
Although message passing is the primary form of IPC in
QNX/Neutrino, several other forms are available as well.
Unless otherwise noted, those other forms of IPC are
built over QNX message passing. The strategy is to
create a simple, robust IPC service that can be tuned for
performance through a simplified code path in the
microkernel; more ``feature cluttered'' IPC
services can then be implemented from these.
Benchmarks comparing higher-level IPC services (like pipes
and FIFOs implemented over QNX messaging) with their
monolithic kernel counterparts show comparable performance.
QNX/Neutrino offers at least the following forms of IPC:
| Service: |
Implemented in: |
| Message-passing |
kernel |
| Signals |
kernel |
| POSIX message queues |
external process |
| Shared memory |
kernel |
These services can be selected by the designer on the basis
of bandwidth requirements, the need for queuing, network
transparency, etc. The tradeoff can be complex, but the
flexibility is useful.
As part of the engineering effort that went into defining
the Neutrino microkernel, the focus on message passing as
the fundamental IPC primitive was deliberate. As a form of
IPC, message passing (as implemented in
MsgSendv(), MsgReceivev(), and
MsgReplyv()), is synchronous and copies data.
Let's explore these two attributes in more detail.
A thread that does a MsgSendv() to another thread
(which could be within another process) will be blocked
until the target thread does a MsgReceivev(),
processes the message, and executes a
MsgReplyv(). If a thread executes a
MsgReceivev() without a previously sent message
pending, it will block until another thread executes a
MsgSendv().

A thread undergoing state changes in a typical
send-receive-reply transaction.
This inherent blocking synchronizes the execution of the
sending thread, since the act of requesting that the data
be sent also causes the sending thread to be blocked and the
receiving thread to be scheduled for execution - this
happens without requiring explicit work by the kernel to
determine which thread to run next (as would be the case
with most other forms of IPC). Execution and data move
directly from one context to another.
Data queuing capabilities are omitted from these messaging
primitives because queueing could be implemented when needed
within the receiving thread. The sending thread is often
prepared to wait for a response; queueing is unnecessary
overhead and complexity (i.e. it slows down the non-queued
case). As a result, the sending thread doesn't need to make
a separate, explicit blocking call to wait for a response
had some other IPC form been used.
While the send and receive operations are blocking and synchronous,
MsgReplyv() (or MsgError()) doesn't block. Since the client thread is already
blocked waiting for the reply, no additional synchronization is required,
so a blocking MsgReplyv() isn't needed. This allows a server
to reply to a client and continue processing while the kernel and/or
networking code asynchronously passes the reply data to the sending thread
and marks it ready for execution. As most servers will tend to do some
processing to prepare to receive the next request (at which point they
block again), this works out well.
The MsgReplyv() function is used to return zero
or more bytes to the client. MsgError(), on the
other hand, is used to return only a status to the
client. Both functions will unblock the client from its
MsgSendv().
 |
MsgError() is supported in Neutrino 1.1 |
Because the Neutrino kernel messaging services copy a message directly
from the address space of one thread to another without intermediate
buffering, the message-delivery performance approaches the memory bandwidth
of the underlying hardware. Neutrino attaches no special meaning to the
content of a message - the data in a message has meaning only as mutually
defined by sender and receiver. However, ``well-defined''
message types are also provided so that user-written processes or threads
can augment or substitute for system-supplied services.
The messaging primitives in QNX support multipart transfers,
so that a message delivered from the address space of one
thread to another needn't pre-exist in a single, contiguous
buffer. Instead, both the sending and receiving threads can
specify a vector table that indicates where the sending and
receiving message fragments reside in memory. Note that the
size of the various parts can be different for the sender
and receiver.
Multipart transfers allow messages that have a header block
separate from the data block to be sent without
performance-consuming copying of the data to create a
contiguous message. In addition, if the underlying data
structure is a ring buffer, specifying a three-part message
will allow a header and two disjoint ranges within the ring
buffer to be sent as a single atomic message. A hardware
equivalent of this concept would be that of a scatter/gather
DMA facility.

When sent or received, these parts are treated as one
contiguous sequence of bytes. This is ideal for
scatter/gather buffers and caches.
The multipart transfers are also used extensively by
filesystems. On a read, the data is copied directly from the
filesystem cache into the application using a message with
one part for the reply status and n parts for the
data. Each part points into the cache and compensates for
the fact that cache blocks aren't contiguous in memory with
a read starting or ending within a block.
For example, with a cache block size of 512 bytes, a read of
1454 bytes can be satisfied with a 5-part message:

Scatter/gather of a read of 1454 bytes.
Because message data is explicitly copied between address
spaces (rather than by doing page table manipulations),
messages can be easily allocated on the stack instead of
from a special pool of page-aligned memory for MMU
``page flipping.'' As a result, many of the
library routines that implement the API between client and
server processes can be trivially expressed, without
elaborate IPC-specific memory allocation calls.
For example, the code used by a client thread to request
that the filesystem manager execute lseek on its
behalf is implemented as follows:
#include <unistd.h>
#include <errno.h>
#include <sys/iomsg.h>
off64_t _lseek(int fd, off64_t offset, int whence) {
union {
struct _io_lseek s;
struct _io_lseek_reply r;
} msg;
iov_t iov[2];
msg.s.type = _IO_LSEEK;
msg.s.combine_len = _IO_NO_COMBINE;
msg.s.offset = offset;
msg.s.whence = whence;
msg.s.zero = 0;
SETIOV(iov + 0, &msg.s, sizeof msg.s);
SETIOV(iov + 1, &msg.r, sizeof msg.r);
if(MsgSendv(fd, iov + 0, 1, iov + 1, 1) == -1) {
offset.lo = offset.hi = -1;
return offset;
}
if(msg.r.status != EOK) {
errno = msg.r.status;
offset.lo = offset.hi = -1;
return offset;
}
return msg.r.offset;
}
off_t lseek(int fd, off_t offset, int whence) {
off64_t off;
off.hi = (offset < 0) ? -1 : 0;
off.lo = offset;
off = _lseek(fd, off, whence);
return off.lo;
}
off_t tell(int fd) {
return lseek(fd, 0, SEEK_CUR);
}
This code essentially builds a message structure on the
stack, populates it with various constants and passed
parameters from the calling thread, and sends it to the
filesystem manager associated with fd. The reply
indicates the success or failure of the operation.
|
This implementation doesn't prevent the kernel from
detecting large message transfers and choosing to implement
``page flipping'' for those cases. Since most
messages passed are quite tiny, copying messages is often
faster than manipulating MMU page tables. For bulk data
transfer, shared memory between processes (with
message-passing or the other synchronization primitives for
notification) is also a viable option.
|
In Neutrino, message passing is directed towards channels
and connections, rather than targeted directly from thread
to thread. A thread that wishes to receive messages first
creates a channel, and another thread that wishes to send a
message to that thread must first make a connection to that
channel by ``attaching'' to the channel.
Channels are required by the message kernel calls and are
used by servers to MsgReceivev() messages on.
Connections are created by client threads to
``connect'' to the channels made available by
servers. Once connections are established, clients can
MsgSendv() messages over them. If a number
of threads in a process all attach to the same channel, then
the one connection is shared between all the threads.
Channels and connections are named within a process by a
small integer identifier. Client connections map directly
into file descriptors.
Architecturally, this is a key point. By having client
connections map directly into FDs, we have eliminated yet
another layer of translation. We don't need to "figure
out" where to send a message based on the file
descriptor (e.g. via a read(fd) call). Instead,
we can simply send a message directly to the "file
descriptor" (i.e. connection ID).
| Function |
Description |
| ChannelCreate() |
Create a channel to receive messages on. |
| ChannelDestroy() |
Destroy a channel. |
| ConnectAttach() |
Create a connection to send messages on. |
| ConnectDetach() |
Detach a connection. |

Connections map elegantly into file descriptors (i.e. coid 2 == fd 2).
A process acting as a server would implement an event loop
to receive and process messages as follows:
chid = ChannelCreate(flags);
SETIOV(&iov, &msg, sizeof(msg));
for(;;) {
rcv_id = MsgReceivev( chid, &iov, parts, &info );
switch( msg.type ) {
/* Perform message processing here */
}
MsgReplyv( rcv_id, &iov, rparts );
}
This loop allows the thread to receive messages from any thread that had
a connection to the channel.
The channel has three queues associated with it:
- one queue for threads waiting for messages
- one queue for threads that have sent a message that hasn't yet been received
- one queue for threads that have sent a message that has been received, but not yet replied to.
While in any of these queues, the waiting thread is blocked
(i.e. RECEIVE, SEND, or REPLY blocked).

Waiting threads are blocked while in a channel queue.
In addition to the synchronous Send/Receive/Reply
services, Neutrino also supports fixed-size, non-blocking
messages. These are referred to as pulses and carry
a small payload (four bytes of data plus a single byte
code).
Pulses are often used as a notification mechanism
within interrupt handlers. They also allow servers to signal
clients without blocking on them.

Pulses pack a small payload - 8 bits of code and 32 bits of data.
A server process receives messages in priority order. As
the threads within the server receive requests, they then
inherit the priority of the sending thread (but not the
scheduling algorithm). As a result, the relative priorities
of the threads requesting work of the server are preserved,
and the server work will be executed at the appropriate
priority. This message-driven priority inheritance avoids
priority-inversion problems.
The message-passing API consists of the following functions:
| Function |
Description |
| MsgSendv() |
Send a message and block until reply. |
| MsgReceivev() |
Wait for a message. |
| MsgReplyv() |
Reply to a message. |
| MsgError() |
Reply only with an error status. No message bytes are transferred. |
| MsgReadv() |
Read additional data from a received message. |
| MsgWritev() |
Write additional data to a reply message. |
| MsgInfo() |
Obtain info on received message. |
| MsgSendPulse() |
Send tiny, non-blocking message (pulse). |
| MsgDeliverEvent() |
Deliver an event to a client. |
| MsgKeyData() |
Key a message to allow security checks. |
Architecting a QNX application as a team of cooperating
threads and processes via Send/Receive/Reply results in a
system that uses synchronous notification. IPC thus occurs
at specified transitions within the system, rather than
asynchronously.
A significant problem with asynchronous systems is that
event notification requires signal handlers to be run.
Asynchronous IPC can make it difficult to thoroughly test
the operation of the system and make sure that no matter
when the signal handler runs, that processing will continue
as intended. Applications often try to avoid this scenario
by relying on a ``window'' explicitly opened and
shut, during which signals will be tolerated.
With a synchronous, non-queued system architecture built
around Send/Receive/Reply, robust application architectures
can be very readily implemented and delivered.
Avoiding deadlock situations is another difficult problem
when constructing applications from various combinations of
queued IPC, shared memory, and miscellaneous synchronization
primitives. For example, suppose thread A doesn't release
mutex 1 until thread B releases mutex 2. Unfortunately, if
thread B is in the state of not releasing mutex 2 until
thread A releases mutex 1, a standoff results. Simulation
tools are often invoked in order to ensure that deadlock
won't occur as the system runs.
The Send/Receive/Reply IPC primitives allow the construction
of deadlock-free systems with the observation of only a
couple simple rules:
- Never have two threads send to each other.
- Always arrange your threads in a hierarchy, with sends going up the tree.
The first rule is an obvious avoidance of the standoff situation, but
the second rule requires further explanation. The team of cooperating
threads and processes is arranged as follows:

Threads should always send up to higher-level threads.
Here the threads at any given level in the hierarchy
never send to each other, but send only upwards instead.
One example of this might be a client application that sends
to a database server process, which in turn sends to a
filesystem process. Since the sending threads block and wait
for the target thread to reply, and since the target thread
isn't send-blocked on the sending thread, deadlock cannot
result.
But how does a higher-level thread notify a lower-level
thread that it has the results of a previously requested
operation? (Assume the lower-level thread didn't want to
wait for the replied results when it last sent.)
QNX/Neutrino provides a very flexible architecture with the
MsgDeliverEvent() kernel call to deliver
non-blocking events. All of the common asynchronous services
can be implemented with this. For example, the server-side
of the select() call is an API that an
application can use to allow a thread to wait for an I/O
event to complete on a set of file descriptors. In addition
to an asynchronous notification mechanism being needed as a
``back channel'' for notifications from
higher-level threads to lower-level threads, we can also
build a reliable notification system for timers, hardware
interrupts, and other event sources around this.

A higher-level thread can "send" a pulse event
in order to notify a lower-level thread.
A related issue is the problem of how a higher-level thread
can request work of a lower-level thread without sending to
it, risking deadlock. The lower-level thread is present only
to serve as a ``worker thread'' for the
higher-level thread, doing work on request. The lower-level
thread would send in order to ``report for work,''
but the higher-level thread wouldn't reply then. It would
defer the reply until the higher-level thread had work to be
done, and it would reply (which is a non-blocking operation)
with the data describing the work. In effect, the reply is
being used to initiate work, not the send, which neatly
side-steps rule #1.
A significant advance in the kernel design for Neutrino is
the event-handling subsystem. POSIX and its realtime
extensions define a number of asynchronous notification
methods (e.g. UNIX signals that don't queue or pass data,
POSIX realtime signals that may queue and pass data, etc.)
Neutrino also defines additional, QNX-specific notification
techniques such as pulses. Implementing all of these event
mechanisms could have consumed significant code space, so
our implementation strategy was to build all of these
notification methods over a single, rich, event subsystem.
A benefit of this approach is that capabilities exclusive to
one notification technique can become available to others.
For example, a Neutrino application can apply the same
queueing services of POSIX realtime signals to UNIX signals.
This can simplify the robust implementation of signal
handlers within applications.
The events encountered by an executing thread can come from any of three
sources:
- a MsgDeliverEvent() kernel call invoked by a thread
- an interrupt handler
- the expiry of a timer.
The event itself can be any of a number of different types:
QNX pulses, interrupts, various forms of signals, and forced
``unblock'' events. ``Unblock'' is a
means by which a thread can be released from a deliberately
blocked state without any explicit event actually being
delivered.
Given this multiplicity of event types, and applications
needing the ability to request whichever asynchronous
notification technique best suits their needs, it would be
awkward to require that server processes (the higher-level
threads from the previous section) carry code to support all
these options.
Instead, the client thread can give a data structure, or
``cookie,'' to the server to hang on to until
later. When the server needs to notify the client thread, it
will invoke MsgDeliverEvent() and the microkernel
will set the event type encoded within the cookie upon the
client thread.

The client sends a sigevent to the server, who
saves the event structure. When conditions are met, the server
delivers the event via MsgDeliverEvent().
The ionotify() function is a means by which a
client thread can request asynchronous event delivery. Many
of the POSIX asynchronous services (e.g.
mq_notify() and the client-side of the
select()) are built on top of it. When performing
I/O on a file descriptor (fd), the thread may
choose to wait for an I/O event to complete (for the
write() case), or for data to arrive (for the
read() case). Rather than have the thread block on
the resource manager process that's servicing the read/write
request, ionotify() can allow the client thread
to post an event to the resource manager that the client
thread would like to receive when the indicated I/O
condition occurs. Waiting in this manner allows the thread
to continue executing and responding to event sources other
than just the single I/O request.
The select() call is implemented using I/O
notification and allows a thread to block and wait for a mix
of I/O events on multiple fd's while continuing
to respond to other forms of IPC.
Here are the conditions upon which the requested event can
be delivered:
- _NOTIFY_COND_OUTPUT - there's room in the output buffer for more data.
- _NOTIFY_COND_INPUT - resource-manager-defined amount of data is available to read.
- _NOTIFY_OUT_OF_BAND - resource-manager-defined ``out of band'' data is available.
Neutrino supports the 32 standard POSIX signals (as in UNIX)
as well as the POSIX realtime signals, both numbered from a
kernel-implemented set of 64 signals with uniform
functionality. While the POSIX standard defines realtime
signals as differing from UNIX-style signals (in that they
may contain four bytes of data and a byte code and may be
queued for delivery), this functionality can be explicitly
selected or deselected on a per-signal basis, allowing this
converged implementation to still be compliant with the
standard.
Incidentally, the UNIX-style signals can select POSIX
realtime signal queuing, should the application desire it.
Neutrino also extends the signal-delivery mechanisms of
POSIX by allowing signals to be targeted at specific
threads, rather than simply at the process containing the
threads. Since signals are an asynchronous event, they're
also implemented with the event-delivery mechanisms within
Neutrino.
| Microkernel call |
POSIX call |
Description |
| SignalKill() |
kill(), pthread_kill(), raise(), sigqueue() |
Set a signal on a process group, process, or thread. |
| SignalReturn() |
N/A |
Return from a signal handler. |
| SignalAction() |
sigaction() |
Define action to take on receipt of a signal. |
| SignalProcmask() |
sigprocmask() |
Change signal blocked mask of a thread. |
| SignalSuspend() |
sigsuspend(), pause() |
Block until a signal invokes a signal handler. |
| SignalWaitinfo() |
sigwaitinfo() |
Wait for signal and return info on it. |
The original POSIX specification defined signal operation
on processes only. In a multi-threaded process, the following
rules are followed:
-
The signal actions are maintained at the process level.
If a thread ignores or catches a signal, it affects
all threads within the process.
-
The signal mask is maintained at the thread level. If a thread blocks a
signal, it affects only that thread.
-
An un-ignored signal targeted at a thread will be
delivered to that thread alone.
-
An un-ignored signal targeted at a process is delivered
to the first thread that doesn't have the signal blocked. If
all threads have the signal blocked, the signal will be
queued on the process until any thread ignores or unblocks
the signal. If ignored, the signal on the process will be
removed. If unblocked, the signal will be moved from the
process to the thread that unblocked it.
When a signal is targeted at a process with a large
number of threads, the thread table must be scanned, looking
for a thread with the signal unblocked. Standard practice
for most multi-threaded processes is to mask the signal in
all threads but one, which is dedicated to handling them. To
increase the efficiency of process-signal delivery, the
kernel will cache the last thread that accepted a signal and
will always attempt to deliver the signal to it first.

Signals delivered to a process are given to the first thread with an
interest or queued on the process until a thread expresses an interest.
The POSIX standard includes the concept of queued realtime
signals (first introduced in 1003.1b). QNX/Neutrino supports
optional queuing of any signal, not just realtime signals.
The queuing can be specified on a signal-by-signal basis
within a process. Each signal can have an associated 8-bit
code and a 32-bit value.
This is very similar to message pulses described earlier.
The kernel takes advantage of this similarity and uses
common code for managing both signals and pulses. The signal
number is mapped to a pulse priority using
_SIGMAX - signo. As a
result, signals are delivered in priority order with
lower signal numbers having higher
priority. This conforms with the POSIX standard, which
states that existing signals (which encompass the first 32)
have priority over the new realtime signals.
As mentioned earlier, Neutrino defines a total of 64
signals. Their range is as follows:
| Signal range |
Description |
| 1 ... 32 |
32 POSIX 1003.1a signals (including traditional UNIX signals) |
| 33 ... 56 |
24 POSIX 1003.1b realtime signals (SIGRTMIN to SIGRTMAX) |
| 57 ... 64 |
8 special-purpose Neutrino signals (SIGSPECIALMIN to SIGSPECIALMAX) |
The 8 special signals cannot be ignored or caught. An
attempt to call the signal() or
sigaction() functions or the
SignalAction() kernel call to change them will
fail with an error of EINVAL.
In addition, these signals are always blocked and have
signal queuing enabled. An attempt to unblock these signals
via the sigprocmask() function or
SignalProcmask() kernel call with be quietly
ignored.
A regular signal can be programmed to this behavior using
the following standard signal calls. The special signals
save the programmer from writing this code and protect the
signal from accidental changes to this behavior.
sigset_t *set;
struct sigaction action;
sigemptyset(&set);
sigaddset(&set, signo);
sigprocmask(SIG_BLOCK, &set, NULL);
action.sa_handler = SIG_DFL;
action.sa_flags = SA_SIGINFO;
sigaction(signo, &action, NULL);
This configuration makes these signals suitable for
synchronous notification using the sigwaitinfo()
function or SignalWaitinfo() kernel call. The
following code will block until the 8th special signal is
received:
sigset_t *set;
siginfo_t info;
sigemptyset(&set);
sigaddset(&set, SIGSPECIALMAX);
sigwaitinfo(&set, &info);
printf("Received signal %d with code %d and value %d\n",
info.si_signo,
info.si_code,
info.si_value.sival_int);
Since the signals are always blocked, the program cannot be
interrupted or killed if the special signal is delivered
outside of the sigwaitinfo() function. Since
signal queuing is always enabled, signals won't be lost
- they'll be queued for the next
sigwaitinfo() call.
These signals were designed to solve a common IPC
requirement where a server wishes to notify a client that it
has information available for the client. The server will
use the MsgDeliverEvent() call to notify the
client. There are two reasonable choices for the event
within the notification: pulses or signals.
A pulse is the preferred method for a client that may also
be a server to other clients. In this case, the client will
have created a channel for receiving messages and can also
receive the pulse.
This won't be true for most simple clients. In order to
receive a pulse, a simple client would be forced to create a
channel for this express purpose. A signal can be used in
place of a pulse if the signal is configured to be
synchronous (i.e. the signal is blocked) and queued -
this is exactly how the special signals are configured. The
client would replace the MsgReceivev() call used
to wait for a pulse on a channel with a simple
sigwaitinfo() call to wait for the signal.
This signal mechanism is used by Photon to wait for events
and by the select() function to wait for I/O from
multiple servers. Of the 8 special signals, the first two
have been given special names for this use.
#define SIGSELECT (SIGSPECIALMIN + 0)
#define SIGPHOTON (SIGSPECIALMIN + 1)
| Signal |
Description |
| SIGABRT |
Abnormal termination signal such as issued by the abort() function. |
| SIGALRM |
Timeout signal such as issued by the alarm() function. |
| SIGBUS |
Indicates a memory parity error (QNX-specific interpretation).
Note that if a second fault occurs while your process is in a signal handler for this fault, the
process will be terminated. |
| SIGCHLD | Child process terminated. The default action is to
ignore the signal.
|
| SIGCONT |
Continue if HELD. The default action is to ignore the signal if the process isn't HELD. |
| SIGEMT |
EMT instruction (emulator trap) |
| SIGFPE |
Erroneous arithmetic operation (integer or floating point),
such as division by zero or an operation resulting
in overflow. Note that if a second fault occurs while your
process is in a signal handler for this fault, the process
will be terminated.
|
| SIGHUP |
Death of session leader, or hangup detected on controlling terminal. |
| SIGILL |
Detection of an invalid hardware instruction. Note that
if a second fault occurs while your process is in a signal
handler for this fault, the process will be terminated.
|
| SIGINT |
Interactive attention signal (Break) |
| SIGIOT |
IOT instruction (not generated on x86 hardware) |
| SIGKILL |
Termination signal - should be used only for
emergency situations. This signal cannot be caught or
ignored.
|
| SIGPIPE |
Attempt to write on a pipe with no readers. |
| SIGPOLL |
Pollable event occurred |
| SIGQUIT |
Interactive termination signal. |
| SIGSEGV |
Detection of an invalid memory reference. Note that if a
second fault occurs while your process is in a signal
handler for this fault, the process will be terminated.
|
| SIGSTOP |
HOLD process signal. The default action is to hold the process. |
| SIGSYS |
Bad argument to system call |
| SIGTERM |
Termination signal |
| SIGTRAP |
Unsupported software interrupt |
| SIGTSTP |
Not supported by QNX/Neutrino. |
| SIGTTIN |
Not supported by QNX/Neutrino. |
| SIGTTOU |
Not supported by QNX/Neutrino. |
| SIGURG |
Urgent condition present on socket |
| SIGUSR1 |
Reserved as application-defined signal 1 |
| SIGUSR2 |
Reserved as application-defined signal 2 |
| SIGWINCH |
Window size changed |
POSIX defines a set of non-blocking message-passing
facilities known as message queues. Like pipes, message
queues are named objects that operate with
"readers" and "writers." As a
priority queue of discrete messages, a message queue has
more structure than a pipe and offers applications more
control over communications.
POSIX message queues are implemented in QNX/Neutrino via an
optional resource manager (Mqueue).
Unlike QNX/Neutrino's inherent message-passing primitives,
the POSIX message queues reside outside the kernel.
For information about resource managers, see Chapter 4 in
this book.
POSIX message queues provide a familiar interface for many
realtime programmers. They are similar to the
"mailboxes" found in many realtime executives.
There's a fundamental difference between QNX messages and
POSIX message queues. QNX messages block - they copy
their data directly between the address spaces of the
processes sending the messages. POSIX messages queues, on
the other hand, implement a store-and-forward design in
which the sender need not block and may have many
outstanding messages queued. POSIX message queues exist
independently of the processes that use them. You would
likely use message queues in a design where a number of
named queues will be operated on by a variety of processes
over time.
For raw performance, POSIX message queues will be
slower than QNX native messages for transferring
data. However, the flexibility of queues may make this small
performance penalty worth the cost.
Message queues resemble files, at least as far as their
interface is concerned. You open a message queue with
mq_open(), close it with mq_close(),
and destroy it with mq_unlink(). And to put data
into ("write") and take it out of
("read") a message queue, you use
mq_send() and mq_receive().
For strict POSIX conformance, you should create message
queues that start with a single slash (/) and
contain no other slashes. But note that QNX/Neutrino extends
the POSIX standard by supporting pathnames that may contain
multiple slashes. This allows, for example, a company to
place all its message queues under its company name and
distribute a product with increased confidence that a queue
name will not conflict with that of another
company.
In QNX/Neutrino, all message queues created will appear in
the filename space under the directory /dev/mqueue.
| mq_open() name: |
Pathname of message queue: |
| /data |
/dev/mqueue/data |
| /acme/data |
/dev/mqueue/acme/data |
| /qnx/data |
/dev/mqueue/qnx/data |
You can display all message queues in the system using the
ls command as follows:
ls -Rl /dev/mqueue
The size printed will be the number of messages waiting.
POSIX message queues are managed via the following
functions:
| Function |
Description |
| mq_open() |
Open a message queue. |
| mq_close() |
Close a message queue. |
| mq_unlink() |
Remove a message queue. |
| mq_send() |
Add a message to the message queue. |
| mq_receive() |
Receive a message from the message queue. |
| mq_notify() |
Tell the calling process that a message is available on a message queue. |
| mq_setattr() |
Set message queue attributes. |
| mq_getattr() |
Get message queue attributes. |
Shared memory offers the highest bandwidth IPC available.
Once a shared memory object is created, processes with
access to the object can use pointers to directly read and
write into it. This means that access to shared memory is in
itself unsynchronized. If a process is updating an
area of shared memory, care must be taken to prevent another
process from reading or updating the same area. Even in the
simple case of a read, the other process may get information
that is in flux and inconsistent.
To solve these problems, shared memory is often used in
conjunction with one of the synchronization primitives to
make updates atomic between processes. If the granularity of
updates is small, then the synchronization primitives
themselves will limit the inherently high bandwidth of using
shared memory. Shared memory is therefore most efficient
when used for updating large amounts of data as a block.
Both semaphores and mutexes are suitable synchronization
primitives for use with shared memory. Semaphores were
introduced with the POSIX realtime standard for interprocess
synchronization. Mutexes were introduced with the POSIX
threads standard for thread synchronization. Mutexes may
also be used between threads in different processes. POSIX
considers this an optional capability; Neutrino supports it.
In general, mutexes are more efficient than semaphores.
Shared memory and message passing can be combined to provide
IPC that offers:
- very high performance (shared memory)
- synchronization (message passing)
- network transparency (message passing).
Using message passing, a client sends a request to a server
and blocks. The server receives the messages in priority
order from clients, processes them, and replies when it can
satisfy a request. At this point, the client is unblocked and
continues. The very act of sending messages provides natural
synchronization between the client and the server. Rather
than copy all the data through the message pass, the message
can contain a reference to a shared memory region, so the
server could read or write the data directly. This is best
explained with a simple example.
Let's assume a graphics server accepts draw image requests
from clients and renders them into a frame buffer on a
graphics card. Using message passing alone, the client would
send a message containing the image data to the server. This
would result in a copy of the image data from the client's
address space to the server's address space. The server
would then render the image and issue a short reply.
If the client didn't send the image data inline with the
message, but instead sent a reference to a shared memory
region that contained the image data, then the server could
access the client's data directly.
Since the client is blocked on the server as a result of
sending it a message, the server knows that the data in
shared memory is stable and will not change until the server
replies. This combination of message passing and shared
memory achieves natural synchronization and very high
performance.
This model of operation can also be reversed - the
server can generate data and give it to a client. For
example, suppose a client sends a message to a server that
will read video data directly from a CD-ROM into a shared
memory buffer provided by the client. The client will be
blocked on the server while the shared memory is being
changed. When the server replies and the client continues,
the shared memory will be stable for the client to access.
This type of design can be pipelined using more than one
shared memory region.
Simple shared memory can't be used between processes on
different computers connected via a network. Message
passing, on the other hand, is network transparent. A server
could use shared memory for local clients and full message
passing of the data for remote clients. This allows you to
provide a high-performance server that is also network
transparent.
In practice, the message-passing primitives are more than
fast enough for the majority of IPC needs. The added
complexity of a combined approach need only be considered
for special applications with very high bandwidth.
Multiple threads within a process share the memory of that
process. To share memory between processes, you must first
create a shared memory region and then map that region into
your process's address space. Shared memory regions are
created and manipulated using the following calls:
| Function |
Description |
| shm_open() |
Open (or create) a shared memory region |
| shm_close() |
Close a shared memory region |
| mmap() |
Map a shared memory region into a process's address space |
| munmap() |
Unmap a shared memory region from a process's address space |
| mprotect() |
Change protections on a shared memory region |
| shm_unlink() |
Remove a shared memory region |
POSIX shared memory is implemented in QNX/Neutrino via the
Process Manager (ProcNto). The above calls are
implemented as messages to ProcNto. For
information about the Process Manager, see Chapter 3 in this
book.
The shm_open() function takes the same arguments
as open() and returns a file descriptor to the
object. As with a regular file, this function lets you
create a new shared memory object or open an existing
shared memory object.
When a new shared memory object is created, the size of the
object is set to zero. To set the size, you use the
ftruncate() function. Note that this is the very
same function used to set the size of a file.
Once you have a file descriptor to a shared memory object,
you use the mmap() function to map the object, or
part of it, into your process's address space. The
mmap() function is the cornerstone of memory
management within Neutrino and deserves a detailed
discussion of its capabilities.
The mmap() function is defined as follows:
void * mmap(void *where_i_want_it, size_t length, int memory_protections,
int mapping_flags, int fd, off_t offset_within_shared_memory);
In simple terms this says: "Map in length
bytes of shared memory at
offset_within_shared_memory in the shared memory
object associated with fd."
The mmap() function will try to place the memory
at the address where_i_want_it in your address
space. The memory will be given the protections specified by
memory_protections and the mapping will be done
according to the mapping_flags.
The three arguments fd,
offset_within_shared_memory, and
length define a portion of a particular shared
object to be mapped in. It's common to map in an entire
shared object, in which case the offset will be zero and the
length will be the size of the shared object in bytes. On an
Intel processor, the length will be a multiple of the page
size, which is 4096 bytes.

How arguments to the mmap() function refer to the mapped region.
The return value of mmap() will be the address in
your process's address space where the object was mapped.
The argument where_i_want_it is used as a
hint by the system to where you want the object placed. If
possible, the object will be placed at the address
requested. Most applications specify an address of zero,
which gives the system free reign to place the object where
it wishes.
The following protection types may be specified for
memory_protections:
| Manifest |
Description |
| PROT_NONE |
No access allowed |
| PROT_READ |
Memory may be read |
| PROT_WRITE |
Memory may be written |
| PROT_EXEC |
Memory may be executed |
| PROT_NOCACHE |
Memory should not be cached |
The PROT_NOCACHE manifest should be used when
a shared memory region is used to gain access to dual-ported
memory that may be modified by hardware (e.g. a video frame
buffer or a memory-mapped network or communications board).
Without this manifest, the processor may return
"stale" data from a previously cached read.
The mapping_flags determine how the memory is
mapped and are broken down into two parts. The first part is
a type and must be specified as one of the following:
| Map type |
Description |
| MAP_SHARED |
The mapping is shared by the calling processes. |
| MAP_PRIVATE |
The mapping is private to the calling process. It allocates system RAM and makes a copy of the object. |
| MAP_ANON |
Similar to MAP_PRIVATE, but the fd parameter isn't used (should be set to NOFD), and the allocated memory is zero-filled. |
The MAP_SHARED type is the one to use for
setting up shared memory between processes. The other types
have more specialized uses. For example,
MAP_ANON can be used as the basis for a
page-level memory allocator.
A number of flags may be ORed into the above type to further
define the mapping. These are described in detail in the
mmap() library reference. A few of the more
interesting flags are:
| Map type modifier |
Description |
| MAP_FIXED |
Map object to the address specified by
where_i_want_it. If a shared memory region
contains pointers within it, then you may need to
force the region at the same address in all processes that
map it. This can be avoided by using offsets within the
region in place of direct pointers.
|
| MAP_PHYS |
This flag indicates that you wish to deal with physical
memory. The fd parameter should be set to
NOFD. When used with MAP_SHARED,
the offset_within_shared_memory specifies the
exact physical address to map (e.g. for video frame
buffers). If used with MAP_ANON then
physically contiguous memory is allocated (e.g. for a DMA
buffer). MAP_NOX64K and
MAP_BELOW16M are used to further define the
MAP_ANON allocated memory and address limitations present in
some forms of DMA.
|
| MAP_NOX64K |
Used with MAP_PHYS | MAP_ANON.
The allocated memory area will not cross a 64K boundary.
This is required for the old 16-bit PC DMA.
|
| MAP_BELOW16M |
Used with
MAP_PHYS | MAP_ANON.
The allocated memory area will reside in physical memory
below 16M. This is necessary when using DMA with ISA bus
devices.
|
Using the mapping flags described above, a process can
easily share memory between processes:
/* Map in a shared memory region */
fd = shm_open("datapoints", O_RDWR);
addr = mmap(0, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Or share memory with hardware such as video memory:
/* Map in VGA display memory */
addr = mmap(0, 65536, PROT_READ|PROT_WRITE, MAP_PHYS|MAP_SHARED, NOFD, 0xa0000);
Or allocate a DMA buffer for a bus-mastering PCI network
card:
/* Allocate a physically contiguous buffer */
addr = mmap(0, 262144, PROT_READ|PROT_WRITE|PROT_NOCACHE, MAP_PHYS|MAP_ANON, NOFD, 0);
You can unmap all or part of a shared memory object from
your address space using munmap(). This primitive
isn't restricted to unmapping shared memory - it can
be used to unmap any region of memory within your process.
When used in conjunction with the MAP_ANON
flag to mmap(), you can easily implement a
private page-level allocator/deallocator.
You can change the protections on a mapped region of memory
using mprotect(). Like munmap(),
mprotect() isn't restricted to shared memory
regions - it can change the protection on any region
of memory within your process.
Clock services are used to maintain the time of day,
which is in turn used by the kernel timer calls to implement
interval timers.
The ClockCycles() function is implemented upon a
64-bit, free-running, high-precision counter. On an Intel
Pentium processor, this is implemented directly with the
RDTSC instruction. For processors that don't support this
opcode, an instruction fault is generated - the kernel
catches and emulates this using the counter timer chip.
The ClockPeriod() function allows a thread to set
the system timer to some multiple of nanoseconds; the OS
kernel will do the best it can to satisfy the precision of
the request with the hardware available to it. On a
PC-architecture machine, the precision of this timer setting
can be as fine as 838 nanoseconds.
The interval selected is always rounded down to an integral
of the precision of the underlying hardware timer. Of
course, setting it to an extremely low value can result in a
significant portion of CPU performance being consumed
servicing timer interrupts.
The ClockTick() call is provided as an entry
point to be used by an external timer interrupt handler. If
the system has custom timer hardware, a thread external to
the kernel can use this call to explicitly indicate the
advance of time to the kernel.
| Microkernel call |
POSIX call |
Description |
| ClockTime() |
clock_gettime(), clock_settime() |
Get or set the time of day. |
| ClockAdjust() |
N/A |
Apply small time adjustments to synchronize clocks. |
| ClockCycles() |
N/A |
Read a 64-bit free-running high-precision counter. |
| ClockPeriod() |
clock_getres() |
Get or set the period of the clock. |
| ClockTick() |
N/A |
Simulate a clock interrupt from an external |
In order to facilitate applying time corrections without
having the system experience abrupt ``steps'' in
time (or even having time jump backwards), the
ClockAdjust() call provides the option to specify
an interval over which the time correction is to be applied.
This has the effect of speeding or retarding time over a
specified interval until the system has synchronized to the
indicated current time. This service can be used to
implement network-coordinated time averaging between
multiple nodes on a network.
Neutrino directly provides the full set of POSIX timer
functionality. Since these timers are quick to create and
manipulate, they're an inexpensive resource in the kernel.
The POSIX timer model is quite rich, providing the ability
to have the timer expire on:
- an absolute date
- a relative date (i.e. n nanoseconds from now)
- cyclical (i.e. every n nanoseconds).
The cyclical mode is very significant, because the most
common use of timers tends to be as a periodic source of
events to ``kick'' a thread into life to do some
processing and go back to sleep until the next event. If
the thread had to re-program the timer for every event,
there would be the danger that time would slip unless the
thread was programming an absolute date. Worse, if the
thread doesn't get to run on the timer event because a
higher-priority thread is running, the date next programmed
into the timer could be one that has already elapsed!
The cyclical mode circumvents these problems by requiring
that the thread set the timer once and then simply respond
to the resulting periodic source of events.
Since timers are another source of events in QNX/Neutrino,
they also make use of its event-delivery system. As a
result, the application can request that any of the
Neutrino-supported events be delivered to the application
upon occurrence of a timeout.
An often-needed timeout service provided by Neutrino is the
ability to specify the maximum time the application is
prepared to wait for any given kernel call or request to
complete. A problem with using generic OS timer services in
a preemptive realtime OS is that in the interval between the
specification of the timeout and the request for the
service, a higher-priority process might have been scheduled
to run and preempted long enough that the specified timeout
will have expired before the service is even requested. The
application will then end up requesting the service with an
already lapsed timeout in effect (i.e. no timeout). This
timing window can result in `` hung'' processes,
inexplicable delays in data transmission protocols, and
other problems.
alarm(...);
:
: <-- Alarm fires here
:
blocking_call();
Neutrino's solution is a form of timeout request atomic to
the service request itself. One approach might have been to
provide an optional timeout parameter on every available
service request, but this would overly complicate service
requests with a passed parameter that would often go unused.
Neutrino provides a TimerTimeout() kernel call
that allows an application to specify a list of blocking
states for which to start a specified timeout. Later, when
the application makes a request of the kernel, the kernel
will atomically enable the previously configured timeout if
the application is about to block on one of the specified
states.
Since Neutrino has a very small number of blocking states,
this mechanism works very concisely. At the conclusion of
either the service request or the timeout, the timer will be
disabled and control will be given back to the application.
TimerTimeout(...);
:
:
:
blocking_call();
: <-- Timer atomically armed within kernel
| Microkernel call |
POSIX call |
Description |
| TimerAlarm() |
alarm() |
Set a process alarm. |
| TimerCreate() |
timer_create() |
Create an interval timer. |
| TimerDestroy() |
timer_delete() |
Destroy an interval timer. |
| TimerGettime() |
timer_gettime() |
Get time remaining on an interval timer. |
| TimerGetoverrun() |
timer_getoverrun() |
Get number of overruns on an interval timer. |
| TimerSettime() |
timer_settime() |
Start an interval timer. |
| TimerTimeout() |
sleep(), nanosleep(), sigtimedwait(), pthread_cond_timedwait(),
pthread_mutex_trylock(), intr_timed_wait()
|
Arm a kernel timeout for any blocking state. |
No matter how much we wish it were so, computers are not
infinitely fast. In a realtime system, it's absolutely
crucial that CPU cycles aren't unnecessarily spent. It's
also crucial to minimize the time from the occurrence of an
external event to the actual execution of code within the
thread responsible for reacting to that event. This time is
referred to as latency.
The two forms of latency that most concern us are interrupt latency and
scheduling latency.
Interrupt latency is the time from the assertion of
a hardware interrupt until the first instruction of the
device driver's interrupt handler is executed. QNX leaves
interrupts fully enabled almost all the time, so that
interrupt latency is typically insignificant. But certain
critical sections of code do require that interrupts be
temporarily disabled. The maximum such disable time usually
defines the worst-case interrupt latency - in QNX this
is very small.
The following diagrams illustrate the case where a hardware interrupt is
processed by an established interrupt handler. The interrupt handler either
will simply return, or it will return and cause an event to be delivered.

Interrupt handler simply terminates.
The interrupt latency (Til) in the above diagram represents the
minimum latency - that which occurs when interrupts were
fully enabled at the time the interrupt occurred. Worst-case interrupt
latency will be this time plus the longest time in which QNX, or
the running QNX process, disables CPU interrupts.
The following table shows typical interrupt-latency times (Til)
for a range of processors:
| Interrupt latency (Til) |
Processor |
| 1.38 microsec |
200 MHz Pentium |
| 1.84 microsec |
100 MHz Pentium |
| 7.54 microsec |
33 MHz 486 |
| 14.25 microsec |
33 MHz 386EX |
In some cases, the low-level hardware interrupt handler must schedule a
higher-level thread to run. In this scenario, the interrupt handler will
return and indicate that an event is to be delivered. This introduces a
second form of latency - scheduling latency - which
must be accounted for.
Scheduling latency is the time between the last instruction of the
user's interrupt handler and the execution of the first instruction of a
driver thread. This usually means the time it takes to save the context of
the currently executing thread and restore the context of the required
driver thread. Although larger than interrupt latency, this time is also
kept small in a QNX system.

Interrupt handler terminates, returning an event.
It's important to note that most interrupts terminate without
delivering an event. In a large number of cases, the interrupt handler can
take care of all hardware-related issues. Delivering an event to wake-up a
higher-level driver thread occurs only when a significant event
occurs. For example, the interrupt handler for a serial device driver would
feed one byte of data to the hardware upon each received transmit
interrupt, and would trigger the higher-level thread within
(Devc.ser) only when the output buffer is nearly empty.
This table shows typical scheduling-latency times (Tsl) for a
range of processors:
| Scheduling latency (Tsl) |
Processor |
| 2.93 microsec |
200 MHz Pentium |
| 4.73 microsec |
100 MHz Pentium |
| 12.57 microsec |
33 MHz 486 |
| 38.55 microsec |
33 MHz 386EX |
Since microcomputer architectures allow hardware interrupts to be given
priorities, higher-priority interrupts can preempt a lower-priority
interrupt.
This mechanism is fully supported by Neutrino. The previous
scenarios describe the simplest - and most common
- situation where only one interrupt occurs. This is
usually the case for the highest-priority interrupt.
Worst-case timing considerations for lower-priority
interrupts must take into account the time for all
higher-priority interrupts to be processed, because a
higher-priority interrupt will preempt a lower-priority
interrupt.

Thread A is running. Interrupt IRQx causes
interrupt handler Intx to run, which is
preempted by IRQy and its handler
Inty. Inty returns an event
causing Thread B to run; Intx returns an event
causing Thread C to run.
Neutrino implements an interrupt-handling API closely
modeled after the POSIX realtime extensions (draft status at
time of printing).
| Microkernel call |
POSIX call |
Description |
| InterruptAttach() |
intr_capture() |
Attach a local function to an interrupt vector. |
| InterruptDetach() |
intr_release() |
Detach an interrupt handler. |
| InterruptWait() |
intr_timed_wait() |
Wait for an interrupt. |
| InterruptDisable() |
N/A |
Disable hardware interrupts. |
| InterruptEnable() |
N/A |
Enable hardware interrupts. |
| InterruptMask() |
intr_lock() |
Mask a hardware interrupt. |
| InterruptUnmask() |
intr_unlock() |
Unmask a hardware interrupt. |
Using this API, a suitably privileged user-level thread can
call InterruptAttach(), passing a hardware
interrupt number and the address of a function in the
thread's address space to be called when the interrupt
occurs. Neutrino allows multiple ISRs (Interrupt Service
Routine) to be attached to each hardware interrupt number
- higher-priority interrupts can be serviced during
the execution of lower-priority interrupt handlers.
The following code sample shows how to attach an ISR to the
hardware timer interrupt on the PC (which Neutrino also uses
for the system clock). Since the kernel's timer ISR is
already dealing with clearing the source of the interrupt,
this ISR can simply increment a counter variable in the
thread's data space and return to the kernel:
#include <stdio.h>
#include <sys/neutrino.h>
struct sigevent event;
volatile unsigned counter;
struct sigevent *handler( void *area ) {
// Pulse every 100'th interrupt
if ( ++counter == 100 ) {
counter = 0;
return( &event );
}
else
return( NULL );
}
void main() {
int i;
// Initialize event structure
event.sigev_notify = SIGEV_INTR;
// Attach ISR vector
InterruptAttach( _NTO_INTR_FIRST, &handler, NULL, 0, 0 );
for( i = 0; i < 10; ++i ) {
// Wait for ISR to pulse
InterruptWait( 0, NULL );
printf( "100 events\n" );
}
// Disconnect the ISR handler
InterruptDetach( _NTO_INTR_FIRST, &handler );
exit( 0 );
}
With this approach, appropriately privileged user-level
threads can dynamically attach (and detach) interrupt
handlers to (and from) hardware interrupt vectors at run
time. These threads can be debugged using regular
source-level debug tools; the ISR itself can be debugged by
calling it at the thread level and source-level stepping
through it, or by using the kernel debugger to single-step
the ISR as invoked by the hardware interrupt.
When the hardware interrupt occurs, the processor will enter
the interrupt redirector in the microkernel. This code
pushes the registers for the context of the currently
running thread into the appropriate thread table entry and
sets the processor context such that the ISR has access to
the code and data that are part of the thread the ISR is
contained within. This allows the ISR to use the buffers and
code in the user-level thread to resolve the interrupt and,
if higher-level work by the thread is required, to queue an
event to the thread the ISR is part of, which can then work
on the data the ISR has placed into thread-owned buffers.
Since it runs with the memory-mapping of the thread
containing it, the ISR can directly manipulate devices
mapped into the thread's address space, or directly perform
I/O instructions. As a result, device drivers that
manipulate hardware don't need to be linked into the kernel.
The interrupt redirector code in the microkernel will call
each ISR attached to that hardware interrupt. If the value
returned indicates that a process is to be passed an event
of some sort, the kernel will queue the event. When the last
ISR has been called for that vector, the kernel interrupt
handler will finish manipulating the interrupt control
hardware (the i8259 on a PC) and then ``return from
interrupt.''
This interrupt return won't necessarily be into the context
of the thread that was interrupted. If the queued event
caused a higher-priority thread to become READY, the
microkernel will then interrupt-return into the context of
the now-READY thread instead.
This approach provides a well-bounded interval from the
occurrence of the interrupt to the execution of the first
instruction of the user-level ISR (measured as interrupt
latency), and from the last instruction of the ISR to
the first instruction of the thread readied by the ISR
(measured as thread or process scheduling latency).
The worst-case interrupt latency is well-bounded, because
Neutrino disables interrupts only for a couple opcodes in a
few critical regions. Those intervals when interrupts are
disabled have deterministic runtimes, because they're not
data dependent.
The microkernel's interrupt redirector executes only a few
instructions before calling the user's ISR. Since the
microkernel's call interface is implemented via software
interrupts (which work exactly like hardware interrupts),
kernel call processing works essentially the same as
interrupt processing. As a result, process preemption for
hardware interrupts or kernel calls is equally quick and
exercises essentially the same code path.
While the ISR is executing, it has full hardware access
(since it's part of a privileged thread), but can't issue
other kernel calls. The ISR is intended to respond to the
hardware interrupt in as few microseconds as possible, do
the minimum amount of work to satisfy the interrupt (read
the byte from the UART, etc.), and if necessary, cause a
thread to be scheduled at some user-specified priority to do
further work.
Worst-case interrupt latency is directly computable for a
given hardware priority from the kernel-imposed interrupt
latency and the maximum ISR runtime for each interrupt
higher in hardware priority than the ISR in question. Since
hardware interrupt priorities can be reassigned, the most
important interrupt in the system can be made the highest
priority. Also, ISRs can be written to do no work, always
readying the user-level thread to do work. This allows the
priority of hardware-interrupt-generated work to be
performed at OS-scheduled priorities rather than
hardware-defined priorities. Since the interrupt source
won't re-interrupt until serviced, the effect of interrupts
on the runtime of critical code regions for hard-deadline
scheduling can be controlled.
In addition to hardware interrupts, various
``events'' within the microkernel can also be
``hooked'' by user processes and threads. When one
of these events occurs, the kernel can upcall into the
indicated function in the user thread to perform some
specific processing for this event. For example, the
processor's non-maskable interrupt (NMI) is available for
system watchdog threads and similar applications. Also,
whenever the idle thread in the system is called, a user
thread can have the kernel upcall into the thread so that
hardware-specific low-power modes can be readily
implemented.
| Upcall |
Description |
| _NTO_INTR_NMI |
Watchdog timer hardware is often configured to generate NMIs
(non-maskable interrupts) whenever the timeout expires. This
event would be used by the thread that would deal with these
watchdog events.
|
| _NTO_INTR_TRACE |
Neutrino can be configured to generate trace events representing
significant occurrences within the kernel (hardware interrupts, context
switches, etc.). Trace events generated by explicit trace calls inserted
into applications also end up moving the trace data out through this
interface. A thread prepared to log these events for diagnostic purposes
would attach to this upcall in order to extract the events.
|
| _NTO_INTR_IDLE |
When the kernel has no active thread to schedule, it
will run the idle thread, which can upcall to a user handler. This
handler can perform hardware-specific power-management operations.
|