|
|
|
Greg Bergsma
Senior Architect, R&D
QNX Software Systems Ltd.
gbergsma@qnx.com
With the recent introduction of realtime extensions to Windows NT, many realtime developers are starting to consider NT for their next project. It's easy to see why. Rather than connect realtime and desktop applications over a network, it appears that developers can now integrate both into a single system, while using a single API.
But is NT with realtime extensions really a solution for mission-critical realtime applications? Let's look at the capabilities we have come to expect from a realtime operating system (RTOS) such as determinism, reliability, low overhead, and source-code portability and see how "realtime NT"compares.
By definition, realtime applications are required to respond to external events within predictable time limits. This is especially true of "hard"realtime systems, where missed deadlines can have dire, or even disastrous, consequences.
A realtime system's ability to respond to external events within a specified time in known as determinism. To indicate how well an RTOS can support determinism, most vendors quote at least the following performance metrics (see Figure 1):

Figure 1 - To provide an indication of an operating system's realtime determinism, most OS vendors refer to the following metrics: interrupt latency (til), scheduling latency (tsl), and context switch time (tcs).
While these metrics don't provide a full indication of an RTOS's determinism, they can help you assess whether an RTOS can achieve the determinism and performance required for your realtime application. Just as important, they can help you compare the performance of a realtime NT extension to that of a native RTOS.
OS vendors tend to quote these metrics across a range of processors. For example, let's look at the figures for QNX, a realtime operating system used in a wide variety of realtime applications. Times are in microseconds:
|
Processor |
Interrupt |
Scheduling |
Context |
|
Pentium 200 |
1,4 |
2,9 |
1,2 |
|
Pentium 100 |
1,8 |
4,7 |
2,6 |
|
486 DX/33 |
7,5 |
12,6 |
8,2 |
How do NT realtime extensions compare? The numbers vary, but the published figures for some extensions indicate performance numbers 10 to 15 times slower than the numbers in the above table.
Why are these extensions so much slower? One reason is that they repeatedly poll a hardware interrupt in order to give control to the realtime subsystem. While determinism could be improved by increasing the polling rate, this increase uses up CPU cycles that your NT applications may require to achieve acceptable performance. The problem becomes worse in a networked application since network cards not only require access to as many CPU cycles as possible but also impose their own high interrupt rate.
Determinism is important, but there are additional criteria for measuring a realtime system, such as high availability. Can the system's OS continue to run or at least recover rapidly if a software fault occurs? For that matter, can the OS continue to provide services even if a critical hardware component, such as a hard drive, fails?
Achieving high availability is a complex problem that requires a variety of features in the OS. For example, let's consider how the OS deals with software faults.
No matter how hard we try to write error-free code, a practical reality is that our realtime applications will contain undetected programming errors, such as stray pointers and out-of-bound array indices. Any of these can cause a software fault and, potentially, cause the system to crash. To detect such errors, you need an OS that supports the Memory Management Unit (MMU) found on most of today's 32-bit processors. If a memory-access violation occurs, the MMU will notify the OS, which in turn can abort the errant process at the offending instruction.
Some realtime extension products for NT provide memory protection for realtime processes; some do not. But even if an extension supports memory protection, you still have to ask whether it will let you implement a software watchdog.
What is a software watchdog? It's a process that is informed by the OS whenever a memory violation occurs. This process then makes an intelligent decision on how to recover from the fault.
To understand the importance of a software watchdog, let's look at what many existing systems use to recover from software faults: a hardware watchdog timer attached to the processor reset line. Typically, a component of the system software checks for system integrity, and then strobes the timer hardware to indicate that the system is "sane."If the hardware timer isn't strobed regularly, it expires and forces a processor reset. The good news is that the system recovers from the software or hardware lockup. The bad news is that the system must also completely restart, which defeats our goal of high system availability.
Compare this behavior to a software watchdog, which can intelligently choose from several, less drastic, recovery methods. Instead of always forcing a full reset, the software watchdog could:
The software watchdog lets you retain programmed control of the system, even though several processes within the control software may have failed. A hardware watchdog timer can still help you recover from hardware "latch-ups,"but for software failures you now have much better control. Furthermore, by employing the "partial restart"approach, your system can survive intermittent software failures without experiencing any downtime.
While performing a partial restart, your system can also collect information about the nature of the software failure. For example, if the system contains or has access to mass storage (flash memory, hard drive, a network link to another computer with a hard drive), the software watchdog can generate a chronologically archived sequence of process dump files. These dump files can then give you the information you need to engineer a "fix"before you experience similar failures.
A software watchdog not only decreases costly (or even dangerous) downtime, but also helps you avoid software faults in the future. For these reasons, you should make sure a realtime NT extension has the features required to let you implement a software watchdog.
Of course, programming errors don't occur only in application code. To support new hardware or system services, you may need to develop device drivers and other system-level services.
In traditional OS architectures, these components run as part of the kernel in kernel mode (see Figure 2). Code running in kernel mode runs without MMU protection. As a result, errant pointers or array subscripts in device drivers can cause kernel faults, which only a hardware reboot can remedy. The more code built into the kernel, the greater the likelihood of kernel faults. In Windows NT, these faults result in the "blue screen"crash.
In a microkernel OS like QNX, only the kernel (32k of code) and interrupt service routines (ISRs) run in kernel mode, drastically reducing the possibility of kernel faults (see Figure 3).

Figure 2 - Traditional OS architecture

Figure 3 - QNX Microkernel Architecture
All vendors of realtime NT extensions have recognized the need to deal with blue screen crashes. As a result, some of these products can trap a kernel fault so that the realtime subsystem can choose to continue running or to close down gracefully. Still, the ability to continue running is a questionable benefit if you can't interact with the NT components of the system--such as the operator interface!
And there is a greater problem: some realtime extensions to NT can potentially contribute to kernel faults. These extensions are implemented directly into the kernel as an interrupt service routine (ISR) or into the Hardware Abstraction Layer (HAL). As a result, the whole realtime subsystem runs in kernel mode. So what happens if you have a stray pointer in your realtime application? You get kernel faults the blue screen crash.
Also, most realtime applications require custom device drivers. Since all NT device drivers reside in the kernel space, this only contributes to the fragility of the system.
The subject of software crashes raises another question: Whom do you call for support when you experience problems? Microsoft, or the vendor of your realtime extension? Before you invest in an extension, you need to determine who will assume the responsibility of providing you with technical support if your system experiences problems.
To provide streamlined access to system resources (e.g. filesystems, devices, communications gateways), traditional RTOSs provide an API that is implemented either by system processes or by the kernel itself. A distributed RTOS, such as QNX, goes a step further and turns a network of computers into a single logical machine. As a result, a process running on any computer can, with appropriate privileges, access all resources on the network, including:
This distributed approach can significantly enhance system availability: If a device fails on one machine, you can automatically restart a process to use a device, or even a filesystem, on another machine.
When evaluating a realtime NT extension, you need to determine whether it will let you access resources from both NT applications and realtime applications. For example, let's say your realtime subsystem requires high-performance access to the NT filesystem. Does the NT extension provide the functionality to let you do this? If so, how does it provide this access? Does it go through the HAL? If it does, you'll end up using the same mechanisms that make NT unsuitable for real time (i.e. you'll lose control over the priority of Deferred Procedure Calls initiated by an ISR). Also, what happens if Microsoft decides to make changes to the HAL? Will your realtime extension stop functioning?
All the above questions also apply to accessing communications gateways.
As I mentioned earlier, some realtime NT extensions implement realtime determinism by means of a high-frequency polling interrupt. This interrupt imposes a processing overhead even when no realtime work is to be done. The result? Fewer CPU cycles for non-realtime applications and increased latency. In comparison, most RTOSs are event-driven, responding to interrupts only as they occur.
As for memory overhead, most extensions simply increase NT's already large memory requirements. Most RTOSs, on the other hand, can fit easily into small, ROM-based embedded systems.
To protect their code investment, many developers strive to create applications that are portable across OS platforms. Industry standards such as the POSIX API have emerged to help developers achieve this goal - even NT offers a POSIX option. Nevertheless, the widespread success of Microsoft operating systems has created an additional, de facto standard: the Win32 API. Consequently, several RTOSs now support both POSIX and Win32.
Unfortunately, some NT realtime extensions support neither POSIX nor Win32. Instead, they use a proprietary API that defeats any goal you may have of achieving platform portability and vendor independence. Other extensions provide only a subset of the Win32 API, and may thus limit the functionality you can implement in your realtime subsystem.
Realtime extensions to NT offer a degree of realtime determinism that NT alone cannot provide. But having a degree of determinism is only a piece of the puzzle. A realtime environment must also be extremely reliable. It must be able to recover quickly from software faults, without downtime, and avoid kernel faults. For most applications, the environment should impose low CPU overhead and minimal memory requirements. And it should offer a portable API.
As we've seen, many realtime extensions to NT can't fulfill these requirements. Most RTOSs, on the other hand, offer established technologies that have been fine-tuned to the demands of the realtime marketplace. As a result, a "loosely coupled" approach still makes the most sense for most realtime applications: use NT for the desktop, an RTOS for the realtime control, and integrate the two systems via the various networking options now offered by RTOS vendors.
© QNX Software Systems Ltd. 1998
QNX, Neutrino, and Photon microGUI are registered trademarks
of QNX Software Systems Ltd.
All other trademarks and registered trademarks belong to their respective owners.