Practical Advice on Running uClinux on Cortex-M3/M4

Linux, in the form of uClinux, runs on platforms like STmicroelectronics' STM32. Vladimir Khusainov, co-founder and Director of Engineering at Emcraft Systems, talks about how this works.

Vladimir Khusainov

Sept. 17, 2012

23 min read

Add Us On Google

Download this article in .PDF format

Linux, in the form of uClinux, can run on platforms 32-bit platforms like microcontrollers. Here we talk about how this is possible and how it works.

Issue at Hand

Some time ago, I ran into the following post at one of the on-line forums dedicated to embedded software development:

"I see no practical use of uClinux on STM32. This micro was not designed to run such OS. [The Linux OS is ] Big, slow, CM3 [Cortex-M3] is not optimized to run from EMI [External Memory Interface]. I don't even see any practical use of uClinux on any [Cortex-M3] micro. You always need external memory to store it and to run it. Why not [use] real Linux instead? The only part you need to change is micro with MMU. Price for such micro is basically the same to one without MMU."

Essentially, what this comment is saying is, if you want to run Linux, use a MMU-full microprocessor, don't use Cortex-M3. It actually sounds as a perfectly valid argument. Indeed, why go into all the trouble of adding external RAM to a Cortex-M3 design in order to run uClinux if there are plenty of MMU-full microprocessors specifically designed for running a high-end OS such as Linux available at roughly the same price as Cortex-M3 microcontrollers?

What we at Emcraft Systems see happening in actuality defeats this argument, however. Customers are coming to us from all over the place checking if uClinux may be a viable OS choice for their Cortex-M3 (or Cortex-M4) design. There is very seldom a question about whether they should use Cortex-M3 vs some other technology in their design; as far as this particular trade-off goes, it typically is an already won battle for Cortex-M3. Their big question invariably is the following: "I want to use Cortex-M3 (or M4) in my next product, however I am not sure what to do about firmware/software; can uClinux perhaps be a reasonable OS option for me"?

Now, why people go with Cortex-M3/M4 is a bit of a different story. Performance, low cost, perfect power consumption profiles at both dynamic and static times, wide range of I/O interfaces, perceived as the right stepping stone from a 8/16-bit MCU, ARM is a big trend, - this list can probably go on. No matter what particular reasoning behind going with Cortex-M3/M4 might be for a specific project though, it is an undenyable fact of life that Cortex-M3/M4 is widely used a next microcontroller technology and it is probably just the early days of its proliferation.

And this is the point where the topic of uClinux on Cortex-M3 pops us first in a serious way. Having a hardware design based on Cortex-M3/M4 as a given, the next question for the product architect is what software to run on the microcontroller. Here is how the thinking appears to unfold.

The prevalent profile of an Emcraft's customer is a low to mid-volume project (shipping from 100 to perhaps 5 thousand units yearly) that needs to migrate from a low-end 8 or 16-bit microcontoller to a more capable technology, represented, as discussed above, by Cortex-M3/M4. The main driver behind the migration to Cortex-M3 is the need to provide new, not yet supported, functionality in a next product; the currently used low-end microcontroller is simply not up to the new challenges. Furthermore, more often than not, it is not about just a single feature that is missing in the existing product; the developer faces the task of adding support for several new I/O interfaces and corresponding software stacks. Consider, as a typical example: SD Card with a FAT32 filesystem, WiFi or other wireless link, advanced networking capabilities, USB connectivity in host or device roles (or sometimes even both), GUI on an LCD with a touch screen or push-buttons, ability to play audio files from a file system, CAN connectivity, etc. - and all this must be available from a single device. In other words, the designer faces the challenge of developing or somehow acquiring lots and lots of sophisticated software, all of which must run flawlessly and concurrently on Cortex-M3.

Linux does start looking attractive as an OS choice at this point, given the long lists of new requirements. It supports pretty much all functional features one may desire in a modern embedded application out-of-box. What's more, it is a safe bet that whatever new functionality becomes a next trend, it will be supported by a quality implementation in Linux ahead of other rival RTOSes. Linux is royalty-free, it is ubiquitous in all domains and knowledgeable developers are relatively easy to find, there are abundant materials available on pretty much every aspect of the OS implementation, there are tons of tools, libraries and applications out there on the Internet ready for immediate download, etc - all this is the reality of today's Linux. It is easy to see how use of Linux could drive the project costs down and reduce time to market in a very serious way.

Still, when customers come to us for a first time, there is always doubt. What are the BOM (Bill-Of-Materials) costs of running uClinux on Cortex-M3? Is uClinux performance sufficient on Cortex-M3? Is such and such feature or tool supported with uClinux on Cortex-M3, as opposed to being available for the "standard" Linux using MMU-full microprocessors? Is uClinux even sufficiently robust and stable, in the first place?

This article makes an attempt to provide fair and unbiased answers to the above questions, based on Emcraft's 2-years experience of developing uClinux for Cortex-M3/M4. During this time, we have sold several hundreds of our uClinux Cortex-M3/M4 evaluation kits and actively collaborated with perhaps 30 or 40 projects to help our customers develop and deploy various uClinux-based embedded applications using Cortex-M3/M4.

How Different is uClinux from "Full Linux"?

The defining difference between uClinux and Linux is that uClinux runs on processors with no MMU (Memory Management Unit), making the use of virtual memory (VM) impossible. With VM, all processes run in the same virtual address space and the VM system takes care of translating virtual addresses to physical locations. When there is no ready translation for a virtual address, the MMU raises an exception to the processor core and the VM system proceeds to handle the exception in software. This makes it possible to implement various powerful abstractions such as: contiguous virtual memory using scattered physical pages, adding memory to an already running process, swapping memory pages to a hard disk or even mapping memory to a file or an I/O device.

With no MMU, these things are impossible in uClinux. Each process must be located in memory where it can run and each memory access goes directly to the physical address used in the instruction being executed. This has a number of ramifications for software developers.

First, memory allocated to a process must be, generally speaking, contiguous; it can not be arbitrarily mapped onto scattered physical pages. This may cause memory fragmentation problems, especially in configurations where there are many transient programs that start and finish frequently. If the system ends up in a situation where there are many sparse memory allocations, however small, made by running processes, it may be impossible to allocate a contiguous memory region for a new program even if the total amount of free memory is sufficient. There is no good solution for the memory fragmentation problem in uClinux, although, for practical purposes, embedded applications tend to have a static group of processes that start at boot-up and continue long-lived until a next reset or a power cycle.

Another implication of the lack of VM is that uClinux has no way of expanding memory for a running process since there may be other processes above and below it. As a consequence, the brk and sbrk system calls are not possible in uClinux. This however is fairly transparent to software developers because uClinux provides an alternative implementation of malloc using allocations from a global memory pool.

What is much more noticeable to developers is the lack of a dynamic application stack. In uClinux, the stack for a program must be allocated at compile time and has therefore a fixed size (4KB by default). This often causes problems with a newly ported application since, while the kernel transparently increases the stack size in MMU-full Linux, in uClinux a stack overflow would typically result in corrupted text and/or data and, generally speaking, random crashes. This is one of the most serious pitfalls of uClinux. The solution to this problem is that the stack size can be changed by calling a special tool (flthdr) on the application binary at build time. It is advisable to start with a larger stack when porting or developing a complex application in uClinux.

A somewhat related issue is that the lack of MMU, generally speaking, means no memory protection of any kind. It is possible for any application to corrupt any part of other applications, or the kernel memory, or any part of the system, really. Needless to say, this results in bugs that may be very difficult to track and debug. That being a serious problem generally, the Cortex-M3/M4 architecture provides a memory protection mechanism called MPU (Memory Protection Unit). Using the MPU, Emcraft Systems has added to the kernel an optional feature (configurable using an CONFIG_MPU build-time option) that implements process-to-process and process-to-kernel protection on par with the memory protection mechanisms implemented in Linux using MMU. As an important trade-off, CONFIG_MPU, when enabled, adds some performance overhead related to the need to service MPU exceptions in the kernel, so the recommended approach is to use the MPU during the development and then turn it off for deployment configurations.

Without VM, swap to a disk is impossible in uClinux. This is hardly an issue though, since the whole concept of swap is rarely used in embedded applications.

Implementation of mmap is very different in uClinux. Unless a call is to a file in a ROMFS file system residing on a read-only device, the kernel will allocate a memory buffer and copy the file data to the memory buffer. For practical purposes, this is largely transparent to the developer, although it needs to be understood that uses of mmap are not so effecient with uClinux as they are with Linux.

Probably the most serious deviation of uClinux from Linux as far as the user-space interfaces are concerned is the lack of the fork system call. The only option under uClinux is to use vfork. Although fork and vfork share many properties, there are some subtle differences that may be important for certain uses of these system calls and may be hard to deal with when porting certain application code. Without going into too much of a technical discussion, suffice it to say that most applications that make use of fork are trivial to port to uClinux, although some code that makes use of advanced capabilities of the system call may require detailed analysis and inventive changes.

Shared libraries, while possible conceptually, are hard to support with uClinux. Many uClinux distributions, Emcraft's distribution for Cortex-M3 included, choose not to implement shared libraries.

Despite all the differences, uClinux is still very much a full-fledged "Linux" in terms of sharing code with the mainline Linux. Device drivers, layered I/O stacks, file systems, user-space interfaces, etc - all this code offers little in the way of differences from Linux. As far as user-space code is concerned, there are many tools and applications that have been already ported to uClinux and porting a new program is rarely more than a trivial exercise.

How Robust is uClinux?

In our experience, uClinux is very robust and reliable on Cortex-M3/M4. On many occasions, we have been running various stress tests in customer projects for several weeks, with no crashes, memory leaks or other noticable issues. Our customers report similar results in their installations of uClinux. The overall feeling is that uClinux, as ported to Cortex-M3, is a very solid software product in terms of reliability.

To be fair, it is easy to see why some people may be prejudiced about robustness of uClinux. For one thing, there is the no-memory-protection issue discussed in the previous section that is intrinsic to uClinux. Admittedly, crashing a system when developing new code is to be expected and, if one is fair, there are very few embedded RTOS solutions that offer any kind of memory protection. However compared to conventional Linux, which provides memory protection using MMU, uClinux may indeed come across as less reliable.

As mentioned above, using memory protection mechanism based on the Cortex-M3 MPU helps alleviate the issue in the specific context of uClinux for Cortex-M3.

Another reason why uClinux may be perceived as not very reliable by some developers is that, historically, uClinux has been run in configurations that have very little memory to run code from. This comes from the fact that development and evaluation boards for microcontrollers that end up being uClinux targets are often designed with just minimal external RAM. In such contexts, getting up to the shell is often advertised as a "uClinux port" and various features and configurations beyond the very basic Linux tools remain undebugged and even untried.

As with any new software though, getting to a more mature state, where software is able to run reliably, takes time, patience and diligence in testing and resolving uncovered problems. As mentioned above, with uClinux it often also takes a hardware platform that provides sufficient resources, such as system RAM, to even test certain features.

Once a uClinux project is past the early phases and low-level architecture specific code has been validated and polished by running various "real-life" tests and applications, the bulk of the code you are relying upon is the mainline Linux. And mainline Linux is indeed very-very robust and reliable.

What are the BOM Costs of a uClinux Design?

To understand the BOM (Bill-Of-Materials) costs of running uClinux on Cortex-M3/M4, it is important to understand the execution model of uClinux. As it is custom with Linux, things can be done in various ways, however the default execution model can be described by the following bootstrap sequence:

U-Boot firmware runs on the target from the on-chip eNVM / eSRAM (no external memory required) and performs required initialization from power-on / reset, including setting up the external memory controller to allow accessing external RAM and, optionally, external Flash.
U-Boot relocates the uClinux bootable image from a non-volatile storage device to external RAM and passes control to the kernel entry point. The non-volatile stortage device can be NOR Flash, NAND Flash, SPI Flash, SD Card, USB memory stick, etc - essentially, any I/O interface that is supported by a particular Cortex-M3/M4 device. As a special case, a bootable image can be loaded from network (for instance, using TFTP), in which case no dedicated storage device is required on the Cortex-M3/M4 target.
uClinux proceeds to boot up from RAM and mounts a RAM-based file system (initramfs) as a root file system. initramfs is populated with required files and directories at build time and is then simply linked into the kernel as a special section. Alternatively, a root file system can be mounted from a storage device supported by an appropriate Linux device driver and this, again, can be pretty much anything - Flash, SD Card, USB memory, MMC, etc.
(Typically) uClinux keeps persistent data (application data logs, software files and images, etc) in a file system mounted on a non-volatile storage device.

Given the execution model above, uClinux requires external RAM to run from and, optionally, to mount an initramfs root file system in. On-chip RAM memory of Cortex-M3/M4 is not nearly large enough to satisfy uClinux requirements for RAM. A minimal uClinux configuration could be run from 4MB RAM, although the recommendation we are giving to our customers is that they should design in at least 16 MB's worth of RAM. Generally speaking, the more RAM, the better. It is always a good idea to have some free RAM handy as a way to scale up functionality for new requirements.

So, what are the BOM implications of having external RAM in a Cortex-M3/M4 design? It depends largely on specific RAM technology supported by a concrete microcontroller device but here are some high-level pointers:

For those MCUs that support only SRAM devices, the most practical choice we are aware of is a 16 MB Micron PSRAM device that can be purchased at ~$4.5 (here and below prices are per unit for 1000 pieces, as negotiated by Emcraft with our suppliers). We are not aware of low-cost SRAM devices with density above 16 MB.
For those MCUs that support SDRAM devices, a 32 MB SDRAM device can be purchased at ~$1.5. Alternative SDRAM devices of the same density with better power consumption characteristics will cost more, ranging up to ~$6.
For those MCUs that support DDR memory, a 64 MB DDR device with perfect power consumption characteristics can be purchased at ~$3.5. Pin-compatible devices of higher density are available at a sligthly higher price (or sometimes they can go at the same price or even cheaper - it all depends on what your supplier can find for you).

As can be seen, the costs of external RAM, while not negligible, are in a range where the BOM increment should be affordable for a generic microcontroller project. As a somewhat related consideration, from what we are hearing from our customers, many of them design in external RAM anyhow, even when they plan to run "bare-bones" firmware or an RTOS other than uClinux. Integrated SRAM of Cortex-M3/M4 devices is not very large and in an application with reasonably complex requirements fitting in all what needs to reside in RAM - stack, program data, communication protocols buffers, DMA desciptors, LCD framebuffer, etc - may be very hard or sometimes even impossible.

There is also the non-volatile storage to consider. As explained above, uClinux needs that to load a bootable image from. Here, to allow for an "apples-to-apples" comparison, it is important to remember that pretty much any microcontroller application, regardless of what RTOS it runs, would typically require some external non-volatile device in any case, as a storage for configuration data and run-time logs.

The size of a bootable uClinux image, with integrated initramfs, worthy of the functionality capable of running from 16 MB of RAM would be in the 2-3 MB ballpark. Clearly, if your design provides for an SD Card or a NAND Flash, the topic of the additional BOM costs for volatile storage becomes moot.

Nevertheless, as a point of reference, here is the pricing information for some of the external Flash devices Emcraft is using in our Cortex-M3/M4 designs:

16 MB NOR Flash can be purchased for ~$3
128 MB NAND Flash can be purchased for ~$2
16 MB SPI Flash can be purchased for ~$2.25

How is uClinux Performance?

As a high-level assessment, it can be said that uClinux on Cortex-M3/M4 is sufficiently performant to meet generic requirements of an average microcontroller application. Depending on a specific MCU and a uClinux configuration, we are seeing boot-up times of 2 to 5 seconds, from power-up to a point where uClinux is fully functional with networking and is able to execute commands from the interactive shell or a script.

The interactive shell is fast; response times are on par with those of Linux on a PC. No sluggish-ness; for instance, things such as a vi session on a local or NFS-mounted file can be run quite comfortably. Networking is fast enough to move large files over Ethernet or WiFi at reasonable rates; NFS-based development is a norm and is in fact quite comfortable and helps the development cycle a lot. Using the on-chip DMA engines, we have been able to achieve reasonably high throughput rates for various I/O devices such as, for example, SD Card or USB devices.

To qualtify performance further, it all is very dependant on detailed architecture of a specific microcontroller and also on requirements of an embedded application at hand. Here are some considerations and data points that may be of interest to developers:

Some microcontrollers provide on-chip caches, which help the overall uClinux performance in a very serious way. To give an example, the performance of the Freescale Kinetis K70 microcontoller running at 120 MHz, as measured using the popular dhrystone benchmark, approaches 50% of the Linux performance running the OS on a 250 MHz PowerPC MPU. uClinux on the Kinetis is very sleek and fast overall, including complex things such as, for example, Qt/Embedded GUI interfaces on 24-bit 800x480 LCDs with touchscreen.
For critical kernel and application code, fast on-chip Flash can be used to run code from. For the kernel code, there is a mechanism that allows to link kernel object files of user choice into a section that runs from on-chip Flash, while the rest of the kernel continues to run from external RAM. For application code, uClinux can be configured to run applications in XIP (eXecute-In-Place) mode from a ROMFS file system mounted in the on-chip Flash memory. Seeing that some Cortex-M3/M4 microcontrollers provide reasonably large on-chip Flash (for instance, the high-end STmicroelectronics STM32F2/F4 devices provide 1MB eNVM), it is possible to develop a uClinux configuration where critical kernel and application code runs from fast internal Flash, while remaining, less critical code, runs from external RAM.
As a recent trend, some Cortex-M3/M4 microcontrollers support running code from Quad-SPI Flash memory. Read and execute times of QSPI Flash are estimated to approach the performance of on-chip Flash memory. QSPI Flash, coupled with the XIP capabilities of Linux, can be used to boost performance of critical code. For instance, uClinux designs based on the NXP LPC1850 and LPC4350 devices, while already providing very decent performance when running uClinux from SDRAM, can be optimized even futher by running the kernel and critical user-space code from QSPI Flash.
The concept of multi-core heterogenuous computing can be used to offload critical code to a dedicated processor core, while running uClinux on a separate Cortex-M3/M4 core. As an example of such an architecture, the NXP LPC4350 device combines a 204 MHz Cortex-M4 core, very capable of running uClinux, with a Cortex-M0 core that can be used to offload critical and hard-real time code. As another example of the same concept, the Microsemi SmartFusion system-on-chip combines a Cortex-M3 processor core with a powerful FPGA. In SmartFusion-based uClinux designs, the FPGA is used to implement custom I/O protocols and real-time processing, while uClinux running on the Cortex-M3 core implements high-level application logic and interfacing.

Is Low Power Possible with uClinux?

Being able to maintain low power consumption profiles is a big topic with Cortex-M3. In fact, for many projects low power is exactly the reason why they go with Cortex-M3 in the first place.

First thing that needs to be looked at in this section is estimated power consumption of a specific Cortex-M3/M4 design. To give an example, here is the estimate for the run-time (dynamic) power consumption of Emcraft's K70 SOM (System-On-Module), in a configuration where the Ethernet PHY is not installed on the module:

K70 at 120 MHz = 100 mA
64 MB LPDRAM = 71 mA
128 MB NAND Flash = 25mA

for the total of ~200 mA. This appears to be a conservative estimate since in practical test runs we measure the consumption to be about 80 mA in a test where the UART-based uClinux console is the only active I/O interface. This figure will have to be higher in configurations where additional I/O interfaces of the K70, such as USB, analog, SD Card, etc, are being actively used.

For the idle mode of operation (static time), when the processor does not need to be running and can be put into a sleep module, the power consumption for the same hardware configuration is estimated as follows:

K70 in stop mode = 0.2 mA
64 MB LPDRAM in self-refresh mode = 0.7 mA
128 MB NAND Flash in standby mode = 0.05 mA

for the total of ~1 mA.

This estimate provides for very reasonable power consumption levels at static times. The big question is, is software running on Cortex-M3/M4 capable of maintaining such low power levels when nothing is going on and the device can be fully idle?

Linux actually does a very decent job of satisfying the above need. The kernel has the concept of the "idle process" that gets invoked when all other processes in the system are blocked waiting for some event to occur. The kernel can be configured to switch the system to an architecture-specific sleep mode whenever the idle process is running. When an interrupt occurs indicating that some I/O event requires attention, the processor wakes up and returns to the normal power mode.

One problem with the above approach is that the kernel maintains a "kernel ticker" timer that triggers an interrupt at a fixed frequency allowing the Linux scheduler to resume those processes for which timer-related events may have occurred (for instance, a software timeout may have expired for a process). The default kernel ticker rate is 100 Hz, meaning that even when the system can be fully idle, the kernel would still wake up and switch back to dynamic power consumption levels 100 times per second, unnecessary increasing the overall power consumption.

The solution to the above issue is to use a kernel option called the "tickless kernel". The idea is that in that mode the kernel does not wake up at a 100Hz rate using the normal kernel ticker but instead explicitly keeps track of all timer-related events and calculates when exactly the system needs to wake-up next (outside of the need to service asynchronous I/O interrupts, of course).

Using the tickless kernel operation, it is possible to reduce the number of timer interrupts dramatically, which in turn ensures that the system can remain in a sleep mode when there is no work to be done. At the same time, this architecture guarantees a prompt wake-up as soon as an interrupt occurs indicating that there is some I/O activity that needs to be serviced.

Where Can uClinux Be Downloaded From?

The current list of Cortex-M3/M4 microcontrollers supported by Emcraft Systems' uClinux includes:

Freescale Kinetis
STmicroelectronics STM32F2 and STM32F4
NXP LPC178x, LPC185x and LPC435x
Microsemi SmartFusion

We are actively working on adding uClinux support for a number of brand-new Cortex-M3/M4 devices so the above list will continue to grow.

Emcraft Systems hosts the full source trees of U-Boot and the uClinux kernel for Cortex-M3/M4 here:

https://github.com/EmcraftSystems

We are happy to report that, in spirit of the GPL, many developers contribute their device drivers and additional functionality back to the U-Boot and kernel trees.

In addition to the distributions above, Emcraft sells the low-cost uClinux evaluation kits and BSPs (Board Support Packages) for the supported Cortex-M3/M4 processors. An eval kits purchase includes a hardware board using a corresponding Cortex-M3/M4 microcontoller, ready for the uClinux operation.

Emcraft specifically emphasizes its System-On-Module (SOM) products, which are designed to make it easy, quick, and cost-effective for embedded developers to start using the Cortex-M3/M4 device and uClinux software in their applications.

Download this article in .PDF format