The Universal Serial Bus (USB) standard has been with us for many years, but making USB devices is still a daunting task. The USB specification comprises thousands of pages spread over dozens of documents, and although good books have been written on the subject, they are rarely shorter. In addition, the application programming interface (API) offered for programming USB devices is often complex and intricate. This article describes how to program your own software-based USB devices. It is not limited to standard class devices, but also presents a way to implement any device, whether it complies with a standard class or not.
Table of Contents
- The USB Way Of Thinking
- Specifying And Discovering Device Capabilities
- What To Do With Your Data
- Programming USB Devices
- JTAG Over USB
- Audio Over USB
To understand USB, one has to understand a dozen terms that form the foundation of the USB world. USB separates the host from the device: there is one host, connected to multiple devices. The host initiates all traffic and schedules it on the USB bus.
A device is a physical box at the end of the USB cable that identifies itself to the host by passing it a device descriptor and a configuration descriptor. These descriptors are binary data that describe the capabilities of the USB device. In particular, the configuration descriptor describes one or more interfaces, where each interface is a specific function of the device. A device may have multiple interfaces. For example, a USB device that comprises a keyboard with a built-in speaker will offer an interface for playing audio and an interface for key presses.
Each interface comprises a series of endpoints that are the communication channels between the host and the device. Endpoints are numbered between 0 and 15 and may be IN endpoints or OUT endpoints. These terms are relative to the host: OUT endpoints transport data to the device, and IN endpoints transport data to the host (Fig. 1). The are four types of endpoints:
- Bulk endpoints reliably transport data whenever it is required. Bulk data is acknowledged and therefore fault-tolerant.
- Isochronous endpoints are for transporting real-time data. A fixed bandwidth is allocated to them. The host allocates this bandwidth and will not allow an isochronous endpoint to be created if no bandwidth is available. In contrast, bulk endpoints have no guaranteed bandwidth.
- Interrupt endpoints are polled occasionally by the host and enable a device to report status changes.
- The control endpoint (endpoint 0) is used to perform general operations, such as obtaining descriptors, or performing a control-operation such as “change the volume” or “set the baud rate” on any of the interfaces.
USB traffic is organized in frames. Frames are marked by the host sending a start of frame (SOF) every 125 µs (for high-speed USB) or every 1 ms (for Full Speed USB). Isochronous endpoints are allocated a transfer in every frame. Interrupt endpoints are polled once every so many frames, and bulk transfers may happen anytime when the bus is not in use.
As an example, the aforementioned keyboard with built-in speaker has at least two endpoints: an isochronous OUT endpoint to transfer audio data to the speaker, and an interrupt IN endpoint to poll the keyboard. Suppose the speaker is a mono-speaker with a 48-kHz sample rate. The host then will send six samples of data every 125 µs (six samples/0.000125 seconds = 48,000 samples/second). If a sample occupies 16 bits, the host will reserve enough bandwidth to send a 96-bit OUT packet in every 125-µs frame. This consumes around 0.5% of the USB bandwidth. The remaining 99.5% is free for other interfaces or other USB devices on the same bus.
The host initiates all USB traffic. When a device is plugged in, the host first requests the device-descriptor. This descriptor comprises two sets of information that inform the host of the basic capabilities of the device: the device class and the vendor ID/product ID (VID/PID).
The class and subclass can be used to specify a device with generic capabilities. A USB speaker advertises itself as class Audio-2.0. A keyboard advertises itself as a HID-class (human interface device) device. The previous example of a device with both a speaker and a keyboard advertises itself as a Composite device class.
USB devices that comply with a specific USB class enable cross-vendor and cross-platform compatible USB devices. The USB specification specifies hundreds of device classes that enable the generic implementation of, for example, Ethernet dongles, mixing desks, or flash disks and enable operating systems to provide generic drivers for these classes.
There are cases where the USB device does not fit a specific class or where the class specification is too constrained for a particular device. In that case, the class of the device must be described as vendor-specific. The operating system (OS) shall then use the VID and PID to find a vendor-specific driver.
When the device descriptor has been dealt with, the OS assigns the USB device a number, informs the USB device of the number (it is being enumerated), and requests the configuration descriptor that specifies each interface in detail. In the earlier example, the configuration descriptor will specify two interfaces: one interface of class USB-Audio-2.0 with a single channel output endpoint running at 48 kHz only, the other interface of class HID that specifies a single keyboard with a specific keymap.
There are cases where the USB device does not have any OS support and it should interact with a user program directly. In that case, a generic driver such as the open-source libusb driver that allows an application program to communicate with any USB device can be used. Typically, the device will be advertised as vendor-specific. Through the libusb interface the user program can detect a device with a VID and PID that it wants to interact with, claim an interface, open an endpoint, and send IN and OUT requests to that endpoint.
The enumeration of the device typically requires static descriptors to be sent to the host. The difficult bit is creating the descriptors. Serving them is simple, as that is the only task required of the device at the time. After enumeration, data may arrive or be requested on all endpoints in quick succession. This requires an interface between the software that deals with the function of the USB device (e.g., playing audio or monitoring keystrokes on the keyboard) and the low-level USB protocol. Prior to designing this interface, let’s look at how to handle data on various types of endpoints.
Bulk endpoints are the easiest to deal with. Since each data transfer is acknowledged, it is possible to send a negative acknowledge (NAK) stating that the device is not yet ready to deal with the endpoint. For example, if software is dealing with some other part of the device, or if data is simply not yet available (for example, a read from flash memory is not yet completed), the low-level USB driver can send a NAK.
However, sending NAKs has a downside. The only sensible option for the host is to retry the request, potentially creating a long series of requests that are aborted by NAKs. This wastes USB bandwidth that could have been used by other endpoints or devices. In addition, the host software is blocked until the device answers. Hence, NAKs should be a last resort. It may be more appropriate to send partial data than to NAK an IN request. In the case of an OUT request, little can be done. If there is no room to accept the data, then a NAK is the only answer. However, it may be more appropriate to introduce a high-level protocol that will not allow an OUT request until there is space.
Isochronous endpoints are more difficult to deal with because they are not acknowledged. The transmitter (in either direction) assumes that the data arrives. Since there is no acknowledgement on an isochronous endpoint, there is no possibility to send a NAK. Hence, if the device is not ready, the only course of action is to drop the data from an OUT packet or to send no data for an IN packet.
Although this may seem harsh at first, remember that the purpose of an isochronous endpoint is to transmit real-time data in a guaranteed time slice of the USB bus. If the device does not have room to store the OUT data, data is probably not dealt with in real-time. Dropping is a sensible course of action. If no data is available to answer an IN request, then the device has not collected enough data. A sensible course of action is to transmit whatever data is present, or possibly no data at all.
Assuming that the data can be processed or produced in real time, it is easy to compute the buffer requirements for an isochronous endpoint:
- For an OUT endpoint, the worst possible case is that the host posts one OUT request right at the end of a USB frame, and then immediately after the start of frame (SOF) it posts a second OUT request. This means that two OUT requests, carrying 250 µs of data, are received in quick succession. Hence, the buffering scheme must be able to buffer at least 250 µs worth of data. As long as the program does not consume data from this buffer until the SOF following the first packet, the buffer will never empty, providing a continuous data stream from host to device.
- For an IN endpoint, the worst case is similar. The host could perform two IN transfers in short succession just before and immediately after a SOF. This means the IN buffer needs to be at least 250 µs too, and the buffer should contain125 µs at the start of each frame.
It is worth comparing bulk and isochronous transfers from a perspective of coping with errors. In bulk transfers, the data itself is critical. The host and device can retry and slow down, as long as the data is transferred correctly, and this transfer must be acknowledged. For an isochronous transfer, the timing is critical. Either side can throw data away, as long as the real-time characteristics of data further along in the stream are adhered to. (Of course, the decision to drop data should not be taken lightly as it will have an impact on the fidelity of, for example, a video or audio stream.)
The data-centric versus time-centric approach has a knock-on effect on the consequences of bit errors. A cyclic redundancy code (CRC) for error detection protects all USB traffic. A corrupted bulk transfer must be retried until the data is transferred without error. In contrast, a corrupted isochronous transfer will simply be dropped. The transmitting side will be unaware that data was dropped. The receiving side may know that the transfer was dropped (if the header with the endpoint was not corrupted), but even then how many bytes the transfer contained may not be determined. When streaming real-time video or audio this is important, since there will be an unknown gap in the stream that has to be filled with best effort.
Interrupt endpoints inquire about current state. This may be data that is not too time-critical (such as a key press), or it may be time-critical data (such as the X and Y location of a mouse or other pointing device). In the first case, a few microseconds of delay between typing the key and reporting it won’t hurt. However, when reporting mouse locations, irregular reporting may lead to unintended results.
Having seen how to deal with different types of endpoints, we can develop a programming model for software-based USB devices. It is helpful to keep in mind how USB operates:
- There are one or more endpoints, for one or more interfaces, where traffic may arrive or depart at any time.
- Transfers on isochronous endpoints are time-critical.
- At most one transfer happens at a time.
The first two points suggest a multi-threaded programming structure, especially if more than a single interface is concerned, or if isochronous endpoints are being used (Fig. 2). The basic software architecture assumes that there is some sort of USB device library and that for each endpoint we implement a thread that deals with USB transfers on that endpoint. Other parts of the system, not directly connected to the USB device library, are implemented using additional threads.
Note that one thread per endpoint may not be required and may not be the most elegant method. Given that only one transaction happens at a time (the third point), we can create a version of the system that relies on fewer threads in the system. Suppose that we want to implement a synchronous protocol over two endpoints where the host will always transmit data over a bulk OUT endpoint, prior to receiving data on an associated IN endpoint. This protocol requires only a single thread that handles OUT and IN transactions in order on that endpoint.
This optimization is not without risk. Using a single thread per endpoint naturally caters to the situation where the host program was aborted and restarted between the OUT and IN transaction. In this case, the sequence of transactions seen on the device will be ..., OUT, IN, OUT, IN, OUT, OUT, IN, ..., and the thread dealing with OUT transactions must swallow the extra OUT. When optimized away to a program that sequentially consumes OUT and IN in order, this program must be written so that at any time it may expect the protocol to reset.
The third point enables a further optimization. A single thread can deal with all bulk traffic on all interfaces, optimizing multiple endpoints into a single thread (Fig. 3). The single thread receives a request (IN or OUT) on any endpoint, dealing with that request, whereupon it moves on to the next request, possibly on a different endpoint. If the next request arrives before the last request has been dealt with completely, the USB device library sends NAKs, temporarily holding up the host. This optimization has one disadvantage, which is that the single thread must keep state for each endpoint and is effectively context switching on each request. We will show an example of this later.
The same optimization cannot be applied to isochronous endpoints. If we had a single thread dealing with all isochronous data, it would involve FIFOs for each endpoint from which the thread will read data or post data. These FIFOs will increase latency, which is often undesirable.
The rest of this article discusses two examples of the software architecture and optimizations. One example uses vendor-specific drivers and mostly bulk endpoints (JTAG over USB), and the other shows a standard USB class with mostly isochronous endpoints (Audio over USB).
For debugging programs on embedded processors, it is common to use a protocol such as JTAG for accessing the internal state of the processor and to use a program such as gdb to run on a PC to interpret and modify state, set breakpoints, single step, and so on. USB can be used to provide a cross-platform portable transport layer between the PC and JTAG wires.
These devices are often called JTAG keys. In addition to JTAG, they often contain a UART for text I/O from the embedded program. JTAG keys do not follow any standard USB class. Hence, the descriptor labels them as vendor-specific, and it is up to us to define an endpoint structure that is fit for purpose. One endpoint structure would use six endpoints:
- Two endpoints that control the USB device itself (endpoint 0 IN and OUT, required by USB)
- An IN and OUT endpoint for JTAG traffic
- An IN and OUT endpoint for UART traffic
Since there is no USB standard, we can define the protocol for the JTAG traffic and choose a set of commands such as “send a clock with TMS high” or “read the program counter.” On the host side, our program can use libusb (an open-source USB driver library) to search for a device with our VID and PID, claim the interface, and then use the libusb interface to send IN and OUT transactions to both the JTAG and UART endpoints.
Figure 4 shows a suitable software architecture for the device-end. Given that all endpoints are for bulk traffic, they can all be mapped onto a single thread and have two separate threads to deal with the state machines for JTAG traffic and UART traffic. Figure 5 shows a sample implementation.
As an example of a standard USB device, let’s discuss Audio over USB. The Audio-2.0 Class standard allows interoperability of devices on platforms: a consumer can buy a USB microphone or USB speakers and plug it into any computer that supports Audio over USB. The number of channels, sampling rate, and sample depth can be varied to support anything from low-channel-count consumer devices to high-quality, high-channel-count professional audio.
Devices that are more complex also are supported. The descriptor has a syntax for describing mixers, volume controls, equalizers, clocks, resampling, MIDI, and many other functions, although not all of those functions are recognized by all operating systems.
On the host side, all USB traffic carrying audio samples is directed to the USB-Audio driver, which interacts through some general kernel sound interface with the program using audio, such as Skype. Other data, such as MIDI, can be handled through a separate interface by a separate driver.
The device is designed to use USB Audio Class 2.0, and the standard specifies the endpoints that we need to use. If the application has to support MIDI, stereo in, and stereo out with a clock controlled by the device, then the standard dictates that there shall be seven endpoints:
- Two endpoints that control the USB device itself (endpoint 0 IN and OUT, required by USB)
- An isochronous IN endpoint for the I2S analog-to-digital converter (ADC)
- An isochronous OUT endpoint for the I2S digital-to-analog converter (DAC)
- An isochronous IN endpoint for feedback on the clock speed
- A bulk IN endpoint and bulk OUT endpoint for MIDI
The endpoints for the ADC and DAC have one IN and OUT transaction every microframe, every 125 µs. Assuming that the DAC and ADC operate with a 96-kHz sample rate, 12 samples are sent in each direction every 125 µs. Note that there are two independent oscillators: the device controls the 96-kHz sample rate, and the host controls the 125-µs microframe rate.
As these clocks are independent, they will drift relative to each other, and there won’t always be 12 samples in each transfer. The vast majority of the transfers will have 12 samples, but sometimes there will be 13 or 11 samples.
The device uses the third isochronous endpoint to inform the host of the current speed. It is sampled once every few milliseconds and reports the current sample rate in terms of samples per microframe. The MIDI endpoints carry MIDI data as and when available. The standard provides flexibility, allowing us to easily add more audio channels or audio processing.
Figure 6 shows the software architecture for this device. Unlike the previous example, there is little that can be optimized. The class specification dictates the endpoint structure. With three isochronous endpoints, it is advisable to have three processes ready to accept and provide data on these endpoints. The only optimization that is feasible is for a single thread to handle Endpoint 0 and the MIDI endpoints (Fig. 7).
USB devices comprise many interfaces that run concurrently and endpoints that are either bulk or isochronous. Bulk endpoints are for reliable data transport between host and device, whereas isochronous endpoints are for real-time data transport.
When programming USB device endpoints, it is easiest to see those endpoints as individual software threads. Some of those can be mapped onto a single thread, but the programmer has to understand the consequences. In particular, mapping multiple isochronous endpoints onto a single software thread will introduce an (unpredictable) latency in the real-time stream.
- Audio class specification: http://www.usb.org/developers/devclass_docs/Audio2.0_final.zip
- Libusb documentation: http://libusb.sourceforge.net/api-1.0/