Sunday, August 12, 2018

Streaming from isochronous USB devices using WinUsb


I was struggling to figure out how to do isochronous USB transfers using WinUSB, The documentation is accurate but sparse, and if you’re not already familiar with how the whole process is supposed to work, it can be maddening to figure out what you’re doing wrong. In order to help other people from having to go through this pain, I wanted to present a working example, as well as a conceptual understanding that made it useful and finally click for me.

First of all, information about isochronous USB transfers. First, the Microsoft documentation on how to transfer USB data: https://docs.microsoft.com/en-us/windows-hardware/drivers/usbcon/transfer-data-to-isochronous-endpoints Although it’s written from the perspective of assisting someone developing a USB device driver using the WDF framework, almost all the concepts apply directly to WinUSB, which seems to basically be a thin wrapper around the WDF APIs, shunting them down to usermode.

The next link from BeyondLogic is also pretty good, https://www.beyondlogic.org/usbnutshell/usb4.shtml#Isochronous and gives a good high level overview of why one would use isochronous transfers.

For this example, I’m effectively using the Cypress EX-USB2 development kit, and the firmware from this application note: http://www.cypress.com/documentation/application-notes/an4053-streaming-data-through-isochronous-or-bulk-endpoints-ez-usb

Unlike the application note, I didn’t want to have to use Cypress’s proprietary USB driver and sample application because I didn’t want another thing to distribute and also because it seems that to use it your application must be written in managed code, which I didn’t want to use.

With that out of the way, here’s the conceptual understanding that made sense to me. All the above is in the context of receiving isochronous data from a high-speed peripheral, although other types of transfers are similar. Originally, there was just USB full-speed, and it transferred data in units of frames. Each frame is 1 millisecond long, and the sending and receiving of frames is managed by the USB host controller in hardware. It’s real-time and happens every millisecond, without requiring OS assistance.

Every frame (millisecond), the USB host controller asks the peripheral for data. In response, the USB device can transfer up to 1024 bytes in return. If the device doesn’t have that much data to send, no worries, it can send less every frame.

However, the whole idea behind isochronous transfer is that the OS ensures that there is guaranteed bandwidth available such that so long as nothing anomalous happens, the peripheral will always be able to return the amount of data it needs every frame. In order to do this, the OS ensures that there aren’t more isochronous devices using the bus than bandwidth available.

If a device truly needs to send 1024 bytes of data every frame, it’s simple. That’s the maximum amount of data that can be transferred over the link, and there isn’t room for any other isochronous transferring devices - they’re out of luck.

In practice, many devices will use less than the maximum and could run concurrent with other devices, if the OS knew how much of those 1024 bytes the device was going to actually use. This is information that the max packet size field of the USB descriptor provides. A device could specify 512 bytes, for example, leaving half the bandwidth available for another device.

That’s USB 1 full-speed. Every millisecond, a frame occurs, and the USB host controller asks all isochronous peripherals for their data. They promise not to send more than they specified in their max packet size field, although they may send less. If they frequently send less, it’s wasteful in that it prevents other devices from using that bandwidth.

Then came USB Super Speed with USB 2.0, increasing the link bandwidth 40x (from 12 Mbps to 480 Mbps). For backward compatibility with existing devices, USB 2 keeps the original terminology the same as USB 1, with frames of data happening every millisecond. To accommodate the increased data rate, it subdivides the frames into 8 microframes, each lasting 125 microseconds. The amount of data per microframe also increased: now a peripheral can send 3 times 1024 bytes of data. As before, it can may also send less per microframe.

As with USB 1, the peripheral tells the OS the worst-case bandwidth it’s going to need, so that the OS and host controller can ensure they reserve enough time on the bus; however, the mechanism for doing so is a little more complicated due to backward compatibility.

There are now two mechanisms for the peripheral to tell the OS it needs less than the maximum possible bandwidth. The first is the interval field of the descriptor. Unlike USB 1 where every frame the controller asked the peripheral for data, USB 2 lets the peripheral tell the controller to skip microframes and only ask the peripheral for data every other microframe, every 4th microframe, or every 8th microframe (by setting the interval field to 1, 2, or 4).

The second way is the MaximumPacketSize field of the USB descriptor, and the max packet size is still 1024 bytes (the same as USB 1 full-speed). Each microframe is considered to be capable of transferring up to three transactions (of up to 1024 bytes). This information is encoded bit-wise into that field so that bits 0 through 10 are the PacketSize (same as USB 1), but bits 11-12 are now the number of additional packets.

Effectively, the spec authors used the additional bits beyond those needed for the packet size to encode another field indicating whether the peripheral is going to send 1 transaction per microframe ([12:11] = 2’b00) or 2 ([12:11] = 2'b01) or 3 ([12:11] = 2b’10). I found this really confusing at first so hopefully this explanation helps.

So, if your device needed to transfer the maximum possible data, it would set the interval to 1 (ask for data every microframe), the packet size to 1024, and the number of transactions to 2b’10 (3 packets a microframe). Alternately, if you had the full speed peripheral used an example earlier which only needed to send 512 bytes of data every frame and you converted it to a full speed device, you could set the interval to 4 (only ask for data every 8th microframe), the packet size to 512, and the number of packets to 2’b00 (only 1 packet per microframe).

When using less than the maximum, there are often several ways to split up the bandwidth required, depending on how much you skip microframes and how many packets per microframe you use. It’s not clear to me whether its better to send a little bit of data every microframe or skip a bunch of microframes and then send it all at once from a bus utilization perspective. If you know, leave a comment. In some cases, the choice is decided for you by the buffer sizes of your peripheral. If like me you’re using the EZ-USB2 device, it only has 4kB total buffer size, so above a certain data transfer rate you can’t skip too many microframes or the buffers will overflow. But assuming you have enough space, I’m not sure which approach is best.

In any case, that’s the summary of how USB 2 full-speed isochronous transfers work at the specification level. Every millisecond there’s a frame. This is divided into 8 microframes. Every 125 microseconds, the controller processes a microframe, and if it’s your peripherals turn (as specified by the number of intervals to skip), the controller asks your peripheral for data. Your peripheral can send 1,2 or 3 transactions, each containing up to 1024 bytes of data.

Now let’s get into the way that the Windows WinUsb framework handles isochronous data.
Use the existing examples to obtain a handle to your device (https://docs.microsoft.com/en-us/windows-hardware/drivers/usbcon/using-winusb-api-to-communicate-with-a-usb-device), and select the alternate interface which streams isochronous data.  (You’ll always need to do this, because the USB specification states that the default interface cannot be isochronous. Don’t forget to do this, via https://docs.microsoft.com/en-us/windows/desktop/api/winusb/nf-winusb-winusb_setcurrentalternatesetting , or WinUsb won’t transfer data.)


Now you’re ready to set up your isochronous data transfer. The way this works is that in your application, you allocate some memory for a data buffer into which you want WinUsb to transfer data. Tell WinUsb about it with https://docs.microsoft.com/en-us/windows/desktop/api/winusb/nf-winusb-winusb_registerisochbuffer . Unlike reading from a bulk pipe, you use this same buffer for all your data transfers and you tell WinUsb about it before you start. This allows it to tell the memory manager that it’s going to need physical pages for your virtual buffer, and that they need to be accessible from kernel mode, and that it must keep them where they are and not page them out or move them around. By making these guarantees, it can tell the controller to DMA transfer data into those pages of memory without going through these potentially time consuming steps every transfer and potentially missing data from the device.

Next, WinUsb works on the concept of frame numbers. There’s a counter that continuously increments every time the hardware starts processing a frame, and isochronous transfers are always scheduled against one of those absolute frame numbers. In order to avoid dropping isochronous data, you need to ensure you schedule a transfer on every frame number required by your device (every single frame if the internal is 1, or every 8th frame if the interval is 4).

As an application writer, you *can* manage this yourself by querying the current frame number from the hardware and scheduling your transfer far enough in the future that you don’t miss the frame you want, but soon enough that you don’t lose data from existing data streaming in from your device. I imagine there are scenarios where that housekeeping is required, but for me I found it it easier to use https://docs.microsoft.com/en-us/windows/desktop/api/winusb/nf-winusb-winusb_readisochpipeasap

The way this works is that it schedules the transfer as soon as possible, on whatever frame number is next up in the hardware when you make the call. If you were to just call this routine, wait for the data to return, then call it again, it’s likely enough time has passed in the various software work required that you would have missed the next microframe of data from your device. To avoid this, use overlapped I/O and queue up several transfers in sequence.

When the first transfer finishes, WinUsb can be told to automatically start streaming data into the next queued transfer with no interruption or missing microframes. The way this works is by setting the ContinueTransfer flag to TRUE when queuing up transfers after the first one. This is in effect using multiple transfers to set up a circular buffer of data which the hardware continuously streams data into, and conceptually looks something like this something like this:

char buf[NUM_TRANSFERS * TRANSFER_SIZE];
bool firstIo = true;
// Queue up the transfers
for (auto i = 0; i < NUM_TRANSFERS; i++) {
    WinUsb_ReadIsochPipeAsap(buf + i*TRANSFER_SIZE, !firstio);
    firstIo = false;
}

while (true) {
    for (auto i = 0; i < NUM_TRANSFERS; i++) {
        // Wait for each transfer in sequence to finish.
        WinUsb_GetOverlappedResult(i);
        // Process the data for this transfer.
        // data = buf[i];
        // Add it back to the queue of transfers for hardware to stream into.
        WinUsb_ReadIsochPipeAsap(i, true /*ContinueStreaming*/);
    }
}

NUM_TRANSFERS will need to be at least 2 so that there’s always a transfer ready to accept data while you’re working on the other one, but can be larger to smooth out jitter in your application data processing at the cost of increased latency. As long as you process the data faster than it's arriving from hardware, this goes on indefinitely with no missing data.

In practice it’s a little more complicated, stemming from the observation at the start that devices can transfer up to their stated maximum, but also maybe less as they see fit. For example, suppose you have a device which lists the maximum possible transfer size of 3x1024 bytes each microframe. WinUsb works on USB frames, so each time you ask it to transfer data you’ll need to ensure you ask for at least 8x3x1024 = 24kB of data (because that’s how much your device said it might transfer each frame, since there’s 8 microframes per frame per the USB high speed specification). One possibility is it transfers this maximum amount of data, and you get it all back in sequential order in your provided buffer.

It might also transfer less. Potentially nothing for some transactions or microframes. It would be nice if WinUsb handled this for you, concatenating the bits of data one right after another in your buffer, then just telling you at the end how much data in total was actually transferred. It doesn’t. What actually happens (probably at the hardware layer? let me know), is that it divides your buffer up into smaller chunks of however many bytes your peripheral said it could need per microframe (3072 in our case).

In effect, you can think of your buffer being represented to WinUsb as not a

char buf[NUM_TRANSFERS * TRANSFER_SIZE];
but rather

char buf[NUM_TRANSFERS][MICROFRAMES_PER_TRANSFER][MAX_BYTES_PER_MICROFRAME];

As the application author, you get to choose NUM_TRANSFERS and MICROFRAMES_PER_TRANSFER (so long as it’s big enough to hold an entire frame of data or multiples thereof, so 8*n, n>=1 in our case but could be less if your device has a larger interval value), however MAX_BYTES_PER_MICROFRAME is told to us by the peripheral’s descriptor.
Each time the controller asks the peripheral for data, it advances one of the last array entries, then fills into it as much as the peripheral returned. If there’s extra space at the end of that array entry, that’s your problem as the application developer to deal with.

In order to make it possible for you to deal with it, WinUsb returns a list of values indicating how full each of those microframes actually were with data, so you know how much to process out of each entry before moving on to the next.
As a result, that data processing step in the above pseudo-code actually looks more like:

char buf[NUM_TRANSFERS][MICROFRAMES_PER_TRANSFER][MAX_BYTES_PER_MICROFRAME];
int howMuchData[NUM_TRANSFERS][MICROFRAMES_PER_TRANSFER];
bool firstIo = true;
// Queue up the transfers
for (auto i = 0; i < NUM_TRANSFERS; i++) {
    WinUsb_ReadIsochPipeAsap(buf[i], &howMuchData[i], !firstio /*ContStreaming*/);
    firstIo = false;
}

while (true) {
    for (auto i = 0; i < NUM_TRANSFERS; i++) {
        // Wait for each transfer in sequence to finish.
        WinUsb_GetOverlappedResult(i);
        // Process the data for this transfer.
        for (auto j = 0; j < MICROFRAMES_PER_TRANSFER; j++) {
            char* data = buf[i][j];
            int dataLen = howMuchData[i][j];
            // process data to data + dataLen amount of data from this microframe
        }
        // Add it back to the queue of transfers for hardware to stream into.
        WinUsb_ReadIsochPipeAsap(i, true /*ContStreaming*/);
    }
}

If the rest of your application expects to receive the data as one continuous stream in a nice contiguous buffer, you’ll have to come up with some scheme, such as memcpying the chunks into a separate contiguous buffer.

That’s basically it as far as how to use WinUsb for isochronous data transfer goes at a conceptual level.

Finally, here’s a concrete example:
#include <SDKDDKVer.h>
#include <stdlib.h>
#include <stdio.h>
#define WIN32_LEAN_AND_MEAN // CM_* APIs
#include <Windows.h>
#include <winusb.h>
#include <usb.h>
#include <initguid.h>
#include <cfgmgr32.h>
#include <strsafe.h>
DEFINE_GUID(GUID_DEVINTERFACE_USBPer, 0x1ab4ce39, 0x4959, 0x4c97, 0x84, 0x02, 0x85, 0xa2, 0xe5, 0xa2, 0x6b, 0x8f);
#define XFER_SZ 10
#define NUM_XFERS 8
int main()
{
    // See https://docs.microsoft.com/en-us/windows-hardware/drivers/usbcon/using-winusb-api-to-communicate-with-a-usb-device for opening the device.

    // Get the device path string.
    PWCHAR DeviceInterfaceList = NULL;
    ULONG DeviceInterfaceListLength = 0;
    WCHAR DevicePath[MAX_PATH];
    CM_Get_Device_Interface_List_Size(&DeviceInterfaceListLength, (LPGUID)&GUID_DEVINTERFACE_USBPer, NULL, CM_GET_DEVICE_INTERFACE_LIST_PRESENT);
    DeviceInterfaceList = (PWCHAR)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, DeviceInterfaceListLength * sizeof(WCHAR));
    CM_Get_Device_Interface_List((LPGUID)&GUID_DEVINTERFACE_USBPer, NULL, DeviceInterfaceList, DeviceInterfaceListLength, CM_GET_DEVICE_INTERFACE_LIST_PRESENT);
    StringCbCopy(DevicePath, _countof(DevicePath), DeviceInterfaceList);
    HeapFree(GetProcessHeap(), 0, DeviceInterfaceList);

    // Open a handle to the device.
    HANDLE DeviceHandle = CreateFile(DevicePath, GENERIC_WRITE | GENERIC_READ, FILE_SHARE_WRITE | FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL);
 
    // Initialize WinUsb against the device handle.
    WINUSB_INTERFACE_HANDLE WinusbHandle;
    WinUsb_Initialize(DeviceHandle, &WinusbHandle);

    // ... end example code from link above.

    // Tell WinUsb to use your ISO alt descriptor (1 in our case).
    USB_INTERFACE_DESCRIPTOR UsbAltInterfaceDescriptor;
    WinUsb_QueryInterfaceSettings(WinusbHandle, 1, &UsbAltInterfaceDescriptor);
    WinUsb_SetCurrentAlternateSetting(WinusbHandle, UsbAltInterfaceDescriptor.bAlternateSetting);

    // Get the bandwidth requirements from the peripheral.
    WINUSB_PIPE_INFORMATION pi;
    WinUsb_QueryPipe(WinusbHandle, 1, 0, &pi);
    WINUSB_PIPE_INFORMATION_EX pix;
    WinUsb_QueryPipeEx(WinusbHandle, 1, 0, &pix);

    // In our discussion, MaximumBytesPerInterval is the max bytes per microframe.
    auto TransferSize = XFER_SZ * pix.MaximumBytesPerInterval * (8 / pix.Interval);

    WINUSB_ISOCH_BUFFER_HANDLE handle;
    PUCHAR data = (PUCHAR)malloc(NUM_XFERS * TransferSize);
    WinUsb_RegisterIsochBuffer(WinusbHandle, pi.PipeId, data, NUM_XFERS * TransferSize, &handle);

    // In order to later wait for I/O to finish, need to associate event with completion.
    OVERLAPPED completions[NUM_XFERS];
    for (auto i = 0; i < NUM_XFERS; i++) {
        ZeroMemory(&completions[i], sizeof(completions[i]));
        completions[i].hEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
    }

    // In our discussion, # microframes per transfer.
    auto PacketCount = (NUM_XFERS * TransferSize) / pix.MaximumBytesPerInterval;
    PUSBD_ISO_PACKET_DESCRIPTOR descr = (PUSBD_ISO_PACKET_DESCRIPTOR)malloc(sizeof(USBD_ISO_PACKET_DESCRIPTOR) * PacketCount);
    for (auto i = 0; i < PacketCount; i++) {
        ZeroMemory(&descr[i], sizeof(USBD_ISO_PACKET_DESCRIPTOR));
    }

    // Buffer to hold contiguous data.
    auto bufContiguousBytes = NUM_XFERS * TransferSize;
    PUCHAR bufContigous = (PUCHAR)malloc(NUM_XFERS * TransferSize);

    auto firstio = true;

    // Prime all I/Os for reading.
    for (auto i = 0; i < NUM_XFERS; i++) {
        WinUsb_ReadIsochPipeAsap(handle, i * TransferSize, TransferSize, !firstio, PacketCount / NUM_XFERS, &descr[(PacketCount / NUM_XFERS)*i], &completions[i]);
        firstio = false;
    }

    // As they complete, extract the data and re-queue.
    while (true) {
        auto offsetInDstBuffer = 0;
        for (auto i = 0; i < NUM_XFERS; i++) {
            ULONG NumBytes;
            WinUsb_GetOverlappedResult(WinusbHandle, &completions[i], &NumBytes, TRUE);

            auto startDescriptor = (PacketCount / NUM_XFERS)*i;
            auto endDescriptor = startDescriptor + PacketCount / NUM_XFERS;

            for (auto j = startDescriptor; j < endDescriptor; j++) {
                auto offsetInSrcBuffer = (i * TransferSize) + descr[j].Offset;
                memcpy(bufContigous + offsetInDstBuffer, data + offsetInSrcBuffer, descr[j].Length);
                offsetInDstBuffer += descr[j].Length;
            }

            WinUsb_ReadIsochPipeAsap(handle, i * TransferSize, TransferSize, true, PacketCount / NUM_XFERS, &descr[(PacketCount / NUM_XFERS)*i], &completions[i]);
        }

        bufContiguousBytes = offsetInDstBuffer;
        // Process data in bufContigous.
    }
}

No comments:

Post a Comment