yapicoprobe/doc/lwIP-notes.adoc

:imagesdir: png
:source-highlighter: rouge
:toc:
:toclevels: 5


# Some lwIP Notes

lwIP is used for creation of a network interface of YAPicoprobe to avoid
an uncountable number of additional CDC COM ports.

Currently a Segger SysView server is listening on port 19111.
But I guess there is more to come.


## Pitfalls

lwIP had some traps waiting for me.  Despite reading the official
https://www.nongnu.org/lwip/2_1_x/pitfalls.html[Common Pitfalls]
I took some of them.


### MAC

Do not use a random MAC!  At least not for the first byte.
I was "lucky" choosing one with an odd first byte.  Unfortunately
bit 0 marks a group address.  At least Linux rejects such MAC
addresses with a more or less useless error message in the syslog.
Took a while until I found out that the MAC address was the culprit
for no communication. +
Finally I have used 0xfe as the first byte and the remaining five
bytes of the MAC address are copied from the last bytes of Picos serial number
(which actually is the serial number of the external flash).

### OS Mode

*really* take it serious what they are writing about "raw" APIs
and only using the TCPIP thread for any calls to it. +
Use `tcpip_callback_with_block(<func>, <void-arg>, 0)` for
non-blocking invocations of some function.  This also saves you
from creating extra simple threads for communication tasks line
in the TineUSB/lwIP glue code.  That was my first idea and it took
a while until I found out how bad that idea was. +
Same was true for the thread(!) which stuffed data for SysView into
it's server:  bad idea! +
Effect of wrong API handling were random crashes or connection
disruptions.


### RNDIS/ECM/NCM

To be honest, I'm very confused about the system behaviour of RNDIS/ECM/NCM.
Sometimes the host gets "disconnected", because the DNS in `/etc/resolv.conf`
are changed, sometimes not.  Sometimes the probe needs `dhclient` to get
the TCP/IP connection, sometimes not.  Sometimes the probe has an IPv6 address, sometimes
not.  And this all just on my Linux host.  Interoperation with Windows
makes things even worse. +
And `/etc/network/interfaces` generates error
messages even when the device has `allow-hotplug`.

* *RNDIS*: this was my former favorite, because it is supported by all
  relevant OSs.  Also throughput seemed to be good.
  Unfortunately RNDIS seems to manipulate routing in a way that the
  default route on my Linux wants to go through the probe.  Not
  really what I want...

[NOTE]
====
RNDIS on Win10 works only, if RNDIS on the probe is the only USB class selected.
So it is possible to create a special probe firmware which provides only features
like SystemView, but you cannot have a probe which does SystemView and CMSIS-DAP. +
This is not a fault of lwIP, it is a bug in the Win10 driver(?).
====

* *ECM*: works good, packets are transferred continuously, throughput
  also seems to be ok.  So this is the way to go. +
  Unfortunately there is no driver integrated into Win10, so possible
  extra trouble appears.  Yes... extra trouble: cannot find a driver
  for Win10.

* *NCM*: is said to be the best choice.  And in fact it is.
  At least after creation of a `ncm_device_simple.c` driver which is a
  stripped down version of `ncm_simple.c` which revealed as very buggy. +
  Now thoughput under Linux and Windows is ok.  Operation with SystemView
  works without glitches, `iperf` tests sometimes crashes the probe.
  So consider this driver as beta and work in progress.


## Performance

To measure performance, `iperf` is used (which implies, that `OPT_NET_IPERF_SERVER`
must be set on built).  Good command line for measurement:

  iperf -c 192.168.14.1 -e -i 1 -l 1024

## Testing

Good test cases are the following command lines:

  for MSS in 90 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1459 1460 1500; do iperf -c 192.168.14.1 -e -i 1 -l 1024 -M $MSS; sleep 10; done

  for LEN in 40000 16000 1400 1024 800 200 80 22 8 2 1; do for MSS in 90 93 100 150 200 255 256 300 400 500 511 512 600 700 800 900 1000 1100 1200 1300 1400 1450 1459 1460 1500; do iperf -c 192.168.14.1 -e -i 1 -l $LEN -M $MSS; sleep 2; done; done

Monitor performance/errors with Wireshark.


## Some words about...

### net_glue.c

I'm really trying hard to switch context between lwIP and TinyUSB correctly.  This leads
to some kind of delayed call chains and also does not make the code neither nice nor
very much maintainable.


### NCM

TinyUSB NCM driver implementations is more or less buggy, so I'm doing my best
implementing a driver on my own.

Current work consists of debug versions of `ncm_device` and an almost
working `ncm_device_simple`.

Link:

* link:extern/NCM10-20101124-track.pdf[NCM Specification]


#### ncm_device_simple.c

`ncm_device_simple.c` is actually a mixture of `ecm_rndis_device.c` and `ncm_device.c`.
From `ecm_rndis_device.c` the structure has been inherited and from `ncm_device.c` the
functionality. +
The driver can be considered work in progress, because in conjunction with `iperf`
crashes sometimes happen.  But for operation with SystemView quality seems to be good enough.

WARNING: There must be a nasty bug in `ncm_simple_device`.  It reveals on startup
because startup time with `ncm_device_simple` is much longer compared to ECM/RNDIS and even
`ncm_device`.


#### possible bugs in ncm_device.c

This is more or less obsoleted by `ncm_device_simple.c`.  But as a short summary: the original
driver is very buggy.  Perhaps it is working in certain scenarios, but for sure not together with
SystemView.

* not sure, but perhaps it is best to call all functions within ncm_device in the FreeRTOS
  context of TinyUSB
* `wNtbOutMaxDatagrams` must be set to 1 [2023-06-27]
** iperf runs then
** Systemview still has problems
** `wNtbOutMaxDatagrams == 0` generates a lot of retries with iperf
* I guess that the *major problem* lies within handle_incoming_datagram() because it changes values
  on an incoming packet although tud_network_recv_renew() is still handling the old one
* is there multicore a problem!? (14.7.2023: no!)  I have seen retries with multicore even with
  `wNtbOutMaxDatagrams = 1`
* I think it is assumed, that TinyUSB and lwIP are running in the same task (but in my scenario they don't)
* if removing debug messages, then the receive path seems to work better, which
  indicates a race condition somewhere

There is an open issue in the TinyUSB repo for this issue: https://github.com/hathach/tinyusb/issues/2068


## TinyUSB Driver API

### TinyUSB -> Driver

The following API is for calls from TinyUSB to the driver.
The calls are all initiated from within the TinyUSB stack.  Thus all are done in the context of TinyUSB.

[%autowidth]
[%header]
|===
|Name | Comment

|netd_init()
|Initialization of the driver on startup.  Called several times.

|netd_reset(rhport)
|Called several times on startup.  `rhport` seems to be zero in all calls.

|netd_open(rhport, *itf_desc, max_len)
|Connects the USB endpoints.  This is called once when the host driver
connects with the device.

|netd_control_xfer_cb(rhport, stage, *request)
|called after `netd_open()`.  Only `stage==CONTROL_STAGE_SETUP` seems to be
of interest.

|netd_xfer_cb(rhport, ep_addr, result, xferred_bytes)
a|Depending on `ep_addr` the driver is told here, that a

* packet can be fetched from the stack for further processing within the driver
* packet transmission can be started
* notification packet should be transmitted (that's about communication parameters)
|===


### Driver -> TinyUSB

The driver has a whole bunch of available API calls.  The most important are:

[%autowidth]
[%header]
|===
|Name | Comment

|tud_control_status()
|Send STATUS (zero length) packet.  Called in `netd_control_xfer_cb()`.

|tud_control_xfer()
|Carry out Data and Status stage of control transfer.  Called in `netd_control_xfer_cb()`.

|usbd_edpt_busy()
|Check whether an endpoint is busy or ready for the next `usbd_edpt_xfer()`.

|usbd_edpt_open()
|Used during `netd_open()`.

|usbd_edpt_xfer()
|Submit a USB transfer.  For receive operation, the specified buffer must be empty.
For transmit operation, the buffer may not be touched, until the corresponding
`netd_xfer_cb()` is received.

|usbd_open_edpt_pair()
|Used during `netd_open()`.
|===


### Glue Logic -> Driver

The following API is for call from glue logic to the driver.  The glue logic tries hard to issue
the calls in the TinyUSB context as well.  But this is not guaranteed I'm afraid (other developers).

[%autowidth]
[%header]
|===
|Name | Comment

|tud_network_can_xmit(size)
|check if the driver buffer allows another datagram with the specified size.
If the driver tells the glue logic that there is space enough for the datagram, the glue logic
calls in the next step `tud_network_xmit()`. +
Not sure how recovery works if there is no space left.  So at the moment the glue logic
is responsible for retries.

|[.line-through]#tud_network_link_state_cb(state)#
|[.line-through]#seems to be obsolete.  No call found within the stack.  So do not implement.
PR at TinyUSB pending to remove the call.#

|tud_network_recv_renew()
|Called when the glue logic has the opinion that the driver should check if it
can enable receive logic.  The function has to check, if the USB channel
and receive buffer are available.  Another option (for NCM) is, that there are
still buffered datagrams which can then be transferred via `tud_network_recv_cb()`.

|tud_network_xmit(*ref, arg)
|The glue logic requests a datagram transfer into the driver.  The driver may then
prepare for the actual copy operation from glue logic which is performed via
`tud_network_xmit_cb(dst, ref, arg)`.  Transmission does not have to take place.  E.g.
the NCM driver should be capable of buffering multiple datagrams into one
big NCM transfer block.
The call must succeed.
|===


### Driver -> Glue Logic

The glue logic also provides some API which has to be used by the driver.  The driver always
calls the glue logic in the TinyUSB context.

[%autowidth]
[%header]
|===
|Name | Comment

|tud_network_recv_cb(*src, size)
|Transfer a single datagram from the driver to the glue logic.  When the layer above the glue logic (lwIP)
has handled the datagram, it issues a `tud_network_recv_renew()` so the process of datagram reception
does not die.

|tud_network_xmit_cb(*dst, *ref, arg)
|The driver calls this function from `tud_network_xmit()` to perform the actual copy operation
of the datagram from glue logic into the driver.  The two parameters are not changed by
the driver, except that it specifies an additional copy destination.
|===


## The ncm_device_new driver & comparison

The following table holds a comparison between the several network drivers.  The first seven bars are
created with
`for MSS in 100 200 400 800 1200 1450 1500; do iperf -c 192.168.14.1 -e -i 1 -M $MSS -l 8192 -P 1; sleep 2; done`.
After that SystemView is started with almost maximum load (~85000 events/s, 325 KByte/s) and after a while
iperf is started again in parallel.

The images are recorded with Wireshark.  The red bars are "TCP Window Full" if not otherwise noted.

[%header]
|===
|Driver |

|**ECM** +
The driver shows expected behavior, nothing actually special.
a|image::benchmark-ecm.png[ECM]

|**RNDIS** +
Again nothing special.
a|image::benchmark-rndis.png[RNDIS]

|**Original NCM** +
The original NCM driver is very buggy as said.  The red bars in the graph are not caused by "TCP Window Full".
Obscure messages in Wireshark show that the protocol is more or less garbage.
a|image::benchmark-ncm_device.png[Original NCM]

|**Simple NCM** +
The simple NCM driver behaves much better, but revealed some weaknesses in parallel operation (it also had
some overflows in SystemView without iperf in paralell.  It had some stability issues and it also had bugs
on startup of the probe which was the actual reason to create `ncm_device_new`.
a|image::benchmark-ncm_device_simple.png[Simple NCM]

|**New NCM** +
`ncm_device_new` clearly shows best behavior.  Throughput is best and during parallel operation there was
no packet loss when iperf used large packets.  Also no obsucre Wireshark messages in parallel operation.
a|image::benchmark-ncm_device_new.png[New NCM]
|===

So obviously `ncm_device_new` is the clear winner: best in performance, best in functionality, best in compatibility.
What else is needed?


## Log

### 2023-05-12

* for unknown reasons the probe is even with ECM in stutter mode, don't know
  why, that worked before smoothly.  Transfer rate is bad
* systemview test program (NoOS) on the target:
** it already worked with around 10000 events/s, now the limit is ~3000
** if there is a SysTick ISR then SystemView is completely messed up.
   Checked that locking is included.  Seems to be so.

### 2023-06-26

* after some changes to `rtt_console.c`, `net_sysview.c` and `net_glue.c`
  ECM is working again as expected

### 2023-06-30

* for debugging purposes reimplemented `ncm_device_simple.c` which can hold only
  one ethernet frame per NTB (NCM Transfer Block).  This unfortunately requires
  that the original `ncm_device.c` must be outcommented via `#if` on top.

### 2023-07-14

* did some performance tuning with lwIP and TinyUSB
* stripped sources
* BUG: `ncm_device_simple` sometimes crashes with `iperf`

### 2023-08-11

* BUG: with `ncm_device_simple` startup time of the probe is much longer compared
  to ECM/RNDIS or even `ncm_device`.  With startup time I mean the time until there is
  something visible on the probes debug output.  For ECM/RNDIS/ncm_device this is almost
  instantly, with `ncm_device_simple` it takes ~10s! +
  -> reverted to `ncm_device` because SystemView runs without problems with it +
  -> solved with `ncm_device_new`

### 2023-08-16

* new driver: `ncm_device_new`
** works (better then `ncm_device_simple`), but
*** [x] problems, if `wNtbOutMaxDatagrams!=1` -> see comment
*** [x] iperf also shows problems if `-P` is > 1.  I guess this is an iperf problem, because iperf
        and SystemView are running parallel without such errors
*** [ ] surprisingly `iperf` performance is much better with actual firmware.
        `cmake-create-debugEE` has just half performance
*** [x] but all these problems also exists with `ncm_device`.  Is it in the glue code?
        Possible, because the effect is also with ECM driver.  No, not in the glue code, because
        iperf and SystemView work in parallel
*** [x] packets/s is changing heavily, setting `wNtbOutMaxDatagrams==0` helped to prevent raising
        of packet rate (sometimes there are two datagrams in one NTB even with SystemView)
** how to continue?
*** [x] need a test case where `tud_network_can_xmit()` collects datagrams.  Currently
        there is always only one active xmit datagram, perhaps `iperf` with `-P 4` does it. +
        -> this all happens under load testing
*** [x] check if there is a problem in the glue code for datagram reception.  Glue buffer freed too soon?
        No, I doubt it.  But examples are few.

### 2023-08-18

* the new driver is ready.  Had some optimization loops, but now it seems to work pretty well