Friday, September 27, 2019

Fixing up KA9Q-unix, or "neck deep in 30 year old codebases.."

I'll preface this by saying - yes, I'm still neck deep in FreeBSD's wifi stack and 802.11ac support, but it turns out it's slow work to fix 15 year old locking related issues that worked fine on 11abg cards, kinda worked ok on 11n cards, and are terrible for these 11ac cards. I'll .. get there.

Anyhoo, I've finally been mucking around with AX.25 packet radio. I've been wanting to do this since I was a teenager and found out about its existence, but back in high school and .. well, until a few years ago really .. I didn't have my amateur radio licence. But, now I do, and I've done a bunch of other stuff with a bunch of other radios. The main stumbling block? All my devices are either Apple products or run FreeBSD - and none of them have useful AX.25 stacks. The main stacks of choice these days run on Linux, Windows or are a full hardware TNC.

So yes, I was avoiding hacking on AX.25 stuff because there wasn't a BSD compatible AX.25 stack. I'm 40 now, leave me be.

But! A few weeks ago I found that someone was still running a packet BBS out of San Francisco. And amazingly, his local node ran on FreeBSD! It turns out Jeremy (KK6JJJ) ported both an old copy of KA9Q and N0ARY-BBS to run on FreeBSD! Cool!

I grabbed my 2m radio (which is already cabled up for digital modes), compiled up his KA9Q port, figured out how to get it to speak to Direwolf, and .. ok. Well, it worked. Kinda.

Here's my config:

ax25 mycall CALLSIGN-1
ax25 version 2
ax25 maxframe 7
attach asy 127.0.0.1:8001 kissui tnc0 65535 256 1200

.. and it worked. But it wasn't fast. I mean, sure, it's 1200 bps data, but after digging I found some very bad stack behaviour on both KA9Q and N0ARY. So, off I went to learn about AX.25.

And holy hell, there are some amusing bugs. I'll list the big showstoppers first and then what I think needs to happen next.

Let's look at the stack behaviour first. So, when doing LAPB over AX.25, there's a bunch of frames with sequence numbers that go out, and then the receiver ACKs the sequence numbers four ways:
  • RR - "roger roger" - yes, I ack everything up to N-1
  • RNR - I ack everything up to N-1 but I'm full; please stop sending until I send something back to start transmission up again
  • REJ - I received an invalid or missing sequence number, ACK everything to N-1 and retransmit the rest please
  • I - this is a data frame which includes both the send and receive sequence numbers. Thus, transmitted data can implicitly ACK received data.
I'd see bursts like this:
  • N0ARY would send 7 frames
  • I'd receive them, and send individual RR's for each of them
  • N0ARY would then send another 7 frames
  • I'd receive a few, maybe I'd get a CRC error on one or miss something, and send a REJ, followed by getting what I wanted or not and maybe send an RR
  • N0ARY would then start sending many, many more copies of the same frame window, in a loop
  • I'd individually ACK/REJ each of these appropriately
  • .. and it'd stay like this until things eventually caught up.


So, two things were going wrong here.

Firstly - KA9Q didn't implement the T2 timer in the AX.25 v2.0 spec. T2 is an optional timer which a TNC can use to delay sending responses until it expires, allowing it to batch up sending responses instead of responding (eg RR'ing) each individual frame. Now, since the KISS TNC only sends data and not signaling up to the applications, all the applications can do is queue frames in response to other frames or fire off timers to do things. The KA9Q stack doesn't know that the air is busy receiving more data - only that it received a frame. So, T2 could be used to buffer sending status updates until it expires.

N0ARY-BBS implements T2 for RR/RNR responses, but not for REJ responses.

Then, both KA9Q and N0ARY-BBS don't delay sending LAPB frames upon status notifications. Thus, every RR, RNR and REJ that is received may trigger sending whatever is left in the transmit window. Importantly, receiving a REJ will clear the "unack" (unacknowledged) window and force retransmission of everything. If you get a couple of REJ's in a row then it'll try to send multiple sets of the same window out, over and over. If you get an RR and REJ and RR, it may send more data, then the whole window, then more data. It's crazy.

Finally, there's T1. T1 is the retransmisison timer. Now, transmitting a burst of 7 frames of full length at 1200 baud again takes around 2.2 seconds a frame, so it's around 15.4 seconds for the full burst. If T1 is less than that, then because there's no feedback about when the frames went out - only that you sent them to the TNC - you'll start to try retransmitting things. Now, luckily one can poll the other end using a RR poll frame to ask the other end to respond with its current sequence number - that way each end can re-establish what each others send/receive sequence numbers are. However, these can also be batched up - so whilst you're sending your frames, T1 fires generating another batch of RR's to poke the other side. This in itself isn't such a bad thing, but it does mean the receiver sees a big, long burst of frames followed by a handful of RR polls. Strictly speaking this isn't ideal - you're only supposed to send a single poll and then not poll until you get a response or another timeout.

So what have I done?

I'm doing what JNOS2 (another KA9Q port) is doing - I am using T2 for data transmission, RR, RNR and REJ transmission. It's not a pretty solution, but it at least stops the completely pointless retransmission of a lot of wasted data. I've patched both N0ARY and KA9Q to do this, so once KE6JJJ gets a chance to update his BBS and the N0ARY BBS I am hoping that the bad behaviour stops and the BBS becomes useful again for multiple people.

Ok, so what needs to happen?

Firstly, we've learnt a lot about networking since the 80s and 90s. AX.25 is kinda part TCP, part IP, so you'd think it should be fine as a timer based protocol. But alas no - it's slow, it's mostly half duplex, and overflowing transmit queues or resending data incorrectly has a real cost. It's better to look at it as a reliable wireless protocol like what 802.11 does, and /not/ as TCP/IP. 802.11 has timers, 802.11 has sequence numbers, and 802.11 tries to provide as reliable a stream as it can. But it doesn't guarantee traffic; if traffic takes too long it'll just time it out and let the upper layer (like TCP) handle the actual guarantees. Now, you kind want the AX.25 LAPB sessions to be as reliable as possible, but this gets to the second point.

You need to figure out how to be fair between sessions. The KA9Q stacks right now don't schedule packets based on any kind of fairness between LAPB or AX.25 queues. The LAPB code will try to transmit stuff based only on its local window and I guess they treat retransmits as something that signals they need to back off. That's all fine and dandy in theory but in practice at 1200 bps a 7 packet window at 256 bytes a packet is 7*2.2 seconds, or 15.4 seconds. So after 15.4 seconds if the remote side immediately ACKs and you then send another 7 packet burst, noone is going to really get a chance to come in and talk on the BBS.

So, this needs a couple things.

Firstly, just because you can transmit a maximum window of 7 doesn't mean you should. If you see the air being busy, maybe you want to back that off a bit to let others get in and talk. Yes, it does mean the channel is being used less efficiently in total for a single session, but now you're allowing other sessions to get airtime and actually interact. Bugs aside, I've managed to keep the N0ARY BBS tied up for minutes at a time squeezing me tens of kilobytes of article content. That's all fine and dandy, but I don't mind if it takes a little longer when there are other users trying to also do stuff.

Next, the scheduling for LAPB shouldn't just be timers kicking off packet generation into a queue, and then best effort transmission at the KISS TNC. If the AX.25 stack was told about the data link status transitions - ie, between idle, receiving, transmitting and such - then when the air was free the TNC could actually schedule which LAPB session(s) to service next. I've watched T1 (retransmission) and T2 kick over multiple times during someone else downloading data, and when the air is eventually busy an AX.25 node sends multiple copies of both I data payload and S status frames (RR, RNR, REJ, probes, etc.) It's insane. The only reason it's doing this is because it doesn't know the TNC is busy receiving or transmitting and thus those timers don't need to run and traffic doesn't need to be generated. This is how the MAC and PHY layers in 802.11 interoperate. The MAC doesn't queue lots of packets to be sent out when the PHY is ready - the MAC has the work there, and when the PHY signals the air is free and the contention window timer is expired, the MAC signals to get the air and sends its frame. It sends what it can in its given time window and then it stops.

This means that yes, the KISS TNC popularity is part of the reason AX.25 is inefficient these days. KISS TNCs make it easy to do AX.25 packets, but they make it super easy to do it inefficiently. The direwolf author wrote a paper on this where he compared these techniques to just using the AX.25 stack (and AX.25 2.2 features) which have knowledge of the direwolf physical/radio layer. If these hooks were made available over the KISS TNC interface - and honestly, they'd just be a two byte status notification saying that the TNC is in the { idle, receiving, decoding, transmitting } states - then AX.25 stacks could make much, much smarter decisions about what to transmit and when.

Finally - wow this whole thing needs per packet compression already. AX.25 version 2.2 introduces a way of negotiating parameters with remote TNCs for supported extensions and so one of my medium term KA9Q/N0ARY goals is to introduce enough version 2.2 support to negotiate SREJ (selective rejection/retransmission) and maybe the window size options, but primarily to add compression. I think SREJ + per packet compression would be the biggest benefits over 1200 and 9600 bps links.

If you're interested, software repositories are located below. I encourage people to contribute to the KE6JJJ work; I'm just forked off of it (github username erikarn) and I'll be pushing improvements there.

Oh, and these compile on FreeBSD. KA9Q and direwolf both compile and run on MacOSX but N0ARY-BBS doesn't yet do so. Yes, this does mean you can now do packet radio on FreeBSD and MacOSX.



Wednesday, June 19, 2019

wow, it's been a while

Well, it has been a while. I've been busy with a new job (Facebook/Oculus) which has taken some adjustment to get to. My new baby girl Alice is now 1 and a half years old and a redhead handful of gremlin energy. Nora (who is almost 5 now) is now inquisitive, playful and talkative. And daddy needs some sleep.

But, hey, I have been playing with radios and slowly getting back into FreeBSD wifi hackery. So hopefully I will write a few posts here over the next few months to catch up on what I've been doing.

Field Day 2019 is this weekend. I hope to bring Nora upstairs for a few hours to work some stations with me from my home setup. I'm hoping we can work a handful of modes on different bands during the day.