During the COVID-19 stay-at-home order, I've been blessed to be able to work from home. However, I faced what many people suffered when switching from working in the office to working at home: a poor VPN connection. In my case, my connection would hiccup every 5 minutes.
This problem had shown up before we were forced to stay-at-home and I had worked on it with an excellent network administrator at my client. When tethered to my phone or using the office guest network, the VPN was rock-solid. It seemed to only fail on my home internet connection. This knowledge - combined with the stay-at-home order - gave me an excellent excuse to tinker on my home network.
The original network was a fairly typical setup:
My first suspect in this setup was the ISP Router. Looking through its settings, I saw that passing through the public IP to another device would bypass most of the network stack in the ISP Router. As I didn't want yet another all-in-one router sitting next to the ISP Router, I looked for a vendor that supported a modular, easy to expand and maintain setup. I ended up choosing Ubiquiti Networks because of their unified dashboard that I could host myself and their "pro-sumer" hardware. A few days later, I brought home an EdgeRouter X (ER-X) and a Unifi AP Lite (AP) from the local Micro Center.
After a few hours of tinkering, I had the following setup:
With this in place, the VPN drops stopped occurring and the latency dropped by a few milliseconds. I was elated, but then I noticed another problem. While the VPN remained connected, random latency spikes would cause pages to time out and video calls to drop. Ping would initially report dropped packets until the latency dropped, at which point the missing return packets would all simultaneously appear. My initial thought was that the packets were leaving the network but not making it back until the latency dropped. Suspecting that the IP passthrough hadn't really solved the problem, I attempted to physically bypass the ISP Router with the ER-X.
The ISP network will refuse service to unauthenticated devices attached to it. This meant that the ER-X needed to somehow authenticate itself in the same way that the ISP Router did. After some more research, I found three ways to accomplish this:
I picked the second option as bridging meant a slower connection and extracting authentication keys seemed legally dubious. I changed the physical configuration again and began configuring the ER-X.
I was able to get eap_proxy started only after I found an issue on the eap_proxy Github project and downgraded the firmware on the ER-X to an older supported version. However, the auth packets from the ISP network still weren't making it to the ISP Router. After tweaking different settings and some more research, I found a guide on Github which solved my problem. To make sure that only the required setting were in place, I reset the ER-X to it's factory settings and then walked through authenticating with the ISP network one last time.
With this final set up, I again tested my VPN connection...no dice. Armed with the additional details now in the Ubiquiti dashboard, my new suspect was the AP, as I saw a jump of WIFI retries when I was working in my study. So I bought an Ethernet cable and ran it to the study, to see if that would solve the problem...no dice. I replace the cable between the ER-X and the FTTH connection...no dice.
I was rapidly running out of options. Mildly frustrated that my hardware fixes didn't make any difference, I started tcpdump on the router and watched the VPN keepalive packets between the laptop and the VPN server. Then my understanding of the problem flipped. The laptop would pause in sending the keepalive and then it would send a whole bunch, matching the latency spike. The laptop itself was source of the problem. With this new insight, I set out to show whether it was a hardware or software problem.
To test whether it was a hardware problem, I set up a Linux VM on the laptop and connected to the VPN within the VM using openconnect. No problems. Perhaps the supplied AnyConnect client was the problem? I looked through the brew repository and discovered that an openconnect client existed for macos. I immediately installed and tested it. no problems.
Reaching this resolution took a couple weeks and it's held up for several weeks now. The open-source alternative to the AnyConnect client has been working just fine. At some point, I'll need to re-install the AnyConnect client to see if there was some old driver or library that was causing the network problem. It's been very nice to be able to get my work done without my connection dropping all the time.