[SOLVED] Debugging a slow connection between local devices in only one direction
[SOLVED]
... well, this is in many ways very unsatisfying, because I have no idea why this worked, but I seem to have fixed it.
Server A has two Ethernet ports, an Intel I219V and a Killer E3100. Several months ago, when trying to debug sporadic btrfs errors (I had my RAM installed incorrectly!), I had disabled some unused devices in BIOS, including the Killer Ethernet port.
Since I had no other ideas, and it seemed like this was somehow specific to this server, I just re-enabled the Killer port and switched the Ethernet cable to that port. I'm now getting 300 Mb/s transfers from my wireless devices to my server, exactly as expected.
I'm gonna like... go for a walk or something. Thank you so much to everyone who helped me rule out all of the very many things this could have been! I love this place, you all are so kind and supportive.
Original:
I'm trying to debug a perplexing networking situation, and I could use some guidance if anyone has any.
Here's my setup:
- UniFi Security Gateway
- UniFi Switch Lite
- Two UAPs
- Two servers, A and B, connected to the USW-Lite with GbE
- Many wireless devices, connected to the UAPs
Here's what I'm experiencing:
- Network transfers from the wireless devices to server A (as measured by iperf3 tests) are very slow. Consistently between 10 and 20 Mb/s.
- Network transfers from server A to all devices are expected speeds. 900-1000 Mb/s to server B, 350-ish Mb/s to wireless devices.
- Network transfers between server B and all devices (in both directions!) are expected speeds.
- Network transfers from the USG to server A also seem slow, which is odd. Only about 60 MB/s.
- Network transfers from the USG to server B and the wireless devices is about 300 MB/s
So, specifically network transfers from any wireless device to server A are slow, and no other connections have any issues that I can see.
Some potentially relevant details:
- Server A is running Unraid
- Server B is running Ubuntu
- Wireless devices include a Fedora laptop, an iPhone, and a Macbook Pro
- UniFi configuration is pretty straightforward. I have a few ports forwarded, a guest WiFi network (that none of these devices are on), a single default VLAN, and two simple "Allow LAN" firewall rules for Wireguard on the USG. No other firewall or routing config that I'm aware of.
If anyone has any thoughts at all on how to continue debugging, I would be immensely grateful! I suppose the next step would be to try to determine whether it's the networking equipment or the server itself that is responsible for the throttling, but I'm not sure how best to do that.
Things I would try:
Try to put things "back like they were" after every test so that you are reducing the number of variables tested each time. Also, make sure the problem still exists before starting the next test. Nothing is more frustrating than having done something that fixed it accidentally in the previous test (for example, an intermittent port failure or cable problem ight be fixed after you swap the cables and put them back), but assuming it is the current test that fixed it because the problem is now gone.
There is a lot of overlap in these tests. But my experience is that this can be useful -- if the overlapping tests give the same (expected) results, you have confirmation of your mental model. If you get inconsistent results ( like server A to Server B is fast, but server A to the laptop is slow when connected to the switch) that could help you reframe your assumptions about the network.
Good luck!
Thanks so much. Some updates:
Gave this a shot, same issues no matter which port I'm using. It basically can't be the cable, because server B can write to server A with no issue!
Sorry for the lack of clarity: yes, I can write from server B to server A with no issue (900 Mb/s). It does seem somehow specific to wireless devices.
This is a good thought; I don't know if I have the necessary dongles for this (just tried daisy chaining a USB-C -> Thunderbolt and Thunderbolt -> Ethernet together to no avail) but I will keep hunting around and give it a shot if I can.
It doesn't matter which UAP the wireless devices are connected to, same issue.
It doesn't seem to matter. Also, I get exactly the expected speeds when wirelessly connecting to server B! So I don't think it's anything about the wireless connection itself.
I think this is ruled out by successful transfers from server B to server A
I really appreciate all of the tips, going to keep trying to find a way to connect the laptops with Ethernet!
Update:
Got my Macbook connected with Ethernet. It can transfer perfectly fine; 940 Mb/s on average. As soon as it's back on WiFi, down to 11 Mb/s!
Any chance you’re handy with wireshark/tcpdump? Seems wise to rule out any packet level shenanigans (eg sizes get negotiated lower between server A and a wireless device vs server A and a wired device).
Ah... I'm not, really, but if you have any tips I would definitely give it a shot.
One relevant thing (@first-must-burn in case this triggers anything for you) is that I just noticed that there are tons of packets dropped on the slow transfers. Here's a sample iperf run:
If there's any way to determine why or where those packets get lost, that seems like it'd be very relevant??
If you SSH into the access points you can run a packet capture on the AP itself with tcpdump and see what is coming in and going out. I have done this before but would have to remember the right command line arguments.
Do you see packets getting lost if you just run a ping for a while or only when the connection is loaded?
Hm, ok. I SSH'ed into the AP and ran
tcpdump src <laptop> and dst <server A>
. Then I ran an iperf test (basically the same results as above). For comparison, I did the same with server B. I don't really see any distinction between the two outputs, but I also am not totally sure what I should be looking for.Another thing I did notice is that the "Congestion window" for server A is tiny compared to server B. On the order of 5 KBytes or so, compared to 1.5 MBytes. It's not clear to me why that would be, or whether it's relevant. When I run an iperf from server B to server A, the bitrate is about 950 Mb/s, and the Cwnd is like 340 Kbytes.
Sorry, I'm not sure I understand ... may I ask if you've run an iperf test between server A -> AP, and server B -> AP? The congestion window roughly correlates with what TCP heuristically figured your network's throughput is, so narrowing it down to an issue between the server and the access point at least rules out the clients as a problem.
If you indeed have an issue between server A <-> AP, and no issue between server B <-> AP, then it'd be worth digging into why packets are being dropped. Digging into server A would be useful in that case, since it might be e.g. setting a lower MTU than is advisable, there could be a driver bug for your NIC which is fumbling something, etc. etc.
I think you're like... right on the money with a driver bug for my NIC, or something in that genre. Literally minutes before you posted this, I switched to the primary NIC for my motherboard, and that fixed it! No idea what the heck was going on, but I think I'm ok with this outcome hahaha
Glad to hear it :) ghosts in the system can be such a bear to deal with.
Thanks for keeping us all updated so well. Glad you got a solution!
What type of storage media is being used on Server A versus Server B?
Could server A have some media that might be limiting write speeds? Or possibly being limited by an oversaturated SATA? Seems odd to me, but maybe worth an investigation. Newer motherboards can cause this with their faster m.2 sockets eating up all the resources from some SATA ports.
The fact you can transfer out at 900+ Mbps is odd though...food for thought?
This occurs when running iperf3 tests and scp'ing to /dev/null, too, so it's not related to the storage medium (that is a good a thought though and I also spent a few hours ruling that out initially!)
Oh ok. I'm no expert in this field, but had a similar issue when diagnosing my mother in law's network for her! Hope you get a solution.
Some old HDD had slow write speed and was killing performance.
My (not confident) guess would be a bug in the access points. You could try a different version of the firmware. Are they on the latest firmware? Have you seen the issue on other firmware versions?
You can SSH into the access points themselves and may be able to see errors in system logs or if they are suffering CPU or memory issues when a transfer from wireless devices to server A occurs.
You could check the wifi status on the laptops to see if they are connected at the full speed profile expected. Not sure how to do that on Fedora or Mac, and probably not as they work in the other direction.
Another thought: are all devices definitely on the same IP subnet and not going through any kind of firewall in the gateway?
Thank you for hopping in!
This is a longstanding problem, and the firmware was just updated like five days ago, so it would have to be a longstanding firmware bug (which I suppose is possible!). It would be weird if the bug only affected transfers to one specific device, but I suppose it could be a confluence of things.
The laptops can upload files to other wired and wireless devices on the network at the expected speeds.
All of the devices are on the same IP subnet for sure. I'm going to try to disable the guest network just in case, but none of the devices are on that.
It seems like the issue must be a confluence of factors whatever it is. One reason for suspecting the APs is that I tried to troubleshoot an issue for years with Unifi APs intermittently not allowing connection from Apple devices. I never managed to pin it down, but I am pretty sure it was a bug with UAP-PRO APs sometimes getting into a problematic state.