15 votes

[SOLVED] Debugging a slow connection between local devices in only one direction

[SOLVED]

... well, this is in many ways very unsatisfying, because I have no idea why this worked, but I seem to have fixed it.

Server A has two Ethernet ports, an Intel I219V and a Killer E3100. Several months ago, when trying to debug sporadic btrfs errors (I had my RAM installed incorrectly!), I had disabled some unused devices in BIOS, including the Killer Ethernet port.

Since I had no other ideas, and it seemed like this was somehow specific to this server, I just re-enabled the Killer port and switched the Ethernet cable to that port. I'm now getting 300 Mb/s transfers from my wireless devices to my server, exactly as expected.

I'm gonna like... go for a walk or something. Thank you so much to everyone who helped me rule out all of the very many things this could have been! I love this place, you all are so kind and supportive.

Original:

I'm trying to debug a perplexing networking situation, and I could use some guidance if anyone has any.

Here's my setup:

  • UniFi Security Gateway
  • UniFi Switch Lite
  • Two UAPs
  • Two servers, A and B, connected to the USW-Lite with GbE
  • Many wireless devices, connected to the UAPs

Here's what I'm experiencing:

  • Network transfers from the wireless devices to server A (as measured by iperf3 tests) are very slow. Consistently between 10 and 20 Mb/s.
  • Network transfers from server A to all devices are expected speeds. 900-1000 Mb/s to server B, 350-ish Mb/s to wireless devices.
  • Network transfers between server B and all devices (in both directions!) are expected speeds.
  • Network transfers from the USG to server A also seem slow, which is odd. Only about 60 MB/s.
  • Network transfers from the USG to server B and the wireless devices is about 300 MB/s

So, specifically network transfers from any wireless device to server A are slow, and no other connections have any issues that I can see.

Some potentially relevant details:

  • Server A is running Unraid
  • Server B is running Ubuntu
  • Wireless devices include a Fedora laptop, an iPhone, and a Macbook Pro
  • UniFi configuration is pretty straightforward. I have a few ports forwarded, a guest WiFi network (that none of these devices are on), a single default VLAN, and two simple "Allow LAN" firewall rules for Wireguard on the USG. No other firewall or routing config that I'm aware of.

If anyone has any thoughts at all on how to continue debugging, I would be immensely grateful! I suppose the next step would be to try to determine whether it's the networking equipment or the server itself that is responsible for the throttling, but I'm not sure how best to do that.

16 comments

  1. [2]
    first-must-burn
    Link
    Things I would try: Swap the cables and ports connecting the servers to the switch to see if its the cable or the port on the switch Test speeds between server B and server A (I wasn't sure if...

    Things I would try:

    • Swap the cables and ports connecting the servers to the switch to see if its the cable or the port on the switch
    • Test speeds between server B and server A (I wasn't sure if that was included in your "all devices" comment about server A). I would do that to see if the wireless is involved.
    • Take the laptops off the wireless network and plug them into the switch, then rerun your speed tests.
    • Take the UAPs offline one at a time to see if either of them contributes to the problem.
    • Try the wireless speed tests from different physical locations to see if interference is at play.
    • if you have a spare or USB Ethernet adapter, try using it on server A to see if it's a bad port on the server.

    Try to put things "back like they were" after every test so that you are reducing the number of variables tested each time. Also, make sure the problem still exists before starting the next test. Nothing is more frustrating than having done something that fixed it accidentally in the previous test (for example, an intermittent port failure or cable problem ight be fixed after you swap the cables and put them back), but assuming it is the current test that fixed it because the problem is now gone.

    There is a lot of overlap in these tests. But my experience is that this can be useful -- if the overlapping tests give the same (expected) results, you have confirmation of your mental model. If you get inconsistent results ( like server A to Server B is fast, but server A to the laptop is slow when connected to the switch) that could help you reframe your assumptions about the network.

    Good luck!

    7 votes
    1. smores
      (edited )
      Link Parent
      Thanks so much. Some updates: Gave this a shot, same issues no matter which port I'm using. It basically can't be the cable, because server B can write to server A with no issue! Sorry for the...

      Thanks so much. Some updates:

      • Swap the cables and ports connecting the servers to the switch to see if its the cable or the port on the switch

      Gave this a shot, same issues no matter which port I'm using. It basically can't be the cable, because server B can write to server A with no issue!

      • Test speeds between server B and server A (I wasn't sure if that was included in your "all devices" comment about server A). I would do that to see if the wireless is involved.

      Sorry for the lack of clarity: yes, I can write from server B to server A with no issue (900 Mb/s). It does seem somehow specific to wireless devices.

      • Take the laptops off the wireless network and plug them into the switch, then rerun your speed tests.

      This is a good thought; I don't know if I have the necessary dongles for this (just tried daisy chaining a USB-C -> Thunderbolt and Thunderbolt -> Ethernet together to no avail) but I will keep hunting around and give it a shot if I can.

      • Take the UAPs offline one at a time to see if either of them contributes to the problem.

      It doesn't matter which UAP the wireless devices are connected to, same issue.

      • Try the wireless speed tests from different physical locations to see if interference is at play.

      It doesn't seem to matter. Also, I get exactly the expected speeds when wirelessly connecting to server B! So I don't think it's anything about the wireless connection itself.

      • if you have a spare or USB Ethernet adapter, try using it on server A to see if it's a bad port on the server.

      I think this is ruled out by successful transfers from server B to server A

      I really appreciate all of the tips, going to keep trying to find a way to connect the laptops with Ethernet!

      Update:

      Got my Macbook connected with Ethernet. It can transfer perfectly fine; 940 Mb/s on average. As soon as it's back on WiFi, down to 11 Mb/s!

      2 votes
  2. [8]
    kacey
    Link
    Any chance you’re handy with wireshark/tcpdump? Seems wise to rule out any packet level shenanigans (eg sizes get negotiated lower between server A and a wireless device vs server A and a wired...

    Any chance you’re handy with wireshark/tcpdump? Seems wise to rule out any packet level shenanigans (eg sizes get negotiated lower between server A and a wireless device vs server A and a wired device).

    3 votes
    1. [7]
      smores
      Link Parent
      Ah... I'm not, really, but if you have any tips I would definitely give it a shot. One relevant thing (@first-must-burn in case this triggers anything for you) is that I just noticed that there...

      Ah... I'm not, really, but if you have any tips I would definitely give it a shot.

      One relevant thing (@first-must-burn in case this triggers anything for you) is that I just noticed that there are tons of packets dropped on the slow transfers. Here's a sample iperf run:

      Connecting to host 192.168.1.73, port 9869
      [  5] local 192.168.1.146 port 53620 connected to 192.168.1.73 port 9869
      [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
      [  5]   0.00-1.00   sec  1.25 MBytes  10.5 Mbits/sec  124   4.24 KBytes       
      [  5]   1.00-2.00   sec  1.50 MBytes  12.6 Mbits/sec  134   1.41 KBytes       
      [  5]   2.00-3.00   sec  1.62 MBytes  13.6 Mbits/sec  126   5.66 KBytes       
      [  5]   3.00-4.00   sec  1.12 MBytes  9.44 Mbits/sec  138   1.41 KBytes       
      [  5]   4.00-5.00   sec  1.62 MBytes  13.6 Mbits/sec  125   19.8 KBytes       
      [  5]   5.00-6.00   sec  1.25 MBytes  10.5 Mbits/sec  137   4.24 KBytes
      

      If there's any way to determine why or where those packets get lost, that seems like it'd be very relevant??

      2 votes
      1. [6]
        AntsInside
        Link Parent
        If you SSH into the access points you can run a packet capture on the AP itself with tcpdump and see what is coming in and going out. I have done this before but would have to remember the right...

        If you SSH into the access points you can run a packet capture on the AP itself with tcpdump and see what is coming in and going out. I have done this before but would have to remember the right command line arguments.

        Do you see packets getting lost if you just run a ping for a while or only when the connection is loaded?

        2 votes
        1. [5]
          smores
          Link Parent
          Hm, ok. I SSH'ed into the AP and ran tcpdump src <laptop> and dst <server A>. Then I ran an iperf test (basically the same results as above). For comparison, I did the same with server B. I don't...

          Hm, ok. I SSH'ed into the AP and ran tcpdump src <laptop> and dst <server A>. Then I ran an iperf test (basically the same results as above). For comparison, I did the same with server B. I don't really see any distinction between the two outputs, but I also am not totally sure what I should be looking for.

          Another thing I did notice is that the "Congestion window" for server A is tiny compared to server B. On the order of 5 KBytes or so, compared to 1.5 MBytes. It's not clear to me why that would be, or whether it's relevant. When I run an iperf from server B to server A, the bitrate is about 950 Mb/s, and the Cwnd is like 340 Kbytes.

          1 vote
          1. [4]
            kacey
            Link Parent
            Sorry, I'm not sure I understand ... may I ask if you've run an iperf test between server A -> AP, and server B -> AP? The congestion window roughly correlates with what TCP heuristically figured...

            Sorry, I'm not sure I understand ... may I ask if you've run an iperf test between server A -> AP, and server B -> AP? The congestion window roughly correlates with what TCP heuristically figured your network's throughput is, so narrowing it down to an issue between the server and the access point at least rules out the clients as a problem.

            If you indeed have an issue between server A <-> AP, and no issue between server B <-> AP, then it'd be worth digging into why packets are being dropped. Digging into server A would be useful in that case, since it might be e.g. setting a lower MTU than is advisable, there could be a driver bug for your NIC which is fumbling something, etc. etc.

            1 vote
            1. [3]
              smores
              Link Parent
              I think you're like... right on the money with a driver bug for my NIC, or something in that genre. Literally minutes before you posted this, I switched to the primary NIC for my motherboard, and...

              I think you're like... right on the money with a driver bug for my NIC, or something in that genre. Literally minutes before you posted this, I switched to the primary NIC for my motherboard, and that fixed it! No idea what the heck was going on, but I think I'm ok with this outcome hahaha

              3 votes
              1. kacey
                Link Parent
                Glad to hear it :) ghosts in the system can be such a bear to deal with.

                Glad to hear it :) ghosts in the system can be such a bear to deal with.

                3 votes
              2. AntsInside
                Link Parent
                Thanks for keeping us all updated so well. Glad you got a solution!

                Thanks for keeping us all updated so well. Glad you got a solution!

                3 votes
  3. [3]
    Artren
    Link
    What type of storage media is being used on Server A versus Server B? Could server A have some media that might be limiting write speeds? Or possibly being limited by an oversaturated SATA? Seems...

    What type of storage media is being used on Server A versus Server B?

    Could server A have some media that might be limiting write speeds? Or possibly being limited by an oversaturated SATA? Seems odd to me, but maybe worth an investigation. Newer motherboards can cause this with their faster m.2 sockets eating up all the resources from some SATA ports.

    The fact you can transfer out at 900+ Mbps is odd though...food for thought?

    1. [2]
      smores
      Link Parent
      This occurs when running iperf3 tests and scp'ing to /dev/null, too, so it's not related to the storage medium (that is a good a thought though and I also spent a few hours ruling that out initially!)

      This occurs when running iperf3 tests and scp'ing to /dev/null, too, so it's not related to the storage medium (that is a good a thought though and I also spent a few hours ruling that out initially!)

      1 vote
      1. Artren
        Link Parent
        Oh ok. I'm no expert in this field, but had a similar issue when diagnosing my mother in law's network for her! Hope you get a solution. Some old HDD had slow write speed and was killing performance.

        Oh ok. I'm no expert in this field, but had a similar issue when diagnosing my mother in law's network for her! Hope you get a solution.

        Some old HDD had slow write speed and was killing performance.

        1 vote
  4. [3]
    AntsInside
    Link
    My (not confident) guess would be a bug in the access points. You could try a different version of the firmware. Are they on the latest firmware? Have you seen the issue on other firmware...

    My (not confident) guess would be a bug in the access points. You could try a different version of the firmware. Are they on the latest firmware? Have you seen the issue on other firmware versions?

    You can SSH into the access points themselves and may be able to see errors in system logs or if they are suffering CPU or memory issues when a transfer from wireless devices to server A occurs.

    You could check the wifi status on the laptops to see if they are connected at the full speed profile expected. Not sure how to do that on Fedora or Mac, and probably not as they work in the other direction.

    Another thought: are all devices definitely on the same IP subnet and not going through any kind of firewall in the gateway?

    1. [2]
      smores
      Link Parent
      Thank you for hopping in! This is a longstanding problem, and the firmware was just updated like five days ago, so it would have to be a longstanding firmware bug (which I suppose is possible!)....

      Thank you for hopping in!

      This is a longstanding problem, and the firmware was just updated like five days ago, so it would have to be a longstanding firmware bug (which I suppose is possible!). It would be weird if the bug only affected transfers to one specific device, but I suppose it could be a confluence of things.

      The laptops can upload files to other wired and wireless devices on the network at the expected speeds.

      All of the devices are on the same IP subnet for sure. I'm going to try to disable the guest network just in case, but none of the devices are on that.

      1. AntsInside
        Link Parent
        It seems like the issue must be a confluence of factors whatever it is. One reason for suspecting the APs is that I tried to troubleshoot an issue for years with Unifi APs intermittently not...

        It seems like the issue must be a confluence of factors whatever it is. One reason for suspecting the APs is that I tried to troubleshoot an issue for years with Unifi APs intermittently not allowing connection from Apple devices. I never managed to pin it down, but I am pretty sure it was a bug with UAP-PRO APs sometimes getting into a problematic state.