I think I have a broken AT&T route?
Posting for ideas/advice, if anyone has any, as I'm unsure of where else to turn.
I have a VPS (Named "Bucket") I rent and self host a few services on, along with a home server (Named "Vergil") that lives under my basement stairs and I host many more services on. At 2:01 AM today I got a notification from Bucket that my Plex (hosted on Vergil) was down/unreachable. I'm assuming that's when this issue started.
When investigating I found that Plex wasn't down, but Bucket couldn't reach/talk to Vergil. Further investigation showed that it wasn't just Bucket, but nothing can reach/talk to Vergil. At first I thought it was an issue with my router, as I have my gateway set up in IP bypass mode and manage my network via my third party router (UDM-Pro). But after digging through logs looking for any automated blocks from any misclassified intrusion attempts, I realized that none of my attempts were even reaching the router. So I checked the route, and that's where I found what I think is the problem.
Running mtr
to route from Vergil to Bucket gives full resolution of the route:
mtr -rwzbc 10 45.79.209.169
Start: 2024-12-19T16:49:53-0500
HOST: Vergil.goose.ws Loss% Snt Last Avg Best Wrst StDev
1. AS??? 192.168.2.1 0.0% 10 0.1 0.1 0.1 0.2 0.0
2. AS??? 192.168.99.254 10.0% 10 0.5 0.6 0.4 0.8 0.1
3. AS7018 45-26-156-1.lightspeed.tukrga.sbcglobal.net (45.26.156.1) 0.0% 10 4.4 3.6 2.0 5.9 1.2
4. AS7018 107.212.169.24 0.0% 10 5.2 3.7 1.6 6.1 1.5
5. AS7018 12.242.113.31 0.0% 10 2.2 3.7 2.2 5.3 1.0
6. AS7018 12.247.68.178 0.0% 10 2.8 3.8 2.2 5.8 1.2
7. AS20940 ae6.r21.atl01.mag.netarch.akamai.com (23.192.0.94) 0.0% 10 3.2 4.3 2.3 5.7 1.1
8. AS20940 ae0.r21.atl01.icn.netarch.akamai.com (23.192.0.65) 0.0% 10 3.7 4.1 1.9 6.5 1.5
9. AS20940 ae1.r21.atl01.ien.netarch.akamai.com (23.207.235.35) 0.0% 10 4.2 3.5 1.9 5.6 1.1
10. AS20940 ae22.gw3.atl1.netarch.akamai.com (23.203.144.39) 0.0% 10 5.2 5.0 2.4 8.8 2.0
11. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
12. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
13. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
14. AS63949 bucket.goose.ws (45.79.209.169)
However, routing from Bucket to Vergil does not:
mtr -rwzbc 10 99.42.115.109
Start: 2024-12-19T16:49:13-0500
HOST: Bucket.goose.ws Loss% Snt Last Avg Best Wrst StDev
1. AS??? 10.204.3.155 0.0% 10 0.2 0.3 0.1 0.8 0.2
2. AS??? 10.204.35.16 0.0% 10 0.4 0.4 0.3 0.5 0.1
3. AS??? 10.204.32.2 0.0% 10 0.7 9.4 0.4 74.3 23.2
4. AS63949 lo0-0.gw4.atl1.us.linode.com (74.207.239.106) 0.0% 10 0.7 0.5 0.4 0.7 0.1
5. AS20940 ae45.r22.atl01.ien.netarch.akamai.com (23.203.144.36) 0.0% 10 0.4 0.4 0.4 0.6 0.1
6. AS20940 ae4.r22.atl01.mag.netarch.akamai.com (23.192.0.98) 0.0% 10 0.6 0.7 0.6 0.8 0.1
7. AS20940 ae1.r24.atl01.ien.netarch.akamai.com (23.192.0.103) 0.0% 10 0.5 0.4 0.4 0.6 0.0
8. AS7018 12.247.68.177 0.0% 10 1.0 1.0 0.8 1.2 0.1
9. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
10. AS7018 107.212.169.25 0.0% 10 1.4 1.4 1.4 1.5 0.0
11. AS??? ???
Calling the tier 1 support number for AT&T residential support was very less-than-helpful. They kept on wanting to send a tech out to the house claiming there's an issue with the line. I kindly thanked them for their efforts but gave up, and tried emailing the contact email address for the AT&T datacenter/core router from the WHOIS in that last successful hop of the trace from Bucket to Vergil. I doubt I'll hear anything back, but I'm unsure of who else to turn to/what else to try. I've never seen/experienced a route broken in one direction like this. But I'm unable to access any of my devices/services from outside my house, due to it. Hoping someone has an idea or suggestion?
Edit:
Well, after about 38 hours of this issue, the power went out at my house. My networking equipment is on a UPS, so it did not go down. But when the power returned, the route began resolving again, and I am connectable again. Don't know if an area power outage rebooted some AT&T equipment nearby, I would imagine their stuff is also on UPS. But who knows?
For the non-believer about my route previously being complete:
[goose@Bucket: ~ ] $ mtr -rwzbc 10 99.42.115.109
Start: 2024-12-20T15:20:23-0500 HOST: Bucket.goose.ws Loss% Snt Last Avg Best Wrst StDev
1. AS??? 10.204.3.155 0.0% 10 0.1 0.2 0.1 0.2 0.0
2. AS??? 10.204.35.16 0.0% 10 0.2 0.3 0.2 0.4 0.1
3. AS??? 10.204.32.2 0.0% 10 0.6 1.8 0.4 9.9 2.9
4. AS63949 lo0-0.gw4.atl1.us.linode.com (74.207.239.106) 0.0% 10 0.4 2.0 0.3 15.6 4.8
5. AS20940 ae45.r22.atl01.ien.netarch.akamai.com (23.203.144.36) 0.0% 10 0.4 0.4 0.3 0.5 0.1
6. AS20940 ae4.r21.atl01.mag.netarch.akamai.com (23.192.0.90) 0.0% 10 0.8 0.7 0.6 0.9 0.1
7. AS20940 ae0.r24.atl01.ien.netarch.akamai.com (23.192.0.95) 0.0% 10 0.4 0.5 0.4 0.5 0.0
8. AS7018 12.247.68.177 0.0% 10 0.8 0.9 0.8 1.2 0.1
9. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
10. AS7018 107.212.169.25 0.0% 10 1.4 1.5 1.4 1.6 0.1
11. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
12. AS7018 99-42-115-109.lightspeed.tukrga.sbcglobal.net (99.42.115.109) 0.0% 10 3.6 3.2 2.1 4.9 0.9
[goose@Bucket: ~ ] $
Nothing against your personal setup, but with millions of customers served do you think the problem is really AT&T and not just some failure of your configuration? More than once now I've personally blamed some "big tech issue" for what ultimately turned out to be my own mistakes. To check, I'd first go the least complicated network setup and use your provided router with your smartphone and try to connect to some test setups.
I've certainly broken my fair share of things in the past by tinkering. I'd be more suspicious of myself if I'd changed anything recently. It's been months since my last configuration change, let alone at 2am when I was asleep. Even my UniFi auto updates are scheduled for 4am, so it wasn't a router update or reboot that changed anything.
The tricky part of testing is that I don't have another WAN connection to plug in to my router and test if it's the AT&T connection or not. But it's not just my VPS, traceroute from multiple hosts fail to route to my IP. While my Plex monitor alerted me to the issue, the biggest problem so far is that I can no longer access my UniFi gateway remotely, and therefore don't get doorbell/camera notifications when I'm not home anymore.
I'll try and plug directly into the gateway and bypass my router, in the morning. We'll see if that gets any different results.
Good luck!
I notice the outgoing route has two private IP addresses: 192.168.2 and 192.168.99. This is not really my area of expertise, but it seems like double NAT could be the problem. Is it possible your AT&T router got reset and is not in IP bypass anymore?
If that doesn't help, you could make sure your static IP or dynamic DNS is configured correctly so that the name actually points to your WAN IP address.
Just remember the old adage:
"It's always DNS."
”It can't be DNS."
"It was DNS."
I recently spent three whole days troubleshooting an error where cert manager would fail to issue certificates for a kubernetes cluster I was setting up. Turns out I had an incorrect IPv6 AAAA record set up that caused the acme resolver to crash when trying to compare addresses. Learned a lot about cert manager, though it's not much of a silver lining.
(Sorry, just had to vent.)
Yeah, I have to put my router behind the gateway due to authentication limitations. The tech who installed my service and I spent some time trying to find a way to bypass the gateway as my router can accept the SFP+ connection straight off the ONT, but we weren't able to make it work, so we ended up putting the gateway in bypass mode. I still manage to get a symmetrical 1.2 GB/s from the 1 GB/s service I pay for, despite that.
I verified that IP passthrough is still correctly configured, and all firewall and traffic filtering functions on the gateway are disabled. Good thought, though. This service has been so rock solid I was sure at first that it was me, not them. This is the closest thing to an outage I've experienced since beginning service with them.
But I did all my traceroute's via IP to avoid any DNS hangups.
Your router can't plug in directly to the ONT because ATT uverse run 802.1x supplicants on their CPE devices. Without the certificate, your router won't be able to authenticate to the router on the other end of the GPON.
You may have configured IP passthrough, but something is wrong, and the poster before me is right, you're likely double NATing, which would certainly cause you huge problems when trying to port forward. I'd verify that the WAN interface of your router is set to DHCP instead of statically IPed, if you indeed think that IP passthrough on the ATT CPE router is correctly configured.
Yeah, I try a handful of methods to imitate the supplicant on the router directly, none successfully.
In any case, the problem is now resolved. Update in the main post. But while I am technically behind a double NAT, I've configured IP pass-through that I've never had any trouble with connectivity or port forwarding over the last year that I've been using this same configuration and this service.
Why not let the tech come over? At the very least it will show that there is no issue with the line which might get you a step further in having someone actually look into the issue?
With AT&T, if the tech determines that the issue wasn’t caused by AT&T infrastructure, they charge you for the visit, and the price is usually steep. I talked with AT&T a lot about technical issues. My parents had their DSL for years, and the lines ran to their house were awful and always caused issues. Whenever I would request a tech, they would always tell me about how the problem probably wasn’t their lines and they would charge me. It was always their lines. And those lines still aren’t fixed. When I moved back in with them 2 years ago, I ended up getting cable internet through Comcast and just paying for it myself. It was about 30x the speed, actually reliable, and was strictly cheaper than AT&T, despite both services then loosing the “bundle discounts”.
Absolutely terrible company. It’s telling that I switched to Comcast, which is widely considered to be the worst company with terrible support. And with Comcast, it was cheaper and had significantly better customer support.
They threaten that in the UK too (well, at least OpenReach do, which is the infrastructure company that most ISPs use to get service to your property).
... is what I /would/ have said, had I not done some research just now and found that they scrapped that threat in 2022! Now there is no possibility of a charge for an engineer visit if the problem is on the household side.
Doesn't help in the cases for others where these silly charges do still apply.
Honestly I think it was from back when a lot (most?) houses in the UK had self installed extension sockets that could well have been sloppily wired, and when ADSL came along, cat chewed dangly microfilters, missing microfilters, poor cabling quality, could all cause ADSL problems that were not on the supplier side.
I'd be surprised these days if many people are still using hardwired extensions. We have some but they're all defunct now.
That’s good that they removed them in the UK. I actually do understand why they exist. Like you said, homeowners can do all sorts of weird shit to break their internet. I’m not familiar with those UK specific issues, but I am not surprised. The Internet company shouldn’t need to pay a tech from their own pocket when the customer unplugs their modem and complains. But there should be some way to bypass that. Like if Weldawadyathink calls for an issue with account 123, he probably knows what he is talking about, so if he says a tech is needed, it probably is. Or the lines to this address are shit, so just send a tech out at the first sign of a problem.
As @Weldawadyathink says, they want $99 to come out if they aren't convinced it's a them issue. And given the difficulty I've had in explaining hops of a traceroute to them, I'm concerned they wouldn't accept fault of the issue.
You're putting an awful lot of faith in that traceroute. Are you certain it's justified? Traceroute is really only useful in a handful of cases and this really isn't one of them.
Fun fact, Tracert isn't actually real, or at least it isn't what people think it is. In this case, at least for showing a local double NAT, it can reasonably be trusted, but idk about more than that.
I can't find any other reason that no ICMP, TCP or UDP packets can seek me out when they used to be able to. It's been some months, but I know I've successfully traced all the way to my house before.
I would be surprised bordering on shocked if you got a TTD response from every node along a route, beyond a handful of hops at least. Traceroute has never been a properly defined standard, and few network admins bother to enable TTD response for externally facing routers. The only time it might be a reliable method is when you're working with a complete network map for a network where you know traceroute has been enforced at every node--say, a corporate intranet for a company run by an OCD network engineer. That is definitely not what you're trying to use it for here. You're really not getting much from the traceroute that you wouldn't get from a simple ping.
Networking has never been a strength of mine though, either technically or socially, so I can't really help much. My first guess would be DNS, but that's about as helpful as a dog trying to help you translate Mandarin.
For what it’s worth, the AT&T techs that I have had on service calls have been universally fantastic. The worst was just okay, and that was a single tech across many dozens of service calls. They all knew AT&T infrastructure was shit at my house, and as soon as they saw that I kinda knew what I was talking about, they would easily listen to what I had to say (they would still verify the issue, as any good tech should). As far as I can tell, the tech has the final say on whether to charge the customer for the visit, and they have a lot of leeway to make the right decision. For the many times I knew it was an AT&T issue, the hardest part was often convincing the phone tech support to actually send a tech. Once the tech was onsite, they immediately could see there was an issue and start diagnosing it.
I am sure this all depends heavily on your area and who gets sent out, but my experience with the actual in person techs has been fantastic. It’s just every other part of the company that is a disaster. It’s kinda telling that AT&T hires them as contractors. I am sure they don’t want their bad name spoiled by employing competent people.
Yeah... I shot off emails to their IP admin and their routing email addresses. Failing that, I'll see if I can use an FCC complaint to get the right department to help me. Failing that, I'll hope a tech can get the right department to help out. I also just hate to bring a tech out to the house for an issue not in the house.
This may be a little too obvious, but it’s exactly the kind of thing I have tripped over (and will again…): have you confirmed the external IP address of Vergil, e.g.
curl icanhazip.com
or some similar tool? I don’t see it explicity stated that you have a static public IP on the Vergil side so your IP may have shifted. I know mine will be the same for months at a time and then change suddenly and without warning.Great question! I've confirmed it. I've set up my shell (zsh) that it reports the IP on each initiation (from icanhazip actually!)
And a dynamic IP script running every 5 minutes, that updates the DNS record if necessary (which has a 5 minute TTL). While I don't explicitly pay for static, I haven't had an address change since my service with AT&T began.
Totally off-topic and not helpful, but Vergil is an ODST reference, right?
Nice recognition! Yes, given that this host is my biggest, and hosts the majority of my services, it felt fitting to name it Vergil after the Superintendent from ODST.
Not that you care, but I also have:
My laptop: Cortana
My desktop: River
My router: Woodhouse
My primary DNS RasPi: GLaDOS
My secondary DNS RasPi: Wheatley
My brewing controller RasPi: Wash
My backup host odroid: GuiltySpark
My Steam Deck: Amos
My VPS: Bucket
Each host (for the most part, Amos excluded) has a file containing a number of quotes from the relevant character, upon login/shell initiation, a random one is chosen and printed with the MOTD.
Very cool! I know most of those. I love the quotes with the MOTD. We have networks named for authors - our home WiFi is SANDERSON and the little private access point we used to take when traveling was ROTHFUSS.
I figured you’d be on it, but as with the comment about DNS, is always seems to be the littlest issues that take me the longest to figure out :)
Could it be a CGNAT issue? I’m not a network guy per se - which is to say I don’t have a concrete hypothesis to back that up - but it seems consistent with other times I’ve seen that being the problem. Things work for a while, because the mappings at the ISP side coincidentally don’t interfere with the local ones, but at some point they reconfigure as they onboard more customers or replace systems and their new mappings happen to be incompatible with your setup.
To @goryramsy’s point, it would also be something that could change at the carrier side without being visible to you, break your setup, and not technically be a bug on their side either.
[Edit] I’ve found tailscale to be excellent at punching through issues like this, so that might be a good way to test the issue, and even a viable longer term workaround.
Were you definitely able to traceroute in that direction before? Because it's pretty common for routers to just ignore traceroutes, and that might be a red herring.
Yeah, it's been some months since I've done it, but I know I've successfully done full traces in the past. I remember at the time I was interested in how many hops there were between my home and VPS, as I selected the Atlanta datacenter from my VPS provider for proximity.
Do you have a public IPv6 on your home network and your VPS? If so, it might be a way to narrow down whether this is a IPv4-only or CGNAT issue.
Personally I just run my home hosting stuff (mostly *arrs) off IPv6 via Cloudflare DNS-Only records. As long as I have V6 connectivity I can access everything anywhere and I don't have to run split DNS to get local connections at home.
I do, although I don't route V6 addresses to my devices, just the router. While the addresses were different, the hops (and hop failure on #11) were the same, when the issue was happening.
For what it's worth, I have discovered an intermittent issue with my similar AT&T connection that only started showing up in the last week or weeks.
As best as I can track it down, I will randomly end up under double NAT. That is to say, my router will sometimes receive an IP address in the 172 range on the WAN. The internet still functions as it should, but all of my port forwards break and due to DDNS all of my domain names get updated to the non routable IP.
A release/renew has so far fixed the issue, as has waiting several days.
I am also using AT&T fiber with my gateway put into passthrough mode.
This whole escapade has coincidentally led me to this device. For $230 I'm not gonna take a chance on trying it out, but I love the idea of bypassing their equipment and terminating the fiber line directly into my router.
Hmm, that's interesting. I've heard of attempting to pull the certificate out of the AT&T device and using it to authenticate, but never really explored it.
Personally I would jump at the chance to rip out ISP equipment in my home and replace it with my own. Networks equipment you don’t manage is always going to cause you headaches. Even if you’re fine with the way it’s being managed, you are not going to like it when it changes without warning.