500W seems insane, even for a top-end card. There has to be an upper limit somewhere, right? If nothing else, at some point it becomes too much of a stretch to put such high-draw big iron into the...
500W seems insane, even for a top-end card. There has to be an upper limit somewhere, right? If nothing else, at some point it becomes too much of a stretch to put such high-draw big iron into the same product line as "bare minimum" models like the 4050/5050. It's like putting a Threadripper or Epyc CPU in the same product line as a Ryzen 5100/5500.
The problem they are facing and the reason this is happening is they have to produce a stronger card than last year to keep investors happy, but they're hitting the limits of what silicon chips...
The problem they are facing and the reason this is happening is they have to produce a stronger card than last year to keep investors happy, but they're hitting the limits of what silicon chips can accomplish without pushing more power through them. This issue is exacerbated by the high heat output as well, there are diminishing returns on how much "bang for the watt" we can achieve now that we have pushed these chips this far and transistors this small.
This is also why they have been heavily invested in software for things like DLSS, they are somewhat cooked on hardware so they're trying to find things they can do to the resultant image to reduce processing cost without the user seeing or feeling it (in quality or latency).
I'm interested to see if they will eventually find a way to reduce their instruction set (like apple did when they went away from traditional CISC processors) to reduce the amount of work a gpu has to do without interfering with the capabilities of the card. I'm also interested to see what happens after silicon, like graphene. I'm sure whatever it is will be quite pricey for us consumers for a good long while.
Well, but the data center GPUs are a completely different beast, those aren't "cards." Even if Nvidia tightened up their gamer GPU TDP situation, they would still design the next generation of...
Well, but the data center GPUs are a completely different beast, those aren't "cards." Even if Nvidia tightened up their gamer GPU TDP situation, they would still design the next generation of A100/H100 (or whatever it's called) to maximize performance at the same or similar draw and footprint. Customer calculus is inference per watt per installed square foot.
Also mimicking Apple, they could always take the direction of piling on more silicon, which is expensive but at least helps keep power usage and thermal issues at bay. I'm sure there's a customer...
Also mimicking Apple, they could always take the direction of piling on more silicon, which is expensive but at least helps keep power usage and thermal issues at bay. I'm sure there's a customer segment that would happily pay a premium for cards that are as powerful as current top-end cards, but don't have the crazy PSU and cooling requirements.
Could you provide any more info on this? Last I remembered this was a talking point before the Pascal series were introduced which brought massive efficiency gains. I haven't heard much about TSMC...
but they're hitting the limits of what silicon chips can accomplish without pushing more power through them.
Could you provide any more info on this? Last I remembered this was a talking point before the Pascal series were introduced which brought massive efficiency gains. I haven't heard much about TSMC peaking on silicon density. I know we're getting sorta close which is why companies are investigating 3D designs (though the related thermal issues may be insurmountable).
I found this video about the Apple Silicon a good primer on the topic. Speed enhancements almost invariably come in these ways, with the following tradeoffs: Increase clock speeds, generate more...
I found this video about the Apple Silicon a good primer on the topic.
Speed enhancements almost invariably come in these ways, with the following tradeoffs:
Increase clock speeds, generate more heat per transistor. We're at practicle maximums here around 5GHz, and have been since 2006ish.
Add more transistors by increasing die size. Adds more heat, consumes more power.
Shrink die transistor size with process improvement to reduce power consumption. This lets you add more transistors without increasing die size, however it means more heat in smaller areas which needs to dissipate faster. We're also extremely close to (or at) theoretical/practical limits here.
Software improvements, via better algorithms and more efficient coding...this is mostly ignored in favor of ease of development. Also inventing new algorithms is very difficult.
Dedicated, not general purpose, hardware designed to optimize the software algorithms.
Apple had their breakthrough with the M1 primarily because of being able to push that last one due to a tight ecosystem. However, since doing so, they've been 'stuck with' the same problems as the rest of the silicon industry and other players are catching up.
So yea, NVIDIA is essentially at the point where their primary dials are:
Increase power consumption
Improve software
And one of those dials is much, much, much easier to turn. We're not long before we'll need dedicated 20A breakers for top-end PCs, or perhaps a transition to 240V.
Worth noting that with the M-series, compared to traditional x86 CPUs and their integrated GPUs, Apple has also taken option #2 to a significant degree. Instead of brute forcing performance by...
Worth noting that with the M-series, compared to traditional x86 CPUs and their integrated GPUs, Apple has also taken option #2 to a significant degree. Instead of brute forcing performance by pumping more power, they've made their SoCs huge with a crazy number of transistors.
It's one of the reasons why Intel, AMD, and now even Qualcomm are struggling to match the M-series' performance per watt (Intel/AMD on the low end, Qualcomm on the high end): they simply can't afford to put as many transistors into their CPUs since they're mass-market offerings that can't have costs absorbed by a high-margin product built with them.
As the video gets to at one point, the M3 is already a step backwards from the M2 in many ways, hitting hard thermal throttling with any sort of sustained load. They managed to get a jumpstart...
As the video gets to at one point, the M3 is already a step backwards from the M2 in many ways, hitting hard thermal throttling with any sort of sustained load.
They managed to get a jumpstart there...but as the competition catches up they're kinda limited the same as everyone else, and there's not really much room to expand their die any further or sacrifice their performance/watt.
Their 'magic leap' in that performance/watt was a mostly one-time affair. We're not going to be seeing the generational die shrinks that we saw in the previous 3 decades that let them keep ahead of their competition YoY.
Thanks for the reference! I did some additional research as well, and it seems like the current silicon limit is FinFET transistors have basically hit maximum density. TSMC's next goal is...
Thanks for the reference! I did some additional research as well, and it seems like the current silicon limit is FinFET transistors have basically hit maximum density. TSMC's next goal is gate-all-around (GAAFET) transistors, but those are a few years away (assuming they pan out).
Since you spelled it this way twice, I'll assume it isn't a typo and offer a friendly FYI : "silicone" is a different substance than "silicon". One is good for making transistors, and the other is...
the current silicone limit
Since you spelled it this way twice, I'll assume it isn't a typo and offer a friendly FYI : "silicone" is a different substance than "silicon". One is good for making transistors, and the other is good source of double entendres.
Even when the 40 series came out I think a lot of people were shocked. A lot of people mentioned that a system with a 4090 and high end CPU to match would draw enough power that you may seriously...
Even when the 40 series came out I think a lot of people were shocked. A lot of people mentioned that a system with a 4090 and high end CPU to match would draw enough power that you may seriously have to watch what else you put on that breaker. For older houses you might need to shut stuff off before firing up a game.
At work we needed to give our AI workstation its own 20A circuit. We put the largest available 120V power supply in it, 1650W, to power 3x L40S cards and a 7950X3D.
At work we needed to give our AI workstation its own 20A circuit. We put the largest available 120V power supply in it, 1650W, to power 3x L40S cards and a 7950X3D.
Bit of a tangent, but have you guys noticed any issues with PCIe bottlenecks on that setup? I’m speccing a similar machine and it’d be great to stick with AM5 rather than Threadripper if possible...
Bit of a tangent, but have you guys noticed any issues with PCIe bottlenecks on that setup?
I’m speccing a similar machine and it’d be great to stick with AM5 rather than Threadripper if possible (aside from the big cost advantage, there’s always something that ends up benefitting from extra single-core performance), but I’m having trouble finding info on how much difference running the cards at x8/x8/x4 makes in reality.
For our needs there’s no bottleneck. The cards hover around 100MiB/s in bandwidth during training, well below the capacity of the slowest slot. Even if there is a bottleneck for a moment it just...
For our needs there’s no bottleneck. The cards hover around 100MiB/s in bandwidth during training, well below the capacity of the slowest slot. Even if there is a bottleneck for a moment it just delays the run a few minutes, which is inconsequential across a day long run.
Very good to know, thank you! I’m guessing no FSDP or similar in your scenario, in that case? If it’s just a case of ruling out the fancier kinds of parallelism that could definitely work for us.
Very good to know, thank you! I’m guessing no FSDP or similar in your scenario, in that case? If it’s just a case of ruling out the fancier kinds of parallelism that could definitely work for us.
I'm not one of the ML guys so I'm not familiar enough with terms like FSDP to say if that's in use. We do train across all GPUs sometimes. We also will run different workloads on each card. It's...
I'm not one of the ML guys so I'm not familiar enough with terms like FSDP to say if that's in use. We do train across all GPUs sometimes. We also will run different workloads on each card. It's on a machine that the 3 ML guys all SSH into.
We actually swapped over to A6000 cards recently and moved our 3 L40S cards into our new cluster which totals 16 L40S GPUs now.
Got you, and I really appreciate the info - it sounds like pretty much exactly the use case I’m looking at: enough to run a few different dev jobs in parallel, or one somewhat larger job more...
Got you, and I really appreciate the info - it sounds like pretty much exactly the use case I’m looking at: enough to run a few different dev jobs in parallel, or one somewhat larger job more quickly, with anything significantly bigger going off to the actual servers. Probably means that worrying too much about bus bandwidth would be overkill, which is very much what I was hoping to hear.
It was so much cheaper this way than going with any of the commercial options. For example, Lambda Labs sells a 3x A6000 machine for $44,000! But it's got 96 cores and 512GB of RAM. Ours cost...
It was so much cheaper this way than going with any of the commercial options. For example, Lambda Labs sells a 3x A6000 machine for $44,000! But it's got 96 cores and 512GB of RAM. Ours cost closer to $15,500.
The difference really is mind blowing sometimes! As soon as you edge into “there’s some level of budget justified for this” it’s like the suppliers hear “money no object, add a nice big multiplier”.
The difference really is mind blowing sometimes! As soon as you edge into “there’s some level of budget justified for this” it’s like the suppliers hear “money no object, add a nice big multiplier”.
Actually we just figured out we have the old cards, the A6000 instead of the RTX 6000 Ada. We were on a call with an Nvidia rep and learned that they have 3 very similarly named cards: RTX 6000...
Actually we just figured out we have the old cards, the A6000 instead of the RTX 6000 Ada. We were on a call with an Nvidia rep and learned that they have 3 very similarly named cards:
RTX 6000 (very old)
RTX A6000 (old)
RTX 6000 Ada (new)
I only knew of the first two. The newest one has the same appearance and amount of VRAM as the middle one. Super confusing, but the A6000 does seem to be a better bang for our buck right now.
I actually ran into this issue in my apartment. The breaker gets tripped if I don't turn off the AC unit before doing anything demanding on the gpu. Not a huge deal since I live far enough north...
I actually ran into this issue in my apartment. The breaker gets tripped if I don't turn off the AC unit before doing anything demanding on the gpu. Not a huge deal since I live far enough north that the summers are 80°~ but it's still weird that my gpu can draw such a monumental load.
Try undervolting your card with e.g. Afterburner. I was able to reduce power consumption by about 25-30% on my 4070 Super (from 1.1V to 0.970V) with negligible impact on frame rate. I even did the...
Try undervolting your card with e.g. Afterburner. I was able to reduce power consumption by about 25-30% on my 4070 Super (from 1.1V to 0.970V) with negligible impact on frame rate. I even did the same on my Ryzen 7600 (-50mv and 80W power limit) for another 25% power consumption cut and actually gained performance because it was no longer thermally throttling and stays boosted indefinitely.
The added benefit is a much cooler (15C cooler CPU and 5-10C cooler GPU under max load) and quieter rig.
Do you happen to have any good guides, particularly for beginners, for undervolting? I have a 3080 and a R7 5800X3D. I'm not so much concerned about performance, as I am power consumption/heat...
Do you happen to have any good guides, particularly for beginners, for undervolting? I have a 3080 and a R7 5800X3D. I'm not so much concerned about performance, as I am power consumption/heat generation.
Nothing directly*, and some nuances are chip-generation specific. I would read about overlocking first. But it’s pretty straightforward and not as complicated as it sounds if you just stick to the...
Nothing directly*, and some nuances are chip-generation specific. I would read about overlocking first. But it’s pretty straightforward and not as complicated as it sounds if you just stick to the basics. The first thing would be to familiarize yourself with your BIOS, and make sure the firmware is updated. You don’t want to use any “indirect” OS software for tweaking your memory or CPU. For the GPU I recommend MSI Afterburner (OS software is fine for GPU tuning). There’s plenty of support online for every mainstream chip under the sun.
The overall formula is the same: find a starting point that others found works for your chip, run a stress test for stability, and from there make minor adjustments. Tweak until failure and then back off a bit. Keep in mind every chip is a bit different in terms of stability i.e. the “silicon lottery”. In my case most people report -30 mV undervolt for the ryzen 7600 but I won the lottery with mine (I could push -60 mV but not stable under extended loads).
*EDIT: actually overclock.net is an excellent forum for information and user’s experiences.
Ryzen 5000 Undervolting with PBO2 – Absolutely Worth Doing Not sure about undervolting for Nvidia GPUs though. Note that after undervolting, you'll want to do some stability testing. For a CPU,...
Not sure about undervolting for Nvidia GPUs though.
Note that after undervolting, you'll want to do some stability testing. For a CPU, this would mean letting something like Cinebench run multi-threaded and single-threaded benchmarks on a loop for thirty minutes at least. For your GPU, an extended run of something like Furmark would be fine.
You can't guarantee system stability just through benchmarking software though. If you're computer doesn't crash after testing with benchmarks, you'll just have to use it like you normally do. If your computers end up crashing while playing games or sitting idle, then try dialing back the undervolt and see if it happens again. Stability issues often occur in the weird valley between idle and 100% usage, and that's the most difficult area to test.
People tend to recommend software like OCCT for stress and stability testing, but I have generally found it to be useless and prone to reporting errors when they don't actually exist. Your mileage may vary, but I certainly would not adjust an otherwise stable undervolt just because OCCT reports errors.
I agree with everything you said but I would caution against Furmark. It's generally claimed to put an exceptionally unrealistic heavy load on the GPU and in some cases could cause damage....
I agree with everything you said but I would caution against Furmark. It's generally claimed to put an exceptionally unrealistic heavy load on the GPU and in some cases could cause damage. Something like 3dmark is often sufficient if one just keeps an eye out for artifacts, and then actually play a game for an extended period. Afterburner is convenient enough that you can back off the tweak and get back into playing after a graphics crash pretty quick!
For NVIDIA GPUs (I'm like 90% sure it works the same for AMD), MSI Afterburner (which works great even if your card isn't from MSI) is a good candidate for performance tweaking. You should pretty...
For NVIDIA GPUs (I'm like 90% sure it works the same for AMD), MSI Afterburner (which works great even if your card isn't from MSI) is a good candidate for performance tweaking. You should pretty easily be able to apply voltage and clock adjustments within bounds pretty safe for your card. It's been a while since I went through the process, but when I last looked into it for my 1080 there were quite a few videos I found that walked through undervolting and then over clocking a GPU with Afterburner. It looked pretty card/manufacturer independent but I'm not in a spot where I can jump down a rabbit hole to get you some links. But you could watch a few videos and get a good sense of what to do.
Given the 10x price gap for the A and L series, I’m more than happy for them to keep wedging the xx90 cards into the “consumer” lineup regardless of how much sense they might make there! I’d been...
Given the 10x price gap for the A and L series, I’m more than happy for them to keep wedging the xx90 cards into the “consumer” lineup regardless of how much sense they might make there! I’d been hoping the Titan branding would come back with 36 or 48GB VRAM, but I guess they decided that’d too heavily cannibalise A6000 sales.
500W seems insane, even for a top-end card. There has to be an upper limit somewhere, right? If nothing else, at some point it becomes too much of a stretch to put such high-draw big iron into the same product line as "bare minimum" models like the 4050/5050. It's like putting a Threadripper or Epyc CPU in the same product line as a Ryzen 5100/5500.
The problem they are facing and the reason this is happening is they have to produce a stronger card than last year to keep investors happy, but they're hitting the limits of what silicon chips can accomplish without pushing more power through them. This issue is exacerbated by the high heat output as well, there are diminishing returns on how much "bang for the watt" we can achieve now that we have pushed these chips this far and transistors this small.
This is also why they have been heavily invested in software for things like DLSS, they are somewhat cooked on hardware so they're trying to find things they can do to the resultant image to reduce processing cost without the user seeing or feeling it (in quality or latency).
I'm interested to see if they will eventually find a way to reduce their instruction set (like apple did when they went away from traditional CISC processors) to reduce the amount of work a gpu has to do without interfering with the capabilities of the card. I'm also interested to see what happens after silicon, like graphene. I'm sure whatever it is will be quite pricey for us consumers for a good long while.
Their biggest datacenter cards consume 700W each. It's clear they're in a mode of desperation.
Well, but the data center GPUs are a completely different beast, those aren't "cards." Even if Nvidia tightened up their gamer GPU TDP situation, they would still design the next generation of A100/H100 (or whatever it's called) to maximize performance at the same or similar draw and footprint. Customer calculus is inference per watt per installed square foot.
Also mimicking Apple, they could always take the direction of piling on more silicon, which is expensive but at least helps keep power usage and thermal issues at bay. I'm sure there's a customer segment that would happily pay a premium for cards that are as powerful as current top-end cards, but don't have the crazy PSU and cooling requirements.
Could you provide any more info on this? Last I remembered this was a talking point before the Pascal series were introduced which brought massive efficiency gains. I haven't heard much about TSMC peaking on silicon density. I know we're getting sorta close which is why companies are investigating 3D designs (though the related thermal issues may be insurmountable).
Edit: silicon, not silicone
I found this video about the Apple Silicon a good primer on the topic.
Speed enhancements almost invariably come in these ways, with the following tradeoffs:
dietransistor size with process improvement to reduce power consumption. This lets you add more transistors without increasing die size, however it means more heat in smaller areas which needs to dissipate faster. We're also extremely close to (or at) theoretical/practical limits here.Apple had their breakthrough with the M1 primarily because of being able to push that last one due to a tight ecosystem. However, since doing so, they've been 'stuck with' the same problems as the rest of the silicon industry and other players are catching up.
So yea, NVIDIA is essentially at the point where their primary dials are:
And one of those dials is much, much, much easier to turn. We're not long before we'll need dedicated 20A breakers for top-end PCs, or perhaps a transition to 240V.
Worth noting that with the M-series, compared to traditional x86 CPUs and their integrated GPUs, Apple has also taken option #2 to a significant degree. Instead of brute forcing performance by pumping more power, they've made their SoCs huge with a crazy number of transistors.
It's one of the reasons why Intel, AMD, and now even Qualcomm are struggling to match the M-series' performance per watt (Intel/AMD on the low end, Qualcomm on the high end): they simply can't afford to put as many transistors into their CPUs since they're mass-market offerings that can't have costs absorbed by a high-margin product built with them.
As the video gets to at one point, the M3 is already a step backwards from the M2 in many ways, hitting hard thermal throttling with any sort of sustained load.
They managed to get a jumpstart there...but as the competition catches up they're kinda limited the same as everyone else, and there's not really much room to expand their die any further or sacrifice their performance/watt.
Their 'magic leap' in that performance/watt was a mostly one-time affair. We're not going to be seeing the generational die shrinks that we saw in the previous 3 decades that let them keep ahead of their competition YoY.
Thanks for the reference! I did some additional research as well, and it seems like the current silicon limit is FinFET transistors have basically hit maximum density. TSMC's next goal is gate-all-around (GAAFET) transistors, but those are a few years away (assuming they pan out).
Edit: fixing autocorrect's mistakes
Since you spelled it this way twice, I'll assume it isn't a typo and offer a friendly FYI : "silicone" is a different substance than "silicon". One is good for making transistors, and the other is good source of double entendres.
Thanks for pointing that out. I'm aware of the difference but consistently mess up the spelling despite having worked with both!
The power connectors support up to 600W. So unless they put two of those on one card that's the limit.
Quad power connectors isn't unheard of for GPUs, so I wouldn't be surprised at all if they went this route.
Even when the 40 series came out I think a lot of people were shocked. A lot of people mentioned that a system with a 4090 and high end CPU to match would draw enough power that you may seriously have to watch what else you put on that breaker. For older houses you might need to shut stuff off before firing up a game.
At work we needed to give our AI workstation its own 20A circuit. We put the largest available 120V power supply in it, 1650W, to power 3x L40S cards and a 7950X3D.
Bit of a tangent, but have you guys noticed any issues with PCIe bottlenecks on that setup?
I’m speccing a similar machine and it’d be great to stick with AM5 rather than Threadripper if possible (aside from the big cost advantage, there’s always something that ends up benefitting from extra single-core performance), but I’m having trouble finding info on how much difference running the cards at x8/x8/x4 makes in reality.
For our needs there’s no bottleneck. The cards hover around 100MiB/s in bandwidth during training, well below the capacity of the slowest slot. Even if there is a bottleneck for a moment it just delays the run a few minutes, which is inconsequential across a day long run.
Very good to know, thank you! I’m guessing no FSDP or similar in your scenario, in that case? If it’s just a case of ruling out the fancier kinds of parallelism that could definitely work for us.
I'm not one of the ML guys so I'm not familiar enough with terms like FSDP to say if that's in use. We do train across all GPUs sometimes. We also will run different workloads on each card. It's on a machine that the 3 ML guys all SSH into.
We actually swapped over to A6000 cards recently and moved our 3 L40S cards into our new cluster which totals 16 L40S GPUs now.
Got you, and I really appreciate the info - it sounds like pretty much exactly the use case I’m looking at: enough to run a few different dev jobs in parallel, or one somewhat larger job more quickly, with anything significantly bigger going off to the actual servers. Probably means that worrying too much about bus bandwidth would be overkill, which is very much what I was hoping to hear.
It was so much cheaper this way than going with any of the commercial options. For example, Lambda Labs sells a 3x A6000 machine for $44,000! But it's got 96 cores and 512GB of RAM. Ours cost closer to $15,500.
The difference really is mind blowing sometimes! As soon as you edge into “there’s some level of budget justified for this” it’s like the suppliers hear “money no object, add a nice big multiplier”.
Actually we just figured out we have the old cards, the A6000 instead of the RTX 6000 Ada. We were on a call with an Nvidia rep and learned that they have 3 very similarly named cards:
I only knew of the first two. The newest one has the same appearance and amount of VRAM as the middle one. Super confusing, but the A6000 does seem to be a better bang for our buck right now.
I actually ran into this issue in my apartment. The breaker gets tripped if I don't turn off the AC unit before doing anything demanding on the gpu. Not a huge deal since I live far enough north that the summers are 80°~ but it's still weird that my gpu can draw such a monumental load.
Try undervolting your card with e.g. Afterburner. I was able to reduce power consumption by about 25-30% on my 4070 Super (from 1.1V to 0.970V) with negligible impact on frame rate. I even did the same on my Ryzen 7600 (-50mv and 80W power limit) for another 25% power consumption cut and actually gained performance because it was no longer thermally throttling and stays boosted indefinitely.
The added benefit is a much cooler (15C cooler CPU and 5-10C cooler GPU under max load) and quieter rig.
Undervolt is the new overclock.
Do you happen to have any good guides, particularly for beginners, for undervolting? I have a 3080 and a R7 5800X3D. I'm not so much concerned about performance, as I am power consumption/heat generation.
Nothing directly*, and some nuances are chip-generation specific. I would read about overlocking first. But it’s pretty straightforward and not as complicated as it sounds if you just stick to the basics. The first thing would be to familiarize yourself with your BIOS, and make sure the firmware is updated. You don’t want to use any “indirect” OS software for tweaking your memory or CPU. For the GPU I recommend MSI Afterburner (OS software is fine for GPU tuning). There’s plenty of support online for every mainstream chip under the sun.
The overall formula is the same: find a starting point that others found works for your chip, run a stress test for stability, and from there make minor adjustments. Tweak until failure and then back off a bit. Keep in mind every chip is a bit different in terms of stability i.e. the “silicon lottery”. In my case most people report -30 mV undervolt for the ryzen 7600 but I won the lottery with mine (I could push -60 mV but not stable under extended loads).
*EDIT: actually overclock.net is an excellent forum for information and user’s experiences.
Ryzen 5000 Undervolting with PBO2 – Absolutely Worth Doing
Not sure about undervolting for Nvidia GPUs though.
Note that after undervolting, you'll want to do some stability testing. For a CPU, this would mean letting something like Cinebench run multi-threaded and single-threaded benchmarks on a loop for thirty minutes at least. For your GPU, an extended run of something like Furmark would be fine.
You can't guarantee system stability just through benchmarking software though. If you're computer doesn't crash after testing with benchmarks, you'll just have to use it like you normally do. If your computers end up crashing while playing games or sitting idle, then try dialing back the undervolt and see if it happens again. Stability issues often occur in the weird valley between idle and 100% usage, and that's the most difficult area to test.
People tend to recommend software like OCCT for stress and stability testing, but I have generally found it to be useless and prone to reporting errors when they don't actually exist. Your mileage may vary, but I certainly would not adjust an otherwise stable undervolt just because OCCT reports errors.
I agree with everything you said but I would caution against Furmark. It's generally claimed to put an exceptionally unrealistic heavy load on the GPU and in some cases could cause damage. Something like 3dmark is often sufficient if one just keeps an eye out for artifacts, and then actually play a game for an extended period. Afterburner is convenient enough that you can back off the tweak and get back into playing after a graphics crash pretty quick!
For NVIDIA GPUs (I'm like 90% sure it works the same for AMD), MSI Afterburner (which works great even if your card isn't from MSI) is a good candidate for performance tweaking. You should pretty easily be able to apply voltage and clock adjustments within bounds pretty safe for your card. It's been a while since I went through the process, but when I last looked into it for my 1080 there were quite a few videos I found that walked through undervolting and then over clocking a GPU with Afterburner. It looked pretty card/manufacturer independent but I'm not in a spot where I can jump down a rabbit hole to get you some links. But you could watch a few videos and get a good sense of what to do.
Given the 10x price gap for the A and L series, I’m more than happy for them to keep wedging the xx90 cards into the “consumer” lineup regardless of how much sense they might make there! I’d been hoping the Titan branding would come back with 36 or 48GB VRAM, but I guess they decided that’d too heavily cannibalise A6000 sales.