I am pulling this from one of their most recent patents: This next section is easier to understand when you know what a 'set' is, which they defined as: It reads to me that they are running a...
Using the approaches disclosed herein, a single computing architecture can store the parameters of a large machine learning model in dense but noisy memory while maintaining the performance of the machine learning model by fine-tuning the machine learning model to counteract the impact of those noise sources. The large machine learning model can be first trained on a system with abundant computational resources (e.g., a large server network) and then added to a computing architecture with fewer computational resources (e.g., as a multicore processor). The computing architecture can store the machine learning model in dense memory and refine the machine learning model to offset the performance degradation of the machine learning model caused by memory noise. This fine-tuning requires much less computational effort than initial training. As such, modest systems in terms of their available resources can be used to store extremely large machine learning models and generate accurate inferences therefrom without a network connection to an external computing architecture.
This next section is easier to understand when you know what a 'set' is, which they defined as:
the first memory set includes one or more memories, and the one or more memories include read only memory, the second memory set includes one or more memories, and the one or more memories include random access memory, and the first memory set is denser than the second memory set.
[ .. ]
This can be achieved by storing a given model in the high-density memory of the parts in each set, and each individual part within a given set can use the approaches disclosed herein to refine the machine learning model and counteract the noise sources of the individual part on that given model. In some embodiments, both the storage and refinement of the machine learning model may be performed during back-end of line processing or during final test and assembly of the parts, providing flexibility and manufacturing efficiency.
It reads to me that they are running a distilled version of the model in question (much easier to do on a smaller model such as Llama 8B), and instead of the standard: storing the model in hard memory, moving the workload for processing to the GPU, and then storing the token inserts/workload into ram for post-processing, they are instead just loading the entire model from the start directly into DRAM, which again is much easier to do with a Distilled model. Edit: Yes, this is what they are doing as confirmed by this patent that was assigned to them: Computing architecture with model core and fine-tuning portion.
They also define how they load the model into DRAM in read-only mode, and then operate small subsets of that models present workload into DRAM for processing, this removes the 'hard boundary' of transferring models in and out of different forms of memory. They seem to be focused on short-context, fast response type models and IO for their chips.
This all seems obvious to me in retrospect, but sadly I did not come up with it first (-:
Looking through their careers page and their other patents, it seems like they are fine-tuning a DRAM assembly (software and hardware) onto a modified NVIDIA GPU. Are they just adding DRAM as VRAM to an existing graphics card? That's sort of what this reads to me as, specialized/defined VRAM/DRAM directly on the GPU with a 'host' carrier board for interfacing with. I don't believe they are even designing their own SOC, but you could absolutely transform this into an SOC, or even loosely define it as such.
Edit 2: This guy (Ljubisa Bajic) files a lot of patents, and very few (if any?) are given to him over the last decade. They were initially sponsored by Tenstorrent, which, surprise, is doing the same thing that this company is trying to do. Either this is a spinout, or an attempt to become independent.
Edit 3:
"Ljubisa Bajic desiged video encoders for Teralogic and Oak Technology before moving over to AMD and rising through the engineering ranks to be the architect and senior manager of the company’s hybrid CPU-GPU chip designs for PCs and servers. Bajic did a one-year stint at Nvidia as s senior architect, bounced back to AMD as a director of integrated circuit design for two years, and then started Tenstorrent."
Ah, he started Tenstorrent, that makes sense. He left in early 2023 to start this company.
Eh, sort of but not really? Your interpretation might be right, but the patent also might be a roundabout way to talk about the fact they are using a quantized model It still doesn't say much to...
Eh, sort of but not really? Your interpretation might be right, but the patent also might be a roundabout way to talk about the fact they are using a quantized model
The Silicon Llama is aggressively quantized, combining 3-bit and 6-bit parameters, which introduces some quality degradations relative to GPU benchmarks.
Our second-generation silicon adopts standard 4-bit floating-point formats, addressing these limitations while maintaining high speed and efficiency.
It still doesn't say much to the point that it doesn't say anything and most we can do is speculate. So it still gets an "eh" response from me.
This is a very interesting showcase, more details are available at https://taalas.com/the-path-to-ubiquitous-ai/ A potential positive here is that this can be used in specific cases where models...
This is a very interesting showcase, more details are available at https://taalas.com/the-path-to-ubiquitous-ai/
A potential positive here is that this can be used in specific cases where models are not required to complete complex ambiguous tasks, hopefully alleviating the current pressure on electronic components like RAM and GPUs.
Man, they could have made that much easier to read. The tl;dr seems to be that instead of using general purpose hardware their offering is hardware specifically build for specific models. So far...
Man, they could have made that much easier to read. The tl;dr seems to be that instead of using general purpose hardware their offering is hardware specifically build for specific models. So far the only model they are offering is Llama 3.1 8B.
Which is a relatively light model to run to begin with. But to their credit a quick test does seem to validate their speed claims:
Their claimed speed in chat: Generated in 0.018s • 15,623 tok/s
Running llama 3.1 8b with Q4 in llama.cpp: 239 tokens 2.2s 108.61 t/s
If this approach scales up to more competent models it can potentially be interesting depending on a variety of factors like the actual hardware cost involved. Considering they give absolutely no relevant details about the hardware (close to zero, zip, zilch, nada) other than what basically comes down to "we designed a custom SoC" I suspect there might be some caveats and or gotchas involved here.
electronic components like RAM and GPUs.
They still need RAM, they nicely talk around it in their marketing with stuff like this
Taalas eliminates this boundary. By unifying storage and compute on a single chip, at DRAM-level density, our architecture far surpasses what was previously possible.
But that just seems to be describing a SoC, like apple sillicon. Which yes, gives speed benefits. But at the same time still pretty much requires all the other hardware in order to run properly.
In fact, having it all typed out. I feel like they just reinvented the NPU.
tl;dr it remains to be seen what they actually are offering. This is pure marketing aimed at attracting more investors imho.
I'm not going to even try this one, but their planned releases seem promising.
Our second model, still based on Taalas’ first-generation silicon platform (HC1), will be a mid-sized reasoning LLM. It is expected in our labs this spring and will be integrated into our inference service shortly thereafter.
Following this, a frontier LLM will be fabricated using our second-generation silicon platform (HC2). HC2 offers considerably higher density and even faster execution. Deployment is planned for winter.
I'm not going to even try this one, but their planned releases seem promising.
I am pulling this from one of their most recent patents:
This next section is easier to understand when you know what a 'set' is, which they defined as:
It reads to me that they are running a distilled version of the model in question (much easier to do on a smaller model such as Llama 8B), and instead of the standard: storing the model in hard memory, moving the workload for processing to the GPU, and then storing the token inserts/workload into ram for post-processing, they are instead just loading the entire model from the start directly into DRAM, which again is much easier to do with a Distilled model. Edit: Yes, this is what they are doing as confirmed by this patent that was assigned to them: Computing architecture with model core and fine-tuning portion.
They also define how they load the model into DRAM in read-only mode, and then operate small subsets of that models present workload into DRAM for processing, this removes the 'hard boundary' of transferring models in and out of different forms of memory. They seem to be focused on short-context, fast response type models and IO for their chips.
This all seems obvious to me in retrospect, but sadly I did not come up with it first (-:
Looking through their careers page and their other patents, it seems like they are fine-tuning a DRAM assembly (software and hardware) onto a modified NVIDIA GPU. Are they just adding DRAM as VRAM to an existing graphics card? That's sort of what this reads to me as, specialized/defined VRAM/DRAM directly on the GPU with a 'host' carrier board for interfacing with. I don't believe they are even designing their own SOC, but you could absolutely transform this into an SOC, or even loosely define it as such.
Edit: Some of their patent applications are iffy on merit, such as this one "Mask programmable rom using shared connections " which basically is trying to patent addressable hardware memory pointers, or this one that nebulously tries to patent a method of hardware circuitry switching logic as "Hardware implemented codebook pointers".
Edit 2: This guy (Ljubisa Bajic) files a lot of patents, and very few (if any?) are given to him over the last decade. They were initially sponsored by Tenstorrent, which, surprise, is doing the same thing that this company is trying to do. Either this is a spinout, or an attempt to become independent.
Edit 3:
Ah, he started Tenstorrent, that makes sense. He left in early 2023 to start this company.
@creesch this might add a bit more, ahem, reasoning and explanation to their claims that you were talking about.
Eh, sort of but not really? Your interpretation might be right, but the patent also might be a roundabout way to talk about the fact they are using a quantized model
It still doesn't say much to the point that it doesn't say anything and most we can do is speculate. So it still gets an "eh" response from me.
This is a very interesting showcase, more details are available at https://taalas.com/the-path-to-ubiquitous-ai/
A potential positive here is that this can be used in specific cases where models are not required to complete complex ambiguous tasks, hopefully alleviating the current pressure on electronic components like RAM and GPUs.
Man, they could have made that much easier to read. The tl;dr seems to be that instead of using general purpose hardware their offering is hardware specifically build for specific models. So far the only model they are offering is Llama 3.1 8B.
Which is a relatively light model to run to begin with. But to their credit a quick test does seem to validate their speed claims:
Generated in 0.018s • 15,623 tok/s239 tokens 2.2s 108.61 t/sIf this approach scales up to more competent models it can potentially be interesting depending on a variety of factors like the actual hardware cost involved. Considering they give absolutely no relevant details about the hardware (close to zero, zip, zilch, nada) other than what basically comes down to "we designed a custom SoC" I suspect there might be some caveats and or gotchas involved here.
They still need RAM, they nicely talk around it in their marketing with stuff like this
But that just seems to be describing a SoC, like apple sillicon. Which yes, gives speed benefits. But at the same time still pretty much requires all the other hardware in order to run properly.
In fact, having it all typed out. I feel like they just reinvented the NPU.
tl;dr it remains to be seen what they actually are offering. This is pure marketing aimed at attracting more investors imho.
I'm not going to even try this one, but their planned releases seem promising.