I've been keeping tabs on RWKV, and it's garnered wider interest after a preprint. It'll take me a while to get through the math though, and I figure there must be a couple of Tildeans who...
I've been keeping tabs on RWKV, and it's garnered wider interest after a preprint. It'll take me a while to get through the math though, and I figure there must be a couple of Tildeans who actually enjoy it, the masochists. Jokes aside, RWKV continues to surprise me with how.... relatively simple it is.
My guess is that gradient stability in RWKV is a side effect. I don't think there is necessarily any advantage if RWKV's channels work as intended. I also haven't worked it out so there is a good...
My guess is that gradient stability in RWKV is a side effect. I don't think there is necessarily any advantage if RWKV's channels work as intended.
I also haven't worked it out so there is a good chance I'm wrong.
Edit:
You are right, a proof of numerical stability was in the paper.
Upon further review I guess the article is showing they have preserved numerical stability in their second reformulation into log-space, which I haven't gotten to yet.
The first reformulation introduces a scaling factor showing numerical stability with large w. It might be the case that the original RWKV can vanish with some w? The paper maybe hints at this?
I've been keeping tabs on RWKV, and it's garnered wider interest after a preprint. It'll take me a while to get through the math though, and I figure there must be a couple of Tildeans who actually enjoy it, the masochists. Jokes aside, RWKV continues to surprise me with how.... relatively simple it is.
This is a really cool model, thanks for sharing. I’m going to have to play around with it and see what I can get it to do.
Reading the initial RWKV paper, it seems they address gradient stability; why is this reformulation necessary?
My guess is that gradient stability in RWKV is a side effect. I don't think there is necessarily any advantage if RWKV's channels work as intended.
I also haven't worked it out so there is a good chance I'm wrong.
Edit:
You are right, a proof of numerical stability was in the paper.
Upon further review I guess the article is showing they have preserved numerical stability in their second reformulation into log-space, which I haven't gotten to yet.
The first reformulation introduces a scaling factor showing numerical stability with large w. It might be the case that the original RWKV can vanish with some w? The paper maybe hints at this?
Frankly I don't know.