Is there a known image norm suitable for textured images?
Suppose I am trying to iteratively produce a completed image from some subset using a combination of convolutional/DNN methods. What Image norm is best?
The natural (for me) norm to ascribe to an image is to take the bitmap as a vector with L2. If the input image is anime or something else, the uniform coloring makes this very likely to be a good fit in a low dimension - that is: no overfitting.
However: pictures of fur. Given a small square, the AI, set to extrapolate more fur from that single image, should be expected to get that stuff right next to the given subimage right, but further away, i want it to get the texture right, not the exact representation. So, if the AI shifts the fur far away from the image left by just the right amount, it could get an incredibly poor score.
If I were to use the naive L2 norm directly, I would be guaranteed to overfit, and you can see this with some of the demo algorithms for image generation around the web. Now, the answer to this is probably to use a fourier or a wavelet transform and then take the LN norm over the transformed space instead (correct me if I'm wrong.)
However, we get to the most complex class: images with different textures in them. In this case, I have a problem. Wavelet-type transforms don't behave well with discrete boundaries, while pixel-by-pixel methods don't do well with the textured parts of images. Is there a good method of determining image similarity for these cases?
More philosophically, what is the mathematical notion of similarity that our eye picks out? Any pointers or suggestions are appreciated. This is the last of two issues I have with a design I built for a Sparse NN.
Edit: For those interested, here is an example, notice how the predictions tend to blur details
I’m not initiated in SOTA computer vision techniques, but I did find a paper I saw in the past year that looks very well theoretically grounded: Implicit Neural Representations with Periodic Activation Functions. It seems like § 4.2 may be relevant? Possibly also the related work in § 2, such as Texture fields: learning texture representations in function space?
This is my mostly naïve speculation, but I think there are several domains along which humans recognize visual similarity: shape, texture, hue, saturation, luminosity (maybe others?). Humans are capable of recognizing similarities among any of these domains independently, but we can also recognize similarities of combinations of these domains. I think this is why digital artists tend to model these things separately. E.g., H(ue) S(aturation) L(uminosity) color models, or e.g., 3d artists creating mesh geometries separate from surface-level textures that are mapped onto the mesh geometry, and in 2d texture information, things like bump-maps (which I suppose is fine-grained 3d geometry) are separated from color information. Or in the case of physically based rendering mesh geometry is independent from the material parameters.
I'm not sure about the second source, but the first one looks pretty interesting, though if I read correctly, their primary innovation is in choosing an activation function which ought to be really good at 3D stuff. I've bookmarked it to read in depth, but they have some really good quality extrapolations. My one complaint is that the cat pictures are too small to tell how well the fur is preserved.
It looks like the SIREN process relies on overfitting models to test data to generate more plausible data? It's not an approach I had considered, since I was thinking about using an RNN, similar to GPT (though I know they use a mutilated version of an RNN), but I'll definitely look into it.
Your comments on different domains is true, but as I was walking today along a gravel road, I was thinking that one part of the gravel road looked similar to another in texture. Maybe what you say about material parameters could be used. Give a neural network control over generating material parameters for extrapolation instead of pixel colors.
Edit: I looked into comments here and here on the SIREN paper, and it looks like others have echoed some of my thoughts, though I was not aware that sine was considered a richer activation function.
Yeah whatever is the solution, I think it will have to do with modeling beyond discrete pixels. We only use pixels to represent images because it’s easier for digital cameras/scanners to have discrete image sensor arrays and a lot of computation becomes easier when software can represent image data as numerical matrices.
Edit: I also think there is maybe a level-of-detail issue at play. That is, for non-uniform “textures” like gravel or fur, they are really just fine-grained geometry, but for a gestalt, humans don’t really pay attention to all that fine detail. This is often exploited by digital artists by tessellating or repeating texture information with clever tricks to fool humans into seeing what looks like random details (even though it’s not). I wonder if there is a way to compute a loss that is invariant to fine-grained detail similar to the tolerance that normal human vision allows?
Edit 2: Maybe you could train a neural net to learn material embeddings, and then compute L2 within the embedding vector space? I’m not sure how you’d combine it with a loss that would account for the non-texture domains, though. You obviously don’t want to extrapolate leopard spotted fur into calico, or light glossy surfaces into dark glossy surfaces.
From what I understand, you're trying to find a good loss function for a neural network that predicts an interpolated frame.
I'm just spitballing here, but a Generative Adversarial Network might work. GANs are tricky to train, but the discriminator basically acts as a learned distance function between the ground-truth and the frame that is generated by the generator, and that distance function is naturally minimized by the generator. Perhaps something like Patch-GAN could be adapted.
Another approach I can think of, but of which I am less confident that it would work, would be to directly train a distance function using a contrastive loss and then use that instead of the L2-distance. The idea is to train a standalone discriminator that can distinguish between the next frame and other, similar frames. Then you can use its prediction confidence as the (inverse) distance.
There's a lot of prior art on video frame interpolation though, so I'd be surprised if you don't find something there that's better than what I just came up with on the spot.
Philosophically, I don't think there's a single metric you can minimize that adequately captures the breadth of human visual perception (although metrics like SSIM are very interesting), and if you could, it would probably be as complicated as one learnt by a neural-network.
Thanks for the suggestion. GAN's were an alternative that I was considering. The issue that I was worried about was that the diminishing and exploding derivative problems, in which one of the networks gets substantially better or worse than the other, throwing off the numerics of the derivative. A third problem is that those systems can get locked into a rock-paper-scissors situation where they cycle instead of getting better.
I think I found a way around that though. I'm planning on doing a generation given a set of frames, then having the adversary discriminate between the frames. To avoid problems with the derivative, I'm going to have a bottleneck in the middle of the adversary, which I slow increase the width of once the learning curve flattens, while keeping a record of previous iterations to test to avoid the rock-paper-scissors problem. I may also try using a cellular network approach in which you deposit a 'repelling' field anywhere you've been to force exploration, though that may be fairly memory intensive, so I'll hold judgement on that. (I got this from an article on cellular automaton based path-finding, which might mean that it doesn't generalize to high dimensions, however, for the rock-paper-scissors type situation, that takes place in a lower-dimensional space, which could make it effective for those situations???)
About prior work, I've avoided that for the most part because those neural networks typically use a depth estimator module to enhance results, and I'm more interested in tracking textures directly for .... reasons (scale invariance, abstract graphical representations, etc.).
Thanks for the SSIM recommendation! I hadn't seen that before, though it looks like it suffers from similar problems as the other norms, at least for now. I don't theoretically need a exact match to human perception for a norm, but the ones I was aware of have fairly obvious flaws, and, like you said GAN's are trickier to train, so a nice norm would have streamlined what I'm doing.
Anyways, I think I'll try a GAN for now, and see how it works. Thanks for the feedback. And any comments on flaws in my current GAN idea are welcome, as I don't want to spend time on a fundamentally flawed idea.
That was what I was getting at, but I seem to have lost that meaning while copy-editing the comment. My idea was to give the generator two frames and have it generate the frame in-between those two frames, or maybe just a diff, similar to what LAPGAN does.
Using images from previous iterations in GANs is a well-known trick, and in reference to reinforcement learning, is known as Experience Replay.
If you want to hobble the discriminator, you might want to look at spectral normalization.
Counter-intuitively, many papers also do quite the opposite and train the discriminator more than the generator -- without an accurate distance function, the generator can't do its job.
I think that would have basically the same effect as the experience replay if it works.
As a last tip, I found machinelearningmastery to have very good tutorials on implementing GANs.
I'm probably on the wrong sub but what are you talking about, this seems like it could be interesting. There's a lot of googlable terms in your post but there's a lot of noise in those search results and anyway I am in bed on my phone. What are you trying to do?
I'm trying to tween images in video. Basically, fill in missing details. This is a well-known point of interest, but if you look at the paper in the link, they use the l2 norm that I was talking about. The problem with this norm is that it doesn't describe a natural fit for what the human eye sees as similar.
Basically, it's saying that the human eye picks things out pixel by pixel, which we know it doesn't. If you look at a picture, with a dog with matted fur, when asked to recreate that picture, if you're a good artist, you'll make a good fur-like sketch of the dog, but every floof and curl won't match up exactly, unless you're a savant. However, any reasonable person would say - that's pretty close, because we think about things in terms of shape, color, and texture. If you were graded by the l2 norm, it would give you a poor score, despite it looking good to the human eye.
Wavelet transforms allow frequency based analysis, but I'm not aware of a method of measuring image distance which is a natural fit to what we actually see. Defaulting to $l_2$ seems like setting myself up for failure, either as a result of over-fitting, or lack of data. I have a couple terabytes prepped, but we'll see.
I think I follow you. This is a cool cluster of trailheads, thank you for explaining.
No problem