9
votes
What programming/technical projects have you been working on?
This is a recurring post to discuss programming or other technical projects that we've been working on. Tell us about one of your recent projects, either at work or personal projects. What's interesting about it? Are you having trouble with anything?
Look nice. What kinds of projects do your user create most commonly? Planning renovation, building a website?
A bug report: The two colored boxes "No registration required" and "Start right now" don't fit in the screen on the phone.
I started reading up on Genetic Programming again. In it you use evolution to evolve a random set of programs to achieve a certain goal. I have a couple of ideas in mind for what I'd like to use it for, but for now, I'm just trying out the basics to get the hang of it. I will probably end up using it to generate GPU shader programs for processing images eventually. We'll see how it goes!
Started working on a Chip-8 emulator in Rust. I've wanted to learn Rust for a while now, but had been having trouble picking a project. Emulation hits on all the topics I want to learn (GPU programming, threading, networking). Hopefully I learn enough to tackled an (S)NES emulator next.
I've been working on my OCR project for document information extraction again.
The process
I solved 1 and 2 using classical computer vision and a neural network respectively, I'm now tackling step 3.
Background on OCR using NNs
OCR is generally divided into a four-step pipeline:
Many methods skip one or more of the stages though.
Rectification
The rectification step modifies the image by zooming in on the characters and straightening curved or slanted text. This is done using a neural network that predicts salient points along the text line and then warps the image using a so-called Thin Plate Spline-transform to zoom in on these points. You can think of it as drawing an outline around the text and then discarding everything outside. Sometimes this step is repeated a few times to get the clearest possible image.
Feature Extraction
Feature extraction is also done using a deep neural network, usually a ResNet-{18,34,50}. This gives you a 3d-volume of features -- usually 512 planes of 1/8th or 1/16th of the resolution of the original input image.
Context Modeling
Since language is very contextual, you want to take context into account. For example, the bigram "qu" is much more likely than "qH" or something. Traditionally this has been done using a bidirectional recurrent neural network that reads the image columns sequentially, forwards and backwards, and then combines the result.
Decoding
The decoding stage takes the image columns and the context and gives you the actual character. This is either done using Connectionist Temporal Classification, which is a dynamic programming approach to determining the most probable sequence, or using an Attention mechanism that focuses on a given set of image columns, one at a time, while producing the characters.
CTC is easier to train, but attention mechanisms tend to be slightly more accurate. There are also papers that combine the two.
The elephant in the room
I have a working neural network architecture for OCR, but training neural networks to do OCR requires an incredible amount of training data. In order to get good results, most publications train on 10—20 million synthetic words or real-world text-instances, several times over.
Unfortunately I can't use these training datasets, because they tend to be licensed for research-use only.
A way forward?
So how do I get the training data I need?
I can hand-label a couple hundred images, but that won't be enough. However, I think I can use semi-supervised learning to utilize unlabeled images.
There are basically two applicable kinds of semi-supervised learning: pseudo-labeling and consistency regularization methods.
Pseudo-labeling uses an initial network trained on labeled data to recognize images from a much larger, unlabeled set. You then (basically; this is an oversimplification) add the images it recognizes with 85%+ confidence to your training set, and train the next neural network, which is then again used to recognize the unlabeled images, bootstrapping the process.
Consistency regularization methods augment the input image by e.g. rotating or scaling it and then train the neural network to be consistent, so that it makes the same prediction for unaugmented and augmented images. This makes the network robust to noise and perturbations and helps it learn more discriminative features.
I've only found one instance of semi-supervised text recognition in the literature, but the results they show seem promising, so I'll give it a try to see if it works. I still may not have enough data, but at least then I will know it doesn't work.
I have been trying to write a simple (in my mind) Rust CLI application and I'm a bit baffled with how to deal with args. I've read through the official book's take on args and error-handling, but it's left me a bit ❓😕❓. Maybe I need to read through it a few more times...
Very nice! I'll definitely look at that in the future. For now, though, I kinda wanna do the mental work of trying to do that on my own. Mostly, I'm hoping to get a better understanding of Rust idioms, especially when it comes to error-handling. My hope is that once I have a good idea of error-handling, I can take an "error first" approach to programs.
I think I just need to do more reading on how to work with
Result<T, E>
.