Are any of you AI gurus?
As per subject really.
I'm creating a project with the CEO at work and it's going to need some serious AI. I'm happy to speak about it here and take advice and tips for direction and resources. I'm going to definitely be hiring real human resource to get this going though.
The project is a masters library of video. Anyone that has seen me post before might know I run a server of roughly 10k of videos, all company IP, of TV shows from over the years. What I'd like to do is point AI at the video library and have it build out a serious database of information, or at least a sidecar JSON of information next to every video. Some things I really don't need AI for and can easy generate, such as video length, type, audio channels, codec, bitrate, etc. All of that can be gleamed with the usual suspect tools such as mediainfo or ffprobe. What I'd like AI to do is scan for faces and identify names of celebs (if possible), sections of video at 5 seconds in length containing railway, trees, cars, etc logged to build out a database of video that we have. It would also need to log time codes of where these clips are and for how long.
I know it sounds like a crazy project, but it will be fun and possibly the start of a new product which I would open source. Don't tell my boss that but if we're using open source models and free shit to create these awesome beast, I'd want to give back to the community.
So, ideas on where I would find people interested and talented with this sort of thing? Any thoughts on what else you'd think I should target to capture from a massive video library? I will be grabbing the clock card info too so OCR is a must.
Soooo much to think about. Project plan coming up.
The project sounds interesting, so I'll chime in with what I think may be helpful, though I lack direct experience with some of these. Take it as a springboard for further research.
Analyzing faces in Videos
Finding (frontal) faces in still images and hence videos is relatively simple using Haar Cascades (also known as Viola-Jones method).
Knowing whether an image or video frame contains a face, and how many, may already be valuable on its own. You can use the detected positions to crop the image and then use something more sophisticated (e.g. a neural network) to actually find out who it is.
If you have reference images, you might be able to leverage research (or already available open source projects) that do Face verification to check if the person on screen matches a reference, though I haven't done anything in this space yet.
General Object-Detection
There are a bunch of available object-detection methods (YOLO, RCNN), but I don't know too much about the recent developments here. These tend to be able to identify a lot of different object classes (cars, boats, trees, etc.).
Audio Transcription
You'll probably want to extract what is said in the videos -- I was very impressed with OpenAI's Whisper speech recognition. There is an open source project, Whisper.cpp that you can also use to run the models.
In my testing I've had slight problems with hallucinations on the smaller models, but for the most part, it has been fairly reliable. You can output timestamps and the recognized text; it should certainly be good enough for a search engine or the like.
OCR
I'm aware of many older attempts at doing video OCR. They generally try to use adaptive binarization to separate foreground text from background and then run a stock OCR engine like Tesseract. However, I'm sure there are also newer deep learning-based methods for this.
LLaVA
One interesting thing that has appeared lately is LLaVA, which is a language model you can interrogate about the contents of a specific image; the results range from mildly infuriating to very impressive -- it can do a lot of OCR by itself and recognizes some celebrities, for example.
There is an extension that applies it to video, but I have not looked into that yet.
Labeling Functions
This may not be useful, so I'm listing it last. However, given the scale of your project, it may be handy to let the computer combine 'weak' labeling signals into a stronger classifier. You could then classify the different scenes with greater ease (e.g. indoor vs. outdoor).
I've worked with Snorkel on toy-projects, but I don't see a reason why it wouldn't work for something like your project.
The idea behind Snorkel is that you combine many weak and possibly conflicting signals into a strong one. For example, you could combine a simple detector that checks if the audio transcript contains a word with estimates from a YOLO object detector and/or with the scene description from something like LLaVA or even the opinions of different people who may manually label a subset of the data.
Nice writeup!
YOLOv8 goes strong and is trivial to use on video.
Thank you for all of this. Lots to digest and research.
OP said his video IP is TV shows. Hopefully, he has subtitle files for them.
Otherwise, it might be worth paying for professional subtitles, you can get those pretty cheap. Because I'm not sure what exactly youtube is using for its auto-subtitles, but especially in scenes with music and background sounds, its still pretty much crap.
For some tasks, AI is not the right tool yet.
Machine transcription is a thing you can get at an okay quality (especially as a company), but the types of errors you see in YouTube's automatic subtitles are still an issue even with top of the line. Those issues can be mitigated (but not eliminated) through clever postprocessing, but I can't be giving away company secrets. In a lot of applications (essentially the more video you have to transcribe and the faster you need the transcription), automatic transcription is simply the only feasible option, since human transcription is inevitably slower and way more expensive. Don't underestimate the cost of professional subtitles for a large library!
Having previously-existing subtitles would definitely be the best case scenario for a dataset like this though. The name recognition wouldn't be trivial regardless but names are a common failure point for machine transcription so it would definitely be playing the game on hard mode on that front.
A data scientist with experience working with AI image recognition models is what you're looking for. There's a variety of models out there which are open source and considered very good models, frankly it doesn't matter a ton which one you choose for this kind of application. Identifying things like trees, cars, railway, etc. should be pretty much out of the box on a large model. How accurate that classification is, however, especially if you're dividing a movie up into 5s increments... well I suppose the question is how much do you want to scour these to ensure they are accurate and how many false positives are acceptable? These are the kinds of questions to explore with your scientist as models can be tuned and tagging can be only turned on once confidence reaches a certain threshold.
I'm curious, however, as to why you'd want it to identify celebrities via AI? There are databases of movies and actors involved available online and for purchase... why not tap into those? It would be much simpler to leverage existing databases for this kind of thing.
I'm betting it's a giant collection of stuff like cutting-room-floor editing footage, b-roll, or TV archives with like live interviews or stuff like that. Rather than publicly documented metadata stuff.
Key phrase from OP:
That's not exactly stuff that's gonna appear in IMDB.
How right you are. I work for a group of television production companies and we have a LOT of video from the past 30 years.
As you guessed, we have master videos (broadcast) and we have a lot of rushes (shot footage that may or may not have been in the show) which is often referred to as b-roll. That can be resold. We have footage of celebrities doing intimate and candid interviews along with concert and music events from bands that made it huge, but never aired. Having all of this indexed with meta data would be amazing but humans doing it would take real-time plus pausing and documenting. 30 years of shows which is tens of thousands of hours means we're past humans doing it and need to ask computers instead.
The funny thing is, if you released this treasure trove to the public for free, there are millions of nerdy archivists and film geeks that would pour over this and tag it with more minute detail than any AI for free, just to see what they could find.
Not to say you (the company that is) would be wise to do so, I just like what it says about the human character, and how a little curiousity can go a long way.
I get that. However, it's all company IP and protected as much as possible. That said, half of it is on YouTube and a large portion of Legal's time is updating content ID for copyright or take down. It's a losing battle (he says, as a massive pirate).
I'm starting to think I could use a LLM for a large portion of this and link it to data from the TheTVDB. They have more accurate show data than we do. That would give us a headstart.
The simplest approach would be to treat these as object detection tasks on individual frames from the videos. Sample a frame from each video at 1 second or 2 second intervals and then run an object detection model on it. DETR from Facebook AI labs is a good choice, you can grab it from HuggingFace.co.
The outputs of these model inferences will be bounding boxes with a class label and confidence value. Ex: this box is a person with 80% confidence or this box is a tree with 92% confidence. Afterwards you can create timestamps and clips of videos based on the objects and bounding boxes that you detected.
Identities of celebrities might be trickier to detect. AFAIK there aren't any publicly available celebrity ID models, but you can data-mine your own data from Wikipedia, IMDB, and other places and train a model to detect the people that you want.
I should probably have phrased it better. We know the people in our shows and they would be limited amounts. It would be great to be able to identify when they appear and what they say in each show. Maybe voice recognition would be more accurate and Whisper already does transcripts for us...
This is why we're currently scoping the project.
If you have the list of people, and you can get a good sample of headshots, candid shots, etc for each, you can use one of many classification algorithms. It would be computationally intensive and require quite a lot of storage, but it is actually fairly straight forward using entry level tool kits.
Take all the sample photos and convert them all to the same resolution, convert them to grayscale, and assign them a sequential subject ID that is unique for each guest/subject. E.g., if you had ten pictures of one person, they would all have the same subject ID. Next, vectorize them so instead of an image file, you have a single vector of numerical values representing the grayscale pixels, where the vector is a concatenation of each row of pixels. Finally append the subject ID to the beginning of the vector, and output the vector as a row in a CSV file. There will be a row for each image where the first column is the person's ID, and the rest of the values are the pixel values. The more samples, the more accurate the training.
Now, do the same thing with all of the stills output by the object detection tools which are suspected to be human. This is your testing data.
This is your training data and training target. The first column is the target, and the rest are the training data. Using a tool like Python SciKit-learn you can easily train a classification model on this training set, and then run it against the testing data. This will classify each still from the videos as one of your guests from the training data.
I would suggest staying with Kth Nearest Neighbor since it is fast and lets you spot check the output quickly, but eventually using something like a support vector machine model will probably get you an extra couple of % accuracy.
If you look at the SciKit-learn docs you will see their tutorials for working with public data sets, but this doesn't show you how to work with your own data. I'm on my phone, but if you want later I can DM you code to do this, or post it here.
Edit: this approach is supervised learning where you are creating a labeled training data set. There are unsupervised methods that are more complex. If you want to go that route, you will likely need to hire an expert.
This overlaps a bit with a project I'm working on for relatively large scale automated processing and annotation of BJJ instructional videos.
You could use AWS Rekognition's DetectFaces at a cost of about $0.72/hr of footage if you go in 5 second intervals ($0.001 per image). It's also quite easy to add your own faces into the database. DetectLabels will also give you the other stuff you were looking for, eg. trains, trees, cars, etc.
Whisper is pretty good with transcripts, but it does tend to hallucinate a bit when faced with unusual terminology, or with heavy accents. The most fascinating thing I've seen is it detected a person with a thick Portuguese accent as speaking Portuguese, and transcribed it as the (accurate!) Portuguese translation of what they were saying in English. Once it has gotten a particular word wrong once in a video, it will make that mistake every single time afterwards because of how it uses past context, so I've got a whole queuing system for identifying mistakes and semi-automatically search and replacing across the rest of the transcripts.
Interesting. However, does this mean I'd have to convert the TBs of video before upload because that in itself is massively CPU/GPU intensive along with a lot more storage we're going to need.
Oh gees. That's the next thing to think about. Depending on the process to look at the video, does the AI have to have a specific format to process or will it read Master formats, which is typically AS-11. Else it'll get more complex daisy chaining a FFMpeg read and conversion and parsing it through the model.
My brain is slowly churning this over. So much research required.
Touched on this a little in my longer comment below, but on formats specifically: if you're diving into this, you're going to end up getting intimately familiar with pytorch tensors! Big multilayered grids of numbers in memory, basically.
The good news is that decoding video into tensors is something that a lot of researchers need to do, so there's robust tooling and libraries out there for it already - and as far as I remember AS-11 uses h.264 compatible encoding for the video tracks so you've got the most mature and well supported codec implementations too. The bad news is that any meaningful length of video in tensor format is fucking massive (literally just raw numeric data per frame, per pixel, and per colour channel, with no interframe compression so that it can be processed efficiently - often in a 32 bit float rather than uint8, too), to the extent that you're not realistically going to be storing it anywhere anyway.
You'll generally want an architecture that takes your existing compressed video chunk by chunk, streams that through a decoder, applies whatever analysis to each frame or smallish set of frames while it's sitting there in VRAM, and then saves just the resulting metadata (text transcript, scene boundary, object tags, etc.) to disk before discarding that batch of ~20-100 frames and moving along to the next.
Hey, I know this one! Well, bits of it anyway - video synthesis is my field nowadays, and prior to that streaming content delivery, so I spend a decent chunk of any given week dealing with pipelines to analyse/label/preprocess large quantities of content.
I think your intuition that the devil will be in the details here is spot on. At a base level "all" you need to do is decode->scene split->audio transcription->feature extraction->text parsing, but if you've got say a billion frames to process (5,000 - 10,000 hours of content depending on frame rate), you're in a world where 5ms to decompress each one and let the system make a call on whether to drop/keep it adds two months of compute overhead. Which is fine, you just run 30 cloud instances for two days and that part's done, but I like it as an example of how even just opening the files becomes a nontrivial job to think about. Things like converting NV12 to planar RGB without tanking overall performance also get interesting when you're locked to hardware decoders (use VALI, not decord or torchaudio - took a month of my life to nail down all the nuances around that one!), and you start needing to consider shifting the tensors over the PCIe bus as a meaningful part of your overhead if you're not going to use them.
At a high level, MERLOT or mPLUG-2 do a lot of what you're looking for out of the box - they're all-in-one multimodal models that operate across the visual, audio, and text content of video; the latter is closer to the cutting edge in terms of capability, but the former does have a fairly nice demo that should give you an idea of what's possible even with slightly older tech.
A more custom workflow could look like TransNet for scene splitting, BLIP2 on samples of frames for overall captions, Segment Anything for individual object extraction, WhisperX on the audio, and then dump the whole lot into Mixtral or FLAN-T5 for parsing into a human-usable search/query system. There are a ton of other decent options for any of those specific jobs, though, so don't take any of that as gospel - just a not-terrible suggestion for a set of tools that could make for a reasonable starting point!
For 1,000 hours of content, pretty much any approach would be fine. For a million hours, you'd need a serious budget, a research team, or ideally both. For the scale it sounds like you're in: large, but "production studio" large rather than "every user on the internet filming their breakfast" large, you can do this in reasonable time with one or two good people and a low five figure hardware budget.
Very happy to go deeper into this if you've got specific questions, but sadly the one I'm least equipped to answer is where you've got the best shot of finding someone who has both the time to lead tech on a project like this and existing knowledge of the issues to sidestep when doing it at scale.
What's your budget and timeline for the project?
Budget: unknown, this is all scoping and reaching out to find out what the market looks like. Will we need to hire a specialist? A small company? Consultant? This is the main reason for throwing it out on Tildes, home of the bright and intelligent, rather than other platforms where I might be met by many snake oil salesmen.
Time line: could be all year. There's no specific time to get this off the ground at this time. We're compiling the masters library by hand right now because a lot of it was never indexed correctly. It's finally landed with the IT department after going through three others who absolutely made a hash of it. Who creates thousands of videos but doesn't log any information about them and then scatters them all over the place willy nilly?
So you want to have:
Yes and pretty much. However, if we can get the data out to something sensible the latter part is easy enough to build as an old school DB. We could go as simple as PHP/Maria at that point with a sprinkling of js.
Getting the metadata is the fun part.
I have no particular expertise here, but the first thing I'd ask is how good your search engine is and how easily it can be improved. Generating transcripts might be a good way of gathering data for the search engine.