Are any of you AI gurus? - ~comp

[4]

DataWraith

February 12, 2024

Link

The project sounds interesting, so I'll chime in with what I think may be helpful, though I lack direct experience with some of these. Take it as a springboard for further research. Analyzing...

Exemplary

The project sounds interesting, so I'll chime in with what I think may be helpful, though I lack direct experience with some of these. Take it as a springboard for further research.

Analyzing faces in Videos

Finding (frontal) faces in still images and hence videos is relatively simple using Haar Cascades (also known as Viola-Jones method).

Knowing whether an image or video frame contains a face, and how many, may already be valuable on its own. You can use the detected positions to crop the image and then use something more sophisticated (e.g. a neural network) to actually find out who it is.

If you have reference images, you might be able to leverage research (or already available open source projects) that do Face verification to check if the person on screen matches a reference, though I haven't done anything in this space yet.

General Object-Detection

There are a bunch of available object-detection methods (YOLO, RCNN), but I don't know too much about the recent developments here. These tend to be able to identify a lot of different object classes (cars, boats, trees, etc.).

Audio Transcription

You'll probably want to extract what is said in the videos -- I was very impressed with OpenAI's Whisper speech recognition. There is an open source project, Whisper.cpp that you can also use to run the models.

In my testing I've had slight problems with hallucinations on the smaller models, but for the most part, it has been fairly reliable. You can output timestamps and the recognized text; it should certainly be good enough for a search engine or the like.

OCR

I'm aware of many older attempts at doing video OCR. They generally try to use adaptive binarization to separate foreground text from background and then run a stock OCR engine like Tesseract. However, I'm sure there are also newer deep learning-based methods for this.

LLaVA

One interesting thing that has appeared lately is LLaVA, which is a language model you can interrogate about the contents of a specific image; the results range from mildly infuriating to very impressive -- it can do a lot of OCR by itself and recognizes some celebrities, for example.

There is an extension that applies it to video, but I have not looked into that yet.

Labeling Functions

This may not be useful, so I'm listing it last. However, given the scale of your project, it may be handy to let the computer combine 'weak' labeling signals into a stronger classifier. You could then classify the different scenes with greater ease (e.g. indoor vs. outdoor).

I've worked with Snorkel on toy-projects, but I don't see a reason why it wouldn't work for something like your project.

The idea behind Snorkel is that you combine many weak and possibly conflicting signals into a strong one. For example, you could combine a simple detector that checks if the audio transcript contains a word with estimates from a YOLO object detector and/or with the scene description from something like LLaVA or even the opinions of different people who may manually label a subset of the data.

9 votes

Minty
February 12, 2024
Link Parent
Nice writeup! YOLOv8 goes strong and is trivial to use on video.

Nice writeup!

There are a bunch of available object-detection methods (YOLO, RCNN), but I don't know too much about the recent developments here.

YOLOv8 goes strong and is trivial to use on video.

5 votes
[2]
pbmonster
February 13, 2024
Link Parent
OP said his video IP is TV shows. Hopefully, he has subtitle files for them. Otherwise, it might be worth paying for professional subtitles, you can get those pretty cheap. Because I'm not sure...

You'll probably want to extract what is said in the videos

OP said his video IP is TV shows. Hopefully, he has subtitle files for them.

Otherwise, it might be worth paying for professional subtitles, you can get those pretty cheap. Because I'm not sure what exactly youtube is using for its auto-subtitles, but especially in scenes with music and background sounds, its still pretty much crap.

For some tasks, AI is not the right tool yet.
1. sparksbet
  February 13, 2024
  Link Parent
  Machine transcription is a thing you can get at an okay quality (especially as a company), but the types of errors you see in YouTube's automatic subtitles are still an issue even with top of the...
  
  Machine transcription is a thing you can get at an okay quality (especially as a company), but the types of errors you see in YouTube's automatic subtitles are still an issue even with top of the line. Those issues can be mitigated (but not eliminated) through clever postprocessing, but I can't be giving away company secrets. In a lot of applications (essentially the more video you have to transcribe and the faster you need the transcription), automatic transcription is simply the only feasible option, since human transcription is inevitably slower and way more expensive. Don't underestimate the cost of professional subtitles for a large library!
  
  Having previously-existing subtitles would definitely be the best case scenario for a dataset like this though. The name recognition wouldn't be trivial regardless but names are a common failure point for machine transcription so it would definitely be playing the game on hard mode on that front.
  
  1 vote
Comment removed by site admin
Link Parent

[3]

Gaywallet

February 12, 2024

Link

A data scientist with experience working with AI image recognition models is what you're looking for. There's a variety of models out there which are open source and considered very good models,...

A data scientist with experience working with AI image recognition models is what you're looking for. There's a variety of models out there which are open source and considered very good models, frankly it doesn't matter a ton which one you choose for this kind of application. Identifying things like trees, cars, railway, etc. should be pretty much out of the box on a large model. How accurate that classification is, however, especially if you're dividing a movie up into 5s increments... well I suppose the question is how much do you want to scour these to ensure they are accurate and how many false positives are acceptable? These are the kinds of questions to explore with your scientist as models can be tuned and tagging can be only turned on once confidence reaches a certain threshold.

I'm curious, however, as to why you'd want it to identify celebrities via AI? There are databases of movies and actors involved available online and for purchase... why not tap into those? It would be much simpler to leverage existing databases for this kind of thing.

13 votes

[2]
vord
February 12, 2024
Link Parent
I'm betting it's a giant collection of stuff like cutting-room-floor editing footage, b-roll, or TV archives with like live interviews or stuff like that. Rather than publicly documented metadata...

I'm betting it's a giant collection of stuff like cutting-room-floor editing footage, b-roll, or TV archives with like live interviews or stuff like that. Rather than publicly documented metadata stuff.

Key phrase from OP:

I will be grabbing the clock card info too so OCR is a must.

That's not exactly stuff that's gonna appear in IMDB.

9 votes
1. [2]
  
  Comment removed by site admin
  Link Parent
  1. vord
    February 13, 2024
    Link Parent
    The funny thing is, if you released this treasure trove to the public for free, there are millions of nerdy archivists and film geeks that would pour over this and tag it with more minute detail...
    
    The funny thing is, if you released this treasure trove to the public for free, there are millions of nerdy archivists and film geeks that would pour over this and tag it with more minute detail than any AI for free, just to see what they could find.
    
    Not to say you (the company that is) would be wise to do so, I just like what it says about the human character, and how a little curiousity can go a long way.
    
    3 votes
    
    Comment removed by site admin
    Link Parent

[2]

apolz

February 12, 2024

Link

The simplest approach would be to treat these as object detection tasks on individual frames from the videos. Sample a frame from each video at 1 second or 2 second intervals and then run an...

What I'd like AI to do is scan for faces and identify names of celebs (if possible), sections of video at 5 seconds in length containing railway, trees, cars, etc logged to build out a database of video that we have. It would also need to log time codes of where these clips are and for how long.

The simplest approach would be to treat these as object detection tasks on individual frames from the videos. Sample a frame from each video at 1 second or 2 second intervals and then run an object detection model on it. DETR from Facebook AI labs is a good choice, you can grab it from HuggingFace.co.

The outputs of these model inferences will be bounding boxes with a class label and confidence value. Ex: this box is a person with 80% confidence or this box is a tree with 92% confidence. Afterwards you can create timestamps and clips of videos based on the objects and bounding boxes that you detected.

Identities of celebrities might be trickier to detect. AFAIK there aren't any publicly available celebrity ID models, but you can data-mine your own data from Wikipedia, IMDB, and other places and train a model to detect the people that you want.

7 votes

[2]

Comment removed by site admin
Link Parent
1. krellor
  February 13, 2024
  Link Parent
  If you have the list of people, and you can get a good sample of headshots, candid shots, etc for each, you can use one of many classification algorithms. It would be computationally intensive and...
  
  If you have the list of people, and you can get a good sample of headshots, candid shots, etc for each, you can use one of many classification algorithms. It would be computationally intensive and require quite a lot of storage, but it is actually fairly straight forward using entry level tool kits.
  
  Take all the sample photos and convert them all to the same resolution, convert them to grayscale, and assign them a sequential subject ID that is unique for each guest/subject. E.g., if you had ten pictures of one person, they would all have the same subject ID. Next, vectorize them so instead of an image file, you have a single vector of numerical values representing the grayscale pixels, where the vector is a concatenation of each row of pixels. Finally append the subject ID to the beginning of the vector, and output the vector as a row in a CSV file. There will be a row for each image where the first column is the person's ID, and the rest of the values are the pixel values. The more samples, the more accurate the training.
  
  Now, do the same thing with all of the stills output by the object detection tools which are suspected to be human. This is your testing data.
  
  This is your training data and training target. The first column is the target, and the rest are the training data. Using a tool like Python SciKit-learn you can easily train a classification model on this training set, and then run it against the testing data. This will classify each still from the videos as one of your guests from the training data.
  
  I would suggest staying with Kth Nearest Neighbor since it is fast and lets you spot check the output quickly, but eventually using something like a support vector machine model will probably get you an extra couple of % accuracy.
  
  If you look at the SciKit-learn docs you will see their tutorials for working with public data sets, but this doesn't show you how to work with your own data. I'm on my phone, but if you want later I can DM you code to do this, or post it here.
  
  Edit: this approach is supervised learning where you are creating a labeled training data set. There are unsupervised methods that are more complex. If you want to go that route, you will likely need to hire an expert.
  
  1 vote

[2]

unkz

February 12, 2024

Link

This overlaps a bit with a project I'm working on for relatively large scale automated processing and annotation of BJJ instructional videos. You could use AWS Rekognition's DetectFaces at a cost...

This overlaps a bit with a project I'm working on for relatively large scale automated processing and annotation of BJJ instructional videos.

You could use AWS Rekognition's DetectFaces at a cost of about $0.72/hr of footage if you go in 5 second intervals ($0.001 per image). It's also quite easy to add your own faces into the database. DetectLabels will also give you the other stuff you were looking for, eg. trains, trees, cars, etc.

Whisper is pretty good with transcripts, but it does tend to hallucinate a bit when faced with unusual terminology, or with heavy accents. The most fascinating thing I've seen is it detected a person with a thick Portuguese accent as speaking Portuguese, and transcribed it as the (accurate!) Portuguese translation of what they were saying in English. Once it has gotten a particular word wrong once in a video, it will make that mistake every single time afterwards because of how it uses past context, so I've got a whole queuing system for identifying mistakes and semi-automatically search and replacing across the rest of the transcripts.

4 votes

[2]

Comment removed by site admin
Link Parent
1. Greg
  February 13, 2024
  Link Parent
  Touched on this a little in my longer comment below, but on formats specifically: if you're diving into this, you're going to end up getting intimately familiar with pytorch tensors! Big...
  
  Touched on this a little in my longer comment below, but on formats specifically: if you're diving into this, you're going to end up getting intimately familiar with pytorch tensors! Big multilayered grids of numbers in memory, basically.
  
  The good news is that decoding video into tensors is something that a lot of researchers need to do, so there's robust tooling and libraries out there for it already - and as far as I remember AS-11 uses h.264 compatible encoding for the video tracks so you've got the most mature and well supported codec implementations too. The bad news is that any meaningful length of video in tensor format is fucking massive (literally just raw numeric data per frame, per pixel, and per colour channel, with no interframe compression so that it can be processed efficiently - often in a 32 bit float rather than uint8, too), to the extent that you're not realistically going to be storing it anywhere anyway.
  
  You'll generally want an architecture that takes your existing compressed video chunk by chunk, streams that through a decoder, applies whatever analysis to each frame or smallish set of frames while it's sitting there in VRAM, and then saves just the resulting metadata (text transcript, scene boundary, object tags, etc.) to disk before discarding that batch of ~20-100 frames and moving along to the next.
  
  1 vote

Greg

February 13, 2024

Link

Hey, I know this one! Well, bits of it anyway - video synthesis is my field nowadays, and prior to that streaming content delivery, so I spend a decent chunk of any given week dealing with...

Hey, I know this one! Well, bits of it anyway - video synthesis is my field nowadays, and prior to that streaming content delivery, so I spend a decent chunk of any given week dealing with pipelines to analyse/label/preprocess large quantities of content.

I think your intuition that the devil will be in the details here is spot on. At a base level "all" you need to do is decode->scene split->audio transcription->feature extraction->text parsing, but if you've got say a billion frames to process (5,000 - 10,000 hours of content depending on frame rate), you're in a world where 5ms to decompress each one and let the system make a call on whether to drop/keep it adds two months of compute overhead. Which is fine, you just run 30 cloud instances for two days and that part's done, but I like it as an example of how even just opening the files becomes a nontrivial job to think about. Things like converting NV12 to planar RGB without tanking overall performance also get interesting when you're locked to hardware decoders (use VALI, not decord or torchaudio - took a month of my life to nail down all the nuances around that one!), and you start needing to consider shifting the tensors over the PCIe bus as a meaningful part of your overhead if you're not going to use them.

At a high level, MERLOT or mPLUG-2 do a lot of what you're looking for out of the box - they're all-in-one multimodal models that operate across the visual, audio, and text content of video; the latter is closer to the cutting edge in terms of capability, but the former does have a fairly nice demo that should give you an idea of what's possible even with slightly older tech.

A more custom workflow could look like TransNet for scene splitting, BLIP2 on samples of frames for overall captions, Segment Anything for individual object extraction, WhisperX on the audio, and then dump the whole lot into Mixtral or FLAN-T5 for parsing into a human-usable search/query system. There are a ton of other decent options for any of those specific jobs, though, so don't take any of that as gospel - just a not-terrible suggestion for a set of tools that could make for a reasonable starting point!

For 1,000 hours of content, pretty much any approach would be fine. For a million hours, you'd need a serious budget, a research team, or ideally both. For the scale it sounds like you're in: large, but "production studio" large rather than "every user on the internet filming their breakfast" large, you can do this in reasonable time with one or two good people and a low five figure hardware budget.

Very happy to go deeper into this if you've got specific questions, but sadly the one I'm least equipped to answer is where you've got the best shot of finding someone who has both the time to lead tech on a project like this and existing knowledge of the issues to sidestep when doing it at scale.

4 votes

[2]

teaearlgraycold

February 12, 2024

Link

What's your budget and timeline for the project?

3 votes

[2]

Comment removed by site admin
Link Parent
1. teaearlgraycold
  February 13, 2024
  Link Parent
  So you want to have: An ingest process that takes your video library and produces per-asset metadata Some kind of application (web app?) to display everything to an end user?
  
  So you want to have:
  
  An ingest process that takes your video library and produces per-asset metadata
  
  Some kind of application (web app?) to display everything to an end user?
  1. Comment removed by site admin
    Link Parent

skybrian

February 12, 2024

Link

I have no particular expertise here, but the first thing I'd ask is how good your search engine is and how easily it can be improved. Generating transcripts might be a good way of gathering data...

I have no particular expertise here, but the first thing I'd ask is how good your search engine is and how easily it can be improved. Generating transcripts might be a good way of gathering data for the search engine.