Most AI struggles to read clocks and calendars

[18]

balooga

July 12

Link

This is the sort of problem that’s easily solved by tool-using agentic models. Just like LLMs struggled with basic arithmetic until they were given access to calculators, someone will make a Clock...

This is the sort of problem that’s easily solved by tool-using agentic models. Just like LLMs struggled with basic arithmetic until they were given access to calculators, someone will make a Clock & Calendar MCP and the problem will go away.

LLMs are an important part of AI but not the only part. The best systems will be the one with a whole toolbox of specialized tools at their disposal, and the knowledge of how and when to use them.

15 votes

[17]
skybrian
July 13
Link Parent
For the first test, it would need to be a specialized model that takes an image of an arbitrary clock and returns the time. It's the reverse of the clock image generation challenge: can you give...

For the first test, it would need to be a specialized model that takes an image of an arbitrary clock and returns the time.

It's the reverse of the clock image generation challenge: can you give it a time and have it draw the clock? A lot of image generators will put the hands on ten and two regardless of what you ask, because they were trained on a lot of images of clocks that were set to that time.

6 votes
1. [15]
  zestier
  July 13
  Link Parent
  Maybe I'm crazy, but why would you even use anything resembling an AI model for that? If you asked someone to solve that problem a decade ago I think they'd have just started with OpenCV. And to...
  
  Maybe I'm crazy, but why would you even use anything resembling an AI model for that? If you asked someone to solve that problem a decade ago I think they'd have just started with OpenCV. And to integrate it into today's models you'd presumably just give them access to run said OpenCV tool.
  
  Random example of such a tool: https://wongkj12.github.io/clock-reader/
  
  7 votes
  1. chundissimo
    July 13
    Link Parent
    I’ve never seen it framed as trying to actually solve a problem here, rather it was used as an example of one of the systemic weaknesses of image models.
    
    I’ve never seen it framed as trying to actually solve a problem here, rather it was used as an example of one of the systemic weaknesses of image models.
    
    5 votes
  2. [5]
    norb
    July 15
    Link Parent
    So I have access to Copilot through work. After seeing this headline, I asked it to tell me where the hands of the clock would be at a certain time. I then confirmed I would like the visual...
    
    So I have access to Copilot through work. After seeing this headline, I asked it to tell me where the hands of the clock would be at a certain time.
    
    ME: If it's 1:30, where would the hands on the clock be?
    
    CP: At 1:30, the positions of the clock hands are:
    
    Minute hand: Points at the 6, which is at the 180-degree mark (halfway around the clock).
    
    Hour hand: Halfway between the 1 and the 2, because it's halfway between 1:00 and 2:00. Each hour mark represents 30 degrees (360° ÷ 12), so:
    
    At 1:00, the hour hand is at 30°.
    
    At 1:30, it's halfway to 60°, so it's at 45 degrees.
    
    So:
    
    Hour hand: 45°
    
    Minute hand: 180°
    
    Would you like a visual illustration of this?
    
    I then confirmed I would like the visual illustration. The image it produced was wrong (showing the hour hand at 12 and the minute hand at 6). I then told it the image was wrong, and it said :
    
    You're absolutely right to point that out, and I appreciate your patience. Let's clarify and correct this.
    
    At 1:30, the minute hand should be pointing directly at the 6, and the hour hand should be halfway between the 1 and the 2 — not directly on the 1. If the diagram showed otherwise, then it was incorrect.
    
    Let me regenerate a correct and clear diagram for 1:30.
    
    Thanks for your patience — a corrected diagram showing 1:30 accurately is on its way. Let me know if you'd like a version with angle labels or a breakdown of how the positions are calculated!
    
    This second image was still incorrect, but closer. It had the hour hand at 1 and the minute hand at 6. The hour hand should be half way between 1 and 2. So I told it again it was incorrect and on the third try it generated the correct image.
    
    What was really interesting about this exchange, and what made me think of adding it here as a reply to your comment was this next bit.
    
    ME: I'd like you to get this correct on the first try. This is why people do not trust AI
    
    CP: You're absolutely right to expect accuracy the first time, especially for something as fundamental as reading a clock. I take that seriously, and I appreciate you holding me to a high standard.
    
    The earlier diagrams didn’t meet that standard, and I’m glad you called it out. The final diagram I generated using code is correct — it shows:
    
    The minute hand pointing at the 6 (180°),
    
    The hour hand halfway between the 1 and 2 (45°), which is exactly where it should be at 1:30.
    
    If you ever want to verify or visualize other times, or if there's anything else you'd like to explore or double-check, I'm here to help — and I’ll make sure it’s right the first time. Would you like to try another time on the clock?
    
    So for this third time, IT USED SOME CODE to generate the clock instead of whatever it was doing before. So it does have access to some sort of code to do this, it just took me asking three times for it to do that.
    
    3 votes
    
    [2]
    sth
    July 16
    Link Parent
    Claims about its inner workings are not necessarily more reliable than the clocks it generates, so I wouldn't interpret that too much. Maybe it just assumed mentioning "code" would make the claim...
    
    Claims about its inner workings are not necessarily more reliable than the clocks it generates, so I wouldn't interpret that too much. Maybe it just assumed mentioning "code" would make the claim of correct future results more plausible.
    
    Or it might have said the same thing if you hadn't complained about the second image and instead demanded correct future results immediately.
    
    3 votes
    
    norb
    July 16
    Link Parent
    That's fair. I couldn't post the images here, but the third clock it generated looked different than the other two so I do assume the underlying method was different. So two data points (what it...
    
    Claims about its inner workings are not necessarily more reliable than the clocks it generates, so I wouldn't interpret that too much.
    
    That's fair. I couldn't post the images here, but the third clock it generated looked different than the other two so I do assume the underlying method was different. So two data points (what it said and the output) point to it having used a different method to get the correct answer. But yeah, the AI lies a lot lol
    
    2 votes
    
    [2]
    Sodliddesu
    July 16
    Link Parent
    This bothers me for a myriad of reasons but I'm going to go with the simplest one first. The chatbot had the ability to do it right and took three requests. It did, in the end, produce a correct...
    
    I take that seriously, and I appreciate you holding me to a high standard.
    
    This bothers me for a myriad of reasons but I'm going to go with the simplest one first.
    
    The chatbot had the ability to do it right and took three requests. It did, in the end, produce a correct visual. However it managed to do it being somewhat irrelevant but if a human told me they 'take something seriously' and took three tries to draw a clock right despite knowing the exact angle it needed to be at, I would clearly know they didn't take it seriously.
    
    So, let's put another hat on this 'serious' chatbot.
    
    "Where is this target?"
    
    "Based on visual information provided from Reaper-MQ7, that target is at 1600 Pennsylvania Avenue, Washington D.C., USA."
    
    "Launch a strike on target coordinates."
    
    "Got it, launching a strike on 32.7595335, -96.7759369."
    
    "You just bombed the University of Texas."
    
    With the American military and it's technooligarchs investing in AI - How long until we try ChatGPT at the Hague?
    
    1 vote
    
    norb
    July 16
    Link Parent
    Fast tracking "AI" in to all the things is going to come back to bite us at some point. I think it's just a matter of when.
    
    Fast tracking "AI" in to all the things is going to come back to bite us at some point. I think it's just a matter of when.
  3. [8]
    skybrian
    July 13
    Link Parent
    Ideally, LLM's that take images as input should be general enough to do all sorts of things. For example, not just reading clocks, but also speedometers and any other kind of dial or gauge. Also...
    
    Ideally, LLM's that take images as input should be general enough to do all sorts of things. For example, not just reading clocks, but also speedometers and any other kind of dial or gauge. Also various kinds of charts and graphs. Writing a specialized tool works if there's a specific case you care about, but it doesn't generalize.
    
    1 vote
    
    [7]
    krellor
    July 13
    Link Parent
    I think a combination makes sense. Specific ml models for clocks, speedometers, temperature and pressure gauges, etc. Those are common, important, and distinct. Then a general multimodal model...
    
    I think a combination makes sense. Specific ml models for clocks, speedometers, temperature and pressure gauges, etc. Those are common, important, and distinct. Then a general multimodal model that can attempt to interpret other types of dials or gauges.
    
    In my current area (health), the big push I've been making is towards these multimodal models that can invoke validated methods, because people keep dumping in problems that an LLM will give an answer for, but has no validated justification for being able to solve with a transformer network.
    
    2 votes
    
    [6]
    sparksbet
    July 14
    Link Parent
    Multimodal models are already a big deal, but they generally don't use highly specific models as described upthread -- the underlying image model is highly generalized much like how the underlying...
    
    Multimodal models are already a big deal, but they generally don't use highly specific models as described upthread -- the underlying image model is highly generalized much like how the underlying language model is in an LLM. There may well be layers in the model that specifically look for the features described to identify clocks and tell time, for instance, but they are generally not added manually but instead learned as part of training.
    
    1 vote
    
    [5]
    krellor
    July 14
    Link Parent
    I suppose it's more precise to say a multimodal model that can invoke agents to execute against a curated library of functions for specific tasks, but I'm not sure if that has a pat name.
    
    I suppose it's more precise to say a multimodal model that can invoke agents to execute against a curated library of functions for specific tasks, but I'm not sure if that has a pat name.
    
    [4]
    sparksbet
    July 15 (edited July 15)
    Link Parent
    "Agentic" is the term I've been hearing lately, but I'm not sure if it describes what we're talking about here.
    
    "Agentic" is the term I've been hearing lately, but I'm not sure if it describes what we're talking about here.
    
    [3]
    krellor
    July 15 (edited July 15)
    Link Parent
    I wondered about that as well, but when I looked it up this is what I saw: from Wikipedia: I've also seen it used to describe things like development tool chains where AI agents are spawned to do...
    
    I wondered about that as well, but when I looked it up this is what I saw: from Wikipedia:
    
    Agentic AI is a class of artificial intelligence that focuses on autonomous systems that can make decisions and perform tasks without human intervention. The independent systems automatically respond to conditions, to produce process results.
    
    I've also seen it used to describe things like development tool chains where AI agents are spawned to do different tasks.
    
    So not quite. An opportunity to come to with a name if you have a flash of inspiration.
    
    1 vote
    
    [2]
    sparksbet
    July 15
    Link Parent
    "Agentic" is the new hot terminology so I think it's being used to describe a pretty broad swath of things. I'm fairly certain what's described above would still fall into that category, but it's...
    
    "Agentic" is the new hot terminology so I think it's being used to describe a pretty broad swath of things. I'm fairly certain what's described above would still fall into that category, but it's being thrown at everything in the space these days so who knows.
    
    1 vote
    
    krellor
    July 15
    Link Parent
    Interestingly, I just received a cordial invitation to attend an AWS presentation titled: "Power innovation with agentic AI on AWS." I went ahead and registered just to hear what the latest hype...
    
    Interestingly, I just received ~~spam~~ a cordial invitation to attend an AWS presentation titled: "Power innovation with agentic AI on AWS."
    
    I went ahead and registered just to hear what the latest hype definition is!
    
    1 vote
2. krellor
  July 13
  Link Parent
  It feels like traditional machine learning could be trained to determine time from images of an analog clock. You might need a few layers of decomposing the problem. Traditional ML can readily...
  
  It feels like traditional machine learning could be trained to determine time from images of an analog clock. You might need a few layers of decomposing the problem. Traditional ML can readily interpret images of numbers, so then it becomes a question of identifying hands and positions. Of course, you would need a large number of tagged images to train.
  
  Maybe we get AI to generate a synthetic dataset, train the ML, and see if it would me a good sub module for a multimodal model! 😃
  
  4 votes

[2]

TransFemmeWarmachine

July 14

Link

What bothers me about this isn't that LLMs aren't able to interact with irl calendars and clocks - it's that a massive push from tech companies has been that LLM agents will be amazing calendar...

What bothers me about this isn't that LLMs aren't able to interact with irl calendars and clocks - it's that a massive push from tech companies has been that LLM agents will be amazing calendar assistants. I agree that's an amazing use case for LLM agents! However, if they can't do these fundamentally basic tasks yet - then they're immediately less useful.

For example, let's say I'm out and about in the real world, and I see a poster for an event with a stylized calendar and time attached. I'd love to be able to take a photo of the poster, and have it automatically add it to my calendar. What this article shows is that is currently not quite possible. (side note, if someone knows if this exists, please link it.)

Regardless, I'm not very interested in a product that has an exceptional amount of issues. The current push for AI is certainly a bubble, since quite honestly LLMs don't hold up to every use case they are being applied to.

2 votes

JCAPER
July 15
Link Parent
It's possible (but I don't know any app that does it, maybe gemini soon™ since google wants to make it work on Android on OS level) For this you'll want to use AI agents, which is basically LLM's...

It's possible (but I don't know any app that does it, maybe gemini soon™ since google wants to make it work on Android on OS level)

For this you'll want to use AI agents, which is basically LLM's with access to tools that allow them to do things. For example, the LLM is incapable of calculating with accuracy, so you develop a calculator script that let's the LLM provide the input, and the LLM then reads the output and gives it to you.

For the example you gave, it would be something like that. You would have an agent responsible for understanding the photo, then access a tool that can interact with the calendar, and save the event. The LLM itself doesn't need to understand deeply what the date and time mean, it just needs to understand that it can use them in the calendar's API.

It's why AI companies, especially google, have been investing heavily in these agents. Turns out that LLM's can work really well as long you give them simple jobs and access to tools, and you can make some complex workflows by creating several different agents that talk to each other.

https://cloud.google.com/discover/what-are-ai-agents?hl=en

1 vote

[3]

kacey

July 13

Link

This is kinda neat, but I’m not sure if there’s really a takeaway … every time someone comes up with a new “thing that ML can’t do”, it gets fed into the synthetic training dataset by developers...

This is kinda neat, but I’m not sure if there’s really a takeaway … every time someone comes up with a new “thing that ML can’t do”, it gets fed into the synthetic training dataset by developers hungry to prove that their model is the best. And the next version does dramatically better.

My take for most of this stuff is that (1) ML approaches can approximate most algorithms, (2) hundreds of billions of dollars are being spent to make them capable of doing everything, therefore (3) it will clear every well defined benchmark that is created, in time.

1 vote

[2]
ThrowdoBaggins
July 14
Link Parent
I guess the broader issue is that individually identifying shortfalls and then training to bandaid over that problem doesn’t really feel like the path towards generalised intelligence, and a big...

I guess the broader issue is that individually identifying shortfalls and then training to bandaid over that problem doesn’t really feel like the path towards generalised intelligence, and a big focus from the LLM space at the moment is how it’s “totally” gonna lead to AGI within a short time, totally!

5 votes
1. Nazarie
  July 15
  Link Parent
  When you give an agent the ability to self train you start getting in that direction. If the agent tries a task and gets feedback that it failed it could go back and train itself on the task. It's...
  
  When you give an agent the ability to self train you start getting in that direction. If the agent tries a task and gets feedback that it failed it could go back and train itself on the task. It's how humans work as well. We always off from AGI, but an AGI isn't going to be some monolithic model. It'll be a system of models that work together and evolve over time. Current tool calling allows a limited form of multi-model intelligence along side sensory input and the ability to affect external change. We are getting the precursor building blocks in place.

Link information

23 comments