16 votes

Scalable oral exams with an ElevenLabs voice AI agent

12 comments

  1. [6]
    Comment deleted by author
    Link
    1. TonesTones
      Link Parent
      I agree that the language doesn’t account for that, but I do not believe that discounts the methodology. Pretty much every academic structure is somehow ableist. Some neurodiversities make strict...

      I agree that the language doesn’t account for that, but I do not believe that discounts the methodology.

      Pretty much every academic structure is somehow ableist. Some neurodiversities make strict deadlines too hard, some neurodiversities make flexible deadlines too hard. Pretty much every form of examination is going to discriminate against some group of people; whether it’s timed in-person, untimed take-home, oral, written, etc. Furthermore, every type of exam will have some group saying “this is the way to demonstrate understanding, and if you cannot do that, it’s a personal and moral failing, not an academic one.” The author’s language purports that view with oral exams, but I’d argue it’s even more common with e.g., strict deadlines (yes, every job will have deadlines, but some jobs have a lot more real deadlines, while in others they are a bit more fake). My point is that no single structure solves the fundamental problem you take issue with.

      This is why accommodations need to exist. Peple fundamentally have different strengths, and educators who understand their job know that evaluation isn’t the end-all-be-all, but a tool to spot gaps in the education.

      Oral exams are incredibly effective, for a large proportion of students, at quickly getting signal on if they know what they are talking about. I don’t agree with everything that author does in their approach, but scalable oral exams are in the future of education. Accommodations will exist for the students whose strengths are not in oral exams, and they may have to jump through more hoops to demonstrate understanding (e.g., supervised written equivalents to the oral exam), but a cheap and scalable general solution makes it easier for the school to build out the exceptional solutions for the students who need it.

      11 votes
    2. scarecrw
      Link Parent
      I'm ambivalent on this. I've thankfully not had anxiety issues with public speaking or exams myself, but knowing a few people who have experienced panic attacks, they're obviously not in a state...

      I'm ambivalent on this. I've thankfully not had anxiety issues with public speaking or exams myself, but knowing a few people who have experienced panic attacks, they're obviously not in a state where they would be able demonstrate their understanding of a topic.

      On the other hand, I do feel like a very low weighted oral exam that was still a hurdle for course grading makes a lot of sense. I don't think its a substitute for other forms of assessment, but setting a bar of "participate in an informed discussion of the topic you've ostensibly been studying for the past few months" doesn't seem like an unreasonable ask to be awarded a confirmation of completion. Put another way: I wouldn't want to graduate anyone who couldn't hold a discussion about the topic being learned, and it's hard to imagine a way to do that without some form of direct assessment.

      5 votes
    3. [3]
      skybrian
      Link Parent
      Would it help if you could do it at home, and you could try again as many times as you like? It seems sort of like beating a video game.

      Would it help if you could do it at home, and you could try again as many times as you like? It seems sort of like beating a video game.

      4 votes
      1. stu2b50
        Link Parent
        It probably wouldn't help everyone, but that number may also be small enough that you can just use ADA accommodations on a case-by-case basis to handle.

        It probably wouldn't help everyone, but that number may also be small enough that you can just use ADA accommodations on a case-by-case basis to handle.

        6 votes
      2. DynamoSunshirt
        Link Parent
        I imagine it's a bit like taking a test on a computer for me. I have next to no nerves during paper tests. But any test on the computer shreds my psyche. It's worse if I have instant feedback on...

        I imagine it's a bit like taking a test on a computer for me. I have next to no nerves during paper tests. But any test on the computer shreds my psyche. It's worse if I have instant feedback on each question (even one messup is demoralizing) but there's something different about pressing the submit button compared to turning in paper. Maybe it's the idea that a machine is processing my input, not a human?

        3 votes
  2. skybrian
    Link
    From the article: [...] [...] [...] [...] Obviously this is just a trial run, but they did test on real students, so it seems promising?

    From the article:

    In our new "AI/ML Product Management" class, the "pre-case" submissions (short assignments meant to prepare students for class discussion) were looking suspiciously good. Not "strong student" good. More like "this reads like a McKinsey memo that went through three rounds of editing," good.

    So we started cold calling students randomly during class.

    The result was... illuminating. Many students who had submitted thoughtful, well-structured work could not explain basic choices in their own submission after two follow-up questions. Some could not participate at all. This gap was too consistent to blame on nerves or bad luck. If you cannot defend your own work live, then the written artifact is not measuring what you think it is measuring.

    Brian Jabarian has been doing interesting work on this problem, and his results both inspired us and gave us the confidence to try something that would have sounded absurd two years ago: running the final exam with a Voice AI agent.

    [...]

    Total cost for 36 students: 15 USD.

    [...]

    The grading was stricter than my own default. That's not a bug. Students will be evaluated outside the university, and the world is not known for grade inflation.

    The feedback was better than any human would produce. The system generated structured "strengths / weaknesses / actions" summaries with verbatim quotes from the transcript. Sample feedback from the highest scorer:

    [...]

    And here is an underrated benefit of this whole setup: the exam is powered by guidelines, not by secret questions. We can publish exactly how the exam works—the structure, the skills being tested, the types of questions. No surprises. The LLM will pick the specific questions live, and the student will have to handle them.

    [...]

    And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.

    Obviously this is just a trial run, but they did test on real students, so it seems promising?

    9 votes
  3. stu2b50
    Link
    Huh, seems like an interesting idea. I still think that it doesn't fully cover what a written essay covers - it's still orthogonal to some extent. The written essay tests your ability to collate...

    Huh, seems like an interesting idea. I still think that it doesn't fully cover what a written essay covers - it's still orthogonal to some extent. The written essay tests your ability to collate and present an argument with plentiful resources, which is a situation that isn't uncommon in many practical uses of rhetoric, and in which allows you to dictate the conversation, which also isn't uncommon. In that respect, more than testing understanding, it's also testing a type of skill relating to information synthesis.

    But I can see where the author is going, and this type of oral argument is also going to be a necessity when doing oral interviews for jobs or oral defenses in academia, so both practical and likely usecases.

    6 votes
  4. [5]
    em-dash
    Link
    So, setting aside the utterly disrespectful absurdity of an LLM calling me on the phone: I would completely agree with this if not for "live". Instead, I find myself wanting to say many impolite...

    So, setting aside the utterly disrespectful absurdity of an LLM calling me on the phone:

    If you cannot defend your own work live, then the written artifact is not measuring what you think it is measuring.

    I would completely agree with this if not for "live". Instead, I find myself wanting to say many impolite things to this person.

    It is a good thing to take time to think about what you're going to say before you say it. That is a skill more people should have and use. That's why I spend more time hanging out here and other text-based places than talking to people in person. (Yes, even at work. We send emails. It's great, more companies should try it.)

    For example, over the few minutes I took to write and edit this comment, I had an additional realization, which would have come too late to say out loud in a verbal conversation: this is another instance of that thing where we're not sure anymore, as a society, whether the purpose of education is to prepare students for jobs, or to educate them more broadly. I would believe that this more accurately reflects the dystopia we're likely heading toward. I do not believe for a second that it is better at measuring actual in-depth understanding.

    6 votes
    1. [4]
      skybrian
      Link Parent
      Maybe it makes sense to compare this to playing music. If you know how to play, you should be able to play something in front of someone else, right? But now it's a high-stress situation. I don't...

      Maybe it makes sense to compare this to playing music. If you know how to play, you should be able to play something in front of someone else, right? But now it's a high-stress situation. I don't know how to fix it except by practicing and performing enough that I build up some confidence. And... I do okay sometimes, but I haven't really reached that point yet.

      Similarly, it seems like you should be able to have a conversation with someone who wrote a paper and they should be able to answer questions about it. The question is how to do it without making it high-stress, when how well they do actually matters to them. Hopefully practice helps?

      I did hundreds of job interviews over the years while working at Google and it was the worst part of the job, even though I wasn't the one being tested. I hate judging people. Eventually I stopped doing them. I hope something better comes along.

      3 votes
      1. [3]
        em-dash
        Link Parent
        Why, though? Writing a paper given a bunch of information one can consult as needed, and then editing it into the final result that people read, is an entirely different skill from answering...

        it seems like you should be able to have a conversation with someone who wrote a paper and they should be able to answer questions about it.

        Why, though? Writing a paper given a bunch of information one can consult as needed, and then editing it into the final result that people read, is an entirely different skill from answering questions off the top of one's head.

        ... in much the same way as playing music, actually, though not in the way you describe: would you expect a great songwriter or composer to automatically be a great performer? I'd argue that writing a paper is analogous to writing a song, and answering live questions to playing arbitrary songs on request.

        In the world of research, where the whole goal is figuring stuff out and writing the results down for future people to reference, I would greatly prefer to directly optimize for that and not for live performance.

        (And yeah, same, re: interviews. I've started explicitly telling my employers that I am irreconcilably bad at it, citing vague handwavey neurospicy reasons, and they should prefer literally anyone else if possible.)

        4 votes
        1. skybrian
          Link Parent
          I don't think there are many jobs where you just read and write papers? Academics are expected to be able to teach, too. Also, most papers are collaborations these days, at least in the sciences....

          I don't think there are many jobs where you just read and write papers? Academics are expected to be able to teach, too. Also, most papers are collaborations these days, at least in the sciences.

          I started working before widespread videoconferencing. I valued being able to talk to teammates about technical problems in an informal meeting in front of a whiteboard. I'm under the impression that academics have conversations in front of whiteboards too? There's the old stereotype of mathematicians chatting together while writing equations on a blackboard.

          Google was founded by two graduate students and I think that's the sort of experience that Google's interviews tried to replicate with in-person interviews, however imperfectly: are they someone that I'd want to work with to come up with a design to solve a technical problem?

          On the other hand, since we normally write code in front of a computer, not at a whiteboard, I always thought pair programming together would be a good interview test, but it wasn't done. Nowadays, maybe they'll start testing people on their ability to vibe-code, who knows?

          7 votes
        2. unkz
          Link Parent
          Not really arbitrary songs though. It’s more like, playing songs from a set list that you have supposedly been mastering for the whole semester.

          I'd argue that writing a paper is analogous to writing a song, and answering live questions to playing arbitrary songs on request.

          Not really arbitrary songs though. It’s more like, playing songs from a set list that you have supposedly been mastering for the whole semester.

          1 vote