• o3 - wow

  • Dec 21 2024
  • Length: 22 mins
  • Podcast

  • Summary

  • o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.


    FrontierMath: https://epoch.ai/frontiermath

    https://arxiv.org/pdf/2411.04872

    Chollet Statement:https://arcprize.org/blog/oai-o3-pub-breakthrough

    MLC Paper:

    https://www.scientificamerican.com/article/new-training-method-helps-ai-generalize-like-people-do/?utm_campaign=socialflow&utm_source=twitter&utm_medium=social

    AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf

    Human Performance on ARC-AGI: https://arxiv.org/pdf/2409.01374v1

    Wei Tweet ‘3 months’:https://x.com/_jasonwei/status/1870184982007644614

    Deliberative Alignment Paper: https://openai.com/index/deliberative-alignment/

    Brown Safety Tweet: https://x.com/polynoamial/status/1870196476908834893

    Swe-Bench Verified: https://openai.com/index/introducing-swe-bench-verified/

    Amodei Prediction: https://x.com/OfirPress/status/1858567863788769518

    David Dohan: 16 hours https://x.com/dmdohan/status/1870171404093796638

    OpenAI Personal Writing: https://openai.com/index/learning-to-reason-with-llms/

    https://simple-bench.com/

    John Hallman Tweet: https://x.com/johnohallman/status/1870233375681945725


    00:00 - Introduction

    01:19 - What is o3?

    03:18 - FrontierMath

    05:15 - o4, o5

    06:03 - GPQA

    06:24 - Coding, Codeforces + SWE-verified, AlphaCode 2

    08:13 - 1st Caveat

    09:03 - Compositionality?

    10:16 - SimpleBench?

    13:11 - ARC-AGI, Chollet



    Show more Show less
activate_Holiday_promo_in_buybox_DT_T2

What listeners say about o3 - wow

Average customer ratings

Reviews - Please select the tabs below to change the source of reviews.