The promise of large language models (LLMs) in medicine is no longer theoretical. But when it comes to ophthalmology—a field rich in nuance and subspecialty knowledge—can general-purpose models like OpenAI’s o1 truly deliver?
A recent study published in JAMA Ophthalmology (Sept 2025) put this question to the test, benchmarking o1 against five other leading LLMs: GPT-4o, GPT-4, GPT-3.5, Llama 3-8B, and Gemini 1.5 Pro. The results offer a nuanced picture of what “smart” really means in clinical AI.
📊 Metrics That Matter—Simplified
To evaluate the models, researchers used two types of metrics:
1. Accuracy Metrics (Did the model choose the right answer?)
- Accuracy: Percentage of correct answers out of 6,990 ophthalmology MCQs.
- Macro F1 Score: Balances precision and recall across all question types, especially useful when answer distributions are uneven.
2. Reasoning Metrics (How well did the model explain its answer?)
| Metric | What It Measures | Why It Matters |
|---|---|---|
| ROUGE-L | Overlap of key phrases | Captures how closely the model’s explanation matches the reference |
| BERTScore | Semantic similarity using deep embeddings | Goes beyond word match to assess meaning |
| BARTScore | Bidirectional prediction quality | Measures how well the model and reference explain each other |
| AlignScore | Structural and semantic alignment | Evaluates coherence and clarity |
| METEOR | Precision, recall, and synonym matching | Designed for nuanced semantic comparison |
Each reasoning metric was normalized and averaged to produce an overall score for explanation quality.
🥇 What Did the Study Find?
o1’s Strengths
- Top Accuracy: o1 outperformed all other models in selecting correct answers, with an impressive 87.7% accuracy.
- Best Usefulness & Organization: Expert reviewers rated o1’s explanations as more clinically useful and better structured than GPT-4o’s.
- Strong in METEOR & BARTScore: o1 excelled in semantic alignment and bidirectional reasoning.
o1’s Weaknesses
- Trailing in ROUGE-L, BERTScore, AlignScore: GPT-4 and GPT-4o consistently outperformed o1 in these metrics, suggesting better semantic and structural similarity to reference explanations.
- Verbose Reasoning: o1’s longer, more detailed outputs sometimes diverged from the concise reference answers, lowering its similarity scores.
- Subtopic Variability: o1 led in lens and glaucoma topics but was outperformed by GPT-4o in cornea, retina, and oculoplastics.
đź§ Why the Discrepancy?
The study highlights a key tension in clinical AI:
- Accuracy ≠Reasoning Quality: A model can choose the right answer but explain it in a way that differs from the reference—especially if the reference is terse or poorly structured.
- General vs Domain-Specific Intelligence: o1’s reasoning enhancements may not fully translate to ophthalmology, where specialized knowledge and terminology matter.
- Evaluation Bias: Text-generation metrics reward similarity, not necessarily clinical insight. Human expert review is essential to capture real-world utility.
🔍 Implications for Educators and Developers
- For Educators: o1’s structured, clinically rich explanations may be ideal for teaching modules, especially when paired with expert curation.
- For Developers: There’s a clear need for ophthalmology-specific fine-tuning and better reference datasets with high-quality rationales.
- For Evaluators: Combining automated metrics with expert review offers a more balanced assessment of model performance.
đź§© Final Thought
OpenAI’s o1 shows that reasoning-optimized LLMs can outperform their predecessors in accuracy and clinical usefulness—but they’re not yet perfect. In ophthalmology, where precision and clarity are paramount, domain-specific refinement remains essential.
