telexed ~ c / 153a1062-d5bradar:50 · otherLIVE
← back
NO.
#153a1062
Topic
OTHER
Source
Hacker News · Show HN AI
Published
2026-04-29 16:01:51
Importance
★ 5/10 — radar 50
`Structured Output Benchmark` targets value-level LLM correctness
FIG-1531:1

`Structured Output Benchmark` targets value-level LLM correctness

Schema-valid JSON still breaks workflows when field values drift or hallucinate. SOB scores schema, types, and value accuracy across text, image, and audio, so it is immediately useful for choosing models for extraction pipelines.

[ KEY POINTS ]
  1. Existing benchmarks like JSONSchemaBench mostly check schema/type compliance, but miss wrong-yet-plausible fields such as shifted dates or reordered arrays.
  2. SOB pairs each sample with a JSON Schema and human-verified ground truth, then grades failures at the field-value level across three modalities.
  3. Rankings split by modality: GLM-4.7 leads text, Gemma-4-31B images, and Gemini-2.5-Flash audio, so one default model is a weak strategy.
  4. Large models do not dominate value accuracy: Qwen3.5-35B, GLM-4.7, and even Phi-4 beat bigger frontier models on some structured tasks.
Originalinterfaze.ai/blog/introducing-structured-output-benchmarkRead original →

// related