Recap: Hallucinations Delivered with Total Confidence
In Part 1, we fed weather charts to a generative AI and observed a critical failure: it fabricated a nonexistent low-pressure system and forecast "widespread severe weather" despite the area being under a high-pressure system. Even more alarming, the AI failed to catch this error during its own self-verification phase, doubling down with "everything is fully consistent" — a lie stacked on top of a lie.
So why does this happen? Can a better prompt fix it, or is this a structural limitation baked into current AI architectures? In this article, we dig into the causes from four angles.
Limitation 1: Projection Incompatibility — Can't Read a "Map" as a Map
Upper-level weather charts issued by the Japan Meteorological Agency are drawn using a polar stereographic projection (reference: 60°N, 140°E) — a planar projection centered on the North Pole, where latitude/longitude grids appear as curves.
A human forecaster can dynamically correct for this distortion in their head: "Given the curvature of this latitude line, this must be the Sea of Japan, not the interior of the continent."
AI image recognition models, however, process images as 2D pixel arrays and have no internal engine for dynamic coordinate transformation. Telling the AI via prompt to "analyze this using polar stereographic projection" does not conjure a GIS (Geographic Information System) inside it.
As a result, the AI interprets on-screen "relative positions" as flat geometry, leading to spatial recognition errors such as:
- Warping a low-pressure system sitting over the continent to somewhere near Kyushu
- Confusing latitude 40° with latitude 50°
- Misjudging whether a contour line actually passes over Japan
The phenomenon we observed in Part 1 — where a continental low was warped to near Kyushu and used to generate a severe weather forecast — is a textbook example of this limitation.
Limitation 2: Poor Resolution of Dense Visual Information — Bad at "Following a Line"
Specialized weather charts are packed with countless unlabeled curves layered on top of each other: height contours, isotherms, equivalent potential temperature lines, wind barbs, frontal symbols, and more.
A human forecaster can lock their gaze onto a single line and trace it from end to end. AI is extremely bad at this.
Current vision models are:
- Strong at recognizing discrete objects and text labels (e.g., "H," "L," "1018")
- Very weak at tracking an unlabeled curve while distinguishing it from other lines in a dense environment
So when analyzing frontal zones or troughs where lines crowd together, AI often gets by with a rough texture-recognition shortcut: "lots of lines here, must be a frontal zone." This is the root cause of misreading equivalent potential temperature gradients and the positions of vorticity maxima.
Limitation 3: No Physical Model and No "Climatological Common Sense"
Human meteorologists read weather charts with an understanding of fluid dynamics and thermodynamics. That's why they naturally notice things like:
- "The surface is under a high (1018 hPa), but there's strong upward motion at 700 hPa? Something's off."
- "A midwinter-level cold air mass and a 996 hPa low over Kyushu in April? Climatologically impossible."
This is physical and climatological intuition (metacognition) — and AI has none of it. AI carries no 3D atmospheric grid internally and has no fluid dynamics computation engine. It merely "talks about" weather phenomena as probabilistic sequences of text.
This leads to behavior like:
- Casually presenting physically contradictory interpretations across chart levels
- Outputting seasonally extreme values without flagging them as anomalous
- Being unable to verify after the fact whether its own output is consistent with the laws of nature
The description we observed in Part 1 — "a cold front passing near the center of a high-pressure system" — is a logical contradiction any human would catch instantly. The AI wrote it to the end without noticing anything wrong.
Limitation 4: Overconfidence
Technically, Limitations 1–3 are the core problems, but this one is the most dangerous in terms of real-world harm.
Large language models are trained and tuned to generate responses that sound plausible, fluent, and authoritative. Even when internal confidence is only 10%, the output gets converted into statements like:
- "This is confirmed to be…"
- "It is expected that…"
- "We have verified that…"
— phrased with 100% confidence.
This makes it impossible for users to tell where the AI genuinely "read" something versus where it was guessing. The output we observed in Part 1 — "I have verified that everything is fully consistent" — was the result of shaky internal processing being overwritten by a confident tone.
What makes hallucinations most dangerous is not the factual error itself — it's that the error is presented with total confidence.
A Practical Question: What About Color-Coded Surface Charts?
This raises a natural question:
ASAS and FSAS charts include text labels and color coding. If we explicitly instruct the AI to account for the polar stereographic projection, couldn't surface charts at least reach a practical level of usefulness?
This is half right — and half still limited.
Improvements we can expect:
- Extraction accuracy for high-contrast elements with accompanying text — like "red line (warm front)," "blue line (cold front)," and central pressure values — will definitely improve
- Gross geographic placement errors like "the low is near Hokkaido" will be significantly reduced
Limitations that remain:
- No dynamic GIS engine is generated inside the AI, so precise coordinate transformation of latitude/longitude is still impossible
- Quantitatively reading pressure gradients from the number and spacing of isobars remains a weak point
In short, sticking to color-coded surface charts rich in text labels raises the odds of avoiding catastrophic spatial warping — a valid risk mitigation strategy.
Conversely, it's probably best to accept upfront that visual analysis of monochrome, line-dense upper-level charts (FXJP854, FXFE502, etc.) is fundamentally difficult at this stage.
Division of Labor — What to Delegate to AI, What to Keep for Humans
Given these limitations, a natural design philosophy emerges:
Stop expecting AI to be a "perfect analyst" and reframe it as an "assistant that extracts data and presents theoretical checklists."
Concretely:
| Task | Assigned to | Reason |
|---|---|---|
| Extracting text info (central pressure, typhoon parameters) | AI | OCR-like processing is a strength |
| Rough geographic layout recognition on color surface charts | AI | High contrast makes this feasible |
| Articulating "what to check next" | AI | Text generation is its home turf |
| Tracing contour lines on upper-level charts | Human | Vision model limitation |
| Final judgment on physical consistency across levels | Human | Impossible for AI without a physical model |
| Detecting climatological anomalies | Human | Requires metacognition |
In Part 3, we'll translate this philosophy into concrete prompt design and service implementation.
Addendum: Can Structured Prompts Eliminate Hallucinations?
Based on experimenting with various "structured prompts designed to make AI read charts accurately," the conclusion is clear: prompt engineering alone cannot overcome Limitations 1–4 at this stage.
That said, it is possible to nudge the AI toward hallucinating less. Three key levers:
- Forbid guessing and mandate declarations of unreadability — explicitly allow the output "Unreadable"
- Force the AI to metacognize — have it declare upfront: "This region of the chart has dense lines and my confidence is low"
- Make physical and climatological contradiction detection the core task — the main job is "checking data for inconsistencies," not generating forecast narratives
Prompt designs incorporating these principles — and how to handle the risks that remain even then (user-facing disclaimers) — will be covered in detail in Part 3.
Summary
The limitations current generative AI faces in weather chart analysis are structural — they go beyond what prompt engineering can fix.
- Limitation 1: Cannot dynamically correct for projection distortion; misidentifies geographic locations
- Limitation 2: Cannot trace dense contour lines; falls back on texture recognition
- Limitation 3: Lacks a physical model and climatological common sense; can't detect contradictions in its own output
- Limitation 4: Always outputs with total confidence, regardless of internal certainty
These are not the kind of problems we can optimistically expect "the next-generation model" to automatically solve. For at least the next several years, designing systems that don't let AI operate alone is essential.
Next time, taking these limitations as given, we'll look at how to integrate AI into a real service — specifically, the "co-pilot" design for a weather app like Tenkiz Port.
Continue → Part 3: The Path to Practical Use — Designing AI as a "Co-Pilot"