The models that failed, the bugs that took weeks, and the architecture that survived.
What I Was Trying to Do
I needed to read specific UI panel content from a screen in real-time on macOS with Apple Silicon. Detect panel boundaries from a screenshot, extract text via OCR, accumulate that text across scrolling viewports, and render a transparent overlay on top showing the results. All of it running locally -- no cloud APIs, no network round-trips, no usage limits. A self-contained screen reading and annotation tool that could understand what was on screen and react to it at interactive speed.
Everything That Failed
This section might be the most useful part of this article. Each of these approaches cost days to weeks of effort. If you are building anything similar on macOS, you can skip all of them.
Florence-2
Microsoft's Florence-2 was the first vision model I tried. It supports grounding tasks out of the box -- you give it an image and ask "where is the text panel?" and it returns bounding box coordinates. On paper, perfect for UI panel detection.
In practice, Florence-2 cannot run on macOS with Apple Silicon. The model uses a custom architecture that requires trust_remote_code=True, depends on flash-attention (a CUDA-only library), and cannot be converted to CoreML. There is no MLX port. I spent two days trying different conversion paths before accepting that this model simply does not exist on Apple's platform.
If you are searching for a grounding-capable vision model on macOS, remove Florence-2 from your list immediately.
Ferret-UI
Apple's own UI understanding model seemed like the obvious choice for an Apple Silicon project. Ferret-UI was specifically designed to understand user interfaces -- element detection, widget classification, spatial reasoning about UI layouts.
It was a dead end. Ferret-UI requires CUDA flash-attention, which means it needs an NVIDIA GPU. Apple's own UI understanding model does not run on Apple's own hardware without significant porting effort. Beyond the runtime issue, the model's grounding output was not usable for my task -- I needed precise pixel-coordinate bounding boxes, and the model's output format did not map cleanly to that.
The irony of Apple publishing a UI model that cannot run on macOS was not lost on me.
Qwen2.5-VL-3B (the Small One)
After the first two dead ends, I found that Qwen2.5-VL had an MLX port via the mlx-vlm library. The 3B parameter variant (4-bit quantized) was only 2.9GB, loaded in 1.9 seconds, and ran inference in 3-7 seconds. Fast and light.
But too weak. The 3B model could identify that UI elements existed in an image -- it would say "there is a text panel on the left" -- but the bounding box coordinates it returned were hallucinated. Boxes would be off by hundreds of pixels, overlap incorrectly, or enclose regions that contained nothing. For panel detection where you need to know "the question text lives between pixels (0, 120) and (900, 800)," a model that hallucinates coordinates is worse than no model at all.
The 7B variant turned out to be the sweet spot. More on that in the next section.
Pixel-Edge Detection (No ML)
Before committing to a VLM, I tried the traditional computer vision approach. Each UI panel has a uniform background color. The question panel might be rgb(53, 67, 83), the editor panel rgb(22, 43, 54). In theory, you can find panel boundaries by detecting where the background color changes.
The algorithm worked on test screenshots. Then I tested it on a page where both panels used similar background colors. The panel border was a thin 1-pixel line that blended into the surrounding regions. Same-color-background UIs -- which are increasingly common with modern design trends -- broke the approach entirely.
Pixel-edge detection is fragile because it depends on an assumption (panels have visually distinct backgrounds) that is not guaranteed. A VLM can detect panel boundaries semantically -- it understands "this is a question panel" regardless of what color it is.
Accessibility API (AX API)
macOS has a built-in accessibility API that lets you programmatically read UI elements. For a screen reader, this sounds ideal.
The problem is that the Accessibility API cannot see inside web content rendered in Chrome. The browser exposes high-level structural elements -- the window, the tab bar, the content area -- but not individual text lines, panel layouts, or the DOM structure within the page. You get a single "web area" element that says "this is a web view" with no ability to drill into it.
If your target is a native macOS application, the AX API might work. For reading web-based UIs through the browser, it is a dead end.
Spawning a New Python VLM Process Per Inference
My initial integration spawned a new Python process for each VLM inference call. The Python script imported mlx-vlm, loaded the Qwen2.5-VL-7B model (5.3GB of weights), ran inference on one image, printed the result, and exited. The next cycle, 15 seconds later, spawned a new process that loaded the 5.3GB model again.
After three or four cycles, the Mac froze. Each process was loading the full model into unified memory, and the previous processes had not fully released their allocations before the next one started. OOM within minutes.
The fix was a persistent server architecture: load once, serve many.
What Actually Worked -- The Architecture
Here is the system that survived. Each component earned its place by being the last option standing after everything else failed.
Panel Detection: Qwen2.5-VL-7B via MLX
The 7B parameter Qwen2.5-VL model, 4-bit quantized, is the sweet spot for UI panel detection on Apple Silicon. The 3B model hallucinates bounding boxes. Larger models (14B+) are too slow for interactive use. The 7B variant reliably returns accurate panel coordinates when prompted correctly.
Why MLX matters. Apple Silicon's unified memory architecture means the CPU and GPU share the same physical RAM. MLX exploits this -- the model weights live in unified memory once and are accessed by both the CPU (for attention computations) and the GPU (for matrix multiplications) without copying. The 4-bit quantized model shows ~238MB resident memory in Activity Monitor, not the full weight file size, because MLX memory-maps the weights and pages them in on demand.
The prompt that works. After testing dozens of prompt variations, this format reliably produces usable output:
Detect the following UI panels in this screenshot and output their
bounding box coordinates in JSON format:
1. The "question" panel (problem description text area)
2. The "editor" panel (code editor area)
Return JSON with format: [{"label": "question", "bbox_2d": [x1,y1,x2,y2]},
{"label": "editor", "bbox_2d": [x1,y1,x2,y2]}]
Key details: ask for each panel by name on a numbered list (the model sometimes merges panels into one bbox if you describe them in a single sentence), and specify the exact JSON format you want (the model follows format instructions well).
The initial architecture: a persistent Python server. The model takes ~12 seconds to load. Rather than paying that cost every cycle, I built a Python server process that loads the model at startup and accepts requests over a simple stdin/stdout protocol:
# Server: load model once, serve forever
import sys, json
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("mlx-community/Qwen2.5-VL-7B-Instruct-4bit")
config = load_config("mlx-community/Qwen2.5-VL-7B-Instruct-4bit")
for line in sys.stdin:
request = json.loads(line)
prompt = apply_chat_template(processor, config,
request["prompt"],
num_images=1)
result = generate(model, processor, prompt,
image=request["image_path"],
max_tokens=512, verbose=False)
print(json.dumps({"result": result}), flush=True)
The Swift host process spawns this server once, sends JSON requests on stdin, and reads JSON responses from stdout. No HTTP server, no sockets, no serialization framework -- just newline-delimited JSON over pipes.
Coordinate conversion. The model returns bounding boxes in the coordinate space of the resized image (max 1280px on the longest side, rounded to multiples of 28). To get screen pixels:
screen_x = model_x * (original_width / resized_width) / retina_scale
screen_y = model_y * (original_height / resized_height) / retina_scale
On a Retina display, retina_scale is 2.0. Forgetting this division is a common source of bounding boxes that are exactly 2x too large.
From Python Server to Native Swift
The Python persistent server worked. But it had friction: a Python subprocess to manage, a PIL resize helper, stdin/stdout JSON marshaling, and ~50ms of overhead per inference just from process communication. For a real-time pipeline, I wanted everything native.
The mlx-swift-lm library promised exactly this -- a Swift implementation of the MLX model runtime, including Qwen2.5-VL. Load the model in Swift, run inference in Swift, no Python anywhere. In theory, a single-binary solution.
In practice, the Swift implementation had 9 bugs. Finding and fixing them took weeks. But the result was worth it: a fully native Swift binary that runs Qwen2.5-VL-7B with zero Python dependencies, producing identical output to the Python reference.
The 9 Bugs in mlx-swift-lm's Qwen2.5-VL
This section documents those weeks. The bugs collectively made the model produce wrong bounding boxes. Fixing them was the difference between "the model hallucinates" and "the model matches Python output at 0px delta on all 8 bbox edges."
1. MROPE section selection (split-select vs slice-replace). Multi-Resolution Rotary Position Embedding (MROPE) assigns different frequency bands to temporal, height, and width dimensions. The Swift implementation split the frequency tensor into three parts using modulo indexing (i % 3), which interleaves the frequencies. Python's implementation starts with temporal frequencies and overwrites height/width slices in-place: [T_0-15, H_16-39, W_40-63]. The layouts are completely different, and the wrong layout produces subtly wrong attention patterns.
2. Chat template ordering. The Swift message generator placed text before the image token in the content array. The Python implementation puts the image first: <|vision_start|><|image_pad|><|vision_end|>PROMPT. This ordering matters because the model's attention patterns are position-dependent -- putting text before the image means the text tokens attend to positions where image features have not yet been injected.
3. invFreq registered as a Module weight. The invFreq tensor was declared as a property on an Attention class that inherits from Module. MLX's weight-loading mechanism scans all Module properties and tries to load matching weights from the checkpoint. Since invFreq is a computed constant (not a learned weight), the loader either threw keyNotFound errors or silently overwrote it with garbage. The fix was wrapping it in a non-Module class to hide it from reflection.
4. rope_deltas unused during autoregressive generation. After the prefill pass, the code cleared the cached position IDs but never applied rope_deltas during subsequent token generation. The correct computation is positionIds = cache_offset + rope_deltas + arange(seqLen). Without the deltas, position embeddings drifted with each generated token, degrading output quality progressively.
5. Image resize using 1800px max instead of 1280px. The Swift code resized input images to a maximum of 1800 pixels on the longest side, producing 2688 visual tokens. The Python reference implementation uses 1280px maximum, producing 1305 visual tokens. The model was trained on the 1280px resolution. Feeding it 1800px images meant the visual token positions were outside the model's training distribution.
6. Prompt format for single bbox output. Using a single sentence asking for both panels caused the model to sometimes return one combined bounding box. Switching to a numbered list with explicit labels ("1. question panel" / "2. editor panel") reliably produced two separate bboxes.
7. maxTokens not set. Without an explicit max_tokens parameter, the model generated tokens until hitting an internal limit or running out of memory. For a task that should return ~100 tokens of JSON, this caused multi-second waits and occasionally produced thousands of tokens of hallucinated output.
8. MROPE state not reset between successive images. The cached position IDs and rope deltas from one image persisted into the next inference call. When processing a new screenshot, the model's position embeddings started from where the previous image left off instead of resetting. This caused progressively worse results on the second, third, and subsequent images.
9. Vision attention mask ignored -- the ROOT CAUSE. This was the single bug most responsible for bounding box inaccuracy. The vision encoder's self-attention uses a mask to implement windowed attention (the model processes the image in patches, and each patch should only attend to patches within its window). The Swift code passed mask: .none to the scaled dot-product attention call instead of mask: .array(floatMask). Without the mask, every patch attended to every other patch globally, destroying the spatial locality that the model relies on for precise coordinate prediction.
// WRONG -- ignores the attention mask entirely
let attnOutput = scaledDotProductAttention(
queries: q, keys: k, values: v, scale: scale, mask: .none
)
// CORRECT -- applies the windowed attention mask
let attnOutput = scaledDotProductAttention(
queries: q, keys: k, values: v, scale: scale,
mask: .array(floatMask)
)
After fixing all 9 bugs, the Swift implementation produced identical bounding box coordinates to the Python reference on the same input image. 0px delta on all 8 edges (x1, y1, x2, y2 for two panels). The model was not hallucinating -- the implementation was broken.
OCR: Apple Vision Framework
Apple's Vision framework provides on-device OCR that runs on the Neural Engine at ~300ms per frame.
Recognition levels are confusing. The API has two recognition levels: level 0 and level 1. Intuitively, you might assume level 0 is the baseline (fast) and level 1 is the premium (accurate). It is the opposite. Level 0 is accurate (slower, higher quality). Level 1 is fast (lower quality). I ran with level 1 for weeks thinking I was getting the best results, then discovered I was using the fast path the entire time.
RecognizeDocumentsRequest vs VNRecognizeTextRequest. Apple's Vision framework has two OCR APIs, and they behave very differently on code content. RecognizeDocumentsRequest (the newer, WWDC25 API) is optimized for documents -- prose, forms, receipts. It silently drops lines that look like code: indented lines with brackets, semicolons, and unusual formatting. For a code editor panel, it would capture 15 out of 20 visible lines, silently losing the rest.
VNRecognizeTextRequest (the older API) captures everything -- every line, regardless of formatting. For reading code from screen, use VNRecognizeTextRequest. I discovered this after weeks of mysterious "missing lines" that turned out to be the newer API being too clever about what constitutes document text.
Bounded OCR. Rather than scanning the entire screen (which picks up menu bars, dock icons, and other noise), the OCR is bounded to the panel regions detected by the VLM. This reduces both processing time and false positives -- you only extract text from the panel you care about.
Scroll Accumulator
Most non-trivial content does not fit in a single viewport. A problem description might be 40 lines long, but only 15 are visible at once. The scroll accumulator solves this by scrolling through the content in steps, OCR-ing each viewport, and stitching the results into a complete transcript.
The stitching problem. Adjacent viewports overlap. When you scroll down by 100 pixels, the bottom 80% of the previous viewport is still visible. Naive concatenation produces massive duplication. The accumulator uses Levenshtein distance to fuzzy-match each incoming OCR line against all accumulated lines.
Threshold tuning. A line is classified as "already seen" if its Levenshtein similarity to any accumulated line exceeds 60%. I tested thresholds from 40% to 80%:
- 40%: too permissive -- novel lines were classified as duplicates and dropped
- 80%: too strict -- lines with minor OCR variations were classified as novel and added twice
- 60%: best F1 score for the duplicate-vs-novel classification task
Metal GPU Overlay Rendering
The overlay renders detected text and annotations as a transparent window on top of the target application.
let window = NSWindow(
contentRect: screenFrame,
styleMask: .borderless,
backing: .buffered,
defer: false
)
window.level = NSWindow.Level(rawValue: 25)
window.isOpaque = false
window.backgroundColor = .clear
window.ignoresMouseEvents = true
window.hasShadow = false
Self-exclusion from screen capture. This is critical: the overlay must not appear in its own screenshots. If it does, the next VLM inference cycle sees the overlay text, interprets it as UI content, and the system enters a feedback loop where it reads its own annotations. The fix is captureScreenExcluding(windowID:), which tells ScreenCaptureKit to exclude the overlay window from the captured frame.
The Demo
Here's the system in action — detecting panels, reading text, and rendering the overlay in real-time:
Performance Numbers
| Component | Latency | Resource |
|---|---|---|
| VLM model loading | ~3s | Unified memory (one-time) |
| VLM panel detection | ~18s per inference | GPU (MLX unified memory) |
| OCR per frame | ~300ms | Neural Engine |
| Overlay render | <16ms (60fps) | Metal GPU |
| Full scroll accumulation | ~40s (20 steps) | Combined |
| Model resident memory | ~5.5GB peak | Unified memory |
The VLM inference is the bottleneck at ~18 seconds, but it only needs to run when the panel layout changes (e.g., navigating to a new page). During normal operation, the OCR and overlay run continuously at ~300ms per cycle while the VLM-detected panel bounds remain cached. On an M1 Pro with 16GB, the system runs comfortably alongside Chrome and other applications.
How to Reproduce This
Requirements:
- macOS 14+ on Apple Silicon (M1/M2/M3/M4)
- Xcode 16+
- ~16GB unified memory (8GB minimum, 16GB comfortable)
Model:
-
mlx-community/Qwen2.5-VL-7B-Instruct-4bitfrom Hugging Face (~5.3GB)
Key dependencies:
-
mlx-swift-lm(Swift package, for native VLM inference) - Apple Vision framework (built into macOS)
- Metal (built into macOS)
- ScreenCaptureKit (built into macOS)
Closing
Building this system produced more failure than success. Six major approaches failed before the working architecture emerged, and even the working approach required fixing 9 implementation bugs in a third-party library before it produced correct output. The total development time from "I want to read a panel from the screen" to "this reliably works" was measured in weeks, not days.
The experience of building this real-time screen reader led directly to a testing methodology I call CCSV (Cross-Channel Spatiotemporal Verification) -- the idea that you can verify a UI by reading it through two completely independent channels (DOM and pixels) and comparing what they see. That methodology is described in a companion article.
If you are building something similar -- local VLMs on Apple Silicon, real-time screen understanding, overlay rendering -- I would like to hear what you have tried and what worked. The failure modes are not well documented anywhere, and the community benefits from sharing them.
Niv Dvir is a software developer who builds tools at the intersection of computer vision and UI automation. You can find him on GitHub.
This article was originally published by DEV Community and written by Niv Dvir.
Read original article on DEV Community



