valdemaras.com

Computer vision overview

Tier 0 — Edge & Mobile-Optimized Detectors

Runs on phones, drones, Raspberry Pis, embedded cameras.

They detect a fixed set of classes

  • YOLO-NAS (Tiny) — NAS-discovered architecture with built-in quantization awareness

  • MobileNet-SSD — lightweight CNN backbone designed for mobile inference

  • EfficientDet — compound-scaled detector balancing resolution, depth, and width

  • YOLO11-N / YOLOv8-N — nano-sized YOLO variants for edge deployment

  • RF-DETR Nano — smallest RF-DETR variant, ~100 FPS on NVIDIA T4

Original image
Detection with RF-DETR Nano. It missed a couple of cars and a person on a sidewalk

Tier 1 — Real-Time Specialist Detectors

More accurate, run on GPU's

They detect a fixed set of classes

  • RF-DETR (Base / Medium / Large / 2XL) — transformer-based, first real-time model to exceed 60 AP on COCO, NMS-free

  • YOLOv12 — latest YOLO with area attention and R-ELAN blocks

  • YOLO11 / YOLOv8 — widely adopted anchor-free single-stage CNN detectors

  • D-FINE — real-time transformer detector competitive with RF-DETR at larger sizes

  • LW-DETR — lightweight DETR with plain ViT backbone

  • RT-DETR — first DETR to match YOLO speeds, from Baidu

Detected a couple more cars, but still just one out of the two persons

√Tier 1.5 — Promptable Segmentation Models

don't classify, segments (cuts out objects from picture). Typically paired with a detector to get both detection and pixel-perfect masks.

  • SAM 2 (Segment Anything Model 2) — universal promptable segmentation from Meta, supports images and video

  • SAM (original) — the foundational "segment anything" model

  • HQ-SAM — high-quality variant for fine-grained mask edges

  • FastSAM — CNN-based reimplementation of SAM for faster inference

  • EfficientSAM — distilled SAM variant for resource-constrained environments

Tier 2 — Open-Vocabulary Detection Models

Aaccept text or visual prompts to detect arbitrary objects zero-shot — no retraining needed. Slower than Tier 1, but can find anything you describe.

  • DINO-X (Pro / Edge) — SOTA open-world detector from IDEA Research, 56 AP zero-shot on COCO

  • GroundingDINO 1.5 / 1.6 — text-prompted detection combining DINO with language grounding

  • OWLv2 (Open-World Localization) — Google's open-vocabulary detector built on ViT

  • Detic — vocabulary-expanded detector using image-level supervision

  • YOLO-World — open-vocabulary extension of YOLO architecture

Grounding Dino (older vresion of Dino X). Prompt: 'person, vehicle'. Note the 2 false positives for person.
Grounding dino. Prompt "blue and black cars"
Grounding dino does not understand complex prompts like: "person inside the courtyard: the area enclosed by the garages and the cars parked within it"

Tier 3 — General-Purpose Vision-Language Models (VLMs)

Massive model. Captioning, reasoning, document parsing, and agentic tasks. Slow and expensive.

  • Qwen2.5-VL / Qwen3-VL — Alibaba's open-source VLM family, strong grounding and spatial awareness

  • GPT/Gemini/Claude

  • InternVL 2.5 — open-source VLM with competitive detection grounding

  • Florence-2 — Microsoft's unified vision foundation model supporting detection, captioning, and grounding

qwen/qwen3-vl-235b-a22b-instruct: 100% correct. "Outline the position of people, vehicles and output all the coordinates in JSON format. Use bbox_2d format with [x1, y1, x2, y2] coordinates in a 0-1000 coordinate space where the entire image maps to 1000x1000."
3-vl-235b-a22b-instruct: Find people INSIDE the courtyard: the area enclosed by the garages and the cars parked within it. ....
3-vl-235b-a22b-instruct: Find people and cars INSIDE the courtyard: the area enclosed by the garages and the cars parked within it