Qwen3-VL and 3.5 Coordinate System

Overview

Qwen3-VL and 3.5 uses a relative coordinate system for bounding box outputs, normalized to a 0–1000 range. This is a notable shift from earlier models like Qwen2.5-VL, which used absolute pixel coordinates tied to the actual image dimensions.

The key idea is simple: no matter what size your input image is, the model always describes locations as if the image were mapped onto a 1000 × 1000 grid.

How Normalization Works

All coordinate values are expressed relative to a standardized 1000 × 1000 resolution.

Coordinate	Meaning
`(0, 0)`	Top-left corner of the image
`(1000, 1000)`	Bottom-right corner of the image
`(500, 500)`	Center of the image

This means a bounding box of (0, 0, 1000, 1000) covers the entire image, regardless of whether the original image is 640×480, 1920×1080, or any other resolution.

Converting to Actual Pixel Coordinates

To map the normalized coordinates back to real pixel positions, scale them using the original image dimensions:

x_pixel = x_normalized × (actual_width / 1000)
y_pixel = y_normalized × (actual_height / 1000)

Example: If the model outputs (250, 400) and your image is 1920 × 1080:

x_pixel = 250 × (1920 / 1000) = 480
y_pixel = 400 × (1080 / 1000) = 432

Output Format

Bounding boxes are returned in the following notation:

<box>(x1, y1), (x2, y2)</box>

Component	Description
`(x1, y1)`	Top-left corner of the bounding box
`(x2, y2)`	Bottom-right corner of the bounding box

Example

<box>(120, 230), (560, 780)</box>

This describes a region starting at 12% from the left and 23% from the top, extending to 56% across and 78% down the image.

Why This Matters

Resolution-agnostic: The model doesn't need to know or care about the input image's actual size. Outputs are always in the same scale, making them easy to work with across different image sizes.
Consistent post-processing: You always apply the same conversion formula, regardless of the input.
Optimized for grounding tasks: The model is trained on 1000 × 1000 resolution images, making this coordinate space a natural fit for 2D spatial grounding and object detection.

Quick Reference: Post-Processing

Python

def normalize_to_pixel(box, image_width, image_height):
    """Convert Qwen3-VL normalized coordinates to pixel coordinates."""
    x1, y1, x2, y2 = box
    return (
        x1 * image_width  / 1000,
        y1 * image_height / 1000,
        x2 * image_width  / 1000,
        y2 * image_height / 1000,
    )

# Example usage
bbox_normalized = (120, 230, 560, 780)
bbox_pixels = normalize_to_pixel(bbox_normalized, 1920, 1080)
# → (230.4, 248.4, 1075.2, 842.4)

TypeScript

interface BoundingBox {
  x1: number;
  y1: number;
  x2: number;
  y2: number;
}

function normalizeToPixel(
  box: BoundingBox,
  imageWidth: number,
  imageHeight: number
): BoundingBox {
  return {
    x1: (box.x1 * imageWidth)  / 1000,
    y1: (box.y1 * imageHeight) / 1000,
    x2: (box.x2 * imageWidth)  / 1000,
    y2: (box.y2 * imageHeight) / 1000,
  };
}

// Example usage
const bboxNormalized: BoundingBox = { x1: 120, y1: 230, x2: 560, y2: 780 };
const bboxPixels = normalizeToPixel(bboxNormalized, 1920, 1080);
// → { x1: 230.4, y1: 248.4, x2: 1075.2, y2: 842.4 }

You can also parse the raw <box> output from the model:

function parseQwenBox(raw: string): BoundingBox | null {
  const match = raw.match(/<box>((\d+),\s*(\d+)),\s*((\d+),\s*(\d+))</box>/);
  if (!match) return null;
  return {
    x1: parseInt(match[1]),
    y1: parseInt(match[2]),
    x2: parseInt(match[3]),
    y2: parseInt(match[4]),
  };
}

// Example usage
const box = parseQwenBox("<box>(120, 230), (560, 780)</box>");
// → { x1: 120, y1: 230, x2: 560, y2: 780 }