Qwen3-VL and 3.5 Coordinate System
Overview
Qwen3-VL and 3.5 uses a relative coordinate system for bounding box outputs, normalized to a 0–1000 range. This is a notable shift from earlier models like Qwen2.5-VL, which used absolute pixel coordinates tied to the actual image dimensions.
The key idea is simple: no matter what size your input image is, the model always describes locations as if the image were mapped onto a 1000 × 1000 grid.
How Normalization Works
All coordinate values are expressed relative to a standardized 1000 × 1000 resolution.
| Coordinate | Meaning |
|---|---|
(0, 0) |
Top-left corner of the image |
(1000, 1000) |
Bottom-right corner of the image |
(500, 500) |
Center of the image |
This means a bounding box of (0, 0, 1000, 1000) covers the entire image, regardless of whether the original image is 640×480, 1920×1080, or any other resolution.
Converting to Actual Pixel Coordinates
To map the normalized coordinates back to real pixel positions, scale them using the original image dimensions:
x_pixel = x_normalized × (actual_width / 1000)
y_pixel = y_normalized × (actual_height / 1000)
Example: If the model outputs (250, 400) and your image is 1920 × 1080:
x_pixel = 250 × (1920 / 1000) = 480y_pixel = 400 × (1080 / 1000) = 432
Output Format
Bounding boxes are returned in the following notation:
<box>(x1, y1), (x2, y2)</box>
| Component | Description |
|---|---|
(x1, y1) |
Top-left corner of the bounding box |
(x2, y2) |
Bottom-right corner of the bounding box |
Example
<box>(120, 230), (560, 780)</box>
This describes a region starting at 12% from the left and 23% from the top, extending to 56% across and 78% down the image.
Why This Matters
- Resolution-agnostic: The model doesn't need to know or care about the input image's actual size. Outputs are always in the same scale, making them easy to work with across different image sizes.
- Consistent post-processing: You always apply the same conversion formula, regardless of the input.
- Optimized for grounding tasks: The model is trained on 1000 × 1000 resolution images, making this coordinate space a natural fit for 2D spatial grounding and object detection.
Quick Reference: Post-Processing
Python
def normalize_to_pixel(box, image_width, image_height):
"""Convert Qwen3-VL normalized coordinates to pixel coordinates."""
x1, y1, x2, y2 = box
return (
x1 * image_width / 1000,
y1 * image_height / 1000,
x2 * image_width / 1000,
y2 * image_height / 1000,
)
# Example usage
bbox_normalized = (120, 230, 560, 780)
bbox_pixels = normalize_to_pixel(bbox_normalized, 1920, 1080)
# → (230.4, 248.4, 1075.2, 842.4)
TypeScript
interface BoundingBox {
x1: number;
y1: number;
x2: number;
y2: number;
}
function normalizeToPixel(
box: BoundingBox,
imageWidth: number,
imageHeight: number
): BoundingBox {
return {
x1: (box.x1 * imageWidth) / 1000,
y1: (box.y1 * imageHeight) / 1000,
x2: (box.x2 * imageWidth) / 1000,
y2: (box.y2 * imageHeight) / 1000,
};
}
// Example usage
const bboxNormalized: BoundingBox = { x1: 120, y1: 230, x2: 560, y2: 780 };
const bboxPixels = normalizeToPixel(bboxNormalized, 1920, 1080);
// → { x1: 230.4, y1: 248.4, x2: 1075.2, y2: 842.4 }
You can also parse the raw <box> output from the model:
function parseQwenBox(raw: string): BoundingBox | null {
const match = raw.match(/<box>((\d+),\s*(\d+)),\s*((\d+),\s*(\d+))</box>/);
if (!match) return null;
return {
x1: parseInt(match[1]),
y1: parseInt(match[2]),
x2: parseInt(match[3]),
y2: parseInt(match[4]),
};
}
// Example usage
const box = parseQwenBox("<box>(120, 230), (560, 780)</box>");
// → { x1: 120, y1: 230, x2: 560, y2: 780 }