# Qwen3-VL and 3.5 Coordinate System ## Overview Qwen3-VL and 3.5 uses a **relative coordinate system** for bounding box outputs, normalized to a `0–1000` range. This is a notable shift from earlier models like Qwen2.5-VL, which used absolute pixel coordinates tied to the actual image dimensions. The key idea is simple: no matter what size your input image is, the model always describes locations as if the image were mapped onto a **1000 × 1000 grid**. --- ## How Normalization Works All coordinate values are expressed relative to a standardized 1000 × 1000 resolution. | Coordinate | Meaning | |------------|---------| | `(0, 0)` | Top-left corner of the image | | `(1000, 1000)` | Bottom-right corner of the image | | `(500, 500)` | Center of the image | This means a bounding box of `(0, 0, 1000, 1000)` covers the **entire image**, regardless of whether the original image is 640×480, 1920×1080, or any other resolution. ### Converting to Actual Pixel Coordinates To map the normalized coordinates back to real pixel positions, scale them using the original image dimensions: ``` x_pixel = x_normalized × (actual_width / 1000) y_pixel = y_normalized × (actual_height / 1000) ``` **Example:** If the model outputs `(250, 400)` and your image is `1920 × 1080`: - `x_pixel = 250 × (1920 / 1000) = 480` - `y_pixel = 400 × (1080 / 1000) = 432` --- ## Output Format Bounding boxes are returned in the following notation: ``` (x1, y1), (x2, y2) ``` | Component | Description | |-----------|-------------| | `(x1, y1)` | **Top-left** corner of the bounding box | | `(x2, y2)` | **Bottom-right** corner of the bounding box | ### Example ``` (120, 230), (560, 780) ``` This describes a region starting at 12% from the left and 23% from the top, extending to 56% across and 78% down the image. --- ## Why This Matters - **Resolution-agnostic:** The model doesn't need to know or care about the input image's actual size. Outputs are always in the same scale, making them easy to work with across different image sizes. - **Consistent post-processing:** You always apply the same conversion formula, regardless of the input. - **Optimized for grounding tasks:** The model is trained on 1000 × 1000 resolution images, making this coordinate space a natural fit for 2D spatial grounding and object detection. --- ## Quick Reference: Post-Processing ### Python ```python def normalize_to_pixel(box, image_width, image_height): """Convert Qwen3-VL normalized coordinates to pixel coordinates.""" x1, y1, x2, y2 = box return ( x1 * image_width / 1000, y1 * image_height / 1000, x2 * image_width / 1000, y2 * image_height / 1000, ) # Example usage bbox_normalized = (120, 230, 560, 780) bbox_pixels = normalize_to_pixel(bbox_normalized, 1920, 1080) # → (230.4, 248.4, 1075.2, 842.4) ``` ### TypeScript ```typescript interface BoundingBox { x1: number; y1: number; x2: number; y2: number; } function normalizeToPixel( box: BoundingBox, imageWidth: number, imageHeight: number ): BoundingBox { return { x1: (box.x1 * imageWidth) / 1000, y1: (box.y1 * imageHeight) / 1000, x2: (box.x2 * imageWidth) / 1000, y2: (box.y2 * imageHeight) / 1000, }; } // Example usage const bboxNormalized: BoundingBox = { x1: 120, y1: 230, x2: 560, y2: 780 }; const bboxPixels = normalizeToPixel(bboxNormalized, 1920, 1080); // → { x1: 230.4, y1: 248.4, x2: 1075.2, y2: 842.4 } ``` You can also parse the raw `` output from the model: ```typescript function parseQwenBox(raw: string): BoundingBox | null { const match = raw.match(/((\d+),\s*(\d+)),\s*((\d+),\s*(\d+))/); if (!match) return null; return { x1: parseInt(match[1]), y1: parseInt(match[2]), x2: parseInt(match[3]), y2: parseInt(match[4]), }; } // Example usage const box = parseQwenBox("(120, 230), (560, 780)"); // → { x1: 120, y1: 230, x2: 560, y2: 780 } ```