olmOCR 2 7B 1025

Type: OCR
Capabilities: vision

Overview

This is a release of the olmOCR model that's fine tuned from Qwen2.5-VL-7B-Instruct using the olmOCR-mix-1025 dataset. It has been additionally fine tuned using GRPO RL training to boost its performance at math equations, tables, and other tricky OCR cases.

Usage Tips

This model expects a prompt alongside the image. The default one used in the olmocr repo is this:

prompt = "Attached is one page of a document that you must process. Just return the plain text representation of this document as if you were reading it naturally. Convert equations to LateX and tables to HTML.\nIf there are any figures or charts, label them with the following markdown syntax ![Alt text describing the contents of the figure](page_startx_starty_width_height.png)\nReturn your output as markdown, with a front matter section on top specifying values for the primary_language, is_rotation_valid, rotation_correction, is_table, and is_diagram parameters"

messages = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
                ],
            }
        ]

Image Processing

This model expects as input a single document image, rendered such that the longest dimension is 1288 pixels.

Pricing

Priority	Input Tokens (per 1M)	Output Tokens (per 1M)
Async	$0.15	$0.15
Batch (24h)	$0.10	$0.10

Playground

Open this model in the Playground.