inspired by imanoop7/Ollama-OCR
A powerful OCR (Optical Character Recognition) package that uses state-of-the-art vision language models through Ollama to extract text from images.
- LLaVA: A multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4. (LLaVa model can generate wrong output sometimes)
- Llama 3.2 Vision: Instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image
- MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
- Install Ollama
- Pull the required models:
ollama pull llama3.2-vision:11b
ollama pull llava:13b
ollama pull minicpm-v:8b
Then run following command:
git clone [email protected]:dwqs/ollama-ocr.git
cd ollama-ocr
yarn or npm i
yarn dev or npm run dev
you can run the demo from docker: debounce/ollama-ocr
- Markdown Format: The output is a markdown string containing the extracted text from the image.
- Text Format: The output is a plain text string containing the extracted text from the image.
- JSON Format: The output is a JSON object containing the extracted text from the image.
MIT