Bookshelf Scanner

VLM

tool

personal

Persian

React

Vision-language catalog tool for personal libraries, with bilingual spine recognition

Published

May 11, 2026

Problem

Cataloging a personal library of several hundred books is slow when half of them are in Persian. Existing tools assume barcoded, commercially published, English-language books. Most Persian editions have no ISBN in any accessible database, and barcode-scanning apps either skip them entirely or return nothing.

Approach

The constraints were clear from the start: no paid API, no server to maintain, a model that handles non-Latin scripts without fine-tuning. The result is a browser-only React app with no backend.

The pipeline treats cataloging as a three-stage conversation between a vision model, a human reviewer, and an enrichment layer.

Stage 1. Upload and extract. Shelf photos go to Qwen2.5-VL-72B via the HuggingFace Router. The model reads each spine and returns structured data: title and author in the original script, a romanized version for non-Latin text, and a confidence score. Persian/Arabic normalization runs before deduplication, accounting for Unicode variants that look identical but are not (Persian ya versus Arabic ya, kaf variants, ZWNJ, tatweel).

Stage 2. Review and confirm. Results appear in an editable table sorted by confidence, so the rows most likely to need a human eye come first. Confidence renders as three buckets rather than a raw percentage, because the model’s self-reported score is not a calibrated probability. Fuzzy deduplication catches books that appear across multiple photos.

Stage 3. Enrich and browse. Confirmed books are sent to the Open Library API for year, page count, ISBN, and cover image. A second model call fills specific gaps the API leaves, particularly author nationality, which Open Library rarely returns reliably. Genre is normalized to a small controlled vocabulary because Open Library subject arrays include entries like “Accessible book” alongside actual genres. The catalog lives in IndexedDB and renders as a searchable dashboard with breakdowns by country, language, genre, and decade.

Why this project

Most of my work is clinical AI, which means the tooling is serious, validated, and slow to build. This is the opposite: a personal project where I could use VLMs, handle messy multilingual text, and build something that exists for its own sake. It also turned out to be the most direct way I have found to understand how vision-language models actually behave on real, imperfect inputs rather than benchmark images.

An earlier version had a FastAPI backend doing the same work. For a single-user local tool, that meant two processes to run, Python as an installation dependency, and logic written twice. Removing the backend and moving persistence to IndexedDB made the code shorter and the setup trivial: clone, npm install, paste a HuggingFace token, done.

Status

Public. The pipeline works for bilingual libraries. Known limitations: HEIC images from iPhone exports are not yet supported, and VLM accuracy on decorative Persian calligraphic spines drops to around 80 to 85 percent, which is why the review step is architectural rather than optional.

Problem

Approach

Why this project

Status

Links