From Camera to Flashcards: Building AI Smart Scan
How I turned a phone camera into a vocabulary extraction engine using Gemini's vision API and Cloud Functions.
The idea was simple: point your phone camera at a textbook page, and the app automatically extracts vocabulary words and creates flashcards. The implementation was anything but simple — it involved camera capture, image preprocessing, Gemini's vision API, structured output parsing, and graceful error handling for blurry photos. Multi-language support added significant complexity to the extraction pipeline. The system needed to handle English, Korean, Japanese, and Chinese text — often mixed on the same page. Language detection runs as a preprocessing step, analyzing character distributions to determine the primary language and any secondary languages present. Gemini's multilingual capabilities handled the actual extraction, but prompt templates had to be language-specific to produce natural-sounding definitions and properly handle linguistic nuances like Korean honorifics or Japanese kanji readings. The camera integration in Flutter uses platform channels to access native camera APIs.

We needed high-resolution captures with auto-focus confirmation — a blurry image produces garbage vocabulary. The native layer signals when the camera has locked focus, and only then does the capture proceed. Low-light photography was a persistent user complaint in early versions. Students often
study in dimly lit environments, and standard camera captures produced noisy, low-contrast images that degraded OCR accuracy. We implemented adaptive exposure compensation that detects ambient light conditions and adjusts capture parameters accordingly. For extremely low-light scenarios, the app
activates a multi-frame capture mode that takes three rapid exposures and computationally merges them for improved signal-to-noise ratio — similar to night mode in modern smartphone cameras. Image preprocessing happens client-side to reduce API costs. We resize to a maximum of 2048px on the longest edge, convert to JPEG at 85% quality, and apply contrast enhancement for photographed text. This reduces the payload from 8MB to under 500KB while maintaining text legibility. Page boundary detection ensures only the relevant content area is processed. Users rarely photograph a perfectly aligned, full-frame textbook page — there's usually desk surface, fingers, or adjacent pages visible. We implemented a lightweight edge detection algorithm running on-device that identifies the page boundaries, applies perspective correction for angled shots, and crops to the content area before sending to the API. This preprocessing step improved extraction accuracy by roughly 15% and reduced API costs by

The idea was simple: point your phone camera at a textbook page, and the app automatically extracts vocabulary words and creates flashcards. The implementation was anything but simple — it involved camera capture, image preprocessing, Gemini's vision...

