The design of Cinestar originates from a commitment to user privacy. The goal was to create a powerful media search tool that does not require cloud uploads. This privacy-first principle dictated the first major technical decision: the exclusive use of local AI models.
The initial prototype was an Electron application focused on local image search. The ImageJobProcessor
performed two key tasks on the user’s machine:
This approach provided a powerful keyword search for photos while ensuring that all data and processing remained on the user’s machine.
Extending the system to video introduced the complexity of ffmpeg
. The initial approach of shelling out ffmpeg
commands directly led to severe resource contention and UI freezes. A single large video file could overwhelm the system.
Several approaches were tested before arriving at the current solution:
Attempt 1: Direct Shell Execution
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav
Attempt 2: Sequential Processing
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav && \
ffmpeg -i video.mp4 -vf "select='eq(pict_type,I)'" -vsync vfr frame_%04d.png
Attempt 3: Resource-Aware Pool (Final Solution)
The solution was the implementation of a resource-aware processing pool. This system manages a limited number of concurrent ffmpeg
instances, queuing jobs and executing them as resources become available. This was a critical step towards a stable and scalable architecture.
While the ffmpeg
resource pool solved stability, it revealed a deeper architectural challenge: the conflict between long-running write operations (indexing) and instantaneous read operations (searching). Mixing these two concerns resulted in a sluggish user experience.
The adoption of a Command Query Responsibility Segregation (CQRS)-inspired model addressed this. A hard line was drawn between the read and write paths:
graph TB
subgraph "User Interface"
UI[React Frontend]
end
subgraph "Read Path (Fast)"
UI -->|Search Query| Search[Search API]
Search -->|Vector Search| VectorDB[(vector.db<br/>sqlite-vec)]
Search -->|Metadata| MainDB[(main.db<br/>SQLite)]
VectorDB -->|Results| Search
MainDB -->|Results| Search
Search -->|< 100ms| UI
end
subgraph "Write Path (Async)"
UI -->|Add Media| JobQueue[Job Queue]
JobQueue -->|Process| VideoProc[VideoJobProcessor]
JobQueue -->|Process| ImageProc[ImageJobProcessor]
VideoProc -->|FFmpeg Pool| FFmpeg[FFmpeg Instances<br/>Max: 2]
VideoProc -->|Whisper| Transcribe[Audio Transcription]
VideoProc -->|Ollama| Vision[Visual Captioning]
VideoProc -->|Update| VectorDB
VideoProc -->|Update| MainDB
ImageProc -->|Update| VectorDB
ImageProc -->|Update| MainDB
end
subgraph "Refinement Scheduler"
Scheduler[RefinementJobScheduler]
Scheduler -->|Schedule Pass 2<br/>+5min| JobQueue
Scheduler -->|Schedule Pass 3<br/>+30min| JobQueue
end
VideoProc -.->|Trigger| Scheduler
The Write Path: A robust, asynchronous job processing system (JobQueue
, VideoJobProcessor
, ImageJobProcessor
) designed for throughput and resilience. Its sole responsibility is to ingest and process media to update the search index.
The Read Path: A highly optimized query system designed for speed. It directly queries the search index and has no knowledge of the background processing.
This separation is key to the application’s responsive UI, allowing for seamless user interaction during intensive indexing tasks.
The CQRS architecture enabled a focused approach to the core video search problem. The goal was to make a video searchable while it is still being processed, and then continuously improve search quality over time. This was achieved through a five-phase processing pipeline with progressive threshold-based refinement.
sequenceDiagram
participant User
participant Video as Video File
participant P0 as Phase 0<br/>(Audio)
participant P1 as Phase 1<br/>(Visual)
participant P2 as Phase 2<br/>(T:0.8)
participant P3 as Phase 3<br/>(T:0.6)
participant P4 as Phase 4<br/>(T:0.4)
participant Search as Search Index
User->>Video: Add Video
Video->>P0: Start Processing
Note over P0: Segment into 5min chunks
loop For each segment
P0->>P0: Transcribe (Whisper)
P0->>Search: Index immediately
Note right of Search: ✓ Searchable!
end
P0->>P1: Trigger Phase 1
loop For each segment
P1->>P1: Extract keyframes
P1->>P1: Caption (Llama 3.2 Vision)
P1->>P1: Scene reconstruction (Llama 3.2)
P1->>Search: Update with multi-modal data
end
P1->>P2: Trigger immediately
P2->>Search: Coarse segmentation (T:0.8)
Note over P3: Wait 5 minutes
P2->>P3: Schedule Pass 2
P3->>Search: Medium refinement (T:0.6)
Note over P4: Wait 30 minutes (conditional)
P3->>P4: Schedule Pass 3
P4->>Search: Fine refinement (T:0.4)
The primary goal of this phase is immediate searchability. The process is as follows:
BatchProcessor
splits the video into 5-minute segments (300 seconds).video-rag.db
(metadata) and vector.db
(searchable embeddings). This allows a user to start searching a long video moments after it has been added.Performance: A 60-minute video becomes searchable in ~40 seconds (12 segments × ~3s each).
After the initial audio indexing, a deeper, multi-modal analysis begins for each segment:
This RNN-style approach allows the model to understand scene continuity and narrative flow across the video.
[Transcription] + Visual Context: [Captions] + Scene: [Reconstruction]
Performance: ~10-15 seconds per 5-minute segment (4-8s captioning, 2-3s reconstruction, 0.5s embedding).
Search Impact: Enables queries like “romantic scene in dimly lit room” or “action sequence” that would be impossible with audio alone.
The scene reconstruction process creates a rich, contextual understanding that goes beyond simple object detection. By combining audio, visual information, and temporal context from previous segments, the system can understand:
Object Interactions:
Spatial Relationships:
Temporal Actions:
Contextual Scenes:
Narrative Continuity (Temporal Context):
The system maintains a sliding window of previous segment descriptions, enabling it to understand:
Example Scene Reconstruction Prompt:
Previous segments:
→ "Person introduces topic in office setting"
→ "Close-up of whiteboard with diagrams"
Current segment:
Time: 120s-180s
Audio: "So as you can see from this example..."
Visual: Person pointing at screen. Audience visible. Presentation slide.
Text (OCR): "Key Findings 2024"
Write a paragraph describing what happens in this scene:
This RNN-style temporal context allows the model to understand that this is a continuation of the presentation, not an isolated scene. The system effectively creates a “semantic memory” of the video that captures not just what objects are present, but how they relate to each other, what’s happening, and how the narrative flows over time.
After the initial indexing and enrichment, the system enters a continuous improvement loop with three additional refinement passes, each using progressively lower confidence thresholds:
Phase 2 - Coarse Segmentation (Threshold: 0.8)
Phase 3 - Medium Refinement (Threshold: 0.6)
Phase 4 - Fine Refinement (Threshold: 0.4)
The RefinementJobScheduler
manages this process:
This five-phase approach ensures videos are searchable almost instantly (Phase 0), enriched with multi-modal understanding (Phase 1), and progressively refined over time (Phases 2-4) for maximum search quality.
The search system is designed to understand what users are looking for, not just match keywords. This is achieved through a sophisticated multi-stage pipeline that analyzes queries, generates appropriate embeddings, and intelligently ranks results.
When a user types a search query, the system first analyzes it to understand the intent and modality:
Step 1: Query Type Classification
The system uses Llama 3.2 (3B) to classify queries into five types:
Example Classification:
// Query: "romantic scene in dimly lit room"
classification = {
type: "spatial",
confidence: 0.92,
spatialElements: ["romantic scene", "dimly lit room"],
audioElements: [],
actionElements: []
}
Step 2: Multi-Modal Query Transformation
Based on the classification, the query is transformed and expanded with modality-specific keywords:
// Original: "person talking about technology"
transformed = {
searchKeywords: {
text: ["person", "technology"],
visual: ["person", "speaking", "presentation"],
audio: ["talking", "technology", "speech"],
action: ["talking", "speaking"],
temporal: []
},
transformed: "person speaking about technology, visual: person presenting, audio: technology discussion"
}
This transformation ensures the search understands both what’s being said (audio) and what’s being shown (visual).
The system uses a hybrid search approach that combines two complementary methods:
1. Vector Similarity Search (70% weight)
sqlite-vec
extension for efficient vector operations2. Full-Text Search / FTS (30% weight)
Hybrid Scoring Formula:
final_score = α × vector_similarity + (1 - α) × fts_score
Where α = 0.7 (configurable)
Why Hybrid?
Results are further enhanced based on query type:
// Temporal queries boost video segments with timestamps
if (queryType === 'temporal' && result.path.includes('#t=')) {
score *= 1.1; // 10% boost
}
// Spatial queries boost full videos over segments
if (queryType === 'spatial' && result.type === 'video') {
score *= 1.05; // 5% boost
}
This ensures that:
1. Database Architecture
SQLite
with sqlite-vec
extension for local vector searchmedia_fts
) for text search2. Search Cancellation
3. Result Deduplication
sequenceDiagram
participant User
participant UI
participant Classifier as Query Classifier<br/>(Llama 3.2)
participant Embedder as Embedding Generator<br/>(BGE-large)
participant Vector as Vector Search<br/>(sqlite-vec)
participant FTS as Full-Text Search<br/>(FTS5)
participant Ranker as Hybrid Ranker
User->>UI: "romantic scene"
UI->>Classifier: Classify query type
Classifier->>UI: type="spatial", confidence=0.9
UI->>Classifier: Transform query
Classifier->>UI: keywords={visual: ["romantic", "scene", "couple"]}
UI->>Embedder: Generate embedding
Embedder->>UI: 1024D vector
par Parallel Search
UI->>Vector: Search with embedding
Vector->>UI: Results with similarity scores
and
UI->>FTS: Search with keywords
FTS->>UI: Results with BM25 scores
end
UI->>Ranker: Merge results (α=0.7)
Ranker->>UI: Hybrid scores + type boosting
UI->>User: Ranked results
This multi-stage approach ensures that searches are both intelligent (understanding intent) and precise (matching relevant content), delivering results that match the user’s mental model of what they’re looking for.
The search system applies different strategies for images vs. videos, reflecting their fundamental differences in content structure:
Indexing:
Search:
Example:
Image: sunset-beach.jpg
Caption: "Golden sunset over ocean with silhouetted palm trees"
Embedding: [1024D vector from caption]
Search: "beach sunset" → Direct similarity match
Indexing:
transcription_embedding
caption_embedding
reconstruction_embedding
Search:
"beginning"
, "first 5 minutes"
) → Boost segments with timestamps (1.1×)"red car"
, "mountains"
) → Boost full videos over segments (1.05×)"talking about technology"
) → Weight transcription matches higherExample:
Video: presentation.mp4 (20 minutes)
Batch 1 (0-5min):
- Transcription: "Welcome everyone, today we'll discuss..."
- Keyframes: ["Speaker at podium", "Title slide", "Audience", "Diagram"]
- Scene: "Professional presentation begins in conference room with speaker introducing topic to seated audience"
- Temporal context: [Start of video]
- Embedding: [Multi-modal 1024D vector]
Batch 2 (5-10min):
- Transcription: "As you can see from this example..."
- Keyframes: ["Close-up of screen", "Pointing gesture", "Data chart", "Audience reaction"]
- Scene: "Continuing the presentation, speaker explains technical concepts using visual aids while audience takes notes"
- Temporal context: ["Professional presentation begins..."]
- Embedding: [Enhanced with narrative continuity]
Search: "presentation about technology"
→ Matches both batches with different scores
→ Deduplicates to show parent video with "2 segments match"
→ Clicking opens video player with segment navigation
Key Differences:
Aspect | Images | Videos |
---|---|---|
Granularity | 1 item = 1 image | 1 video = N segments/batches |
Embeddings | 1 caption embedding | 3 embeddings per segment (audio, visual, scene) |
Temporal Context | None (static) | RNN-style sliding window |
Query Boosting | Standard scoring | Type-aware (temporal/spatial/audio) |
Result Deduplication | Not needed | Parent + segments merged |
Navigation | Direct image view | Timestamp-based seeking |
Processing Time | ~2-5s per image | ~40s for 60min video (Phase 0) |
This differentiated approach ensures that each media type is indexed and searched in a way that matches how users naturally think about that content—images as single visual moments, videos as temporal narratives with multiple modalities.
Problem: Search queries took 10+ seconds during video indexing because Phase 1 captioning and search embeddings competed for the same Ollama instance.
Solution: Implemented dual Ollama architecture:
Result: Search latency reduced from 10,281ms to 500-1000ms during active indexing.
Problem: Video segments were stored in video-rag.db
but not indexed in vector.db
, causing 0 search results despite successful processing.
Solution: Added immediate vector indexing step in Phase 0, right after segment storage. Both databases are now kept in sync.
Result: Videos become searchable immediately after transcription completes.
Problem: UI showed incorrect progress when jobs were resumed after app restart.
Solution: Implemented phase-specific progress tracking with database persistence. Each phase tracks 0-100% independently, and the system can resume from any phase.
Result: Accurate progress display and graceful resume capability.
Phase | Duration | Cumulative | Status |
---|---|---|---|
Phase 0 (Audio) | ~40s | 40s | ✓ Searchable |
Phase 1 (Visual) | ~60s | 100s | ✓ Multi-modal |
Phase 2 (T:0.8) | ~10s | 110s | ✓ Coarse |
Phase 3 (T:0.6) | +5min | +5min | ✓ Medium |
Phase 4 (T:0.4) | +30min | +30min | ✓ Fine |
Component | Technology | Purpose |
---|---|---|
Desktop Framework | Electron | Cross-platform desktop app |
Frontend | React + TypeScript | UI components |
Styling | TailwindCSS + shadcn/ui | Modern, responsive design |
Icons | Lucide | Consistent iconography |
Model | Parameters | Purpose | Runtime |
---|---|---|---|
Whisper | Base | Audio transcription | Local CPU |
BGE-large | 335M | Text embeddings (1024D) | Ollama |
Moondream:v2 | 2B | Visual captioning | Ollama |
Llama 3.2 | 3B | Scene reconstruction & query analysis | Ollama |
Component | Technology | Purpose |
---|---|---|
Main Database | SQLite | Metadata, jobs, configuration |
Vector Database | SQLite + sqlite-vec | Semantic search with embeddings |
Video Database | SQLite (video-rag.db) | Video segments, batches, transcriptions |
Tool | Purpose |
---|---|
FFmpeg | Video/audio extraction, keyframe generation |
FFprobe | Video metadata analysis |
Sharp | Image processing (thumbnails, compression) |
Component | Technology |
---|---|
AI Runtime | Ollama (2 instances + nginx load balancer) |
Process Management | Node.js child processes with semaphore pools |
IPC | Electron IPC (main ↔ renderer) |
Cinestar’s architecture is the result of an iterative engineering process. The core principles of a CQRS-inspired model, phased processing, and a steadfast commitment to privacy through local AI have resulted in a platform that is both powerful and responsive. This demonstrates that a world-class search experience can be delivered without compromising user privacy, with all processing and data storage handled on the local machine.
The five-phase pipeline ensures that users get immediate value (Phase 0) while the system continuously improves search quality in the background (Phases 1-4). The dual-database architecture (metadata + vectors) and dual-Ollama setup (search + indexing) provide both speed and intelligence without sacrificing user experience.
Key Takeaways: