KnowledgeVault: On-Device Semantic Search with RAG

Nov 28, 2025 · 9 min read

on-device-airagmobile-systemssemantic-searchprivacy-first

KnowledgeVault: On-Device Semantic Search with RAG

Privacy-First AI-Powered Note-Taking Without Cloud Dependencies

This project demonstrates how to implement a complete Retrieval-Augmented Generation (RAG) system entirely on mobile devices, eliminating the need for external APIs, cloud services, or network connectivity while preserving user privacy.

Overview

KnowledgeVault is a production-ready Android application built with React Native that brings enterprise-grade semantic search capabilities to personal note-taking. The system generates vector embeddings, performs hybrid search, builds context, and generates summaries—all within the device’s local storage and compute constraints.

Instead of sending user notes to cloud-based embedding services or LLM APIs, this project runs transformer models directly on the device using ONNX Runtime, storing all data in local SQLite databases. The result is instant semantic search with zero network latency and complete data privacy.

Key Capabilities

On-device embeddings: Generate semantic vectors using ONNX models locally
Hybrid search: Combine keyword matching with vector similarity for accurate retrieval
Offline operation: Complete RAG pipeline works without internet connectivity
Privacy preservation: Notes and embeddings never leave the device
Real-time search: Semantic search updates as you type with sub-second latency
Context aggregation: Intelligently builds context from multiple relevant notes
Local summarization: Generates concise summaries from retrieved content

This approach is particularly valuable for users who need intelligent note organization while maintaining strict data privacy, or for developers building mobile applications in environments with limited or unreliable connectivity.

Video Demonstration

Watch the complete walkthrough showing note creation, semantic search, and RAG pipeline demonstration:

Motivation

Traditional mobile note-taking applications rely on simple keyword search, which fails to capture semantic relationships between content. Modern approaches using cloud-based AI solve this but introduce new challenges:

Privacy concerns: User notes sent to third-party embedding APIs
Network dependency: Requires stable internet connectivity
API costs: Per-request pricing for embeddings and LLM inference
Latency overhead: Network round-trips add 100-500ms per operation
Data residency: Complex compliance requirements for sensitive information
Platform lock-in: Dependence on specific cloud providers

This project exists to answer a fundamental question: Can we deliver intelligent semantic search entirely on mobile devices while preserving privacy and eliminating operational dependencies?

For personal knowledge management, medical notes, legal documentation, or any scenario requiring data privacy, an on-device RAG system offers compelling advantages over cloud-based alternatives.

Core Architecture Principles

Mobile-First RAG Pipeline

The application implements the complete RAG workflow locally:

Note Creation: User enters text content
Embedding Generation: ONNX model converts text to 384-dimensional vectors
Vector Storage: Embeddings stored in SQLite alongside note content
Hybrid Retrieval: Combines BM25 keyword matching with cosine similarity
Context Building: Aggregates top-K relevant notes into coherent context
Summary Generation: Produces concise summaries from retrieved content

Each stage runs in-memory on the device, with no external service calls.

ONNX Runtime Integration

ONNX Runtime provides cross-platform ML inference optimized for mobile hardware. The project uses sentence-transformers models exported to ONNX format, which:

Runs efficiently on ARM CPUs without GPU acceleration
Provides consistent inference across Android and iOS
Supports batch processing for improved throughput
Handles memory management automatically
Enables model quantization for reduced size

Tokenization and Preprocessing

Text preprocessing happens locally using JavaScript tokenizers:

Tokenization: Converts text to token IDs matching model vocabulary
Padding: Ensures consistent input dimensions for batch processing
Attention masks: Handles variable-length inputs efficiently
Normalization: Applies same preprocessing as model training

This ensures semantic consistency between training and inference environments.

SQLite Vector Storage

The application uses SQLite as both a document store and vector database:

Normalized schema: Separate tables for notes, embeddings, and metadata
Efficient indexing: B-tree indexes for keyword search
Atomic transactions: Ensures data consistency during concurrent operations
Low memory footprint: Suitable for resource-constrained mobile devices
Cross-platform compatibility: Works identically on Android and iOS

Vector similarity searches use brute-force cosine similarity, which remains performant for personal note collections (up to ~10,000 notes).

System Design

The architecture follows React Native best practices with clear separation of concerns:

User Interface (React Native)
     ↓
Note Management Screen
     ↓
Search Interface with Real-time Updates
     ↓
Embedding Service (ONNX Runtime)
     ↓
Database Layer (SQLite)
     ↓
RAG Pipeline (Retrieval + Context + Summarization)
     ↓
Results Display with Timing Metrics

Each layer communicates through well-defined interfaces, enabling independent testing and future extensibility.

Technical Implementation

Model Selection and Export

The project uses all-MiniLM-L6-v2, a compact sentence-transformer model optimized for mobile inference:

from sentence_transformers import SentenceTransformer
from optimum.onnxruntime import ORTModelForFeatureExtraction

# Load pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Export to ONNX format with optimization
ort_model = ORTModelForFeatureExtraction.from_pretrained(
    'sentence-transformers/all-MiniLM-L6-v2',
    export=True,
    provider='CPUExecutionProvider'
)
ort_model.save_pretrained('./assets/models/')

The model produces 384-dimensional embeddings, balancing accuracy with computational efficiency for mobile devices.

ONNX Runtime Inference

The embedding service manages model lifecycle and batch inference:

import { InferenceSession, Tensor } from 'onnxruntime-react-native';

class EmbeddingService {
  private session: InferenceSession;
  
  async initialize() {
    const modelPath = 'assets/models/model.onnx';
    this.session = await InferenceSession.create(modelPath, {
      executionProviders: ['cpu'],
      graphOptimizationLevel: 'all'
    });
  }
  
  async generateEmbedding(text: string): Promise<Float32Array> {
    // Tokenize input text
    const tokens = this.tokenize(text);
    
    // Create input tensors
    const inputIds = new Tensor('int64', tokens.input_ids, [1, tokens.length]);
    const attentionMask = new Tensor('int64', tokens.attention_mask, [1, tokens.length]);
    
    // Run inference
    const results = await this.session.run({
      input_ids: inputIds,
      attention_mask: attentionMask
    });
    
    // Extract and normalize embedding
    const embedding = results.last_hidden_state.data;
    return this.normalize(embedding);
  }
}

The service maintains a single session across requests, avoiding model reload overhead.

Hybrid Search Implementation

The search system combines two complementary approaches:

Keyword Search (BM25):

function keywordSearch(query: string, notes: Note[]): ScoredNote[] {
  const queryTerms = tokenize(query);
  
  return notes.map(note => {
    const noteTerms = tokenize(note.content);
    const score = calculateBM25(queryTerms, noteTerms, notes);
    return { note, score };
  }).sort((a, b) => b.score - a.score);
}

Semantic Search (Cosine Similarity):

async function semanticSearch(query: string, embeddings: Embedding[]): Promise<ScoredNote[]> {
  const queryEmbedding = await embeddingService.generateEmbedding(query);
  
  return embeddings.map(item => {
    const similarity = cosineSimilarity(queryEmbedding, item.embedding);
    return { note: item.note, score: similarity };
  }).sort((a, b) => b.score - a.score);
}

Hybrid Fusion:

function fuseResults(keywordResults: ScoredNote[], semanticResults: ScoredNote[], alpha: number = 0.5): Note[] {
  const scores = new Map<string, number>();
  
  // Normalize and combine scores
  keywordResults.forEach((result, rank) => {
    scores.set(result.note.id, (1 - alpha) * (1 / (rank + 1)));
  });
  
  semanticResults.forEach((result, rank) => {
    const existing = scores.get(result.note.id) || 0;
    scores.set(result.note.id, existing + alpha * (1 / (rank + 1)));
  });
  
  // Return top-K results
  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .map(([id]) => findNoteById(id));
}

The fusion approach provides robust retrieval across different query types.

RAG Context Building

The context aggregator combines retrieved notes into a coherent prompt:

function buildContext(query: string, retrievedNotes: Note[], maxTokens: number = 512): string {
  let context = `Based on your notes, here are relevant passages for: "${query}"

`;
  let tokenCount = 0;
  
  for (const note of retrievedNotes) {
    const noteTokens = countTokens(note.content);
    
    if (tokenCount + noteTokens > maxTokens) break;
    
    context += `
[${note.title}]
${note.content}

`;
    tokenCount += noteTokens;
  }
  
  return context;
}

This ensures the summarization stage receives relevant information within model constraints.

Local Summarization

Summary generation uses extractive techniques suitable for mobile devices:

function generateSummary(context: string, maxLength: number = 200): string {
  const sentences = splitIntoSentences(context);
  const scored = sentences.map(sentence => ({
    text: sentence,
    score: scoreImportance(sentence, context)
  }));
  
  const topSentences = scored
    .sort((a, b) => b.score - a.score)
    .slice(0, 3)
    .map(s => s.text);
  
  return topSentences.join(' ');
}

function scoreImportance(sentence: string, context: string): number {
  // TF-IDF-based importance scoring
  const terms = tokenize(sentence);
  const contextTerms = tokenize(context);
  
  return terms.reduce((score, term) => {
    const tf = countOccurrences(term, terms) / terms.length;
    const idf = Math.log(contextTerms.length / countOccurrences(term, contextTerms));
    return score + (tf * idf);
  }, 0);
}

For production use, this could be replaced with on-device LLMs using frameworks like LLama.cpp or MLC-LLM.

Performance Characteristics

Embedding Generation Latency

Single note embedding generation:

Average latency: 120ms
├── Tokenization: 15ms
├── ONNX inference: 95ms
└── Normalization: 10ms

Batch embedding (10 notes):

Total latency: 450ms
├── Batch tokenization: 80ms
├── Batched inference: 320ms
└── Batch normalization: 50ms

Batching provides ~3x throughput improvement for bulk operations.

Search Performance

Real-time search latency:

Hybrid search: 85ms
├── Keyword search (BM25): 35ms
├── Semantic search (cosine): 40ms
└── Result fusion: 10ms

Performance remains consistent up to 5,000 notes on mid-range Android devices.

Memory Footprint

Application memory usage:

ONNX Runtime: ~80-120MB (inference engine)
Model weights: ~25MB (quantized MiniLM model)
SQLite database: ~2KB per note + 1.5KB per embedding
React Native overhead: ~50-80MB baseline

Total memory consumption remains under 250MB for typical usage patterns.

Storage Efficiency

Database size scaling:

1,000 notes: ~3.5MB (text + embeddings)
5,000 notes: ~17.5MB
10,000 notes: ~35MB

Storage remains manageable even for extensive personal knowledge bases.

What Makes This Approach Different

Most mobile AI applications follow a client-server architecture:

User data sent to cloud embedding services
LLM inference performed on remote servers
Results streamed back over network
Requires authentication and API keys
Incurs per-request costs

This project demonstrates an alternative philosophy:

Privacy by architecture: Data never leaves the device
Zero operational costs: No API fees or subscription requirements
Offline capability: Works without network connectivity
Instant response: No network latency overhead
User control: Complete ownership of data and models

The approach recognizes that for personal knowledge management, privacy and ownership matter more than access to the largest possible models.

Use Cases and Applications

This architecture is particularly well-suited for:

Personal Knowledge Management

Organizing personal notes, journal entries, research papers, and learning materials with intelligent semantic search. Users maintain complete privacy over sensitive personal information.

Medical and Healthcare Notes

Doctors, therapists, and healthcare professionals managing patient notes where data privacy is legally mandated. On-device processing ensures HIPAA compliance without complex infrastructure.

Legal Documentation

Lawyers managing case notes, contracts, and legal research where client confidentiality is paramount. Semantic search helps quickly find relevant precedents and references.

Field Research

Researchers working in locations with limited connectivity (field studies, archaeological sites, remote locations) who need intelligent note organization without cloud access.

Education and Learning

Students building personal wikis and study notes with semantic connections between concepts. Helps identify knowledge gaps and related topics for deeper learning.

Corporate Knowledge Management

Employees managing work notes in regulated industries where data must remain on corporate devices. Enables productivity without violating security policies.

Design Trade-Offs

Decision	Benefit	Trade-Off
On-device embeddings	Complete privacy, zero latency	Limited to smaller models
ONNX format	Cross-platform compatibility	Some optimization limitations
SQLite storage	Lightweight, reliable	Requires brute-force vector search
Hybrid search	Robust across query types	Higher computational cost
Extractive summarization	Fast, deterministic	Less fluent than generative
Single model architecture	Simple deployment	Can’t easily switch models
CPU-only inference	Universal device support	Slower than GPU acceleration

The key insight is optimizing for privacy, offline capability, and operational simplicity over access to state-of-the-art model performance.

Technical Stack

Core Mobile Framework

React Native: Cross-platform mobile development framework enabling shared codebase across Android and iOS. Provides native performance with JavaScript development experience.

TypeScript: Type-safe development reducing runtime errors and improving code maintainability. Essential for complex state management in mobile applications.

ML Inference

ONNX Runtime React Native: High-performance inference engine with bindings for React Native. Supports CPU execution on ARM processors with optimized kernels.

Sentence Transformers (ONNX): Pre-trained transformer models exported to ONNX format. Provides semantic embeddings optimized for similarity search.

Data Layer

SQLite (react-native-sqlite-storage): Embedded relational database for structured storage. Provides ACID transactions and efficient querying on mobile devices.

AsyncStorage: Simple key-value storage for application settings and preferences.

UI Components

React Navigation: Navigation library for screen transitions and routing in React Native applications.

React Native Vector Icons: Icon library providing consistent visual elements across platforms.

Project Structure

The repository follows standard React Native conventions:

KnowledgeVault/
├── src/
│   ├── components/          # Reusable UI components
│   │   ├── NoteCard.tsx
│   │   ├── SearchBar.tsx
│   │   └── RAGDemo.tsx
│   ├── screens/             # Application screens
│   │   ├── HomeScreen.tsx
│   │   ├── NoteDetailScreen.tsx
│   │   └── SearchScreen.tsx
│   ├── services/            # Business logic
│   │   ├── EmbeddingService.ts
│   │   ├── SearchService.ts
│   │   └── RAGService.ts
│   ├── db/                  # Database layer
│   │   ├── schema.ts
│   │   └── queries.ts
│   ├── utils/               # Helper functions
│   │   ├── tokenizer.ts
│   │   ├── vectorMath.ts
│   │   └── types.ts
│   └── embedding/           # ONNX integration
│       ├── modelLoader.ts
│       └── inference.ts
├── assets/
│   └── models/
│       └── model.onnx       # Embedded transformer model
├── android/                 # Android native code
├── ios/                     # iOS native code (future)
├── App.tsx                  # Root component
├── package.json
└── README.md

The structure separates UI concerns from business logic, making the codebase maintainable and testable.

Current Implementation

The application currently demonstrates:

Note creation with automatic embedding generation
Real-time semantic search with hybrid ranking
RAG pipeline demonstration showing all stages
Performance metrics for each operation
SQLite-based persistence
Fully offline operation

The implementation prioritizes educational clarity while remaining production-ready for personal use.

Future Enhancements

Potential extensions for enhanced functionality:

On-device LLM integration: Replace extractive summarization with generative models using LLama.cpp
iOS support: Port Android implementation to iOS using same codebase
Model quantization: Reduce model size through INT8 quantization
GPU acceleration: Leverage mobile GPUs for faster inference
Vector indexing: Implement approximate nearest neighbor search (FAISS, HNSW)
Multi-model support: Allow users to select different embedding models
Cross-device sync: Optional encrypted synchronization between devices
Export capabilities: Backup notes and embeddings to external storage
Rich text support: Enhanced formatting beyond plain text
Attachment handling: Process PDFs and images for semantic search

Target Audience

This project is designed for:

Mobile developers exploring on-device AI implementations
Privacy-conscious users seeking intelligent note-taking without cloud services
ML engineers deploying models to resource-constrained environments
Students learning practical RAG system implementation
Enterprise developers building privacy-compliant mobile applications
Portfolio reviewers evaluating systems architecture and AI integration skills

The project balances practical utility with technical demonstration of modern mobile AI capabilities.

Repository

Full implementation and APK releases available at:
https://github.com/JashT14/KnowledgeVault

Pre-built Android APK available in Releases section.

License

MIT License—free to study, modify, and extend for commercial or personal use.

Final Thoughts

The most powerful AI system is one that respects user privacy while delivering tangible value.

This project demonstrates that sophisticated AI capabilities - semantic search, retrieval augmentation, and intelligent summarization - can run entirely on mobile devices without compromising user privacy or requiring operational infrastructure.

As edge computing and on-device AI continue to evolve, architectures like this become increasingly relevant. The future of mobile AI isn’t necessarily about connecting to the largest cloud models, but about intelligently deploying appropriately-sized models directly where users need them.

This repository represents a practical approach to mobile AI - one that recognizes privacy as a feature, offline capability as an advantage, and simplicity as a strength. The principles demonstrated here remain valuable regardless of which specific models or frameworks dominate the current landscape.

Sometimes the best AI solution isn’t the one with access to the biggest models, but the one that keeps your data where it belongs: under your control.