KnowledgeVault: On-Device Semantic Search with RAG
KnowledgeVault: On-Device Semantic Search with RAG
Privacy-First AI-Powered Note-Taking Without Cloud Dependencies
This project demonstrates how to implement a complete Retrieval-Augmented Generation (RAG) system entirely on mobile devices, eliminating the need for external APIs, cloud services, or network connectivity while preserving user privacy.
Overview
KnowledgeVault is a production-ready Android application built with React Native that brings enterprise-grade semantic search capabilities to personal note-taking. The system generates vector embeddings, performs hybrid search, builds context, and generates summaries—all within the device’s local storage and compute constraints.
Instead of sending user notes to cloud-based embedding services or LLM APIs, this project runs transformer models directly on the device using ONNX Runtime, storing all data in local SQLite databases. The result is instant semantic search with zero network latency and complete data privacy.
Key Capabilities
- On-device embeddings: Generate semantic vectors using ONNX models locally
- Hybrid search: Combine keyword matching with vector similarity for accurate retrieval
- Offline operation: Complete RAG pipeline works without internet connectivity
- Privacy preservation: Notes and embeddings never leave the device
- Real-time search: Semantic search updates as you type with sub-second latency
- Context aggregation: Intelligently builds context from multiple relevant notes
- Local summarization: Generates concise summaries from retrieved content
This approach is particularly valuable for users who need intelligent note organization while maintaining strict data privacy, or for developers building mobile applications in environments with limited or unreliable connectivity.
Video Demonstration
Watch the complete walkthrough showing note creation, semantic search, and RAG pipeline demonstration:
Motivation
Traditional mobile note-taking applications rely on simple keyword search, which fails to capture semantic relationships between content. Modern approaches using cloud-based AI solve this but introduce new challenges:
- Privacy concerns: User notes sent to third-party embedding APIs
- Network dependency: Requires stable internet connectivity
- API costs: Per-request pricing for embeddings and LLM inference
- Latency overhead: Network round-trips add 100-500ms per operation
- Data residency: Complex compliance requirements for sensitive information
- Platform lock-in: Dependence on specific cloud providers
This project exists to answer a fundamental question: Can we deliver intelligent semantic search entirely on mobile devices while preserving privacy and eliminating operational dependencies?
For personal knowledge management, medical notes, legal documentation, or any scenario requiring data privacy, an on-device RAG system offers compelling advantages over cloud-based alternatives.
Core Architecture Principles
Mobile-First RAG Pipeline
The application implements the complete RAG workflow locally:
- Note Creation: User enters text content
- Embedding Generation: ONNX model converts text to 384-dimensional vectors
- Vector Storage: Embeddings stored in SQLite alongside note content
- Hybrid Retrieval: Combines BM25 keyword matching with cosine similarity
- Context Building: Aggregates top-K relevant notes into coherent context
- Summary Generation: Produces concise summaries from retrieved content
Each stage runs in-memory on the device, with no external service calls.
ONNX Runtime Integration
ONNX Runtime provides cross-platform ML inference optimized for mobile hardware. The project uses sentence-transformers models exported to ONNX format, which:
- Runs efficiently on ARM CPUs without GPU acceleration
- Provides consistent inference across Android and iOS
- Supports batch processing for improved throughput
- Handles memory management automatically
- Enables model quantization for reduced size
Tokenization and Preprocessing
Text preprocessing happens locally using JavaScript tokenizers:
- Tokenization: Converts text to token IDs matching model vocabulary
- Padding: Ensures consistent input dimensions for batch processing
- Attention masks: Handles variable-length inputs efficiently
- Normalization: Applies same preprocessing as model training
This ensures semantic consistency between training and inference environments.
SQLite Vector Storage
The application uses SQLite as both a document store and vector database:
- Normalized schema: Separate tables for notes, embeddings, and metadata
- Efficient indexing: B-tree indexes for keyword search
- Atomic transactions: Ensures data consistency during concurrent operations
- Low memory footprint: Suitable for resource-constrained mobile devices
- Cross-platform compatibility: Works identically on Android and iOS
Vector similarity searches use brute-force cosine similarity, which remains performant for personal note collections (up to ~10,000 notes).
System Design
The architecture follows React Native best practices with clear separation of concerns:
User Interface (React Native)
↓
Note Management Screen
↓
Search Interface with Real-time Updates
↓
Embedding Service (ONNX Runtime)
↓
Database Layer (SQLite)
↓
RAG Pipeline (Retrieval + Context + Summarization)
↓
Results Display with Timing Metrics
Each layer communicates through well-defined interfaces, enabling independent testing and future extensibility.
Technical Implementation
Model Selection and Export
The project uses all-MiniLM-L6-v2, a compact sentence-transformer model optimized for mobile inference:
from sentence_transformers import SentenceTransformer
from optimum.onnxruntime import ORTModelForFeatureExtraction
# Load pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Export to ONNX format with optimization
ort_model = ORTModelForFeatureExtraction.from_pretrained(
'sentence-transformers/all-MiniLM-L6-v2',
export=True,
provider='CPUExecutionProvider'
)
ort_model.save_pretrained('./assets/models/')
The model produces 384-dimensional embeddings, balancing accuracy with computational efficiency for mobile devices.
ONNX Runtime Inference
The embedding service manages model lifecycle and batch inference:
import { InferenceSession, Tensor } from 'onnxruntime-react-native';
class EmbeddingService {
private session: InferenceSession;
async initialize() {
const modelPath = 'assets/models/model.onnx';
this.session = await InferenceSession.create(modelPath, {
executionProviders: ['cpu'],
graphOptimizationLevel: 'all'
});
}
async generateEmbedding(text: string): Promise<Float32Array> {
// Tokenize input text
const tokens = this.tokenize(text);
// Create input tensors
const inputIds = new Tensor('int64', tokens.input_ids, [1, tokens.length]);
const attentionMask = new Tensor('int64', tokens.attention_mask, [1, tokens.length]);
// Run inference
const results = await this.session.run({
input_ids: inputIds,
attention_mask: attentionMask
});
// Extract and normalize embedding
const embedding = results.last_hidden_state.data;
return this.normalize(embedding);
}
}
The service maintains a single session across requests, avoiding model reload overhead.
Hybrid Search Implementation
The search system combines two complementary approaches:
Keyword Search (BM25):
function keywordSearch(query: string, notes: Note[]): ScoredNote[] {
const queryTerms = tokenize(query);
return notes.map(note => {
const noteTerms = tokenize(note.content);
const score = calculateBM25(queryTerms, noteTerms, notes);
return { note, score };
}).sort((a, b) => b.score - a.score);
}
Semantic Search (Cosine Similarity):
async function semanticSearch(query: string, embeddings: Embedding[]): Promise<ScoredNote[]> {
const queryEmbedding = await embeddingService.generateEmbedding(query);
return embeddings.map(item => {
const similarity = cosineSimilarity(queryEmbedding, item.embedding);
return { note: item.note, score: similarity };
}).sort((a, b) => b.score - a.score);
}
Hybrid Fusion:
function fuseResults(keywordResults: ScoredNote[], semanticResults: ScoredNote[], alpha: number = 0.5): Note[] {
const scores = new Map<string, number>();
// Normalize and combine scores
keywordResults.forEach((result, rank) => {
scores.set(result.note.id, (1 - alpha) * (1 / (rank + 1)));
});
semanticResults.forEach((result, rank) => {
const existing = scores.get(result.note.id) || 0;
scores.set(result.note.id, existing + alpha * (1 / (rank + 1)));
});
// Return top-K results
return Array.from(scores.entries())
.sort((a, b) => b[1] - a[1])
.map(([id]) => findNoteById(id));
}
The fusion approach provides robust retrieval across different query types.
RAG Context Building
The context aggregator combines retrieved notes into a coherent prompt:
function buildContext(query: string, retrievedNotes: Note[], maxTokens: number = 512): string {
let context = `Based on your notes, here are relevant passages for: "${query}"
`;
let tokenCount = 0;
for (const note of retrievedNotes) {
const noteTokens = countTokens(note.content);
if (tokenCount + noteTokens > maxTokens) break;
context += `
[${note.title}]
${note.content}
`;
tokenCount += noteTokens;
}
return context;
}
This ensures the summarization stage receives relevant information within model constraints.
Local Summarization
Summary generation uses extractive techniques suitable for mobile devices:
function generateSummary(context: string, maxLength: number = 200): string {
const sentences = splitIntoSentences(context);
const scored = sentences.map(sentence => ({
text: sentence,
score: scoreImportance(sentence, context)
}));
const topSentences = scored
.sort((a, b) => b.score - a.score)
.slice(0, 3)
.map(s => s.text);
return topSentences.join(' ');
}
function scoreImportance(sentence: string, context: string): number {
// TF-IDF-based importance scoring
const terms = tokenize(sentence);
const contextTerms = tokenize(context);
return terms.reduce((score, term) => {
const tf = countOccurrences(term, terms) / terms.length;
const idf = Math.log(contextTerms.length / countOccurrences(term, contextTerms));
return score + (tf * idf);
}, 0);
}
For production use, this could be replaced with on-device LLMs using frameworks like LLama.cpp or MLC-LLM.
Performance Characteristics
Embedding Generation Latency
Single note embedding generation:
Average latency: 120ms
├── Tokenization: 15ms
├── ONNX inference: 95ms
└── Normalization: 10ms
Batch embedding (10 notes):
Total latency: 450ms
├── Batch tokenization: 80ms
├── Batched inference: 320ms
└── Batch normalization: 50ms
Batching provides ~3x throughput improvement for bulk operations.
Search Performance
Real-time search latency:
Hybrid search: 85ms
├── Keyword search (BM25): 35ms
├── Semantic search (cosine): 40ms
└── Result fusion: 10ms
Performance remains consistent up to 5,000 notes on mid-range Android devices.
Memory Footprint
Application memory usage:
- ONNX Runtime: ~80-120MB (inference engine)
- Model weights: ~25MB (quantized MiniLM model)
- SQLite database: ~2KB per note + 1.5KB per embedding
- React Native overhead: ~50-80MB baseline
Total memory consumption remains under 250MB for typical usage patterns.
Storage Efficiency
Database size scaling:
- 1,000 notes: ~3.5MB (text + embeddings)
- 5,000 notes: ~17.5MB
- 10,000 notes: ~35MB
Storage remains manageable even for extensive personal knowledge bases.
What Makes This Approach Different
Most mobile AI applications follow a client-server architecture:
- User data sent to cloud embedding services
- LLM inference performed on remote servers
- Results streamed back over network
- Requires authentication and API keys
- Incurs per-request costs
This project demonstrates an alternative philosophy:
- Privacy by architecture: Data never leaves the device
- Zero operational costs: No API fees or subscription requirements
- Offline capability: Works without network connectivity
- Instant response: No network latency overhead
- User control: Complete ownership of data and models
The approach recognizes that for personal knowledge management, privacy and ownership matter more than access to the largest possible models.
Use Cases and Applications
This architecture is particularly well-suited for:
Personal Knowledge Management
Organizing personal notes, journal entries, research papers, and learning materials with intelligent semantic search. Users maintain complete privacy over sensitive personal information.
Medical and Healthcare Notes
Doctors, therapists, and healthcare professionals managing patient notes where data privacy is legally mandated. On-device processing ensures HIPAA compliance without complex infrastructure.
Legal Documentation
Lawyers managing case notes, contracts, and legal research where client confidentiality is paramount. Semantic search helps quickly find relevant precedents and references.
Field Research
Researchers working in locations with limited connectivity (field studies, archaeological sites, remote locations) who need intelligent note organization without cloud access.
Education and Learning
Students building personal wikis and study notes with semantic connections between concepts. Helps identify knowledge gaps and related topics for deeper learning.
Corporate Knowledge Management
Employees managing work notes in regulated industries where data must remain on corporate devices. Enables productivity without violating security policies.
Design Trade-Offs
| Decision | Benefit | Trade-Off |
|---|---|---|
| On-device embeddings | Complete privacy, zero latency | Limited to smaller models |
| ONNX format | Cross-platform compatibility | Some optimization limitations |
| SQLite storage | Lightweight, reliable | Requires brute-force vector search |
| Hybrid search | Robust across query types | Higher computational cost |
| Extractive summarization | Fast, deterministic | Less fluent than generative |
| Single model architecture | Simple deployment | Can’t easily switch models |
| CPU-only inference | Universal device support | Slower than GPU acceleration |
The key insight is optimizing for privacy, offline capability, and operational simplicity over access to state-of-the-art model performance.
Technical Stack
Core Mobile Framework
React Native: Cross-platform mobile development framework enabling shared codebase across Android and iOS. Provides native performance with JavaScript development experience.
TypeScript: Type-safe development reducing runtime errors and improving code maintainability. Essential for complex state management in mobile applications.
ML Inference
ONNX Runtime React Native: High-performance inference engine with bindings for React Native. Supports CPU execution on ARM processors with optimized kernels.
Sentence Transformers (ONNX): Pre-trained transformer models exported to ONNX format. Provides semantic embeddings optimized for similarity search.
Data Layer
SQLite (react-native-sqlite-storage): Embedded relational database for structured storage. Provides ACID transactions and efficient querying on mobile devices.
AsyncStorage: Simple key-value storage for application settings and preferences.
UI Components
React Navigation: Navigation library for screen transitions and routing in React Native applications.
React Native Vector Icons: Icon library providing consistent visual elements across platforms.
Project Structure
The repository follows standard React Native conventions:
KnowledgeVault/
├── src/
│ ├── components/ # Reusable UI components
│ │ ├── NoteCard.tsx
│ │ ├── SearchBar.tsx
│ │ └── RAGDemo.tsx
│ ├── screens/ # Application screens
│ │ ├── HomeScreen.tsx
│ │ ├── NoteDetailScreen.tsx
│ │ └── SearchScreen.tsx
│ ├── services/ # Business logic
│ │ ├── EmbeddingService.ts
│ │ ├── SearchService.ts
│ │ └── RAGService.ts
│ ├── db/ # Database layer
│ │ ├── schema.ts
│ │ └── queries.ts
│ ├── utils/ # Helper functions
│ │ ├── tokenizer.ts
│ │ ├── vectorMath.ts
│ │ └── types.ts
│ └── embedding/ # ONNX integration
│ ├── modelLoader.ts
│ └── inference.ts
├── assets/
│ └── models/
│ └── model.onnx # Embedded transformer model
├── android/ # Android native code
├── ios/ # iOS native code (future)
├── App.tsx # Root component
├── package.json
└── README.md
The structure separates UI concerns from business logic, making the codebase maintainable and testable.
Current Implementation
The application currently demonstrates:
- Note creation with automatic embedding generation
- Real-time semantic search with hybrid ranking
- RAG pipeline demonstration showing all stages
- Performance metrics for each operation
- SQLite-based persistence
- Fully offline operation
The implementation prioritizes educational clarity while remaining production-ready for personal use.
Future Enhancements
Potential extensions for enhanced functionality:
- On-device LLM integration: Replace extractive summarization with generative models using LLama.cpp
- iOS support: Port Android implementation to iOS using same codebase
- Model quantization: Reduce model size through INT8 quantization
- GPU acceleration: Leverage mobile GPUs for faster inference
- Vector indexing: Implement approximate nearest neighbor search (FAISS, HNSW)
- Multi-model support: Allow users to select different embedding models
- Cross-device sync: Optional encrypted synchronization between devices
- Export capabilities: Backup notes and embeddings to external storage
- Rich text support: Enhanced formatting beyond plain text
- Attachment handling: Process PDFs and images for semantic search
Target Audience
This project is designed for:
- Mobile developers exploring on-device AI implementations
- Privacy-conscious users seeking intelligent note-taking without cloud services
- ML engineers deploying models to resource-constrained environments
- Students learning practical RAG system implementation
- Enterprise developers building privacy-compliant mobile applications
- Portfolio reviewers evaluating systems architecture and AI integration skills
The project balances practical utility with technical demonstration of modern mobile AI capabilities.
Repository
Full implementation and APK releases available at:
https://github.com/JashT14/KnowledgeVault
Pre-built Android APK available in Releases section.
License
MIT License—free to study, modify, and extend for commercial or personal use.
Final Thoughts
The most powerful AI system is one that respects user privacy while delivering tangible value.
This project demonstrates that sophisticated AI capabilities - semantic search, retrieval augmentation, and intelligent summarization - can run entirely on mobile devices without compromising user privacy or requiring operational infrastructure.
As edge computing and on-device AI continue to evolve, architectures like this become increasingly relevant. The future of mobile AI isn’t necessarily about connecting to the largest cloud models, but about intelligently deploying appropriately-sized models directly where users need them.
This repository represents a practical approach to mobile AI - one that recognizes privacy as a feature, offline capability as an advantage, and simplicity as a strength. The principles demonstrated here remain valuable regardless of which specific models or frameworks dominate the current landscape.
Sometimes the best AI solution isn’t the one with access to the biggest models, but the one that keeps your data where it belongs: under your control.