Semantic Memory Clustering

Architectural framework for theme discovery through DBSCAN clustering in C-LLMs. Achieves 82% theme discovery accuracy through density-based clustering on conversation embeddings.

Report ID

TR-2025-51

Type

Framework Report

Date

2025-01-15

Version

v1.0.0

Authors

Cognitive Architecture Team

Abstract

We define Semantic Memory Clustering as an architectural framework that structures memory systems to discover emergent themes across conversations through unsupervised clustering algorithms.

1. Introduction

Semantic Memory Clustering enables AI systems to discover emergent themes across conversations without explicit user labeling. Traditional memory systems rely on chronological or tag-based organization, requiring manual categorization. Mavaia's semantic clustering applies DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to conversation embeddings, automatically identifying thematic groups based on semantic similarity density. The system discovers recurring topics (work projects, personal interests, technical problems) that span multiple sessions, enabling thematic memory retrieval that surfaces related past interactions regardless of temporal distance. Clustering operates as background process during idle time, ensuring real-time queries aren't delayed by expensive computation.

2. Methodology

Semantic clustering implements three-stage processing. First, Embedding Generation creates vector representations for each conversation turn using local embedding model (all-MiniLM-L6-v2, 384 dimensions) with <100ms per conversation latency. Second, DBSCAN Clustering analyzes embedding space using density-based grouping: minimum 3 points per cluster, epsilon distance 0.35 (cosine similarity threshold), identifying dense regions as themes while marking outliers as noise. Third, Theme Labeling generates natural language descriptions for discovered clusters by extracting frequent keywords and analyzing representative conversations. The system maintains cluster membership persistently, updating incrementally as new conversations occur rather than full recomputation. Clusters evolve over time as conversation patterns shift, with automatic theme merging or splitting based on density changes.

3. Results

Semantic clustering evaluation across 500 users with 50+ conversations each showed 82% theme discovery accuracy validated through user assessment of cluster coherence. Average user profile: 8.3 distinct themes discovered, ranging 3-15 themes depending on conversational diversity. Cluster quality metrics: 0.74 average intra-cluster similarity (conversations within themes are related), 0.29 average inter-cluster similarity (themes are distinct), 15% noise ratio (conversations not fitting clear themes). Processing performance: initial clustering 200-500ms for 50 conversations, incremental updates <80ms per new conversation. Theme label quality: 71% of auto-generated labels judged accurate by users, 24% partially accurate, 5% inaccurate. Memory retrieval improvement: thematic clustering enabled 19% better context recall versus pure chronological retrieval for queries referencing past topics.

4. Discussion

Semantic Memory Clustering demonstrates that conversational themes emerge naturally from unsupervised analysis of embedding space density. The 82% discovery accuracy validates DBSCAN's effectiveness for conversation grouping despite the unsupervised approach. The 8.3 average themes per user suggests appropriate granularity - enough clusters to represent distinct topics without fragmenting into excessive micro-categories. The 0.74 intra-cluster versus 0.29 inter-cluster similarity validates that discovered themes are internally coherent and mutually distinct. The 15% noise ratio appropriately identifies conversations that don't fit clear thematic patterns. The 19% memory retrieval improvement quantifies benefit of thematic versus chronological organization. The background processing approach ensures clustering doesn't impact real-time query latency.

5. Limitations

Current limitations include: (1) DBSCAN parameters (epsilon 0.35, min points 3) manually tuned rather than adaptive per user, (2) Theme labeling relies on keyword extraction that may miss nuanced topic descriptions, (3) Cluster evolution doesn't explicitly model temporal theme shifts, potentially mixing old and new incarnations of similar topics, (4) The 384-dimension embedding space may not capture all semantic nuances relevant for clustering, (5) No explicit hierarchical clustering to represent theme/sub-theme relationships, (6) Cross-workspace clustering not yet implemented, potentially missing related themes spanning workspace boundaries, (7) Outlier conversations marked as noise are not further analyzed for potential micro-themes.

6. Conclusion

Semantic Memory Clustering provides unsupervised theme discovery that organizes conversational memory without manual categorization. The 82% accuracy and 19% retrieval improvement validate that density-based clustering effectively groups related conversations. The framework enables thematic memory organization that complements chronological and recency-based retrieval, providing users with multiple access paths to relevant past interactions. Future work will focus on adaptive parameter tuning per user, hierarchical theme modeling, explicit temporal theme shifts, cross-workspace theme discovery, and micro-theme analysis for outlier conversations that don't fit major discovered themes.