Microsoft’s Memora aims to fix AI agent memory with 98% less token usage

AI agents are increasingly expected to remember conversations, preferences, and decisions over weeks or months, rather than just within a single chat session. To address the fragmentation and inefficiency of current memory systems, Microsoft Research has developed Memora, a new architecture designed to provide scalable and reliable long-term recall.

According to Microsoft, Memora solves common memory bottlenecks by decoupling what an AI remembers from how it retrieves that information. This approach reportedly reduces context token usage by up to 98% while matching or exceeding the accuracy of full-context inference.

The problem with current AI memory

Modern large language models (LLMs) are powerful reasoners, but they typically start every session from scratch. When conversations grow long, models must repeatedly re-read entire histories, which is inefficient. New information is often stored as raw text or compressed into summaries, leading to lost details or fragmented data.

Existing solutions fall into two extremes. Content-fragmentation systems like RAG (Retrieval-Augmented Generation) and Mem0 extract atomic facts or text fragments. While this preserves detail, it creates brittle, isolated entries that lose narrative coherence. Conversely, coarse-abstraction systems compress experiences into compact summaries, stripping away the specific constraints, edge cases, and numeric details that make memory useful.

Graph-based systems like Zep and GraphRAG add structure but still rely on the content itself for retrieval and often require rigid ontologies that don’t generalize well across different domains.

How Memora works

Memora addresses these limitations by separating storage from retrieval. Each memory entry consists of two components:

Primary Abstraction: A short phrase (6–8 words) that captures the fundamental topic of the memory.
Memory Value: The rich, detailed content itself.

This separation allows new information about an evolving topic to be merged into an existing entry under the same primary abstraction, preventing fragmentation into partial duplicates. Additionally, cue anchors—short, context-aware tags extracted from the memory value—provide alternative access paths, functioning as flexible, organically generated metadata.

Instead of returning a fixed number of semantically similar items, Memora uses a policy-guided retriever. This system iteratively refines its query, expands through cue anchors to surface related memories, and decides autonomously when to stop searching.

Benchmarking results

Microsoft evaluated Memora on two long-context benchmarks: LoCoMo, which features dialogues averaging 600 turns, and LongMemEval, which uses 115,000-token contexts. The results showed significant efficiency gains:

86.3% LLM-judge accuracy on LoCoMo.
87.4% accuracy on LongMemEval.

In both tests, Memora outperformed RAG, Mem0, Nemori, Zep, LangMem, and even full-context inference. It also stored nearly half as many memory entries per conversation as Mem0 (344 versus 651) while reducing token consumption by up to 98% compared with full-context inference.

What this means for you

For everyday Windows users, this development signals a shift toward more capable AI assistants that can maintain context over long periods without requiring massive computational resources. However, enterprise experts caution that lower token consumption does not automatically translate to lower infrastructure costs.

Sanchit Vir Gogia, chief analyst at Greyhound Research, noted that real costs include memory construction, indexing, storage, and audit logging. He also pointed out that Memora’s strongest retrieval mode is its slowest, taking between five and six seconds per query compared to under a second for simpler semantic modes. This means the “memory crunch” moves from prompt length to retrieval latency and extra inference steps.

Currently, Memora is an active Microsoft Research project with code available on GitHub. While developers can experiment with the architecture, IT leaders are advised to view it as a promising architectural model rather than production-ready software. Organizations will still need robust governance policies to ensure AI memories are managed securely and remain auditable under regulations like the EU’s AI Act.

Source: Computerworld

Over to you: Do you think AI assistants need better long-term memory, or is forgetting context actually a privacy benefit?