- 2026.01.11 π We released the paper RealMem on arXiv.
- 2026.01.11 π We open-sourced RealMem β a robust multi-agent framework designed to simulate realistic user-assistant interactions with sophisticated memory management.
The ultimate goal of dialogue systems is to maintain long-term consistency and memory across multiple sessions, mimicking human-like interaction.
π Our Solution: To address this gap, we introduce RealMem. Our framework employs a Multi-Agent architecture where specialized agents (User, Assistant, Evaluator, Memory Manager) collaborate to generate coherent, multi-session dialogues. By strictly controlling the "User's" temporal perception and the "Assistant's" memory retrieval, RealMem produces high-quality datasets for training and evaluating long-context LLMs.
![]()
Overview of RealMem Framework.
RealMem operates as a modular pipeline, transforming high-level project outlines into granular, multi-turn dialogues. It consists of four core components: a User Agent that simulates user behavior with strict temporal constraints, an Assistant Agent that provides professional responses, a Goal Evaluator that assesses task completion in real-time, and a Memory Manager that handles the extraction and deduplication of structured memory points.
- π€ Multi-Agent Architecture: Collaborative agents simulate authentic interactions.
- π§ Intelligent Memory Management: Automated extraction, storage, and deduplication of memory points.
- π― Long-Term Task-Oriented: Dialogues are driven by explicit goals with automatic success evaluation.
- β° Temporal Logic Control: Strict enforcement of time constraints to prevent information leakage from future events.
- π Context Continuity: Maintains logical consistency across multiple sessions via memory retrieval.
The dataset consists of multiple JSON files located in dataset/datasets/, each corresponding to a distinct user persona (e.g., Adeleke_Okonjo_dialogues_256k.json). These files contain the full multi-session interaction history.
Within each file, the structure is organized as follows:
_metadata: Contains global information includingperson_name,total_sessions, andtotal_tokens.dialogues: A list of dialogue sessions. Each session object contains the following fields:session_identifier: The unique identifier for the session (e.g.,Knowledge_Learning_1:S1_01).session_uuid: The UUID for the session.current_time: The simulated date and time of the session.extracted_memory: A list of structured memory points extracted from the session. Each item contains:index: Memory index (e.g.,Travel_Planning_2-DM-S1_01-01).type: Memory type (e.g.,Dynamic).content: The textual content of the memory.source_turn: The turn index where this memory was extracted.source_content_snapshot: A snapshot of the source content.source_role_snapshot: The role of the speaker in the source snapshot.session_uuid: The UUID of the session where memory was created.
dialogue_turns: A list of dialogue turns. Each turn is a dictionary with the following fields:speaker: The role of the speaker (UserorAssistant).content: The text content of the message.is_query:trueif the turn represents a memory retrieval query,falseotherwise.query_id: The unique ID for the query (ifis_queryis true).memory_used: The memory points retrieved and used by the assistant for generating this specific response. (List of objects containingsession_uuidandcontent).memory_session_uuids: A list of session UUIDs corresponding to the memories used.
RealMem/
βββ dataset/ # πΎ Generated Dialogues (e.g., Lin_Wanyu_dialogues_256k.json)
β βββ all_persona_topic/ # Persona & Topic Definitions
βββ pipeline/ # π Core Processing Pipeline
β βββ base_processor.py # Base Interface
β βββ project_outline_processor.py # Project Blueprint Generation
β βββ event_processor.py # Event Sequence Generation
β βββ summary_processor.py # Session Summary Generation
β βββ multi_agent_dialogue_processor.py # Multi-Agent Core
βββ utils/ # π Utility Toolkit
β βββ llm_client.py # LLM Client (w/ Retry)
β βββ error_handler.py # Error Handling & JSON Parsing
β βββ data_validator.py # Data Validation
β βββ dialogue_validator.py # Dialogue Logic Verification
β βββ dialogue_postprocessor.py # Post-processing & Cleaning
βββ eval/ # π Evaluation Metrics
β βββ run_generation.py # Evaluation Generation Runner
β βββ compute_auto_metrics_for_realmem.py # Automated Metrics
β βββ compute_llm_metrics_for_realmem.py # LLM-based Metrics
βββ prompts/ # π Prompt Templates
β βββ project_outline.txt # Project Outline Prompt
β βββ event.txt # Event Generation Prompt
β βββ summary.txt # Session Summary Prompt
β βββ refine.txt # Dialogue Refinement Prompt
βββ figs/ # πΌοΈ Figures & Assets
βββ main.py # π Main Entry Point
βββ requirements.txt # π¦ Dependencies
First, clone the repository and create a suitable environment:
# Install dependencies
pip install -r requirements.txtThen, configure your environment variables:
# Copy example configuration
cp .env.example .env
# β οΈ Edit .env to add your API Keys (e.g., OpenAI API Key)Ensure the following base data files exist (for Persona and Topic generation):
dataset/all_persona_topic/person&goal.jsondataset/all_persona_topic/persona_all.json
-
Standard Generation (Recommended)
To start the full pipeline generation using the main Python entry point:
python main.py --names "Lin Wanyu" --smart-recoveryπ§ Options:
--names <names>: (Recommended) Specify the target persona name. Seedataset/all_persona_topic/persona_all.jsonfor available names (e.g., "Ethan Hunt", "Sarah Miller", "Kenta Tanaka"). Default: Process All.--projects <num>: Number of projects (dialogue topics) to generate per person. Default: 3.--max-turns <num>: Maximum number of turns per dialogue session. Default: 24.--output <dir>: Output directory path. Default:output.--smart-recovery: Enable smart interrupt recovery (resume from previous state). Default: False.--log: Enable verbose logging for debugging. Default: False.
π€ Model Configuration:
--blueprint-model <model>: Model for generating project outlines.--event-model <model>: Model for generating event sequences.--summary-model <model>: Model for generating session summaries.--dialogue-model <model>: Model for generating the actual dialogue.--memory-model <model>: Model for memory extraction.
RealMem provides a comprehensive evaluation suite in the eval/ directory.
The evaluation pipeline follows a strict temporal sequence, processing dialogues session by session. We iterate through the sessions to update the memory state. When a query is detected within a session, we trigger retrieval and generation based on the history accumulated from previous sessions:
for session in dialogue_sessions:
# 1. Evaluate Queries in Session
for i, turn in enumerate(session['turns']):
if turn.get('is_query', False):
question = turn.get('content', '')
# Generate Keywords & Retrieve Context (from all historical sessions)
keywords = self.generate_query_llm(question)
memories = self.retrieve_memories(question, keywords, k=10)
# Generate Answer
generated_answer = self.generate_answer(question, memories)
# 2. Update Memory with Session Content (for future sessions)
self.memory_system.add_session_content(session)Generate responses using retrieved memory context to simulate the model's ability to utilize long-term information. This step produces the model outputs that will be evaluated in the next phase.
python eval/run_generation.py \
--process_retrieval_results \
--retrieval_result_dir eval/retrieval_resultπ§ Options:
--process_retrieval_results: Enable batch processing mode to iterate through retrieval results in a directory.--retrieval_result_dir <dir>: Directory containing retrieval results (default:eval/retrieval_result).--model_name <model>: Model used for generation (default:gpt-4o-mini).--top_k <num>: Number of top retrieved memories to use (default: 5).
We support both automated metrics (Recall, NDCG) and LLM-based qualitative metrics.
Automated Metrics:
python eval/compute_auto_metrics_for_realmem.py \
--process_retrieval_results \
--retrieval_result_dir eval/retrieval_result \
--input_data_dir dataset/π§ Options:
--process_retrieval_results: Enable batch processing mode.--retrieval_result_dir <dir>: Directory containing retrieval results (default:eval/retrieval_result).--input_data_dir <dir>: Directory containing ground truth dialogue files (default:datasets).--in_file <file>: Input file path for single file evaluation.--dialogues_file <file>: Ground truth file path for single file evaluation.
LLM-based Metrics:
python eval/compute_llm_metrics_for_realmem.py \
--process_retrieval_results \
--retrieval_result_dir eval/retrieval_result \
--input_data_dir datasetπ§ Options:
--process_retrieval_results: Enable batch processing mode.--retrieval_result_dir <dir>: Directory containing generation results (default:eval/retrieval_result).--input_data_dir <dir>: Directory containing ground truth dialogue files (default:datasets).--model_name <model>: Model used as the evaluator/judge (default:gpt-4o).
![]()
Performance Comparison of Different Memory Methods on RealMem.
We welcome community contributions! Please feel free to open issues or submit pull requests.
- Fork the repository.
- Create your feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
If you use RealMem in your research, please cite our work:
@misc{bian2026realmembenchmarkingllmsrealworld,
title={RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction},
author={Haonan Bian and Zhiyuan Yao and Sen Hu and Zishan Xu and Shaolei Zhang and Yifu Guo and Ziliang Yang and Xueran Han and Huacan Wang and Ronghao Chen},
year={2026},
eprint={2601.06966},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.06966},
}β If RealMem helps you, please give us a star!