Recently, I started working on a multi-persona chat application as a hobby project. You can find it here, along with complete documentation on how to set it up and what it does. The interactions between characters and their environment are generated by an LLM, and throughout the development process, I learned several valuable lessons that I want to share in this post.
From prompt engineering to model selection and API usage, I ran into various challenges and had to refine my approach multiple times. Some things worked as expected, others didn’t, and a few surprised me entirely. This blog outlines the key takeaways—what worked, what didn’t, and what I would do differently next time.
Prompting for code editing
Separation of Code and Prompts
When the code and prompts are in the same file and you update the file using for example ChatGPT or GitHub Copilot, it is likely the LLM will also make (usually unwanted) changes to the prompts. If you do not notice these changes, they can influence your application behavior. It is a smart thing to keep the prompts and code clearly separated and preferably in separate files to avoid this.
High-Quality Inquiries
I found that when making changes to code, it usually works better to one-shot the change instead of letting the LLM generate an answer and continue finetuning that with follow-up questions. Usually the better models have quota which limit their use and thus asking fewer but more high quality questions allows more efficient usage of that quota.
Incremental Changes
Usually when editing code using an LLM, I want to make several changes at the same time and I want to avoid making too many separate calls. I used to have prompts like the following which worked fine on small codebases;
Update the following code and return the complete updated functions which need to be changed to achieve the requested functionality. no pseudocode, no suggestions for improvements, no partial implementations but complete working robust updated functions. Files or functions which do not need to be changed do not need to be mentioned.
INSTRUCTIONS
CHANGE 1: do this and that to the backend
CHANGE 2: fix this bug
CHANGE 3: update the frontend
Below is my complete codebase:
CODE
code...
When the codebase became larger however, making many changes at the same time started to cause issues. It was hard to debug and usually the model did not manage to make all changes at the same time. This caused me to supply a single change request at a time to get better results.
Model selection
Many of the things relating to model choice and usage have been previously discussed here. Some new models have become available in the meantime. I’ve mainly used o1 and o3-mini for this application. I do not have paid subscriptions for Claude, Gemini and DeepSeek but experiences with the free versions of those were worse. Claude and DeepSeek had difficulty making solid code changes in large codebases (did fine with small codebases) and Gemini did not have enough output tokens to be useful (at the moment of writing, limited to 8K tokens in Google AI Studio). Reasoning models did better than non-reasoning models for coding. No solid numbers/measures but personal experience.
Splitting the roles during development for a model into an architect and an editor and choose specific models for each task can yield even better results (see here). I did not yet go that far but can imagine that for larger scale projects this can be beneficial.
Prompting for the multi-persona chat application
Focused context
At first I send in a request which contained all the required information for generation of an interaction. Inputs were a lot of context, previous thoughts, previous visible actions from other characters, previous summaries. As output I wanted plan changes, appearance changes, dialogue, actions, reasons for every change. My prompt itself was very long and complicated with many instructions for interactions such as to prioritize responding to the latest message, to align actions to milestones in order to reach the goal, when to leave fields empty or not, try to keep everything within the characters behavioral patterns (which are of course described in great detail to steer behavior).
The results were not so good. There was little progress in the interactions / story. The characters tended to forget who they were in responding and would mix up responses. The replies were often not good in all output parameters.
I decided to functionally split up my flow. I limited the interaction to just action and dialogue. I decided to handle location, appearance, plan changes separately. Thus first create or update the steps and goal. Next generate the action/dialogue based on the plan and the context. Next determine if the action/dialogue lead to appearance and location changes.
This greatly reduced the context I needed to get my request to the LLM. It also very much simplified my prompts which were now a lot more focused. The drawback however is more LLM requests and similar context needs to be send multiple times.
Systematic Prompt Structuring
It is important what you put in the system prompt and what in the user prompt. System prompts should be mostly static and guide the general behavior of the model. User prompts are usually dynamic and specific. Some models like DeepSeek R1 do not support a separate system prompt. In such a case you should concatenate the system prompt and user prompt and provide them as a single prompt. A certain amount of redundancy helps guide the model answers but more redundancy means the other content gets less attention because it is a smaller part of the entire context.
Prompts can be model specific. They can work well on one model but not at all on another. This has many reasons. Clarity in prompts helps make prompts which can be used universally.
Structured Outputs
I would say using structured output in code is essential and it is not difficult to implement. Read for example here and here. This allows you to supply an output JSON structure which the model output needs to conform to and most recent models support this. This works a lot better then asking the model to create JSON and parsing the response afterwards. It still helps to provide a sample output JSON in the request to improve the results. Also it helps to keep the output JSON simple and of course implement retries when certain fields are empty but should not be empty in the result or if there are relations between fields in the output which can be checked. When performing retries I supplied the previous request, response and an instruction what is wrong so the LLM can fix it. N.b. I use Pydantic models structured output. There are several libraries available which allow more elaborate output checking and make the retry cycle easier.
API Selection
When choosing an API, there are generally two options:
- Chat API – Designed for multi-turn conversations, incorporating roles like “user” and “assistant.”
- Generate API – Optimized for single-turn requests, requiring all necessary context upfront.
Initially, I assumed the Chat API would be a good fit for my multi-persona chat application. However, I quickly realized its limitations—it works well when the use case aligns perfectly, but lacks flexibility for more complex requirements.
The Generate API, on the other hand, presents a different challenge: since it processes a single request at a time, you must provide a fully constructed context for the model to make accurate decisions. This can become a lot of data, so my application summarizes messages after a certain threshold and even condenses those summaries when they accumulate.
Additionally, each character in my chat system has unique thoughts, emotions, and plans that influence their perspective. These internal states aren’t visible to others, leading different characters to generate distinct summaries of the same events. This approach ensures more nuanced and context-aware interactions, but requires careful prompt structuring and context management.
Reasoning
Reasoning models are becoming more popular. OpenAI o1, o3, DeepSeek R1, Gemini 2 Thinking. Are they suitable for a usecase as a chat application? No. I had already implemented reasoning capabilities myself in several ways;
- Characters make plans and the next step in their plan and the goal is context for creating interactions
- I ask the model to explain in several prompts why it does certain things. This helps guide the model to obtain more structured output
Reasoning models are not really something new or special. What they do is output so-called ‘think tokens’ (also called reasoning tokens, and planning tokens). These tokens are output first in the response. The models creates its output based on the input and what has already been produced as output. In case of reasoning models the reasoning is output first and thus used in the generation of whatever comes after that. This has several consequences;
- The reasoning can be looked at and this can be used to debug/improve your request. Great for coding.
- The usable output context is a lot smaller.The thinking tokens are usually not directly usable in the application.
- Usable output is usually more concise / less verbose than is the case with non-reasoning models. When generating a story, I prefer more prose and more verbosity.
I noticed that for this usecase it helps to use a model which is good in following complex instructions and does not reason since the reasoning is generated in separate requests and supplied. For me those models were at the moment of writing (this can quickly change ) Llama 3.3 70B and Mixtral 8x22B. Mixtral is faster and Llama 3.3 is smarter.
Ensuring Text Diversity with Dual Similarity Measures
I’ve used a two-way approach to keep the generated interactions—from a character’s actions to their dialogue—engaging and diverse. To avoid the pitfall of repetitive or overly similar outputs, I rely on two complementary methods:
Embedding and Cosine Similarity
An embedding model transforms text into high-dimensional vectors that encapsulate its semantic meaning. By computing the cosine similarity—the measure of the angle between two vectors—we can gauge how semantically close two pieces of text are. If, for example, a generated dialogue is nearly identical in meaning to the corresponding action or to recent messages from the same character, the cosine similarity score will be high (close to 1), triggering a flag for revision.
Jaccard Similarity
Complementing the embedding approach, I also use a token-based Jaccard similarity metric. This method compares the sets of words in the texts by calculating the ratio of their intersection over their union. Essentially, it provides a surface-level check that highlights when texts share many of the same words or phrases—even if their overall meanings differ somewhat. This extra layer helps catch literal overlaps that the embedding might overlook.
Combined Evaluation
Both similarity measures are applied to each candidate text against previous messages and across different fields (like comparing a character’s action with their dialogue). If either the cosine or the Jaccard similarity exceeds a predefined threshold (typically around 0.8), the system deems the output too repetitive and automatically prompts the model to regenerate the text with refined instructions. This dual strategy ensures that every interaction remains fresh and varied, steering clear of excessive repetition or verbatim copying from earlier dialogue.
Finally
Developing the AutoChatManager application has been an insightful journey, reinforcing several key lessons in prompt engineering and application design.
Prompting for Code Editing
- Separation of Code and Prompts: Keeping code and prompts in separate files prevented unintended modifications during updates, ensuring application stability.
- High-Quality Inquiries: Formulating concise, well-thought-out questions led to more effective interactions with language models, optimizing resource usage.
- Incremental Changes: Implementing one change at a time facilitated easier debugging and maintained code integrity.
- Model Selection: Choosing the appropriate model is crucial. Reasoning models are great for coding!
Prompting for the Multi-Persona Chat Application
- Focused Context: Limiting the context provided to the language model improved the relevance and coherence of generated interactions.
- Systematic Prompt Structuring: Clearly defining system and user prompts enhanced the model’s understanding and response accuracy.
- Structured Outputs: Utilizing structured output formats ensured consistency and facilitated easier parsing of model responses.
- API Selection: Choosing the appropriate API, such as the Generate API over the Chat API, provided greater control over context management and interaction flow.
- Reasoning Model Use: Instead of relying on reasoning models to generate interactions, reasoning is handled separately in structured steps before being passed to the model. Models like Llama 3.3 70B and Mixtral 8x22B performed better for multi-character interactions than pure reasoning models like o1, o3, or DeepSeek R1.
Ensuring diversity in responses
I use a two-way approach to ensure every character interaction stays engaging and diverse. First, an embedding model computes cosine similarity to capture the semantic closeness of texts, flagging any dialogue that mirrors actions or recent messages too closely. Then, a token-based Jaccard similarity check catches surface-level word overlaps. By applying both methods and regenerating outputs when either score exceeds a threshold (typically around 0.8), the system effectively prevents repetition and keeps the conversation dynamic.
Building a Dynamic, Context-Aware Chat System
The above strategies were instrumental in creating a dynamic, context-aware conversational application. By meticulously structuring prompts and outputs, and making informed design choices, the AutoChatManager delivers coherent and engaging multi-persona interactions. I hope these insights are beneficial to others embarking on similar projects. For a closer look at the application, you can find it here.