Instead of using character ai, which will send all my private conversations to governments, I found this solution. Any thoughts on this? 😅
Instead of using character ai, which will send all my private conversations to governments, I found this solution. Any thoughts on this? 😅
OP you NEED to tell me how you did this. I want this. I want to host something like character.ai on my own hardware. If you have a guide on this I’d love it.
Ollama.com is another method of self hosting. Figuring out which model type and size for what equipment you have is key, but it’s easy to swap out. That’s just an LLM, where you go from there depends on how deep you want to get into the code. An LLM by itself can work, it’s just limited. Most of the addons you see are extra things to give memory, speech, avatars, and other extras to improve the experience and abilities. Or you can program a lot of that yourself if you know Python. But as others have said, the more you try to get out, the more robust a system you’ll need, which is why you find the best ones online in cloud format. But if you’re okay with slower responses and lower features, self hosting is totally doable, and you can do what you want, especially if you get one of the “Jailbroke” models that has had some of the safety limits modified out of them to some degree.
Also as mentioned, be careful not to get sucked in. Even a local model can be convincing enough sometimes to fool someone wanting to see things. Lots of people recognize that danger, but then belittle people who are looking for help in that direction (while marketing realizes the potential profits and tries very hard to sell it to the same people).
I’ve run Kobold AI on local hardware, and it has some erotic models. From my fairly quick skim of character.ai’s syntax, I think that KoboldAI has more-powerful options for creating worlds and triggers. KoboldAI can split layers across all available GPUs and your CPU, so if you’ve got the electricity and the power supply and the room cooling and are willing to blow the requisite money on multiple GPUs, you can probably make it respond about as arbitrarily-quickly as you want.
But more-broadly, I’m not particularly impressed with what I’ve seen of sex chatbots in 2025. They have limited ability to use conversation tokens from earlier in the conversation in generating each new message, which means that as a conversation progresses, it increasingly doesn’t take into account content earlier in the conversation. It’s possible to get into loops, or forget facts about characters or the environment that were present earlier in a conversation.
Maybe someone could make some kind of system to try to summarize and condense material from earlier in the conversation or something, but…meh.
As generating pornography goes, I think that image generation is a lot more viable.
Thanks for the edit. You have a very intriguing idea; a second LLM in the background with a summary of the conversation + static context might make performance a lot better. I don’t know if anyone has implemented it/knows how one can DIY it with Kobold/Ollama. I think it is an amazing idea for code assistants too if you’re doing a long coding session.
I had never heard of Kobold AI. I was going to self-host Ollama and try with it but I’ll take a look at Kobold. I had never heard about controls on world-building and dialogue triggers either; there’s a lot to learn.
Will more VRAM solve the problem of not retaining context? Can I throw 48GB of VRAM towards an 8B model to help it remember stuff?
Yes, I’m looking at image generation (stable diffusion) too. Thanks
IIRC — I ran KoboldAI with 24GB of VRAM, so wasn’t super-constrained – there are some limits on the number of tokens that can be sent as a prompt imposed by VRAM, which I did not hit. However, there are also some imposed by the software; you can only increase the number of tokens that get fed in so far, regardless of VRAM. More VRAM does let you use larger, more “knowledgeable” models.
I’m not sure whether those are purely-arbitrary, to try to keep performance running, or if there are other technical issues with very large prompts.
It definitely isn’t capable of keeping the entire previous conversation (once you get one of any length) as an input to generating a new response, though.
I see. Thanks for the note. I think beyond 48GB of VRAM diminishing returns set in very quickly so I’ll likely stick to that limit. I wouldn’t want to use models hosted in the cloud so that’s out of the question.
Basically I used
Gemma 3 4B QAT
with lmstudio on my rtx 2060 Gaming PC (it answeres almost instantly. I think even faster than c.ai) with this custom prompt:guys please don’t get triggered by the prompt 😄 I tried many ones but writing it like this gave me the best experience I’m not a woman beater or anything like that just pieced this together from other prompts i found in the internet.
Interesting. You’re using a model without special finetuning for this specific purpose and managed to get it to work with just giving it a prompt. I didn’t think that was possible. How would you piece together something like this? Can I just ask AI to give me a prompt which I can use on it/another AI?
How much of VRAM does your GPU have?
As long as the LLM itself is good enough, follows instructions well and has example of similar interactions in its training set (which it definitely has from millions of books minimum and most likely also from public/private chats) it doesn’t really matter if it’s fine-tuned or not. For instance openai’s current LLMs like o4-mini etc are the best at math, coding etc but they are also very good at normal chatting, world knowledge etc. Even a fine-tuned math model can’t beat them. So fine-tuned does not mean it’s better at all. A fine-tuned “emotion” model will not be as good as a much better general-knowledge model because for a general-knowledge model you can compare benchmarks and select the best of the best which will of course then be among the best instruction followers etc. But the fine-tuned model on the other hand will be trained on a data-set which is optimal for that area/topic but will most likely be much worse as a LLM in general compared to the best of the best general-language model. So taking a general-language model that follows instructions very well and understands from context etc will be better than a “non-benchmarkable” ‘emotion’ model at least imo. Idk if I could explain it but hope it makes sense
Yes sure, it’s just trial and error. You can make different custom instructions and save them in text-files. Basically templates for your “girlfriends”.
8GBs
I’m feeling old now. We live in very strange times.
Use an executable like LM Studio, and then an off the shelf pre-trained model from Huggingface.
VRAM × 0.8 for max size.
Experiment until you find one you like.
Thank you. I was going to try and host Ollama and Open WebUI. I think the problem is to find a source for pretrained/finetuned models which provide such… Interaction. Does huggingface have such pre-trained models? Any suggestions?
I don’t know what GPU you’ve got, but Lexi V2 is the best “small model” I’ve seen with emotions, that I can just cite from the top of my head.
It tends to skew male and can be a little dark at times, but it’s more complex than expected for the size (8B feels like 48-70B).
Lexi V2 Original
Lexi V2 GGUF Version
Do Q8_0 if you’ve got the VRAM, Q5_KL for speed, IQ4_XS if you’ve got a potato.
I was going to buy the ARC B580s when they come back down in price, but with the tariffs I don’t think I’ll ever see them at MSRP. Even the used market is very expensive. I’ll probably hold off on buying GPUs for a few more months till I can afford the higher prices/something changes. Thanks for the Lexi V2 suggestion
If you are using CPU only, you need to look at very small models or the 2-bit quants.
Everything will be extremely slow otherwise:
GPUs are at least 3 times faster for the same power draw.