Inspiration

I stumbled across a Donald Trump tweet generator, which used Markov chains in order to generate a tweet similar to one of (then presidential candidate) Donald Trump. The project page has a really good explanation of Markov chains so I won't repeat it here. My idea was simple: what if I could use the same principle to generate a message for anyone in my Discord server? And thus, Trutho: Impersonation was born.

The idea for the Trutho name came from one of my friends online. It was originally used for an earlier silly project we had which, if I recall correctly, was themed around some sort of propaganda battle, or something ridiculous like that. Hence the name Truth-o. I used the name for a few other projects including the one I am writing about today.

Difficulties I had with implementation.

I needed to rely on the Discord API for retrieving messages before I could generate the 'impersonations'. But in doing so, I encountered a major issue: I would not be able to all the messages sent by a user in one go, and trying to do so would involve numerous requests which would not work because the bot would almost certainly get rate limited, and thus impersonations would take far too long to generate. My solution? Whenever a user sent a message to a Discord server (Discord refers to them as guilds in the API so I will from now on use that term), I would log that message into a database. As I will discuss later, I should not have done this. But now, whenever I wanted to generate an impersonation, I would just have to query a database to get all the messages previously sent by a user which I would then use in the generation of the impersonation. This was also useful for moderation purposes as if a problematic user were to send, and delete messages, I still had a copy of them.

I could not find a library in C# that did Markov chains (that I could understand) so I just programmed this manually. Looking back on the function I made, the code quality is not great. The function itself is way too big, and it should have split into several other functions.

In order to have a Markov chain, you must start with one component (what I called the 'origin word'). The challenge was: how would I select this origin word. To do this, I gathered together all the words a user had said but then I imposed a criteria where that word must have been used 10% of the time by the user. Firstly, the program would pick a random word a user had used, and then check against the criteria. But, if the user never uttered a word 10% of the time, this would of course result in an infinite loop. To remedy this, if there have been 100 attempts to find an origin word, and none of the potential words met the criteria then the criteria is decremented by 1%. On reflection, it probably would've been better to just create a filter for all words that met the criteria, and if the resulting list is empty, then decrement the criteria.

As you might expect, when I first built the program, impersonations were normally quite quick to generate as there were not many messages I had in the database to generate impersonations from. However, this eventually changed as the database grew, and it was taking too much time to generate them. In some instances, the time between the user requesting an impersonation, it being generated, and sent to the chat was so long that the connection to Discord just timed out. My solution to this was to generate impersonations ahead of time before they got requested. For each user, I would have a bank of impersonations, and my program would periodically check to see how many impersonations had been generated for a user, and top this up where necessary. This meant that even if a user requested several impersonations in a short amount of time, the bot would still be able to respond to these requests (albeit it would return an error if it had run out, telling the user to try again later). This seemed to solve the problem quite nicely though looking back on it, I probably could've made a good few optimisations to the code. I had not really learnt about how to optimise algorithms when I originally wrote this program so I didn't really have that education to apply that I do now.

The library that I was using for interacting with Discord was quite notorious for breaking a lot. Sometimes, my bot would crash randomly, and I assumed that was because of the library I was using. While this was sometimes correct (others had reported issues with the library so it was not just me), I actually later discovered that some impersonations were running into an infinite loop, and this was causing the bot to lock up, and be unable to respond to any more requests. This happened very frequently, and required me to manually intervene to restart the bot, and therefore it gained a reputation for frequently breaking. I did eventually work out what the problem was, and after fixing it the bot was then so much more stable.

Storing data

As mentioned earlier, I had to store messages that users sent into a database. I thought I wouldn't have to follow data protection legislation because I didn't think I was storing private data since the bot was operating on a public server but on reflection I do not think I was correct. Furthermore, I created the ability for users to opt out of the system. When a user was opted out, their messages were not logged to the database, and it was not possible for another user to request an impersonation on that user. Again, I made a mistake here in that this really should have been an opt in not out. My concern was that if I did make it opt in, not as many people would use it but looking back, concerns about data privacy should have taken precedence.

As for the database itself, I used Google Datastore. This decision was not taken for any reasons of practicality but rather based on cost. Datastore was part of Google Cloud Services, and I believe they charged for the amount of data you stored, and the amount of edits you made to the database with a certain amount just being free. This made it easier than having to host a separate database which I would've had to pay for (although I guess SQLite would've also worked).

Trutho Facts

I also used the messages I was logging to the database to generate certain facts on both users, and guilds. This seemed to be mainly restricted to calculating how often a user would swear, and how often they would use capital letters compared to lower case letters. Seems fairly boring looking back on it. I made it fairly easy to add new calculations though; I think the limitation was more so just that I just ran out of ideas quickly.

Evaluation

As I've just noted, there were many inefficiencies with the impersonations algorithm as a result of my inexperience, and lack of education at the time. Although looking back on this project, I'm still quite proud of it.

As may be expected, the quality of the impersonations gets better when there is more data on the user in question. Often messages that came out were quite humorous, and I eventually created the ability for the user to share impersonations via Trutho Web (a topic for another time). Because Markov chains have no memory beyond the previous state, messages were quite frequently nonsensical but for this use case it was alright because it just added to the humour.

Appendix

The Discord bot used to be a standalone program. However, when I created Trutho Web, I put all of the code into that project so that both the Discord bot, and web server (using ASP.NET) would be in the same executable. In order to not complicate things, I have just taken out the code that I mentioned in this write up, and put it into a tarball so you can just download it here. But below, I've included the code for generating impersonations.

DISCLAMER: I wrote this code 4 years ago. The coding practices here do not reflect my current coding practices.

Code for generating impersonations

using System;
using System.Linq;
using System.Collections.Generic;
using System.Threading.Tasks;

namespace Trutho.Web
{
    public static class ImpersonationGenerator
    {
       private static string GetOriginWord(string[] previousMessages)
        {
            List<string> allWordsUsed = new List<string>();
            foreach (string message in previousMessages)
            {
                allWordsUsed.AddRange(message.Split(' '));
            }
            float usage = 0;
            Random random = new Random();
            string originWord = null;
            int amountOfRuns = 1;
            float criteria = 0.1f;
            while (usage < criteria /*percent*/)
            {
                originWord = allWordsUsed[random.Next(0, allWordsUsed.Count)];
                //Check if origin word is possibly a mention
                if (originWord.Contains('@'))
                    continue;
                if (criteria >= 0)
                {
                    ulong timesUsed = (ulong)allWordsUsed.Count(w => w == originWord);
                    usage = (timesUsed / (ulong)allWordsUsed.Count) * 100;
                    amountOfRuns++;
                    if ((amountOfRuns % 100) == 0) criteria -= 0.01f;
                }
            }
            return originWord;
        }
        public static string GenerateMessage(string[] previousMessages)
        {
            Random random = new Random();
            string originWord = GetOriginWord(previousMessages);
            //Origin word found. Generate message
            string nextWordInMessage = null;
            bool endMessage = false;
            string finalMessage = originWord;
            while (!endMessage)
            {
                Dictionary<string, int> nextWordProb = new Dictionary<string, int>();
                foreach (string message in previousMessages)
                {
                    string[] wordsUsedInMessage = message.Split(' ');
                    int pos = Array.IndexOf(wordsUsedInMessage, nextWordInMessage ?? originWord);
                    if (pos >= 0) //If word exists in the current message
                    {
                        string nextWord;
                        if ((wordsUsedInMessage.Length - 1) == pos)
                        {
                            nextWord = string.Empty;
                        }
                        else
                        {
                            nextWord = wordsUsedInMessage[pos + 1];
                            //Check if next word is probably a mention
                            if (nextWord.Contains('@'))
                                continue;
                        }
                        if (nextWordProb.ContainsKey(nextWord))
                            nextWordProb[nextWord]++;
                        else
                            nextWordProb.Add(nextWord, 1);
                    }
                }
                ulong maxPoss = (ulong)nextWordProb.Sum(p => p.Value);
                ulong wordToUsePos = (ulong)random.Next((int)maxPoss);
                ulong onProb = 0;
                bool foundWord = false;
                foreach (KeyValuePair<string, int> prob in nextWordProb)
                {
                    onProb += (ulong)prob.Value;
                    if (onProb > wordToUsePos)
                    {
                        //If the next word is actually the last word
                        if (prob.Key == string.Empty)
                        {
                            endMessage = true;
                            break;
                        }
                        nextWordInMessage = prob.Key;
                        foundWord = true;
                        break;
                    }
                }
                if (!foundWord)
                    endMessage = true;
                else
                {
                    finalMessage += $" {nextWordInMessage}";
                }
            }
            return finalMessage;
        }
        public static Task<string> GenerateMessageAsync(string[] previousMessages)
        {
            return Task.Run(() => GenerateMessage(previousMessages));
        }
    }
}

Getting this to run

Good luck.

The Discord API has probably changed since I first wrote this so you'll have that to mess with. But also, remember what I said about how I was storing data. This probably wasn't a very ethical (or indeed legal) way of doing it so if you wanted to get it running, that's also something you really ought to consider.

Later, I'll try to create a newer version of the Markov chain with better practices. I'll probably wind up doing this in Clojure, or Scheme so look out for that.

James Crake-Merani

Trutho: Impersonation