Testing deep learning chatbots

I’m lucky to have got involved in a project testing chatbots. It’s an emerging technology, so I’ll do a series of posts on different aspects of testing chatbots.

The problem with simple chatbots...

Chatbots already exist, and have done so for decades. But the classical bots from “ye olden days” have their limits.

They rely on a database of thousands of pre-made answers to pre-thought up conversation scenarios. Every time a user writes something, the bot looks through its database for an answer. If a user’s question alternate a little from the bot’s script, it’s followed up by a standard “I don’t understand the question” reply. It’s a stone sure way to find out if that sweet girl you’re talking to on Tinder is actually a human.

The person behind the typical dating-site bot has programmed it to follow a best-scenario conversation. If the user deviates from the script, the bot is quickly exposed.

It’s nice that it’s so easy to deviate from a bots script if you’re trying to decide whether you’re chatting with a human or not. If you are a customer trying to get support for a problem however, it can be pretty darn frustrating.

Enter deep learning chatbots

“Deep Learning” is a method of making computers able to learn from patterns in conversation data. It’s a part of machine learning, and it’s actually not far from how humans perceive the world.

Imagine that you had to start all over every time you perceived something. You had to consider every little detail about that dirty coffee cup Peter left in the sink. What is it? What’s its shape? Is it a threat? What’s it for? There’s something in it, what is that? What are those letters on it? How do I use it? It looks like I can grab it, is it a weapon?

… No one would get anything done.

Luckily, the human brain is exceptional at making assumptions and at remembering how things worked the first time it met it. Instead, we’d go “Oh, that scoundrel Peter left his dirty mug in the sink again, just like yesterday”.

testing deep learning

The same thing applies to a deep learning chatbot. It doesn’t have to start from square one every time a user writes a message. It learns from its’ interactions by using “deep neural networks”, a set of algorithms inspired by the human brain. The deep neural networks are designed to interpret sensory data (Text, images, sounds etc.) through a kind of machine perception. It basically makes the computer able to recognizing patterns by labeling or clustering conversation input.

Deep learning is not a new idea. In the 80s and 90s a lot of work was done in the area of machine learning. The difference from then to now is that we have computers with way better processing power, and easier access to much larger amounts of data for the bot to learn from.

Getting a deep learning chatbot is like getting a baby

Just like humans, a deep learning bot has to start somewhere. Without data, the bot is just a bundle of algorithms. The state of a recently released beta chatbot with limited capabilities is also called a cold started chatbot

Generally speaking, the chatbot can learn in two different ways; supervised or unsupervised learning.

Supervised learning

When a bot is taught about how it should classify data by a human.

It’s close to when I tell my daughter about the world. I show her a tomato, and tell her that it’s a “tomato” and that she can eat it.

Chatbots under supervised learning is trained to classify input in a certain way by the people who made the bot (or sometimes even users).

If you’ve chatted with a bot that asked you if some information was the thing you were looking for, you’ve encountered supervised learning in the making. The bots actually asks whether its’ classifications are right or wrong. If the user implies “yes”, the chatbot will remember and suggest the same content the next time a user asks it a similar question.

Unsupervised learning

When the computer is left to its own, free to make its own connections and patterns of the input it receives.

If my daughter were to learn unsupervised, I’d tell her where the fridge is, and let her figure out how the things in it should be categorized and used. Maybe she’d put tomatoes in the same category as oranges, because both things are round. She could also group tomatoes with chili, since both are red.

It would be quicker for my daughter to categorize the content of our fridge this way. She wouldn’t have to consult me. However, her categorization of things in the fridge could just as well be useless to me. If I needed tomatoes for at pasta dish, I wouldn’t agree that tomatoes and oranges are mostly the same because they are both round.

A good example of unsupervised deep learning is Neuralconvo, a chatbot that learned all it’s conversation patterns by analyzing movie scripts. The result? A bot that knows ALL the movie one-liners ever, but outputs absolute gibberish when you try to have a decent conversation with it.

Neuralconvo is a Deep Learning chatbot that makes very little sense.

Unsupervised learning is definitely interesting, and extremely efficient. Most companies would however want some sort of control over how their bot categorizes data and finds information for the customers.

But what about testing deep learning chatbots?!

Some testers are used to testing up against some sort of end-result. They’ve read the requirements and have sketched out all possible test scenarios before testing.

Forget about doing that.

When testing deep learning bots, you don’t really have any pre-decided output for different scenarios to lean on. The bot can answer a customer’s question in various ways, and since the bot is always learning new things, the answers will vary from day to day. 

It’s still possible to test it though. You can always make assumptions about how you expect the system to act. You can also follow the development of the bot from (cold)start to finish. Below are three different, but very important, stages in chatbot development, where testing is important.


 The proofreading stage

Before releasing your bot anywhere, test that it’s doing what it was created for.

Run a number of tests where you give it expected input, and see that it returns a fitting output. You should also test any integrations to other systems and platforms.

Since the data that forms the basis of the bot’s learning in this stage is still a rather small sample and under your control, correcting unwanted behavior is possible and easy.


The beta stage

Release the bot to a small sample of real-life users and let them beta-test. Every beta tester is valuable as each individual will treat the bot in a unique way. Many users will also find it interesting to explore the boundaries of the bot, and provide it with completely irrelevant inputs.

In the beta test stage the bot will be subjected to unexpected inputs that you never thought of, such as slang, jargon, and colloquial language. And auto-correct. You will quickly find out just how frustrating auto-correct is for your data.


The surveillance stage

When the bot has been released into the wild, give users the opportunity to report in problems they encounter. This will give you even more input.

Users should be able to report problems right away, preferably inside the chat. Make your chatbot work for you. Let it gather problems users encountered, and send them directly to you. 

Remember to thank your users every time they submit a problem. Every reported problem and idea make your bot a little better.

A few last words for deep learning testers

When testing deep learning bots, you need to let go of the urge to know every scenario of the system. Testing chatbots is about exploring and experimenting to discover and learn about unexpected data patterns and classifications.

Instead of trying to give your customer a check list of what works and what doesn’t, give them your professional account of how the system you’re testing behaves. Maybe 


Kontakt Gheist

Kaffe? (Eller the)