With the recent news of the closing of Regal & Cineworld cinemas, it appears as if the 100 year old cultural tradition and associated business model of feature length storytelling in the cinema is reaching the inevitable conclusion that analysts have been forecasting for a decade now. Audiences will always go to the cinema, but the days of the feature film being the best method creatively and fiscally for creators to craft narratives are over.
Conversation around future of entertainment is generally centered around user generated content, SVOD platforms, cross platform, free to play models and the intrigue of the Metaverse.
In what ways is arguably the most transient technology of our time, deep learning, going to make an impact on the way we play and experience?
The most broad answer to this question is personalization. Deep learning based technologies will help to personalize the stories that are told to us and the ways we experience them.
After learning about and experimenting with these technologies over the past couple of years as part of the Founding Team of Transforms.ai , I want to outline what I have learned from a technical, creative and business opportunity standpoint.
In this expository essay, I discuss the potential implications for storytelling, practicalities and impracticalities, as well as a very early and idealistic framework for how I see some of these technologies functionally monetizing and fitting into the entertainment economy. I have included some semi-technical descriptions in italics, so feel free to read past these if you are not interested.
While AI has been present in videogames since Nim was released in 1952, the most common use of AI in storytelling has been non-player characters, or NPC’s in video games. NPC’s rose to prominence in the 1990’s with a model called Finite State Machines, or FSM’s. FSM’s script NPC reactions to the player based on the state of the player, and react accordingly. FSM’s have evolved to include dialogue, and innovated on the looping structure of classic FSM’s to have larger but less flexible behaviour trees.
FSM’s, behaviour trees and bots are all examples of game AI. What separates game AI from machine learning and deep learning techniques using today is the inability for Game AI to learn. But what happens if NPC’s in games begin to become more sophisticated, and are able to learn, and craft narrative through user input?
‘Virtual Beings’ has picked up steam in the investment space, with Virtual Beings startups having raised more than $320 million to date. This list includes Brud (Lil Miquela), The AI Foundation (Digital Deepak Chopra), Soul Machines and Pinscreen. The majority of these use cases are corporate training, personalized chatbots and marketing extensions. What potential are people seeing for using Virtual Beings in content?
We’ve all seen Lil Miquela and her Instagram presence and burgeoning music career. Brud has done an incredible job monetizing and marketing Lil Miquela. However, there is no evidence to suggest that Brud has used any deep learning or machine learning techniques for Lil Miquela’s Instagram captions, interview dialogue or song lyrics/production. For now, she is likely the Americanized version of Hatsune Miku rather than an evolution of a virtual being.
If you ask Ed Saatchi, former head of Oculus Studio and President of Fable, he believes that Virtual Beings are the next operating system. Fable’s experience, Wolves in the Walls, features Lucy, a character who supposedly has a mind of her own. Players can interact with Lucy with their voice, and she responds to each user, giving them an intimate moment. The vision here is to have characters imbued with a personality, and be able to integrate a user into a scene in an immersive and unique way that is novel for each viewer. The character “guides” each audience member through a unique story.
As you can see from the above video, Wolves in the Walls’ Lucy is somewhat disappointing. I found her to be more of a demo around a one to one connection with a character, rather than a character who can offer up unique dialogue and narrative beats based on my reactions. There seemed to be a soul to Lucy, but no intelligence yet.
Replika is a company based on a Black Mirror episode, which feels like some kind of threshold we are beginning to cross. While Replika does not fit into a traditional story like Lucy does, there is an argument to be made that this may be a form of user generated content and a new way to tell a story — based on exactly what the user wants to hear. Replika uses user text input with it’s recurrent neural network to generate dialogue through a 3D character model. The user then trains it’s Replika by upvoting and downvoting certain utterances, until their Replika says the types of things that they want to hear. A creepy concept indeed, but one that has already seen some incredible affects on those that are geriatric, lonely, and anti-social.
The discussion around how a Virtual Being in a story can be monetized is an interesting one. If we really do get to a point where a Virtual Being can lead a user through evergreen content, it would likely require more compute than a single executable file, binary or update could handle within the game itself. My thought is to take the strategy that Minecraft: Realms monetizes with and apply it to Virtual Beings infused experiences. Minecraft: Realms is a subscription package for Minecraft that allows players to create personal servers for themselves and their friends. For every $7.99 monthly subscription, Microsoft makes a margin on servers, and also has some added flexibility in terms of creating new content, such as skins and environments. By embracing a F2P model with a SaaS type monthly fee, Virtual Beings experiences can cover the ongoing cost of compute while allowing developers to create new content or continue to engineer and maintain underlying neural networks. One can see a future with this model being commonplace for Virtual Beings, but until we see Virtual Beings that really do guide the way to evergreen content, the price point will likely outweigh the benefits and competition.
So do these Her like machinations have any merit?
They do, but as of this moment, the progress has been incremental. Lucy in Wolves in the Walls is promising, but seems “on rails” and can be fooled easily by being given even somewhat difficult or counter-contextual questions. With Miquela and other Brud projects, we are seeing the beginnings of how users can have empathy with and connection to virtual characters, which is promising to steer us towards the vision of intelligent virtual beings that personalize stories. But for now, those characters are purely a visual delight with what’s under the hood mainly being crafted by humans. Natural language processing could very well be the key to unlocking this.
Natural Language Processing/Generation
While Virtual Beings are the visual piece of the puzzle, natural language processing (NLP) is a prerequisite to a virtual character generating text that makes sense in context and adds to a story.
NLP is defined as a subset of artificial intelligence concerned with the interaction between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
OpenAI’s GPT-2 & GPT-3 are the most advanced language models to date, and we are starting to see some interesting use cases in entertainment. How these models work is they scrape a good chunk of the internet as a training set, which helps them gain a wealth of knowledge, context and also understand grammar and sentence structure. These models then use deep learning techniques to predict what the best way to put a sentence together is, going through calculations after each word to predict the next word.
AI Dungeon was one of the first use cases to use GPT-2 to tell a story. Created by engineer Nick Walton and available through a Patreon subscription, AI Dungeon emulates the traditional role of a dungeon master in Dungeons and Dragons, who creates a story within certain constraints for other D&D players. In AI Dungeon, players first choose a genre, such as fantasy, mystery or cyberpunk, and then choose other options to begin to ascribe their story. GPT-2 then creates an introduction to a story. Players then can interact with the model in 3 ways:
“Say” — The player comes up with a character’s dialogue
“Do” — The player starts with a verb and inputs some kind of action
“Story” — The player writes a couple of sentences to drive the narrative forward
After any of these commands is input, GPT-2 bases their response on this and continues to write the story. The experience is amazing, but something that most users will try a few times and leave alone. For the cost of the compute needed, it’s fun but not entirely useful. Walton has since upgraded to GPT-3 (with a much higher subscription price). The future uses of this include script writing (input a large sample of scripts, feed it certain parameters, let the network output dialogue)
Ross Goodwin has probably come closest to creating scripts using deep learning. A graduate of MIT & NYU and former speechwriter for Barack Obama, Goodwin gained access to NYU’s supercomputer for his first AI project, Sunspring (2016). To create the script for Sunspring, Goodwin fed a neural network (named Benjamin) film scripts, let it write one, and then filmed exactly what the neural network output, including all dialogue and actions.
He then followed up Sunspring with Zone Out, where he took things a step further: he used AI for the script once again, but also the entire production pipeline. The films uses Deepfakes, Jukedeck for a robo-score, as well as letting Benjamin decide on dubbing & colouring. The results are interesting, but it is tough to see an audience use their monthly content subscription on content like this.
In 2018, Goodwin fed a neural network 3 corpuses containing 20 million letters of text from fiction: one with poetry, one with science fiction, and one with “bleak” writing, and took a computer on the road from New York from New Orleans — the same journey as in Kerouac’s “On the Road”. He also fed the network data from Foursquare, giving his current GPS location. The results? ‘Choppy’, according to Goodwin. The final novel, called “1 the Road” is available on Amazon. Goodwin’s work may one day be looked back upon as pioneering a new wave of content, but for now, it stays in the realm of the weird and interesting.
The holy grail of natural language processing in games, film & TV is to have personalized experiences based around the inputs given to a machine, as well as infinite content generated for literature and scripts. Did you think that the ability to make narrative decisions in Black Mirror: Bandersnatch was cool? While Bandersnatch was an innovation to the format of interactive film, it will always be limited by the amount of narrative content that must be written and filmed by humans. NLP, and OpenAI’s GPT series in particular, may be the way we enable truly infinite narratives, and by proxy, Virtual Beings in story.
Unfortunately, high level NLP, and specifically GPT-3, is computationally very expensive, and outputs text that is contextually relevant, informative, and often funny, but is inconsistent and has a tendency for total randomness.
Take the example of GPT-3 writing for a content farm, which many argue is the simplest use case for GPT-3 for storytelling in it’s current form, and likely a precursor to believable dialogue and scriptwriting. Content farms generate high quantities of written content that they then leverage for ad revenue and SEO. What if GPT-3 was fed keyword heavy, clickbait-y headlines, and then told to output coherent content given certain parameters around length, keyword frequency, and subject?
Most agree that long form GPT-3 text takes around 6 attempts to create something valuable, and even then, many of the articles we have seen “generated” by GPT-3, using the exact same query, were either the best parts of a number of different outputs, or edited by a human.
On top of all of this, OpenAI’s recently announced pricing takes GPT-3 away from hobbyists and into the hands of those with large scale enterprise use cases. Just to run the model on OpenAI’s “Create” tier, without additional costs, training requires a minimum $87k rig for 1.5 million words.
With the above assumption that 6 samples are needed for a useful result, this is more like 250k words. With the additional cost of an editor going through 2 million words to distinguish what does and does not make sense, and then editing or cobbling together the best of the results, GPT-3 still does not likely pay for itself. Not to mention, the ramifications of Microsoft’s deal with OpenAI to exclusively license GPT-3 are still unknown, and it may become even tougher for storytellers to use GPT-3. On top of all of this, machine learning talent to make these models work is generally expensive and possibly outside of the range of the entertainment and creator spaces. The fixed cost to license and run GPT-3 is likely too high for creators to justify at this point, and into the foreseeable future. While creators can still experiment with these technologies, according to Moore’s law we will likely be waiting at least five to ten years until the computational cost becomes reasonable for creators to use. There are lightweight versions of GPT-3, and these may continue to get lighter.
From a subjective and functional perspective, GPT-3 having the ability to write dialogue and fictional content that is consistently of high quality, culturally relevant and effective for a large scale production is still a ways away. It does appear likely to happen, but an neural network writing a traditional human comedy or drama given the current level of innovation, as incredible as it is, is both unlikely to be a hit and unnecessary. The logical thinker in me, given both the NLP and tools like GPT-3 may be used for inspiration rather than a full creative process . I am also optimistic that the truly creative and innovative minds out there will use the ability to generate infinite text to create new art forms entirely.
Synthetic Media — GAN’s, Deepfakes & Style Transfer
What if you could purchase a piece of art that never stopped changing? What if you could animate your favourite live action film or see your favourite TikTok trend in the style of Van Gogh’s paintings?
Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means.
Generative adversarial networks, or GANs, are a newer neural network (developed in 2014 by Ian Goodfellow) with some very interesting properties. You may have seen GAN’s on Twitter this week, with the news that NVIDIA AI is using them to compress video and make video conferencing reliable regardless of an internet connection.
The basic idea of GANs is that two AI’s, a “generator” and a “discriminator” go to intellectual war with one another in a zero-sum game. The generator produces samples out of a “latent space”, which contains a large dataset, often images in the case of a GAN. The generator in the GAN architecture learns to map points in the latent space to generated images. The internal workings of both the latent space and generator of a GAN are not yet very well understood.
The samples produced by the discriminator are then judged by the discriminator to determine whether they are real or fake, using a specific dataset or asset as it’s “Northstar”. Data scientists call this a “Min-Max game”, as the generator is always trying to maximize the probability of declaring fake data as real, while the discriminator is trying to minimize that probability. In short, the generator tries to produce patterns that the discriminator replicates, until what the discriminator creates is indistinguishable from the original. Along the way and the conclusion, what the generator creates falls into the realm of the absurd.
This is all well and good, but what can GAN’s do for storytelling? To start, an artwork created by a GAN recently sold for $432,500 USD at Christie’s, showing that the future of the art world may be disrupted. Style GAN, created by Nvidia, has also created convincing live action images of humans. It is only a matter of time until facial rigs and voice prosidy catch up to the realism of the still images of StyleGANs. You can get creative with what this means in live action films and TV.
GANs encompass what we know as deepfakes. The most widely known examples of deepfakes take the face of a reference, and make it seamlessly take the place of a target’s face in video content.
GAN’s create deepfakes by using two autoencoders — one of which is tasked with reconstructing the target, and the other is tasked with reconstructing the reference. Each autoencoder consists of an encoder and decoder. Each encoder’s role is to use a given dataset of videos of either the target or reference, and then learn to separate each input image into visual dimensions, such as the characters pose and shape of features, and learn to recognize patterns such as mouth movement. The decoder’s role is to reconstruct this representation with pattern details back into the original image. A deepfake is generated when the encoded representation of the target is fed into the decoder of the reference.
Synthesia, a startup based in the UK, burst onto the scene when they released a charity video of David Beckham speaking in 9 different languages.
The implications of this are fantastical — imagine being able to make Tom Cruise convincingly speak Mandarin? This in theory could be a way to mitigate what Bong Joon-Ho dubbed as the 6 inch barrier in films, and bring international stories to mass audiences. However, after raising their 3.1 million seed round in April 2019. Synthesia has seemingly pivoted to a video platform where users can produce live action video without leaving their office using deepfake actors. This shift casts doubt on the ability of their technology to be used in major live action productions. The implication is still fascinating — even more so than realistic dubbing, what if a studio could cast Guan Xiaotang, Zendaya & Alia Bhatt in the exact same role in a film and ‘port’ them for different audiences? This could lead to a change in the way films are distributed and marketed, and globalize content production markets in a way we have never seen before. For all of the scary potential of deepfake being used for evil, the potential of a future where deep learning allows nations to share the stories we tell and cultural artifacts across borders is very encouraging. Also, someone tipped the South Park guys off to deepfakes and they came up with Sassy Justice. This validates that when these tools to the right creators, they will find ways to create content no one has seen before.
Arguably the most interesting potential for synthetic media is style transfer. Neural style transfer is an optimization technique used to take two images — a content image or video and a style reference image (such as an artwork by a famous painter) — and blend them together so the output image looks like the content image, but “painted” in the style of the style reference image.
Right now, this seems like the type of technology that may never make it past Snap filter status, and possibly be used for gimmicky uses by studios constantly looking to optimize the library value of their IP (“who wants to see an animated Top Gun?”). For me, the most interesting potential for neural style transfer is the potential of turning live action into high fidelity, convincing animation, and vice versa. This alone could open up the floodgates for the prosumer generation to create incredible art.
If a YouTube creator, using limited hardware, could one day make their videos look like Pixar by bringing a video shot on a regular DSLR into a style transfer software, the quality and variety of content that could be produced in a bedroom are limitless. For a moment, let’s ignore the hardware and performance considerations of turning a live action video, frame by frame, into 3D art. The way prosumers tell their stories and engage with their audiences could potentially be changed.
Right now, the closest thing we have seen to a standalone business model is Runway ML, who have curated an app store of sorts of the top visual machine learning models, meant to be used by filmmakers, creators and designers. It is unclear what and if the authors of each model are paid, but Runway ML sells users ‘tokens’, which allow a certain amount of compute. Essentially Runway makes a premium on selling remote compute power, which most of the models they curate for their app store require for any significant results. Runway is a pioneer in the creative machine learning space, and this type of business model will likely be how we see this type of work make it’s way into the public consciousness.
While interesting, the knock on these forms of synthetic media is that they emulate creativity rather than being forms of creativity. However, despite all of the dangers posed by synthetic media, these applications of deep learning have potential to be tools and techniques to both augment creativity and customize visuals while bringing stories to larger audiences.
Of all of the methods and technologies discussed in this post, synthetic media seems to be the closest to being used in commercially exploited productions or other forms of storytelling. The various deep learning models and techniques on Runway ML are widely available, and researchers are constantly putting out open-source papers and Github repos containing models with a lower than average knowledge barrier to entry for engineers who want to use them. These new art forms could soon find their way into content for the masses.
Agent-Based Learning: Story Structure
Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. A human being customizes these rewards, creates parameters in an environment, and then allows the software agents to learn based on repetition.
In other words, check out this definition by Towards Data Science:
Imagine a baby is given a TV remote control at your home (environment). In simple terms, the baby (agent) will first observe and construct his/her own representation of the environment (state). Then the curious baby will take certain actions like hitting the remote control (action) and observe how would the TV response (next state). As a non-responding TV is dull, the baby dislikes it (receiving a negative reward) and will take less actions that will lead to such a result(updating the policy) and vice versa. The baby will repeat the process until he/she finds a policy (what to do under different circumstances) that he/she is happy with (maximizing the total (discounted) rewards).
What if humans could interfere with both the agents and the environment on the fly, either disturbing it or creating new ways for the agents to take control of it? What if instead of controlling the baby in the above example, a human could control the actions, states and rewards, thus leading to a reshaping of the outcome?
Agence, a new experience that calls itself a “dynamic film”, uses reinforcement learning to give it’s characters personalities and have them react to environmental shifts caused by user interaction. Agence was produced by myself and my team at Transforms.ai, and we consider Agence the “silent film” of this new era of AI art. It is too new and different to qualify as a film or game, and the characters (aptly named“Agents”) in the film do not speak, but rather emote. The Agents in Agence live a peaceful existence, keeping a planet (their environment) in perfect balance, which the user interrupts, leading to an infinite amount of outcomes. We trained the Agents by setting up an environment for them, giving limited game mechanics (actions), and then laid out rewards for them based on dramatic events in the narrative. We trained them over time using Unity’s ml-agents to survive in this environment given the planet’s physics, and then gave them additional parameters and goals that we thought would lead to interesting behaviours.
With each set of parameters, we ran them through the simulation many, many times (read: millions), and gauged training success by through tying variables in graphs to dramatic beats. We dubbed the best sets of rewards and goals we set up in each simulation “brains”, and put them back into the experience one by one. Each of these “brains” exhibit certain types of behaviour and personalities, and we are continuously making brains to add more variability to the characters and resulting. For example, certain brains we have trained are more aggressive than others, some are more curious, and some are more more helpful and sensitive. The differences may not be noticeable unless you are paying attention, but every outcome of Agence is different because of the different brains. Agence is an interesting first example of using RL software Agents in a story, letting their narrative be shaped by the environment, the user’s interactions, and reinforcement learning “brains”.
We see Agence as a progression towards allowing users more autonomy towards choosing environments, rewards and actions, leading to very interesting mechanics and outcomes within an environment, and eventually, a feature length interactive film or game that a user does not just experience, but rather has a hand in forming unpredictable outcomes.
Beyond that, the dream here is turn the qualitative elements of a story, say the hero’s journey, into variables that can be quantitatively controlled so that agents in an environment can be trained to learn story. This way, RL agents can learn to follow a story structure and compose stories that are unique on each viewing.
One can also see reinforcement learning as a useful tool in creating the Metaverse (please read Matthew Ball’s essay on the Metaverse if you are not familiar with the concept). In Neal Stephenson’s Snow Crash, the same novel where the term Metaverse was coined, the Metaverse is inhabited by both humans in avatar form and pieces of software that serve certain functions, such as serving drinks, policing spaces, and generally keeping things running smoothly. The modern interpretation of these pieces of software is akin to what we see as agents in reinforcement learning.
Reinforcement Learning simulations will be useful in creating learning agents instead of NPC’s in the Metaverse for more realistic interactions, as well as automating the flow of different environments. For moderators, or those that will be tasked with keeping virtual-physical communities in line, they can change parameters of the learning agents and ensure that communities are running as needed.
Currently, the best use cases for reinforcement learning in content are to test games for bugs and to simulate levels and scenarios in order to give game designers an understanding of how users would react. Modl.ai, a Copenhagen based startup with a team of specialists ranging from PhD machine learning researchers to psychology to game design, are doing exactly that using reinforcement learning. Their business model is service oriented and revolves around the delivery of tools for cheat detection and game balancing, called modl:assure and modl:play, respectively. They are also working on modl:create, which in theory would be a way to allow game developers to generate content for their games to create better experiences faster, and personalize existing content. This business model makes sense in the near term, as these simulations will likely require specialized services to operate. Eventually, I hope to see software packages and a marketplace similar to Unity and Unity’s asset store emerge specifically for reinforcement learning simulation environments and scripts.
Tools for Game & Film Production
Much of the innovation for deep learning in storytelling has come in the form of production techniques and tools, specifically in terms of using data to automate production processes and optimizing existing processes to make games run smoother. For example, La Forge, Ubisoft’s R & D lab, created Learned Motion Matching, which is a way to ensure that Motion Matching, a key technique in game animation, runs more efficiently. Reinforcement learning is also extremely commonplace at AAA game studios, with both EA & Ubisoft developing systems to use Agent-based learning to test their games and find bugs.
We have also seen the advent of tools that make the video editing process much simpler, such as those found in Adobe Premiere. These tools use deep learning to label and organize footage, match them to a script, and then give instructions on the style in which videos should be edited.
While innovations like this in the production process are interesting and will likely be the most prominent use of deep learning in storytelling in the near term, I am more interested in the implications on story itself.
Through my time working at Transforms.ai, I’ve had a question nagging me. Do we need deep learning in our content and stories? We are in the Golden Age of content, with most media consumers having decision paralysis at the content they have at their fingertips between the 2 or 3 SVOD subscriptions they have, 1 or 2 game consoles sitting in their living room, and enough YouTube, TikTok & Twitch creators to start an army. Do we need new ways to tell stories, or just better stories?
Trevor McFedries, co-founder of Brud and father of Lil Miquela, was recently featured on Reid Hoffman’s podcast Masters of Scale, and gave a quote that stuck with me.
“I think the biggest lessons for us thus far have been the sexy technology stuff doesn’t matter as much as the narrative…It comes back to this story always, it comes back to this character always.”
Is deep learning a prerequisite to the evolution of story, or is good story still the domain of human? None of us yet know the answer to this question, but I hope that I at least managed to get your right brained wheels spinning on the exciting possibilities.
I believe that “deep learning in story” still has a ways to go to get to the Peak of Inflated Expectations on the Gartner Hype Cycle, and the slide towards the Trough of Disillusionment may be long and tough. It will be a long time until computational creativity is advanced enough for machines to understand and create stories without human intervention. I hope that the models and use cases I have outlined are closer to the Plateau of Productivity, rather than many of the weird but interesting pieces of content we are bound to see in the coming period.
The possibilities for storytelling with deep learning are endless, but it will take some time for these technologies and other deep learning rooted models to find their way into story. Even then, knowing the current costs associated, these generally seem like large expenditures for unknown gains, not considering any brand new forms of content that may be created. Computational creativity may end up solely as a tool to supplement human creativity rather than a way for creativity to be generated. Whether a machine can actually be creative is a philosophical conversation for another day.
What I find most interesting about using these technologies for story are all rooted in developing new ways to personalize the stories we are told. Reinforcement learning can allow users to tweak the parameters of the environment, actions and character motivations of the story they are watching. NLP & virtual beings may lead to audiences being able to generate infinite quality stories. Synthetic media could give audiences the ability to take any piece of content and watch it in any language, art style, or with whoever they would like playing the lead role. As we enter a period where content is becoming shorter, recommended based on data and most importantly easily accessible, these systems all suggest different ways for one piece of content to be different for everyone who experiences it.
For now, I hope that deep learning and all of the associated tools, techniques and open source treasure will slowly start to find their way into commercial stories, and inspire people to use them in whatever way they are available. There is true innovation and new forms of media that cannot even be imagined yet that will come to fruition.
This is my first long-form piece, I hope you enjoyed! If you are interested in more content like this, please follow me on Twitter.