From the perspective of a developer, how we structure software seems to change constantly. However, if you cut through the noise and think a bit more like an economist, you’ll notice that there have really been only a few big shifts.
Architectural shifts happen when a previously expensive (slow, costly, difficult) operation becomes cheap, allowing for alternative architectures to not just improve, but to unlock previously impossible products or practices.
Now we find ourselves at the start of a new paradigm - triggered by the cheaper cost of prediction.
Before Cheap Prediction
Until now, we developed software by giving computers strict rules using layers of logic and human friendly abstractions. New features required new rules to be added, with significant process to ensure such rules didn't break existing functionality (tests/QA) or break the host (virtualization/sandboxing). Updating poorly designed systems sometimes required more rules than building new ones. Creating rules required humans, and therefore was expensive.
As an industry, we adapted by designing standards and protocols to improve interoperability and by automating rule generation. We leveraged our foundation of cheap hardware, vast networks, and portable sensors to safely distribute and execute our rules for an ever growing number of inputs. We designed systems and technologies intended to be maintained by humans.
Enter Cheap Prediction
Machine learning enables cheap prediction by creating models with rules not written (or understood) by humans, but derived from vast sets of data. Once trained on a corpus of inputs and outputs, these models can predict future outputs within the scope of their training data, no human rules required. Many data-rich industries, like fraud detection and facial recognition have models that are >99% accurate, even when facing situations never seen before.
Following this approach using language from the internet as a corpus, Large Language Models (LLMs) are able to predict probable words following an input sequence. At enough size (70 billion parameters), what emerges is able to predict words which pass the Turing Test, accurately translate any language (including code), and appear intelligent. In seconds, these models can implement code, paint a picture, or draft an essay given minimal direction.
Strict Rules Vs. Flexible Probabilities
Unfortunately, benefiting from these models by integrating them into our rules based systems is problematic for a few reasons.
-
Model output is probabilistic, unlike today's deterministic programs
-
Unexpected output is hard to detect, and even harder to fix
To be able to use them at scale, we'll need to both understand these difficulties and design systems which use them strategically.
LLMs are Probabilistic
The first major issue is that model output is not deterministic, meaning we cannot guarantee it will produce the specified output format for every input, with compounding issues the more complicated the situation. This can be improved by specific training on similar formats, but will mean that each invocation of an LLM should be sanitized as if it was human input.
A less obvious misalignment is that today's systems are designed and optimized to be used by humans. Humans are forgetful, impatient, slow, and emotional beings, and so websites are usually broken up into many small inconsistent steps, with unique inputs and layouts designed for use with fingers. These present significant complexity for LLMs to navigate efficiently or to avoid irreversible actions.
LLMs are inscrutable
The second problem is that detecting and fixing "unexpected" outputs cannot be done through typical debugging techniques. For example, rather than emit an error when an input cannot be handled, LLMs may predict a seemingly valid output, called a hallucination.
Aside from human oversight, the best methods of detecting this involve using another LLM to review the output, or by analysis of the results's perplexity - which represents how confident it was about an output as opposed to alternative outputs. For predictions such as "He turned the lamp ____" , where "on" and "off" are both highly likely, with terms like "upside down" distantly less likely. With no other input, the model will pick On or Off, but will log a high perplexity value. This can be used as a signal, but not a rule, as models often emit correct output even with perplexity.
LLM Safe Architecture
To mitigate these risks, we must design systems that defend against the unexpected, and limit the complexity of any single inference. This approach may not be as exciting as training a bigger model, but give us much of the flexibilities LLMs provide while limiting the blast radius of unexpected prediction.
Many simple inferences > one complex inference
By designing systems with rules based control flows, which use LLM inferences as inputs is a great way to leverage LLMs. This method keeps LLM output structures simple, as well as allows for smaller or specially trained LLMs to be used for specific parts of a process when needed, without limiting the collective benefits. Humans can also observe and debug such a system and more easily pinpoint where a problem began. Like any distributed system, we will have to invest in how we deploy and observe LLMs at scale.
LLM Friendly Services
To get the most value out of LLM automation, we need to enable them to discover and invoke services without custom integration and testing. This will require new platform standards and different business models that enable transactional usage and on the fly integrations by an LLM. We should look to standards like OIDC and Oauth2 which have standardized authentication using well known configuration endpoints and extendable JWTs to support almost any usage. By exposing a structured specification of endpoints at a well known location, and perhaps an index or two to search for services, we could avoid a lot of incorrect predictions in finding relevant tools.
Intent oriented workflows
This is particularly problematic as traditional programs will always produce consistent output, and a developer can trace line by line to understand why any specific output was produced.
Complicating matters is the fact a model's rules - parameters - number in the billions, with no human friendly structure to understand, let alone fix manually.
The second problem
What we can do however is test a model with as many predictions as we like, before or after the fact.