Real World AI: No Success Without Safety
Keeping things on the rails when there aren't any. If natural language interfaces finally take off, a new set of trust and safety tooling is required.
When you can type anything into a computer, it needs to be ready for everything. Although ChatGPT is very popular, we’re still in the early innings of this technology. By applying lessons from prior iterations of AI product design and the social media era, it becomes clear that if natural language succeeds in becoming a main way to interact with computers, many new trust and safety issues will arise.
During the 2016 chatbot boom (Lessons from the ‘first chatbot wars’), my big learning was that error-first product design is a defining characteristic of AI products. AI, by definition, makes complex, automatic decisions, and while large language models have made some product design aspects easier, new challenges are emerging.
If natural language succeeds, both foundation model providers and user-facing application developers will need help in managing trust and safety risks. New tools for testing and auditing will be required, as well as industry-wide standards for responsible behavior that can be audited for compliance so users and customers can have confidence in their interactions.
Tech is new, but history is instructive
AI systems are already making critical decisions at huge scale. For instance, millions of posts on platforms like Facebook and Twitter potentially violate rules against spam, violence, misinformation, human exploitation, and more. No purely human system can manage this influx of content. The humans involved have famously burned out from looking at disturbing content for days on end; have been attacked politically; and caused significant financial and reputational costs for these companies.
As an example, Meta uses sophisticated AI systems to help identify and remove harmful content from its platforms. In a December 2021 post, they touted a 3x reduction in ‘hate speech prevalence’ driven by technical advances in AI. Implicitly, this suggests that their systems had previously misidentified the majority of violating content.
In other contexts, companies like Lyft, Uber, and Airbnb use AI language systems sift through customer complaints to identify urgent safety-related issues like damages, car accidents, and criminal activities. When systems misclassify a benign complaint, the consequences are minor; when a report of violence is misclassified, it’s a real issue. AI/machine learning have long faced these issues, including:
Accuracy: Was that hate speech correctly identified? Non-hate speech incorrectly flagged?
Bias: When the model is wrong, does it disproportionately affect one human subgroup over another? When Google Photos was released in 2015, it identified a Black couple as ‘gorillas’. The AI models repeat the patterns they’re trained to recognize, and unforeseen or imbalanced data will show up.
Explainability: Deep learning models cannot explain their actions. This sounds foreboding, but it’s more annoying in practice — it can’t point easily to a certain piece of training data that would have affected its verdict. Simpler models can point directly to the “weights” placed on different pieces of data.
Natural language and large models introduce many new risks
Leaning heavily on natural language inputs and outputs opens up a huge number of ways for things to go wrong, both attacks and accidents. Some may happen at the ‘core model’ layer — the OpenAIs, Anthropics, and Googles — and some at the application layer, where companies without their deep AI expertise are trying to provide services.
Core model providers have the most risk, but also the most resources and expertise. This will likely change as open source models proliferate. OpenAI, Anthropic, and Google all publicly and significantly invest in significant safety work before releasing models. However, it’s easy to imagine a not-fully-aligned open source model deployed for a virtual therapist service also accidentally being able to offer instructions on how to make explosives out of household materials.
The potential harms often cross multiple categories, but here’s some to get started:
Topical drift and brand safety: Nike deploys a customer service bot to help customers find the right shoe. If it gets asked about Adidas, what does it say? What if it gets asked for the best way to rob a bank? What if a malcontent starts inputting prompts to get Nike to say racist things?
Language safety: Your Nike bot is happily helping customers buy shoes when a customer says they are feeling sad that day. Leave the bombmaking example aside — what happens when your virtual therapist doesn’t articulate its advice properly and its patient worsens?
Hallucinations: Just providing the wrong answer, often in a hard-to-catch way.
Unexpected actions: Foundational models can do many things. What happens if the right set of prompts cause Nike Bot to write a spam email?
Training data provenance and copyright: Does the model’s training data create liability for your company because the copyright was unclear?
Privacy: Can the bot reveal what other customers are doing because it doesn’t know not to?
Safety will be success
If natural language interfaces are going to become popular — possible, but not a sure thing — then there will be an entire industry springing up around meeting these needs, just like there is for information security and software development.
Internal checks: The first layer would be internal checks that developers can perform automatically as part of deployment processes to check for basic risks. NVIDIA announced NeMo Guardrails as an open-source toolkit compatible with popular and similarly open-source conversational management platform LangChain. Guardrails provides the ability to add a set of programmable constraints or rules to LangChain conversations, across what they deem “topical”, “safety”, and “security”, covering some but not all of the risks above.
Imagine a set of uniformly disallowed prompts (hate speech and other disallowed queries) as well as company-specific behavior checks (brand safety, hallucination and accuracy) as a new test suite run as part of deployment.
External trust: The highest degree of trust will be gained from third-party audit. Given the vast surface area of these systems, software will be required. A potential parallel here is like Vanta automating the collection of SOC 2, HIPAA, and other compliance architectures to enable quicker third-party audit.
Third-party audits could encompass both the topic and safety questions above as well as system security tests, training data provenance, and more. As governments around the world create AI regulation, providers will need tools to help them achieve and display their compliance.
Making it easy to deploy natural language interfaces for business is more than just connecting OpenAI’s APIs — they will need to mitigate safety, security, and risk factors that are inherent in this new world, and this an interesting area to keep exploring — there will not be success in this market without safety.