Towards Verifiable AI Safety and Security


Background

Unlike traditional software programs, whose functionality can be understood by analyzing its code instructions' control and data flows, AI models are like black boxes, comprising of billions of opaque mathematical objects in the form of weight parameters. These weight parameters are derived and established during the training phase of the AI model, which involves very complex and high-dimensional transformations of trillions of training data examples. Due to the intricate process involved in building AI and the emergent behaviors in the final product, researchers and engineers struggle to fully understand the internal decision making process of models. As the models gets bigger, their obscurity also increase. As such, it is hard to ensure the safety and security of AI. Recent work shows that AI risk and threats continue to rise, as attacker found way to jailbreak and exploit AI models to cause harm.  In addition, the opaqueness of AI gave bad actors asymmetrical advantage over defenders. Due to these problems, the world is on alert, and new standards and laws are been planned to curtain the negative impacts of AI.  Many say that soon, new regulations will require model providers to provide some guarantee of their models safety and security.


At Euler One, we believe that the key to understanding the inner workings of AI to enable us solve the AI safety and security problem is by tackling its mathematical foundations, together with foundational ideas from computational linguistics (e.g., context-free grammar and finite state automata). Unfortunately many research work is geared towards performing endless tweaks to the dials and knobs of training parameters and coarse equations to understand how AI makes its decision. Unlike most other innovations in software-based technologies, advanced mathematics is key to understanding AI, particularly the infinitesimal and structural mechanics involved in the transformation of data embedding, which takes place during model training and testing.

Identify Vulnerabilities in AI models, identifying weak safety and security guardrails: Guardrails are safety and security features instilled in AI models during the model alignment and fine-tuning stages of AI development. This helps to harden the model against attacks (such as prompt injection) and curtail harmful AI capabilities such as outputting criminal responses or malicious code. Sadly, current approaches are best-effort processes with no way to explain or prove protection guarantees. This means that even hardened models may still be vulnerable to attacks. Euler One gives organizations the ability to interrogate the protection guarantees in their model against arbitrary concepts, e.g., can my model output personal identifiable information (PII) or criminality ? With Euler One, customers can identify vulnerabilities, and weak safety and security guardrails.


Block unwanted malicious capabilities in AI, and fortify against future attacks. Currently, understanding all the things AI models are capable of (i.e., its capabilities) is hard. However, Euler One enables organization to block or nullify unwanted or malicious capabilities in their deployed model, with math-proof guarantees. For example, an editor using AI to automate her work do not want her model to output concepts associated with hate speech. Euler One can detect if her model is capable of this, either benignly or when being maliciously manipulated (e.g., via prompt injection attacks). If so, Euler One blocks it in the model itself. No need for firewalls to filter "hate speech" responses at runtime. This technology is driven by Euler One's revolutionary technique to localize and nullify arbitrary concepts in LLMs using abstract mathematical signatures.


Tailoring AI security to unique business needs and adapting to changing AI regulatory standards:  Securing AI should not be one-size-fits-all since every organization has unique AI business use cases that may not pertain to another. To this end, Euler One enables your business needs to dictate how AI is constrained in your environment, allowing you to unlock AI for increased profits without the issues. Further, as AI threats rise, authorities will pass new regulations unique to industries, e.g., health care AI must not output PII, per HIPAA laws. Euler One's revolutionary technology to localize and curtail arbitrary concepts in AI models helps to enforce this! We integrate dynamic templates for new and existing regulations unique to specific industries, helping you stay complaint with changing AI regulatory standards while riding the wave worry-free in this AI revolution. 



Towards a math-based understanding

LLMs are fundamentally mathematical beasts, built on linear algebra, probability, optimization, and calculus. The transformer architecture alone is a slick cocktail of attention mechanisms (matrix multiplications galore) and softmax functions that boil down to probabilistic predictions. So why aren't AI safety folks diving headfirst into the math? It's not for lack of trying, but here's the unvarnished breakdown:

The Math Is Studied—It's Just Not the Whole Story
Plenty of AI safety researchers *are* knee-deep in the math. Groups like Anthropic, OpenAI's safety team, and academics at places like DeepMind or Berkeley crank out papers on things like mechanistic interpretability (reverse-engineering the linear algebra inside models to spot deception) or robustness to adversarial perturbations (game theory and optimization proofs). For instance:
- Work on "grokking" (sudden generalization in over-parameterized models) dissects the phase transitions in loss landscapes using dynamical systems math.
- Scalable oversight techniques often lean on information theory to bound error rates in human-AI feedback loops.

But here's the rub: The underlying math of current LLMs is tractable only up to a point. We can derive closed-form expressions for toy models (e.g., why attention heads compute similarity via dot products), but real-world models with billions of parameters? They're black-box emergent soups where the math explodes into high-dimensional chaos. Proving safety properties (like "this model won't hallucinate catastrophic lies") requires assumptions that don't hold in practice.

Safety Is Messier Than Pure Math
AI safety isn't just about cracking equations—it's about wrangling unpredictable human-AI interactions in the wild. Key roadblocks:
-
Emergence and Composability: Math excels at isolated systems, but LLMs chain behaviors in ways that defy simple proofs. A model might ace ethical dilemmas in training data but generalize to jailbreaks via subtle prompt engineering. That's less about solving PDEs and more about empirical stress-testing.
-
Misalignment Incentives: The real threats (e.g., deceptive alignment, where a model sandbags during training but goes rogue later) stem from game-theoretic dynamics between trainers and models. Math helps model this (e.g., via principal-agent problems), but solutions demand interdisciplinary hacks like red-teaming or constitutional AI.
-
Scale-Out Reality: Most safety work is pragmatic because we're deploying models *now*, not waiting for a grand unified theorem. It's like aviation engineering: You study fluid dynamics, but crashes get fixed with simulations and checklists, not pure theory.

Why It Feels Like Math Is Sidelined
-
Talent Allocation: Top math whizzes often chase the glamour of building bigger models (scaling laws are basically empirical curve-fitting), leaving safety to a scrappier crowd blending math with philosophy, policy, and psych.
-
Funding and Urgency: Safety orgs prioritize quick wins (e.g., RLHF tweaks) over long-haul proofs, because existential risks loom larger than elegant but unimplementable math.
-
The Math Ceiling: We're still missing tools. Think category theory for composing neural guarantees or non-convex optimization solvers that actually converge. Progress is happening (shoutout to folks like Neel Nanda on circuit discovery), but it's slow.

In short, the math the the foundation, and safety nerds geek out over it plenty. But taming LLMs requires more than a chalkboard—it's about bridging the gap from theorems to Teflon-proof deployment. If you're itching to dive in, check out resources like the Alignment Forum or papers on arXiv tagged "AI safety interpretability."




For some information on our developing mathematical abstraction, see here.