FORTRESS AI
Overview
The goal of FORTRESS is to prevent the injection of hidden backdoors in a model. In other words, FORTRESS reduces the degree by which a model is susceptible to backdoor injection. FORTRESS is powered by the mathematical abstractions we outlined in the core Euler One technique, which are concept, concept spaces, concept space axes, and meaning hyperplane.
An attacker can inject backdoor in a model during the fine-tuning of an existing model, or via a total re-training of the model weights. When we say "hidden" backdoor, we mean the scenario where by the backdoor can be injected in such a way that it does not affect the business functionality of the model to attract the customer attention (i.e, the model usage does not raise any flag of malfunction, but allow the customer to continue to use the model without noticing any degradation of function). If the business functionality is affected by the backdoor injection, then before long, the user will notice and may order a new model from the vendor, thwarting the attacker's goal.
Further, since FORTRESS aims to fortify a known good model, models where backdoors have already been a injected (e.g., via the supply chain or during the original training of the model) is outside the scope of FORTRESS.
Process
FORTRESS works on a model that is assumed to be benign.
To make a model impregnable by a hidden backdoor (we call this backdoor-resistant), FORTRESS first explores all concept spaces in LLMs to identify sparse concept spaces with no dominant concept axes. These are spaces where fresh new concepts can be selectively injected into such that it will not affect the business requirement of the original model. Following this identification, FORTRESS performs targeted fine-tuning to patch these areas. This is similar to filling the open holes before the attacker can, preventing the injection of future backdoors.
FORTRESS is under testing and development.
Some Considerations
We are working on an extension of FORTRESS that can explore and search the concept spaces of an LLM in search of potentially hidden backdoors. This is a hard problem but we are working with collaborators to reduce the gap of this challenge.
Fine-tuning a model modifies the distribution and values of parameter weights and bias in the original model. Recent work show that fine-tuning a model may remove certain desired features of the base model, such as safety and security functionality