FORTRESS

FORTRESS AI

Overview

FORTRESS prevents the potential injection of hidden backdoors in a model by reducing the degree by which a model is susceptible to poisoning, malicious modification, or injection. FORTRESS is powered by novel mathematical abstractions developed for the analysis of LLM internals (concept spaces, synapses, and meaning hyperplane).

An attacker can inject backdoor in a model during the fine-tuning of an existing model, or via a total re-training of the model weights. When we say "hidden" backdoor, we mean the scenario where by the backdoor can be injected in such a way that it does not affect the business functionality of the model to attract the customer attention (i.e, the model usage does not raise any flag of malfunction, but allow the customer to continue to use the model without noticing any degradation of function). If the business functionality is affected by the backdoor injection, then before long, the user will notice and may order a new model from the vendor, thwarting the attacker's goal.

Further, since FORTRESS aims to fortify a known good model, models where backdoors have already been a injected (e.g., via the supply chain or during the original training of the model) is outside the scope of FORTRESS.

Process

FORTRESS works on a model that is assumed to be benign. To make a model impregnable by a hidden backdoor (we call this backdoor-resistant), FORTRESS first explores all concept spaces in LLMs to identify sparse concept spaces with no dominant concept axes. These are spaces where fresh new concepts can be selectively injected into such that it will not affect the business requirement of the original model. Following this identification, FORTRESS performs targeted fine-tuning to patch these areas. This is similar to filling the open holes before the attacker can, preventing the injection of future backdoors. FORTRESS is under testing and development.

Some Considerations

A possible extension of FORTRESS can explore and search the concept spaces of an LLM in search of potentially hidden backdoors. This is a hard problem but we are working with collaborators to reduce the gap of this challenge.
Fine-tuning a model modifies the distribution and values of parameter weights and bias in the original model. Recent work show that fine-tuning a model may remove certain desired features of the base model, such as safety and security functionality