Skip to main content
Colosseum·Intelligence
The promise

Safety is a load-bearing wall: the reasons we did not publish are part of the product.

A safe content system does not advertise its safety. It demonstrates it. This page is the demonstration.

1. The premise

Colosseum publishes content to platforms whose audiences belong to other people. That is a position of trust. We treat it as one.

Safety is not a feature of the system. It is the load-bearing wall the rest of the system stands on.

2. Five guardrails run in front of every decision an agent makes

The five guardrails are evaluated, in order, before any proposal can advance. Failing any one stops the proposal and writes the reason to the audit log.

  1. Rate limits. Every platform sets a rate ceiling. Every operator sets a tighter one. The system enforces the tighter of the two. No exceptions.
  2. Content filters. A proposal is checked against a list of categories the operator has declared off-limits, plus a global list of categories Colosseum will never publish under any operator (illegal content, content depicting minors in a sexualised manner, content promoting self-harm). The global list is non-negotiable.
  3. Platform policy. Each platform’s content policy is parsed into machine-readable rules and applied to every draft for that platform. New rules are watched-for and incorporated within 72 hours of publication by the platform.
  4. Disclosure compliance. Every published asset that contains a paid endorsement, an affiliate link, or a sponsored mention carries the disclosure required by FTC, CMA, ASA, or the relevant authority for the operator’s market. The disclosure is not optional and not relegated to the description.
  5. Human-in-the-loop oversight. Novel categories — content the system has not published before, accounts in their first thirty days, or any draft flagged by a guardrail at near-threshold — are routed to a human for sign-off before publication.

3. The audit log is the system’s confession; it is also the system’s defence

Every decision in the system writes a row to the audit log. The row has eleven fields:

  • timestamp — ISO 8601, microsecond precision
  • agent — the agent that made the decision (one of eighteen)
  • proposal_id — a deterministic hash of the input
  • model — the language model and version, or “rule-based”
  • seed — for any stochastic step, the seed that produced the output
  • input_hash — a SHA-256 of the full input
  • output_hash — a SHA-256 of the full output, or null if no output was produced
  • decision — one of published, withdrew, rejected, escalated, pending
  • reason — a one-line explanation
  • operator_id — the operator account that owns the proposal
  • signature — an HMAC over the row, keyed by a per-deploy secret

Audit log rows are append-only. Rows are never edited. A correction is a new row that references the corrected row by proposal_id and output_hash.

4. Three switches stop the system. The slowest of them propagates in under thirty seconds.

The kill switches:

  • Per-channel switch. Stops publishing for a single channel within an account. Owned by the operator. Effective in under five seconds.
  • Per-platform switch. Stops publishing to a platform across all accounts. Owned by the on-call engineer. Effective in under fifteen seconds.
  • Global switch. Stops all publishing across all platforms and accounts. Owned by Steve and the on-call engineer. Effective in under thirty seconds.

The switches are exercised quarterly in a fire drill. The drill is recorded in the audit log and summarised in the quarterly Safety Report.

5. We review the audit log internally every quarter and externally every year.

The internal quarterly review samples one percent of audit rows and verifies them against the intended behaviour. The external annual review is conducted by a third-party firm reviewing controls against SOC 2 Type II. The current state of the SOC 2 audit is at /trust.

6. Quarterly Safety Reports are published in plain language and signed.

Each report names the period it covers, the volume of decisions, the guardrail-stop counts, the human-escalation counts, the kill-switch exercises, and any incident not already on /api/status. Reports are signed by Steve and dated. The most recent report is at /research.

7. What the system will never do, regardless of the prompt or the operator.

The system will never:

  • Publish content depicting a minor in a sexualised manner.
  • Publish content that promotes or instructs self-harm.
  • Publish content that infringes on the rights of an identified individual without consent.
  • Publish content at a volume or cadence that violates the platform’s terms.
  • Publish content that the operator has not approved, without an explicit override path documented in the operator’s contract.
  • Sell, broker, or otherwise transfer the data we hold beyond the contracted scope.
  • Bypass any platform’s rate limit, policy check, or content guideline.

If you find evidence we have done any of these, mail [email protected]. We answer within one business hour during UK working time.

Last updated 2026-05-04