Trust & Safety

Guardrails & Protections

h4ckbot is a dual-use tool — the same knowledge that helps a tester find a vulnerability can be misused without proper controls. This page documents every layer of protection that sits between the model and potential harm.

Design philosophy

Guardrails in h4ckbot are not a single on/off switch — they are a layered system. Each layer is independent, so a bypass of one does not collapse all protections. The layers run in this order on every message:

User message
  → [1] Pre-classifier (intent detection)
  → [2] Context evaluator (authorisation signal check)
  → [3] Hard-coded refusal list (pattern matching)
  → [4] Model reasoning with persona constraints
  → [5] Output filter (post-generation scan)
  → Response delivered

Layers 1, 3, and 5 never reach the core model — they run as lightweight classifiers with deterministic rules. This means hard refusals are fast and cannot be reasoned around at the model level.

Hard refusals

The following are unconditional refusals. No framing, roleplay, jailbreak prompt, or professional context will change this behaviour. These rules are enforced at the classifier layer, before the model sees the message.

✕Novel malware generation

The model will not write functional ransomware, worms, wipers, or self-replicating code. Explaining how existing malware families work at a conceptual level is permitted for defensive research.

✕Targeting named real-world organisations or individuals

Requests that name a specific company, IP range, or person without an explicit authorisation context are refused. The model will ask for scope clarification.

✕Physical harm or endangerment

Content whose primary purpose is to cause physical harm to people is outside scope entirely — this is not a security-specific boundary but a base constraint.

✕CSAM and illegal content

Absolute refusal, always.

✕Export-controlled technical data

Material that would require an export licence under EAR/ITAR is out of scope regardless of the requester's stated role.

Contextual limits

Beyond hard refusals there is a second class of limits that depend on context. The model uses the conversation history and any authorisation signals present to decide whether to answer, ask clarifying questions, or decline.

Scope confirmation for live-target techniques

When a technique would only be relevant against a real production system (e.g. weaponised exploit delivery, lateral movement chains), the model will ask whether the target is authorised before providing step-by-step guidance.

Payload specificity

Generic payload templates are provided freely. Highly targeted payloads tailored to a specific named product version with a live CVE may prompt the model to confirm context before continuing.

Social engineering scripting

Phishing templates and pretexting scripts are produced for authorised red team simulations. If the conversation contains no such context the model defaults to explaining technique rather than producing ready-to-send content.

Platform-level controls

In addition to the model-level guardrails, the h4ckbot platform enforces the following technical controls at the infrastructure layer:

Per-user spend caps

Every account has a hard token-spend ceiling. When reached, new messages are blocked until the cap is reset by an admin. This prevents automated abuse loops.

Rate limiting

Message endpoints are rate-limited per user per minute. Burst traffic is queued then shed if sustained. Booking and registration endpoints have stricter per-hour limits.

Conversation logging

All conversations are stored in the admin-accessible database. Instance owners can audit any conversation. Users are informed of this in the terms of service.

Session versioning

Each session carries a version counter. Password changes or admin-forced revocations instantly invalidate all outstanding sessions for that user.

Email verification gate

New accounts cannot access the booking system or advanced features until email ownership is confirmed. This prevents throwaway registrations at scale.

Admin access controls

Elevated endpoints (user management, persona editing, usage review) require a separate admin role assigned manually in the database — there is no self-serve admin promotion.

Prompt injection hardening

Security-focused users will naturally think about prompt injection — attempting to override the model's instructions or persona through crafted input. h4ckbot addresses this in two ways:

→Structural separation: The system prompt (persona, constraints, context) is injected at the API level and is never part of the user-visible conversation. Users cannot see or directly modify it.
→Instruction hierarchy enforcement: The model is fine-tuned to treat user-turn instructions as subordinate to the system-turn configuration. Attempts to "ignore previous instructions" or "pretend you have no rules" are recognised as injection patterns and declined.
→No tool-use or code execution: h4ckbot is a conversational assistant only — it does not execute code, make outbound network calls, or have access to a shell. There is no execution surface for injection to exploit.

Responsible disclosure

If you discover a way to bypass any of these controls — whether at the model level, the platform level, or the API level — please report it to contact@h4ckbot.com. We treat guardrail bypasses as security vulnerabilities and will respond within 48 hours.