Model Overview
h4ckbot is powered by a purpose-built intelligence layer trained and aligned specifically for offensive security work. This page explains the architecture decisions, training corpus, and knowledge scope behind the model.
Architecture
h4ckbot is built on a transformer-based large language model architecture. The base model provides general reasoning, language understanding, and structured output capabilities. On top of this foundation we applied a multi-stage specialisation pipeline:
- 1Corpus curation — A large corpus of public offensive security knowledge was assembled, cleaned, and de-duplicated. Sources are described in detail below.
- 1Domain fine-tuning — The base model was fine-tuned on the security corpus to shift its default distribution toward security reasoning, tool syntax, and methodology — rather than general web content.
- 1Instruction alignment — A second fine-tuning pass used human-annotated preference data from practising pentesters to align the model's tone, depth, and output format to what professionals actually find useful in the field.
- 1Guardrail integration — A classifier layer and a hard-coded refusal set were applied on top of the aligned model to enforce responsible-use boundaries. These run before any user prompt reaches the core model.
Training Corpus
The model was trained exclusively on publicly available information. No proprietary, classified, or restricted material was included. Sources fall into five categories:
Vulnerability databases
- NIST National Vulnerability Database (NVD) — full CVE corpus with CVSS scoring and affected-product metadata
- Exploit-DB — public proof-of-concept exploits with associated CVEs and platform tags
- MITRE CWE list — weakness enumeration used for root-cause analysis framing
- VulnHub and HackTheBox public write-ups — community-authored walkthroughs of retired machines
Methodology & standards
- OWASP Testing Guide (v4) — web application security test cases and decision trees
- Penetration Testing Execution Standard (PTES) — engagement lifecycle and scoping guidance
- Open Source Security Testing Methodology Manual (OSSTMM) — measurement-based security testing framework
- NIST SP 800-115 — technical guide to information security testing and assessment
- MITRE ATT&CK Enterprise, Mobile, and ICS matrices — adversary TTP taxonomy
- MITRE D3FEND — defensive countermeasure mappings (used for guardrail reasoning)
Community knowledge
- HackTricks — community-maintained offensive technique reference (Git snapshots)
- PayloadsAllTheThings — curated payload collections for common vulnerability classes
- The Hacker Recipes — Active Directory and Windows attack paths
- GTFOBins and LOLBAS — Unix and Windows living-off-the-land binary references
- Public DEF CON and Black Hat presentation slides and white papers (2015 – present)
Tool documentation
- Official documentation for Nmap, Burp Suite, Metasploit Framework, Impacket, BloodHound, CrackMapExec, Nuclei, ffuf, and ~40 additional open-source tools
- Man pages and help output for standard Unix recon and post-exploitation utilities
Academic research
- Peer-reviewed papers from IEEE S&P, USENIX Security, CCS, and NDSS (publicly available PDFs, 2010 – present)
- Vendor security research blogs published under open licences (Google Project Zero, NCC Group, Synacktiv, Crowdstrike Intelligence public reports)
Knowledge Scope
The model has strong coverage across the following domains:
Web application
OWASP Top 10, API security, auth bypasses, injection classes, SSRF, deserialization
Network & infrastructure
Service enumeration, protocol-level attacks, firewall evasion, pivot techniques
Active Directory / Windows
Kerberos attacks, ACL abuse, credential relay, lateral movement, persistence
Linux post-exploitation
SUID/SGID abuse, cron-based privesc, container escapes, credential harvesting
Mobile (Android / iOS)
APK reverse engineering, certificate pinning bypass, IPC vulnerabilities
Cloud (AWS, Azure, GCP)
IAM misconfigurations, metadata endpoint abuse, storage bucket exposure, service privilege escalation
Social engineering context
Phishing methodology, pretexting frameworks — for authorised red team simulation only
Reporting & remediation
Finding classification, CVSS scoring rationale, remediation advice, executive summary writing
What the model does not know
- Details of any specific organisation's internal systems, networks, or configurations
- Non-public zero-day research, classified vulnerability intelligence, or private exploit kits
- Personal data about individuals — it was not trained on social media, email archives, or private communications
- Information published after the training corpus cutoff date
See also