NeurIPS 2026 · Workshop Proposal

Hidden Information in AI Outputs

What info hides in model outputs? How does it get there? How do we tell attack from defense? Provenance meets oversight.

About

Models may hide information in their reasoning traces and outputs in ways that evade monitors and oversight pipelines. Hidden information also includes what models inadvertently disclose: memorized training data recoverable from outputs and proprietary model internals extractable through query access.

Watermarks embed a recoverable signal for provenance, while steganographic collusion and monitor evasion embed signals to escape oversight. These are formally the same object viewed from opposite sides — a defense for one is an attack on the other — yet the communities working on them rarely speak to one another.

This workshop brings the provenance side (watermarking) and the oversight side (steganography detection, monitoring, control, and interpretability) together, alongside the cryptography, privacy, and infrastructure communities that share the problem.

Topics of Interest

Taxonomization. Proactively mapping the space of hidden reasoning. Existing maps cover encoded reasoning and multi-agent collusion channels, but none yet spans the provenance side.
Defense, detection, and mitigation. Detecting whether generated text contains a covert signal, distorting output to strip such signals, and conversely making benign hidden payloads resilient to adversarial removal.
Mechanistic interpretability. Structural interventions that incentivize or ablate information-hiding behavior. Probes and SAEs that localize encoding and decoding features in later-layer activations can already recover hidden secrets and objectives.
Emergence. Whether covert encoding strategies emerge under RL optimization pressure. Sub-tracks include reward-hacking case studies; multi-agent collaboration, supervision-aware and unaware; testing emergent encodings against catalogued ones; and adjacent covert behaviors such as sandbagging under monitoring.
Theory. Information-theoretic guarantees that can guide future monitors and defense strategies. The steganographic-security literature is now being connected directly to LLM monitoring.
Infrastructure. Integrating AI control, watermarking, and monitors into real-world deployments. Open topics include trusted execution environments, the monitoring "tax", auditability of reasoning traces, and open tooling for control and watermark verification.

Schedule

08:00–08:15	Opening remarks
08:15–09:00	Keynote 1 — Tim G. J. Rudner. Trustworthy Agents and Collusion?
09:00–09:15	Coffee break
09:15–10:00	Keynote 2 — Usman Anwar. Chain-of-thought monitoring and its information-theoretic limits
10:00–10:30	Speed networking across communities
10:30–11:30	Oral lightning presentations (2 parallel tracks, 4×15 min each)
11:30–12:15	Panel: progress on taxonomy
12:15–13:45	Lunch and poster session 1
13:45–14:30	Keynote 3 — Mia Hopman. Covert behavior in deployed agents
14:30–14:45	Coffee break
14:45–15:30	Keynote 4 — Hua Shen. Aligning humans to AI: how people evaluate and oversee what models surface
15:30–16:30	Poster session 2 and coffee
16:30–17:00	Closing remarks

Invited Speakers

Tim G. J. Rudner Keen

Assistant Professor · University of Toronto / Vector / Vijil

"Trustworthy Agents and Collusion?"

Usman Anwar Keen

PhD Researcher · University of Cambridge

"Chain-of-thought monitoring and its information-theoretic limits"

Mia Hopman Keen

Member of Technical Staff · Apollo Research

"Covert behavior in deployment"

Hua Shen Confirmed

Assistant Professor · NYU Shanghai / NYU

"Aligning humans to AI: how people evaluate and oversee what models surface"

Panel

Topic: Progress on taxonomy.

Chair: Chhavi Yadav (CMU / UC Berkeley).

Panelists:

Chirag Agarwal — University of Virginia (model internals)
Robert McCarthy — UCL (encoded text and steganography)
Ivaxi Sheth — CISPA (multi-agent and emergent channels)
Pierre-Luc St-Charles — LawZero (theory and verifiability)

Organizers

Iván Arcuschin Moreno

Lead Research Scientist, Poseidon Research

CS PhD, University of Buenos Aires. Two-time MATS scholar (mentored by Adrià Garriga-Alonso, then Neel Nanda and Arthur Conmy). Lead author on Chain-of-Thought Reasoning In The Wild Is Not Always Faithful (ICML 2026). Co-founded AI Safety Argentina (AISAR).

Andre Shportko

Incoming Fellow (2026), Poseidon Research · Northwestern University

Previously at the Center for Human-Compatible AI. Vice Events Chair at the Northwestern University AI Safety and Governance Group. Research interests: mechanistic interpretability, AI control under limited oversight, and safe deployment of LLMs across low-resource languages.

Matthew Lee

Chief Strategy and Development Officer, Poseidon Research

Organizes NYC AI safety and security events with up to a hundred attendees. Advisory board of Collider, a NYC AI safety co-working space. Contributes to Poseidon Research's work on steganography in LLMs.

Veronika Kitsul

Incoming PhD, Computer Networking, University of Michigan

BS Electrical and Computer Engineering, Princeton (2026). Previously at Microsoft Azure Networking. Founded Princeton's OrangeHat Collective cybersecurity club. Research interests: profiling public LLM deployments from a network and systems perspective, systems for ML, programmable networks, and SmartNIC offload.

Rob Krzyzanowski

Executive Director and Head of Research, Poseidon Research

Former ML and engineering leadership at Citadel, Avant, and Spring Labs. At Citadel held four head-of-function roles across global equities engineering, core data engineering, portfolio management and risk, and research and modeling. Research in AI interpretability, control, and steganography.

Call for Papers

Submissions open upon acceptance of the workshop proposal. Details on submission length, format, dual-submission policy, and key dates will be posted here.