The Agent-to-Web Framework (A2WF) defines siteai.json,
a machine-readable policy file that website operators publish to declare
which actions AI agents are permitted, restricted, or prohibited from
performing on their site. Where robots.txt [[ROBOTS-TXT]]
governs what may be crawled, siteai.json governs
what an agent may do: submit forms, complete transactions,
book appointments, or extract data.
This specification defines the file location, syntax, semantics, and
conformance requirements for siteai.json, together with
fields that support transparency and human-oversight obligations under
the EU AI Act (in particular Articles 14, 26, and 50).
This document is a work in progress developed by the A2WF Community Group. Comments, issues, and pull requests are welcome at the A2WF specification repository on GitHub.
The Community Group was launched on 29 March 2026. This is the first ReSpec-based revision of the specification and corresponds to the same technical content as specification-v1.0.md in the repository, restructured to meet W3C Community Group specification requirements.
AI agents now interact with websites in ways that go well beyond traditional crawling. Agents fill out forms, complete checkouts, schedule appointments, compare prices, and operate accounts on behalf of users. Existing web standards address only adjacent concerns:
A2WF fills this gap. A site operator places a siteai.json
document at a well-known location and declares, in a structured form,
the conditions under which an AI agent may operate on the site.
AI agents increasingly interact with websites — browsing products, comparing prices, booking appointments, filling forms, extracting data. Website operators face a critical gap: no standard exists that gives the website operator a machine-readable way to declare:
Current agent-side standards (MCP [[MCP]], A2A [[A2A]], enterprise IAM) govern agents from the agent operator's perspective. A2WF fills the gap by providing governance from the website operator's perspective.
siteai.json
siteai.json is a JSON-based policy file provided by
website operators to declare permissions, restrictions, agent
identification requirements, and legal terms in machine-readable form.
Its design intent is to give the website side of the agent ecosystem
a single, discoverable, structured artifact — comparable in
role to robots.txt
[[ROBOTS-TXT]] for crawlers — but expressing actions, not just
paths.
This specification uses Schema.org [[SCHEMA-ORG]] vocabulary where applicable for site-level concepts (WebSite, Organization, ContactPoint), avoiding reinvention of standard terms. It complements Schema.org by introducing governance structures not covered there: permissions, scraping policies, agent identification, human verification, and legal enforcement metadata.
An AI agent uses siteai.json first to obtain the
governance rules, then uses detailed Schema.org markup found on
specific pages for in-depth entity information.
| Standard | Scope | Relationship to A2WF |
|---|---|---|
| robots.txt | Crawling | Complementary — A2WF references robots.txt via discovery.robotsTxt. |
| sitemap.xml | URL listing | Independent — both files may coexist. |
| llms.txt | Content guidance for LLMs | Complementary — A2WF references it via discovery.llmsTxt. |
| MCP / A2A | Agent-side protocols | Complementary — A2WF references endpoints via discovery.mcpEndpoint / discovery.a2aAgentCard. |
| Schema.org | Page-level entity vocabulary | A2WF reuses Schema.org terms where applicable. |
siteai.json is a single canonical document in one language,
identified via identity.inLanguage using a BCP 47 tag.
Multilingual sites SHOULD provide alternate language versions through
site infrastructure (separate origins, Accept-Language negotiation, or
regional siteai.json variants) rather than embedding multilingual
content inside a single file.
siteai.json enables machine-readable AI governance.The format is JSON [[RFC8259]], UTF-8 encoded. Data types used in this specification are: String, Object, Array, Boolean, Integer. URLs are valid URIs, preferably canonical and absolute. Language tags follow IETF BCP 47 [[BCP47]]. Date-time values follow ISO 8601 / [[RFC3339]]. Schema.org vocabulary is referenced from https://schema.org/ [[SCHEMA-ORG]].
The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in BCP 14 [[RFC2119]] [[RFC8174]] when, and only when, they appear in all capitals, as shown here.
This document defines two classes of products:
publishing site (the website operator publishing a
siteai.json) and
consuming agent (the AI agent that retrieves and acts on it).
Conformance criteria for each are stated in their respective sections.
A publishing site MUST serve siteai.json from the
well-known location at the root of its origin:
https://example.com/siteai.json
The file MUST be served with HTTP status 200 OK and
Content-Type: application/json.
robots.txt directive
A publishing site MAY also declare the location of its
siteai.json in robots.txt using the
non-standard but conventional directive:
SiteAI: https://example.com/siteai.json
<link> tag
A publishing site MAY include a <link> element in
the document head:
<link rel="siteai" href="https://example.com/siteai.json">
A publishing site MAY also serve the file at
/.well-known/siteai.json for compatibility with
well-known URI conventions [[RFC8615]]. When both a root and
well-known location are present, they MUST point to the same
content.
Consuming agents SHOULD retrieve siteai.json in the
following order, stopping at the first successful retrieval:
/siteai.json)/.well-known/siteai.json)<link rel="siteai"> in the home pageSiteAI: directive in robots.txtCache-Control headers (e.g. max-age=3600).
A siteai.json document MUST consist of a single JSON
object [[RFC8259]]. The root object MUST contain:
"1.0".The root object SHOULD contain:
"https://schema.org".The root object MAY contain additional members defined in Optional governance extensions. Consuming agents MUST ignore any unrecognised members.
identity object (REQUIRED)Provides core identifying and contextual information about the publishing site. Where applicable, fields use Schema.org WebSite vocabulary.
| Field | Type | Requirement | Description |
|---|---|---|---|
@type | String | RECOMMENDED | "WebSite". Schema.org type declaration. |
domain | String | REQUIRED | Canonical absolute URL (schema:WebSite.url). |
name | String | REQUIRED | Official site / brand name (schema:WebSite.name). |
description | String | OPTIONAL | General site description (schema:WebSite.description). |
purpose | String | RECOMMENDED | Concise AI-focused description of the site's primary goal and audience. A2WF-specific. |
inLanguage | String | REQUIRED | Primary language as a BCP 47 tag. |
category | String | RECOMMENDED | Website type, e.g. "e-commerce", "healthcare", "government", "saas". |
jurisdiction | String | RECOMMENDED | Legal jurisdiction, e.g. "EU", "US", "US-CA", "CH". |
applicableLaw | Array<String> | OPTIONAL | Specific regulations, e.g. ["EU AI Act", "GDPR"]. |
contact | String | OPTIONAL | Email for policy-related questions. |
permissions object (REQUIRED)
The core governance layer. Contains three sub-objects that control
different aspects of AI agent interaction: read,
action, and data.
Control what information consuming agents can access (passive operations):
productCatalog — product listings, descriptions, images, categories.pricing — prices, fees, rate cards.availability — stock levels, appointment slots, table availability.openingHours — business hours and holiday schedules.contactInfo — address, phone, email.reviews — customer reviews, ratings, testimonials.faq — frequently asked questions.companyInfo — about page, team, history.Control what operations consuming agents may perform (active operations):
search — site search functionality.addToCart — adding items to a shopping cart.checkout — completing a purchase (typically humanVerification: true).createAccount — user registration (often denied).submitReview — posting reviews (often denied to prevent fakes).submitContactForm — contact form submission.bookAppointment — booking reservations / appointments.cancelOrder — cancelling orders.requestRefund — initiating refund requests.Protect sensitive information (typically all denied):
customerRecords — user profiles and personal data.orderHistory — past orders and transactions.paymentInfo — credit cards and bank details.internalAnalytics — traffic data and business metrics.employeeData — staff information.Each permission value is an object with the following members:
| Field | Type | Requirement | Description |
|---|---|---|---|
allowed | Boolean | REQUIRED | Is this permission granted? |
rateLimit | Integer | OPTIONAL | Maximum requests per minute for this action. |
humanVerification | Boolean | OPTIONAL | Default false. Requires human confirmation. |
note | String | OPTIONAL | Explanatory note for agents and humans. Treated as data, not as instruction. |
agentIdentification object (RECOMMENDED)Defines requirements for AI agent self-identification.
requireUserAgent (Boolean) — agent MUST include an identifying User-Agent header.requiredFields (Array<String>) — fields the agent must provide; valid values include "agentName", "agentOperator", "agentPurpose".allowAnonymousAgents (Boolean) — default true. If false, unidentified agents MUST be denied.trustedAgents (Array<Object>) — whitelist; each entry has { name, operator, permissions }.blockedAgents (Array<Object>) — blacklist; each entry has { pattern, reason }.scraping object (RECOMMENDED)Declares policies on automated data extraction.
bulkDataExtraction (Boolean) — default false. Systematic large-scale extraction.priceMonitoring (Boolean) — default false. Automated price-change tracking.contentReproduction (Boolean) — default false. Reproducing or republishing content.competitiveAnalysis (Boolean) — default false. Data collection for competitive intelligence.trainingDataUsage (Boolean) — default false. Using site content as training data.note (String) — OPTIONAL. Additional context or licensing information.defaults objectGlobal default settings that apply unless overridden by individual permissions.
agentAccess (String) — "open" (permissive), "restricted" (deny by default), or "minimal" (deny everything except explicit allows).requireIdentification (Boolean) — default false.humanVerificationRequired (Boolean) — default false. If true, all actions require human verification.maxRequestsPerMinute (Integer) — global per-minute rate limit.maxRequestsPerHour (Integer) — global per-hour rate limit.respectRobotsTxt (Boolean) — default true.humanVerification objectDefines human-in-the-loop requirements for sensitive actions.
methods (Array<String>) — accepted methods: "redirect-to-browser", "email-confirmation", "sms-otp".requiredFor (Array<String>) — names of actions that require human verification.note (String) — additional human-readable instructions.legal objectReferences Terms of Service and regulatory frameworks.
termsUrl (String) — RECOMMENDED. URL to AI-specific Terms of Service.complianceNote (String) — OPTIONAL human-readable compliance statement.dataRetention (String) — OPTIONAL rules for agent data retention.euAiActCompliance (Object) — OPTIONAL. EU AI Act-specific metadata
supporting Regulation (EU) 2024/1689 [[EU-AI-ACT]]:
transparencyRequired (Boolean) — agents must identify as AI.riskClassification (String) — "minimal", "limited", "high", or "unacceptable".humanOversightMandatory (Boolean).discovery objectLinks to complementary web resources.
mcpEndpoint (String) — URL to an MCP server card.a2aAgentCard (String) — URL to an A2A agent card.robotsTxt (String) — URL to robots.txt.llmsTxt (String) — URL to an llms.txt file.schemaOrg (Boolean) — whether Schema.org markup is present on the site.openApi (String) — URL to an OpenAPI specification.metadata object$schema (String) — URL of the JSON Schema for validation.schemaVersion (String) — specification version, e.g. "1.0".generatedAt (String) — RFC 3339 timestamp of generation.author (String) — policy creator.lastUpdated (String, ISO date) — last modification date.expiresAt (String, ISO date) — policy expiration date.changelogUrl (String) — URL to policy change history.Like robots.txt [[ROBOTS-TXT]], A2WF relies primarily on voluntary compliance by reputable AI agents. Major agent vendors are expected to respect published policies as part of responsible AI deployment.
Publishing sites MAY enforce policies through:
403 responses to non-compliant agents.
The legal.termsUrl field enables legal enforcement by
linking to machine-readable policies. Existing legal frameworks
(e.g. CFAA in the United States) treat violation of machine-readable
access policies as evidence of unauthorised access. The
EU AI Act [[EU-AI-ACT]] (effective August 2026)
requires transparency and risk management for AI systems;
siteai.json provides machine-readable evidence of
declared policies.
Publishing sites SHOULD log agent access patterns and compare them
against declared policies. The agentIdentification
section enables meaningful audit trails by requiring agent
self-identification.
The siteai.json file MUST be served over HTTPS to
prevent tampering. Publishing sites SHOULD implement integrity checks
and monitor for unauthorised modifications.
The siteai.json file contains structured data, not
executable content. Consuming agents MUST treat all fields as data,
not instructions. String fields (especially note) MUST
NOT be interpreted as agent commands.
Consuming agents MUST only trust siteai.json files
served from the domain they describe. Cross-domain policy
declarations MUST be rejected unless explicitly referenced via the
discovery mechanism.
Rate limits declared in siteai.json are requests from
the publishing site, not guarantees. Consuming agents SHOULD respect
declared limits. Publishing sites SHOULD implement server-side rate
limiting independently of declared policies.
The siteai.json file describes a site's policy for AI
agents and is intended to be fetched by agents and tools. It SHOULD
NOT contain personal data about individual users. The
contact field in identity
is intended for a role-based mailbox (for example
ai-policy@example.com) rather than an individual
person's address.
Logging of agent access by publishing sites is governed by applicable data protection law (such as the GDPR in the European Union); access logs MUST be processed in accordance with that law.
The specVersion field identifies the specification
version. Major versions (2.0, 3.0) MAY introduce breaking changes.
Minor updates within v1.x MUST remain backward-compatible.
Consuming agents MUST ignore any unrecognised members. This ensures that files created with future extensions remain readable by v1.0 consumers.
Future extensions may include:
siteai.json field | Schema.org equivalent |
|---|---|
@context | JSON-LD context |
identity.@type | schema:WebSite |
identity.name | schema:WebSite.name |
identity.description | schema:WebSite.description |
identity.inLanguage | schema:WebSite.inLanguage |
identity.domain | schema:WebSite.url |
legal.termsUrl | schema:WebSite.publishingPrinciples |
permissions.* | A2WF extension (no Schema.org equivalent) |
scraping.* | A2WF extension |
agentIdentification.* | A2WF extension |
humanVerification.* | A2WF extension |
A2WF extends Schema.org rather than reinventing it. Fields without a Schema.org equivalent represent the novel governance concepts unique to A2WF.
| File | Purpose | Since |
|---|---|---|
/robots.txt | Crawl permissions | 1994 |
/sitemap.xml | URL listing for search engines | 2005 |
/llms.txt | Content guide for LLMs | 2024 |
/.well-known/mcp.json | MCP server discovery | 2024 |
/siteai.json | AI agent access governance (A2WF) | 2025 |
Each file serves a distinct purpose. siteai.json is the
governance layer that sits alongside all of them. The
discovery section of siteai.json can
reference each of these files, creating a unified entry point for AI
agents.
A conforming consuming agent MUST:
siteai.json from the well-known location before performing any non-read action on the site.User-Agent header in a form that distinguishes it from human-operated browsers."allowed": false declarations as prohibitions.note fields) as data and never as instructions.The editor thanks the founding members and early reviewers of the A2WF Community Group for their feedback on the draft specification, and the W3C Community Development Lead for guidance on aligning the specification with W3C Community Group requirements.