MCP Server — Security Design & Implementation Plan • selmaR

Overview

This document describes the design for an MCP (Model Context Protocol) server that exposes selmaR data to Claude.

The server is a JSON-RPC 2.0 stdio process — it reads JSON requests from stdin, dispatches them to tool handlers, and writes JSON responses to stdout. All internal diagnostics go to stderr to avoid corrupting the protocol stream.

Six defence layers protect PII in responses: code inspection, input allowlisting, configurable field policy, hybrid data access controls, output scanning, and audit logging.

Deployment model: single user, local, same trust boundary as an R console.

Files to Create

File	Purpose
`inst/mcp/server.R`	Main MCP server (~1000–1200 lines)
`inst/mcp/field_policy.yml`	Default PII field policy
`inst/mcp/test_server.R`	Test suite for security layers

Welcome Message on Initialisation

When the MCP client sends the notifications/initialized message, the server sends a notifications/message back with level: "info" containing a user-facing welcome message. This is displayed directly to the end user in Claude Code or Claude Desktop.

The welcome message includes:

Server name and version
Connection status (authenticated or not)
Field policy status (which policy file is loaded, how many entities are covered)
A brief summary of what the server can do

Default mode (IDs pseudonymised):

selmaR MCP v0.1.0 connected to myorg.selma.co.nz.
PII redaction active (15 entities configured). IDs are pseudonymised.
Tools: list_entities, describe_entity, get_entity_summary, get_efts_report, execute_r.

When expose_real_ids: true is set in config.yml (warning level):

WARNING: Real student IDs and NSNs are exposed in this session.
You are responsible for the handling and protection of this PII.
This setting is logged in the audit trail.

Security Architecture — 7 Defence Layers

Layer 1: ID Pseudonymisation (Default On)

Student IDs and NSNs are high-value PII — they can be used to identify individuals across systems. By default, the MCP server pseudonymises all ID columns so the real values never cross the wire.

How it works

On server startup, a random seed is generated:

.mcp_seed <- as.character(sample.int(1e9, 1))

A deterministic hash function transforms IDs:

pseudonymise_id <- function(id, prefix = "S") {
  if (is.na(id) || id == "") return(id)
  hash <- substr(digest::digest(paste0(.mcp_seed, id), algo = "md5"), 1, 8)
  paste0(prefix, "-", hash)
}

This is applied to ID columns after the field policy filter:

Column pattern	Prefix	Example
`id` (entity-aware — see below)	per entity	`S-a3f2c1b9`
`student_id`, `studentid`	`S`	`S-a3f2c1b9`
`nsn`	`N`	`N-7e4d2f1a`
`contact_id`, `contactid`	`C`	`C-b8e3a5d2`
`enrolment_id`, `enrolid`, `enrol_id`	`E`	`E-1c9f4b7e`
`intake_id`, `intakeid`	`I`	`I-d2a8c3f6`
`compenrid`	`CE`	`CE-5a8b3c2d`
`compid`	`CP`	`CP-9d1e4f7a`
`progid`, `prog_id`, `interested_progid`	`P`	`P-2b6c8d1e`
`campus_id`, `campusid`	`CA`	`CA-4e7f1a3b`
`strand_id`	`ST`	`ST-6a9c2d4e`
`orgid`	`O`	`O-8b1d3e5f`
`parentid`	`E`	`E-3c5d7e9a`
`interested_intakeid`	`I`	`I-7a2b4c6d`
`addressid`	`A`	`A-1e3f5a7b`
`noteid`	`NT`	`NT-9c2d4e6f`
`enrolmentid`	`E`	`E-5a7b9c1d`

The generic id column is ambiguous — it appears in students, enrolments, contacts, and other entities. The server uses an ENTITY_ID_PREFIX lookup to assign the correct prefix based on which entity the data came from:

Entity	`id` prefix
students	`S`
enrolments	`E`
contacts	`C`
intakes	`I`
programmes	`P`
components	`CE`
organisations	`O`
classes	`CL`
campuses	`CA`
addresses	`A`
notes	`NT`

Key property: the hash is deterministic within a session. The same real student ID always produces the same pseudo-ID, so joins between entities still work correctly. The seed changes on each server restart, so pseudo-IDs are not stable across sessions.

Opting in to real IDs

To expose real IDs, add an mcp section to config.yml:

default:
  selma:
    base_url: "https://myorg.selma.co.nz/"
    email: "api@selma.co.nz"
    password: "your_password"
  mcp:
    expose_real_ids: true

When expose_real_ids: true is set, the welcome message changes to a warning-level notification requiring acknowledgement:

send_notification("notifications/message", list(
  level = "warning",
  data = paste0(
    "WARNING: Real student IDs and NSNs are exposed in this session. ",
    "You are responsible for the handling and protection of this PII. ",
    "This setting is logged in the audit trail."
  )
))

The opt-in is recorded in the audit log:

{
  "timestamp": "2026-03-14T10:00:01+1300",
  "event": "real_ids_enabled",
  "config_source": "config.yml"
}

Layer 2: Code Inspection (`execute_r` AST Guard)

Before any R code is evaluated, the server parses it into an Abstract Syntax Tree (AST) using parse() and walks the tree to detect blocked constructs. If a match is found, the code is rejected without execution and a warning is returned to the client.

AST analysis is strictly better than string matching — it ignores comments and string literals (no false positives on # don't use selmaR::) and catches obfuscation attempts like get("selma_request").

Blocked namespace packages (via :: and :::):

Package	Reason
`selmaR`	Bypasses policy-wrapped workspace functions
`httr`	Direct HTTP access bypasses all controls
`httr2`	Direct HTTP access bypasses all controls
`curl`	Direct HTTP access bypasses all controls
`jsonlite`	Could serialise/deserialise data outside controls
`config`	Could read credentials from config.yml

Blocked function calls:

Function	Reason
`eval`, `evalq`	Metaprogramming — could construct and run blocked code
`do.call`	Indirect function dispatch — bypass route
`get`, `mget`, `exists`	Object lookup by string name — bypass route
`getExportedValue`, `loadNamespace`, `requireNamespace`	Namespace access — bypass route
`Sys.getenv`, `Sys.setenv`	Could read `SELMA_PASSWORD` and other secrets
`system`, `system2`, `shell`	Shell execution — escape the R sandbox
`readLines`, `scan`, `file`, `readRDS`, `readr::read_csv`	File I/O — could read config.yml or cached data
`writeLines`, `write.csv`, `saveRDS`	File I/O — could exfiltrate data to disk
`download.file`, `url`, `socketConnection`	Network access — could exfiltrate data
`match.fun`	Function lookup by string — bypass route

Design rationale: none of these functions are needed for MCP analysis sessions. The workspace provides dplyr, tidyr, ggplot2, lubridate, scales, and the selma helper functions. All legitimate analysis patterns (filtering, grouping, summarising, joining, charting) are covered by tidyverse methods without needing base metaprogramming or I/O functions.

BLOCKED_PACKAGES <- c("selmaR", "httr", "httr2", "curl", "jsonlite", "config")

BLOCKED_FUNCTIONS <- c(
  # Metaprogramming / indirect dispatch
  "eval", "evalq", "do.call", "get", "mget", "exists",
  "match.fun", "getExportedValue", "loadNamespace", "requireNamespace",
  # Environment / credential access
  "Sys.getenv", "Sys.setenv",
  # Shell execution
  "system", "system2", "shell",
  # File I/O
  "readLines", "scan", "file", "readRDS", "writeLines",
  "write.csv", "write.csv2", "saveRDS",
  # Network
  "download.file", "url", "socketConnection"
)

check_code_safety <- function(code) {
  expr <- tryCatch(parse(text = code), error = function(e) NULL)
  if (is.null(expr)) return(list(safe = TRUE))  # syntax error — will fail at eval

  blocked <- character(0)

  walk <- function(node) {
    if (is.call(node)) {
      fn <- node[[1]]
      # Detect :: and ::: with blocked packages
      if (is.call(fn) && as.character(fn[[1]]) %in% c("::", ":::")) {
        pkg <- as.character(fn[[2]])
        if (pkg %in% BLOCKED_PACKAGES) {
          blocked <<- c(blocked, paste0(pkg, "::", as.character(fn[[3]])))
        }
      }
      # Detect blocked function calls
      fn_name <- if (is.symbol(fn)) as.character(fn) else ""
      if (fn_name %in% BLOCKED_FUNCTIONS) {
        blocked <<- c(blocked, fn_name)
      }
    }
    if (is.recursive(node)) {
      for (child in as.list(node)) walk(child)
    }
  }

  for (e in as.list(expr)) walk(e)

  if (length(blocked) > 0) {
    list(safe = FALSE, blocked = blocked)
  } else {
    list(safe = TRUE)
  }
}

When code is blocked, the server returns an MCP error result:

[BLOCKED] Your code was rejected before execution.
Blocked constructs found: selmaR::selma_students, get
Reason: These functions bypass PII controls or access restricted resources.
Use the workspace functions instead (e.g. selma_students(), selma_get_entity()).

The blocked attempt is recorded in the audit log with the full list of detected constructs.

Layer 3: Input Allowlisting (Tool Restrictions)

Seven tools are exposed. No raw selma_get() or selma_request() access is available to Claude.

Tool	Purpose	Returns
`auth_status`	Connection & policy status	No data
`list_entities`	Available entity types	Names + descriptions only
`search_entities`	Keyword search on entities	Names + descriptions only
`describe_entity`	Schema + column stats	Aggregates only (counts, uniques, distributions)
`get_entity_summary`	Filtered/grouped stats	Aggregates only (counts, means)
`get_efts_report`	EFTS funding breakdown	Aggregate funding totals (no PII)
`execute_r`	Custom R code in sandbox	Policy-filtered, code-inspected, output-scanned

What is NOT exposed in the workspace:

selma_connect() / selma_disconnect() — auth managed by server
selma_get() / selma_request() — raw API access bypasses policy
selma_get_one() — single-record access bypasses aggregation
The connection object itself

Layer 4: PII Field Allowlist (Configurable YAML)

A field_policy.yml defines per-entity field visibility. The file is resolved from (first match wins):

SELMAR_FIELD_POLICY environment variable
{cwd}/field_policy.yml (project-specific override)
{pkg_root}/inst/mcp/field_policy.yml (package default)

Three modes per entity:

mode: allow — only listed fields pass through (whitelist)
mode: redact — listed fields replaced with [REDACTED]
mode: all — no restrictions (for lookup tables with no PII)

Default policy

The full default policy is in inst/mcp/field_policy.yml (the canonical source of truth). Below is the current content — if it drifts, the YAML file is authoritative.

students:
  mode: allow
  fields:
    - id
    - status
    - title
    - gender
    - international
    - citizenship
    - residency
    - residentialstatus
    - ethnicity1
    - ethnicity2
    - ethnicity3
    - ethnicity
    - countryofbirth
    - visatype
    - feesfree
    - prestudyactivity
    - highpostschoolqual
    - third_party_id
    - third_party_id2
    - organisation
    - interested_progid
    - interested_intakeid
  # Hidden: surname, forename, preferredname, middlename, initials,
  #         email1, email2, mobilephone, secondaryphone, dob, nsn,
  #         passportnumber, passportexpiry, visanumber, visaexpiry,
  #         medicalcondition, medicallist, medinsurance, ird_number,
  #         bank_acc, driver_licence, disability, disabilitydetails

enrolments:
  mode: allow
  fields:
    - id
    - student_id
    - intake_id
    - enrstatus
    - enrstartdate
    - enrenddate
    - enrstatusdate
    - enrreturntype
    - enrattendance
    - completedsuccessfully
    - fundingsource
    - enrcompletiondate
    - enrcompletiongrade
    - enrqualcode
    - enrwithdrawalreason
    - enrwithdrawaldate
    - enrfeesfree
    - strand_id
    - third_party_id
    - percent_attendance
    - percent_in_programme
    - percent_progress

components:
  mode: allow
  fields:
    - compenrid
    - compid
    - parentid
    - enrolid
    - studentid
    - compenrstartdate
    - compenrenddate
    - compenrduedate
    - compenrstatus
    - compenrreturntype
    - compenrsource
    - compenrefts
    - compenrfeetf
    - compenrfundingcategory
    - compenrresidency
    - compenrattendance
    - compenrgrade
    - compenrcompletioncode
    - compenrcompletiondate
    - compenrextensiondate
    - compenrwithdrawalreason
    - compenrwithdrawaldate
    - compenrfeepayingstatus
    - comp_title
    - comp_credit_fw
    - comp_level_fw
    - comp_code
    - comp_version
    - comp_type
    - milestone
    - createddate
    - updateddate
  # Hidden: compenrassessmentnote, userfieldchar*, userfielddate*,
  #         userfieldbit*, userfieldint*, userfielddecimal*,
  #         createdby, updatedby (usernames)

intakes:
  mode: allow
  fields:
    - intakeid
    - intakecode
    - intakestatus
    - progid
    - campus_id
    - intake_name
    - intakestartdate
    - intakeenddate
    - available_spaces
    - prog_ie_min_places
    - prog_ie_max_places
    - strand_id
    - funding_source
    - fees
    - createddate
    - updateddate
  # Hidden: createdby, updatedby (usernames)

programmes:
  mode: allow
  fields:
    - progid
    - progcode
    - progversion
    - progtitle
    - progdescription
    - progcrediteq
    - progleveleq
    - progefts
    - progstatus
    - proglength
    - proglengthunits
    - ygtype
    - nzqacode
    - progress_type
    - createddate
    - updateddate
  # Hidden: createdby, updatedby (usernames)

contacts:
  mode: allow
  fields:
    - id
    - type_2
    - gender
    - status
    - createddate
  # Hidden: surname, forename, preferredname, middlename, initials,
  #         email1, mobilephone, secondaryphone, workphone, dob, otherid

addresses:
  mode: allow
  fields:
    - addressid
    - addresstype
    - city
    - postcode
    - region
    - country
    - validfrom
    - validto
    - studentid
    - orgid
    - contactid
    - createddate
    - updateddate
  # Hidden: street, suburb (too specific for PII)

notes:
  mode: allow
  fields:
    - noteid
    - student_id
    - enrolmentid
    - notetype
    - notearea
    - confidential
    - createddate
  # Hidden: note1 (free text may contain PII), createdby (username)

organisations:
  mode: allow
  fields:
    - id
    - name
    - legalname
    - orgtype
    - country
    - registration_number
  # Note: currently returns 0 rows in production

classes:
  mode: allow
  fields:
    - id
    - class_name
    - capacity
    - enrolment_count
    - startdate
    - enddate
    - campusid
    - active
    - createddate
  # Hidden: description (may contain PII), otherid, createdby

# Lookup entities -- no PII
campuses:
  mode: all
ethnicities:
  mode: all
countries:
  mode: all
genders:
  mode: all
titles:
  mode: all

Implementation

apply_field_policy <- function(df, entity_name) {
  policy <- .mcp_policy[[entity_name]]
  if (is.null(policy) || identical(policy$mode, "all")) return(df)

  if (identical(policy$mode, "allow")) {
    return(df[, intersect(names(df), policy$fields), drop = FALSE])
  }

  if (identical(policy$mode, "redact")) {
    for (col in intersect(names(df), policy$fields)) df[[col]] <- "[REDACTED]"
    return(df)
  }

  df
}

Unknown fields are secure by default. With mode: allow, any new field returned by the API that is not in the allowlist is automatically excluded.

Layer 5: Hybrid Data Access

describe_entity and get_entity_summary return only aggregates — never individual rows.
execute_r provides selma_get_entity() which fetches, applies the field policy, and caches. Claude never sees raw unfiltered data.
Entity aliases (selma_students() etc.) in the workspace shadow the real package functions, always applying the field policy first.
Join helpers are exposed — they operate on already-filtered data.

Layer 6: Output Scanning

Before any MCP response is returned, all text content is scanned for PII using two complementary methods: regex pattern matching and a session-specific PII dictionary. Matches are replaced with [REDACTED:type].

6a. Regex Pattern Matching

Catches structurally recognisable PII regardless of source:

PII_PATTERNS <- list(
  email    = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
  nz_phone = "\\+64[0-9]{7,10}",
  au_phone = "\\+61[0-9]{8,9}",
  nz_mobile = "(?:^|\\s)02[0-9]{7,9}",
  dob_iso  = paste0(
    "\\b(?:19[5-9][0-9]|200[0-9]|201[0-9])",
    "-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12][0-9]|3[01])\\b"
  )
)

Design decision: The NSN (National Student Number) pattern — 9-digit numbers — is excluded from the default regex set because it produces too many false positives. NSNs are instead handled by the PII dictionary (below).

6b. PII Dictionary (Session-Specific)

The server has access to the full unfiltered data before the field policy is applied. When an entity with PII fields is first fetched, the server extracts the unique values from known PII columns and stores them in an internal dictionary. This dictionary is never exposed to the workspace.

# Columns to harvest PII values from (per entity)
PII_DICTIONARY_SOURCES <- list(
  students = c("surname", "forename", "preferredname",
               "email1", "email2", "mobilephone", "homephone",
               "workphone", "nsn"),
  contacts = c("surname", "forename", "email",
               "mobilephone", "workphone")
)

# Built during entity fetch, before field policy is applied
build_pii_dictionary <- function(df, entity_name) {
  cols <- PII_DICTIONARY_SOURCES[[entity_name]]
  if (is.null(cols)) return(invisible(NULL))

  for (col in intersect(cols, names(df))) {
    values <- unique(na.omit(as.character(df[[col]])))
    # Filter out short/common values to avoid false positives
    values <- values[nchar(values) >= 3]
    .pii_dictionary$values <- unique(c(.pii_dictionary$values, values))
  }
}

The dictionary scan uses case-insensitive whole-word matching to minimise false positives:

scan_for_dictionary_pii <- function(text) {
  redacted <- text
  detections <- 0L

  for (val in .pii_dictionary$values) {
    pattern <- paste0("\\b", escape_regex(val), "\\b")
    if (grepl(pattern, redacted, ignore.case = TRUE)) {
      redacted <- gsub(pattern, "[REDACTED:pii]", redacted, ignore.case = TRUE)
      detections <- detections + 1L
    }
  }

  list(text = redacted, n_dictionary_hits = detections)
}

How it catches what regex can’t: if a student named “Kowalczyk” appears in a free-text note field or a computed output string, no regex pattern would match — but the dictionary scan catches it because “Kowalczyk” was harvested from the surname column.

False positive mitigation:

Minimum 3-character length threshold — filters out values like “Jo”, “Li”
Whole-word boundary matching — “Lee” won’t match “employee” or “Leeds”
Case-insensitive — catches “SMITH” even if the source was “Smith”
Dictionary is built lazily (only when entities are fetched) so it doesn’t slow down startup

All redactions (both regex and dictionary) are logged to the audit trail.

Layer 7: Audit Log

Every tool call is logged to {output_dir}/selma_mcp_audit.jsonl in JSON Lines format:

{
  "timestamp": "2026-03-14T10:15:30+1300",
  "tool": "execute_r",
  "arguments": {"code_length": 142},
  "response_bytes": 3847,
  "redactions_applied": ["email: 2 match(es)"],
  "code_blocked": false,
  "is_error": false
}

Blocked code attempts are also logged:

{
  "timestamp": "2026-03-14T10:16:05+1300",
  "tool": "execute_r",
  "arguments": {"code_length": 89},
  "response_bytes": 256,
  "code_blocked": true,
  "blocked_pattern": "selmaR::",
  "blocked_reason": "Direct namespace access bypasses PII policy",
  "is_error": true
}

The raw JSONL log can be parsed with jsonlite::stream_in() for programmatic analysis.

Session Summary Report (generated on exit)

When the MCP server shuts down (stdin EOF / client disconnect), it automatically generates a human-readable HTML summary at {output_dir}/selma_mcp_session_{timestamp}.html and writes the path to stderr. This ensures a readable audit artifact is always available, even if the user forgets to check.

The report contains:

Section	Content
Session	Start time, end time, duration, server version, SELMA domain
Security	Field policy file used, ID pseudonymisation on/off, real-ID opt-in warning if applicable
Tool Calls	Table of every call: timestamp, tool name, response size, duration
Entities Accessed	Which entities were fetched, row counts, how many times
Blocked Attempts	Any `execute_r` calls rejected by code inspection, with the blocked pattern and reason
PII Redactions	Count and type of output-scanner redactions (e.g. “3 emails, 1 NZ phone”)
Warnings	Any large-dataset warnings, timeouts, or errors

Example shutdown log:

[selmaR] Session summary written to: /path/to/output/selma_mcp_session_20260314_101530.html
[selmaR] 12 tool calls | 4 entities accessed | 0 blocked | 2 redactions

Implementation:

generate_session_report <- function() {
  log_entries <- jsonlite::stream_in(file(.audit$log_path), verbose = FALSE)

  # Compute summary stats
  n_calls <- nrow(log_entries)
  n_blocked <- sum(log_entries$code_blocked %in% TRUE)
  n_errors <- sum(log_entries$is_error %in% TRUE)
  entities_accessed <- unique(log_entries$entity[!is.na(log_entries$entity)])
  all_redactions <- unlist(log_entries$redactions_applied)
  duration <- difftime(Sys.time(), .audit$session_start, units = "mins")

  # Build HTML report
  html <- build_session_html(
    session_start = .audit$session_start,
    duration = duration,
    log_entries = log_entries,
    n_calls = n_calls,
    n_blocked = n_blocked,
    n_errors = n_errors,
    entities_accessed = entities_accessed,
    redactions = all_redactions,
    real_ids_enabled = .mcp_config$expose_real_ids
  )

  report_path <- file.path(
    .audit$output_dir,
    paste0("selma_mcp_session_",
           format(.audit$session_start, "%Y%m%d_%H%M%S"),
           ".html")
  )
  writeLines(html, report_path)
  mcp_log("Session summary written to: ", report_path)
  mcp_log(n_calls, " tool calls | ",
          length(entities_accessed), " entities accessed | ",
          n_blocked, " blocked | ",
          length(all_redactions), " redactions")
}

The main loop calls this on exit:

on.exit(generate_session_report(), add = TRUE)

repeat {
  line <- readLines(stdin_con, n = 1)
  if (length(line) == 0) break
  # ... dispatch ...
}

Workspace Setup

The execute_r tool runs code in a persistent .mcp_workspace environment.

Pre-loaded packages

dplyr, tidyr, ggplot2, lubridate, scales

Policy-wrapped fetch functions

# Primary interface
selma_get_entity("students")   # fetch -> apply_field_policy -> cache -> return

# Convenience aliases (shadow the real package functions)
selma_students()    # => selma_get_entity("students")
selma_enrolments()  # => selma_get_entity("enrolments")
selma_intakes()     # => selma_get_entity("intakes")
selma_components()  # => selma_get_entity("components")
selma_programmes()  # => selma_get_entity("programmes")

Join helpers

selma_join_students(enrolments, students)
selma_join_intakes(enrolments, intakes)
selma_student_pipeline(enrolments, students, intakes)
selma_component_pipeline(components, enrolments, students, intakes)

Constants

All SELMA_STATUS_*, SELMA_FUNDED_STATUSES, and SELMA_FUNDING_* constants are injected into the workspace.

What is NOT in the workspace

selma_connect(), selma_disconnect() — auth managed by server
selma_get(), selma_request() — raw API access
selma_get_one() — single-record access
The connection object

Runtime Guards

Guard	Detail
60-second compute timeout	`setTimeLimit(elapsed = 60)` on `execute_r` computation (data loading via `selma_get_entity()` runs outside the timeout)
800KB response cap	Truncate oversized responses with `[TRUNCATED]` warning
Row-count reporting	Workspace wrapper emits `[entity: N rows x M cols]` messages
Large dataset warnings	Entities >50K rows trigger `[WARNING: filter early]`
Audience annotations	Diagnostic messages sent to `["assistant"]` only
HTML escaping	`html_esc()` on all user data in plot HTML output
Filename sanitisation	Slugified filenames for chart output (no directory traversal)
stderr diagnostics	All internal logging goes to stderr, never stdout

Messages

Welcome Message (to the end user)

Sent via notifications/message with level: "info" when the client sends notifications/initialized. Displayed directly to the user in Claude Code or Claude Desktop:

selmaR MCP v{version} connected to {domain}.
PII redaction active ({n} entities configured).
Tools: list_entities, describe_entity, get_entity_summary, get_efts_report, execute_r.

Server Instructions (to Claude)

The MCP initialize response includes an instructions field. This is sent to the assistant (Claude) context — not shown to the end user — and front-loads domain knowledge to minimise tool call round-trips.

The full SERVER_INSTRUCTIONS string is defined in inst/mcp/server.R (search for SERVER_INSTRUCTIONS <-). It is maintained there as the single source of truth to avoid dual-maintenance drift.

The instructions cover these sections:

Section	Purpose
TOOL DISCOVERY	Which tools to call and in what order
SECURITY	PII policy rules, blocked patterns
WORKFLOW	Recommended 5-step tool progression
SELMA DATA MODEL	Entity relationships and join keys
JOIN KEYS	Exact column names for cross-entity joins
ENTITY FIELDS	Per-entity column listing (policy-allowed only)
KEY DATE FIELDS	Important date columns per entity
ENROLMENT STATUS CODES	`SELMA_STATUS_*` constants
STATUS GROUPS	`SELMA_FUNDED_STATUSES`, `SELMA_ALL_FUNDED_STATUSES`
FUNDING SOURCES	`SELMA_FUNDING_*` constants
WORKSPACE FUNCTIONS	`selma_*()` fetchers, join helpers, pipelines
COMMON PATTERNS	Example dplyr pipelines (e.g. `count(progtitle, enrstatus)`)
CHARTING	Chart.js HTML via `save_chart()`, ggplot2 fallback
PRE-LOADED PACKAGES	dplyr, tidyr, ggplot2, lubridate, scales

Configuration

Claude Desktop / Claude Code config

{
  "mcpServers": {
    "selmaR": {
      "command": "Rscript",
      "args": ["/path/to/selmaR/inst/mcp/server.R"],
      "cwd": "/path/to/directory/containing/config.yml"
    }
  }
}

SELMA credentials (`config.yml` in `cwd`)

default:
  selma:
    base_url: "https://myorg.selma.co.nz/"
    email: "api@selma.co.nz"
    password: "your_password"

Environment variable overrides

Variable	Purpose
`SELMAR_PKG_DIR`	Path to selmaR package root
`SELMAR_OUTPUT_DIR`	Output directory for charts/reports
`SELMAR_FIELD_POLICY`	Path to field policy YAML
`SELMA_BASE_URL`	Credential override
`SELMA_EMAIL`	Credential override
`SELMA_PASSWORD`	Credential override

Prompt Injection Defence

The combination of layers provides defence-in-depth against prompt injection attempting to exfiltrate data:

ID pseudonymisation — Student IDs and NSNs are hashed with a session-scoped seed before any data reaches Claude. Even if all other layers fail, real IDs are never exposed unless explicitly opted in via config.yml.
Code inspection (AST) — Before execute_r evaluates anything, the code is parsed into an AST and walked. Blocked namespace access (selmaR::, httr::, etc.), metaprogramming (eval, do.call, get), file I/O, network access, and shell commands are all detected and rejected. The code never runs.
Input allowlisting — Claude cannot call arbitrary API endpoints; only the 7 defined tools are available.
Field policy — Even if Claude writes selma_students() in execute_r, the workspace wrapper strips PII fields before the data reaches Claude.
Output scanning — If PII somehow enters the response (e.g. via a free-text note field that wasn’t policy-filtered), two scanners catch it: regex patterns for emails, phones, and DOBs, plus a session-specific PII dictionary built from actual student names, NSNs, and contact details.
Audit log — All tool calls, redactions, blocked attempts, and real-ID opt-ins are logged, so injection attempts are detectable in post-hoc review.
Server instructions — Tell Claude never to return raw unfiltered data and never to use :: namespace access, reducing the surface area for social engineering.

Known limitations:

The AST guard blocks eval, do.call, get, and other metaprogramming functions, which closes the main bypass routes. However, R is a dynamic language — sufficiently creative code could theoretically construct function references via mechanisms not yet blocked (e.g. environment manipulation). The output scanner and field policy provide additional defence lines. For multi-user deployment, process-level isolation would be needed (out of scope).
The output scanner uses regex heuristics — it will not catch all possible PII (e.g. free-text names that aren’t in a recognisable pattern). The field policy is the primary defence; the scanner is a safety net.

Build Sequence

Server skeleton: preamble, package loading (jsonlite, stringr, yaml, digest), mcp_log(), output directory
Security engine: pseudonymise_id(), check_code_safety(), load_field_policy(), apply_field_policy(), scan_output_for_pii(), audit_log()
Entity registry and cache: ENTITY_REGISTRY, get_cached_entity()
Column summariser: summarize_column()
Tool definitions: 7 tool schemas in mcp_tools list
Tool handlers: list/describe/search/summary/efts/auth/execute_r
Workspace setup: .mcp_workspace with policy-wrapped functions
Server infrastructure: SERVER_INSTRUCTIONS, welcome message, handle_request(), auth-on-startup, main loop
field_policy.yml: default policy for all entities
test_server.R: security-focused test suite
Documentation: this article

Testing Strategy

Category	What is tested
ID pseudonymisation	Same ID produces same hash within session, different seed = different hash, joins still work, opt-in disables hashing
AST code inspection	`::` with blocked packages, blocked function calls, `:::`, metaprogramming (`eval`/`do.call`/`get`), file I/O, network, shell — all rejected. Clean dplyr/ggplot2 code passes. Comments and strings containing blocked names do not trigger. Audit log records blocks.
Field policy loading	All three modes (allow/redact/all), missing policy, unknown entity
Output scanning (regex)	Email, NZ/AU phone, DOB patterns, clean text passthrough
Output scanning (dictionary)	Known surnames/names caught in free text, short values (<3 chars) skipped, whole-word matching avoids partial hits
apply_field_policy	Mock tibbles with PII columns, verify filtering
Audit logging	Write entry, verify JSONL structure
Entity registry	All entries map to real selmaR functions
Integration (skip without creds)	End-to-end fetch + filter + scan

Verification

Rscript inst/mcp/test_server.R — all security tests pass
Manual test: pipe JSON-RPC requests to stdin, verify responses
Configure in Claude Code, test full workflow
Verify welcome message appears on connection
Verify blocked code patterns are rejected with clear warnings
Verify audit log is created and populated (including blocked attempts)
Verify PII fields are absent from describe_entity output
Verify output scanning catches emails/phones in execute_r results

MCP Server — Security Design & Implementation Plan