Skip to contents

Overview

This document describes the design for an MCP (Model Context Protocol) server that exposes selmaR data to Claude.

The server is a JSON-RPC 2.0 stdio process — it reads JSON requests from stdin, dispatches them to tool handlers, and writes JSON responses to stdout. All internal diagnostics go to stderr to avoid corrupting the protocol stream.

Six defence layers protect PII in responses: code inspection, input allowlisting, configurable field policy, hybrid data access controls, output scanning, and audit logging.

Deployment model: single user, local, same trust boundary as an R console.

Files to Create

File Purpose
inst/mcp/server.R Main MCP server (~1000–1200 lines)
inst/mcp/field_policy.yml Default PII field policy
inst/mcp/test_server.R Test suite for security layers

Welcome Message on Initialisation

When the MCP client sends the notifications/initialized message, the server sends a notifications/message back with level: "info" containing a user-facing welcome message. This is displayed directly to the end user in Claude Code or Claude Desktop.

The welcome message includes:

  • Server name and version
  • Connection status (authenticated or not)
  • Field policy status (which policy file is loaded, how many entities are covered)
  • A brief summary of what the server can do

Default mode (IDs pseudonymised):

selmaR MCP v0.1.0 connected to myorg.selma.co.nz.
PII redaction active (15 entities configured). IDs are pseudonymised.
Tools: list_entities, describe_entity, get_entity_summary, get_efts_report, execute_r.

When expose_real_ids: true is set in config.yml (warning level):

WARNING: Real student IDs and NSNs are exposed in this session.
You are responsible for the handling and protection of this PII.
This setting is logged in the audit trail.

Security Architecture — 7 Defence Layers

Layer 1: ID Pseudonymisation (Default On)

Student IDs and NSNs are high-value PII — they can be used to identify individuals across systems. By default, the MCP server pseudonymises all ID columns so the real values never cross the wire.

How it works

On server startup, a random seed is generated:

.mcp_seed <- as.character(sample.int(1e9, 1))

A deterministic hash function transforms IDs:

pseudonymise_id <- function(id, prefix = "S") {
  if (is.na(id) || id == "") return(id)
  hash <- substr(digest::digest(paste0(.mcp_seed, id), algo = "md5"), 1, 8)
  paste0(prefix, "-", hash)
}

This is applied to ID columns after the field policy filter:

Column pattern Prefix Example
id (entity-aware — see below) per entity S-a3f2c1b9
student_id, studentid S S-a3f2c1b9
nsn N N-7e4d2f1a
contact_id, contactid C C-b8e3a5d2
enrolment_id, enrolid, enrol_id E E-1c9f4b7e
intake_id, intakeid I I-d2a8c3f6
compenrid CE CE-5a8b3c2d
compid CP CP-9d1e4f7a
progid, prog_id, interested_progid P P-2b6c8d1e
campus_id, campusid CA CA-4e7f1a3b
strand_id ST ST-6a9c2d4e
orgid O O-8b1d3e5f
parentid E E-3c5d7e9a
interested_intakeid I I-7a2b4c6d
addressid A A-1e3f5a7b
noteid NT NT-9c2d4e6f
enrolmentid E E-5a7b9c1d

The generic id column is ambiguous — it appears in students, enrolments, contacts, and other entities. The server uses an ENTITY_ID_PREFIX lookup to assign the correct prefix based on which entity the data came from:

Entity id prefix
students S
enrolments E
contacts C
intakes I
programmes P
components CE
organisations O
classes CL
campuses CA
addresses A
notes NT

Key property: the hash is deterministic within a session. The same real student ID always produces the same pseudo-ID, so joins between entities still work correctly. The seed changes on each server restart, so pseudo-IDs are not stable across sessions.

Opting in to real IDs

To expose real IDs, add an mcp section to config.yml:

default:
  selma:
    base_url: "https://myorg.selma.co.nz/"
    email: "api@selma.co.nz"
    password: "your_password"
  mcp:
    expose_real_ids: true

When expose_real_ids: true is set, the welcome message changes to a warning-level notification requiring acknowledgement:

send_notification("notifications/message", list(
  level = "warning",
  data = paste0(
    "WARNING: Real student IDs and NSNs are exposed in this session. ",
    "You are responsible for the handling and protection of this PII. ",
    "This setting is logged in the audit trail."
  )
))

The opt-in is recorded in the audit log:

{
  "timestamp": "2026-03-14T10:00:01+1300",
  "event": "real_ids_enabled",
  "config_source": "config.yml"
}

Layer 2: Code Inspection (execute_r AST Guard)

Before any R code is evaluated, the server parses it into an Abstract Syntax Tree (AST) using parse() and walks the tree to detect blocked constructs. If a match is found, the code is rejected without execution and a warning is returned to the client.

AST analysis is strictly better than string matching — it ignores comments and string literals (no false positives on # don't use selmaR::) and catches obfuscation attempts like get("selma_request").

Blocked namespace packages (via :: and :::):

Package Reason
selmaR Bypasses policy-wrapped workspace functions
httr Direct HTTP access bypasses all controls
httr2 Direct HTTP access bypasses all controls
curl Direct HTTP access bypasses all controls
jsonlite Could serialise/deserialise data outside controls
config Could read credentials from config.yml

Blocked function calls:

Function Reason
eval, evalq Metaprogramming — could construct and run blocked code
do.call Indirect function dispatch — bypass route
get, mget, exists Object lookup by string name — bypass route
getExportedValue, loadNamespace, requireNamespace Namespace access — bypass route
Sys.getenv, Sys.setenv Could read SELMA_PASSWORD and other secrets
system, system2, shell Shell execution — escape the R sandbox
readLines, scan, file, readRDS, readr::read_csv File I/O — could read config.yml or cached data
writeLines, write.csv, saveRDS File I/O — could exfiltrate data to disk
download.file, url, socketConnection Network access — could exfiltrate data
match.fun Function lookup by string — bypass route

Design rationale: none of these functions are needed for MCP analysis sessions. The workspace provides dplyr, tidyr, ggplot2, lubridate, scales, and the selma helper functions. All legitimate analysis patterns (filtering, grouping, summarising, joining, charting) are covered by tidyverse methods without needing base metaprogramming or I/O functions.

BLOCKED_PACKAGES <- c("selmaR", "httr", "httr2", "curl", "jsonlite", "config")

BLOCKED_FUNCTIONS <- c(
  # Metaprogramming / indirect dispatch
  "eval", "evalq", "do.call", "get", "mget", "exists",
  "match.fun", "getExportedValue", "loadNamespace", "requireNamespace",
  # Environment / credential access
  "Sys.getenv", "Sys.setenv",
  # Shell execution
  "system", "system2", "shell",
  # File I/O
  "readLines", "scan", "file", "readRDS", "writeLines",
  "write.csv", "write.csv2", "saveRDS",
  # Network
  "download.file", "url", "socketConnection"
)

check_code_safety <- function(code) {
  expr <- tryCatch(parse(text = code), error = function(e) NULL)
  if (is.null(expr)) return(list(safe = TRUE))  # syntax error — will fail at eval

  blocked <- character(0)

  walk <- function(node) {
    if (is.call(node)) {
      fn <- node[[1]]
      # Detect :: and ::: with blocked packages
      if (is.call(fn) && as.character(fn[[1]]) %in% c("::", ":::")) {
        pkg <- as.character(fn[[2]])
        if (pkg %in% BLOCKED_PACKAGES) {
          blocked <<- c(blocked, paste0(pkg, "::", as.character(fn[[3]])))
        }
      }
      # Detect blocked function calls
      fn_name <- if (is.symbol(fn)) as.character(fn) else ""
      if (fn_name %in% BLOCKED_FUNCTIONS) {
        blocked <<- c(blocked, fn_name)
      }
    }
    if (is.recursive(node)) {
      for (child in as.list(node)) walk(child)
    }
  }

  for (e in as.list(expr)) walk(e)

  if (length(blocked) > 0) {
    list(safe = FALSE, blocked = blocked)
  } else {
    list(safe = TRUE)
  }
}

When code is blocked, the server returns an MCP error result:

[BLOCKED] Your code was rejected before execution.
Blocked constructs found: selmaR::selma_students, get
Reason: These functions bypass PII controls or access restricted resources.
Use the workspace functions instead (e.g. selma_students(), selma_get_entity()).

The blocked attempt is recorded in the audit log with the full list of detected constructs.

Layer 3: Input Allowlisting (Tool Restrictions)

Seven tools are exposed. No raw selma_get() or selma_request() access is available to Claude.

Tool Purpose Returns
auth_status Connection & policy status No data
list_entities Available entity types Names + descriptions only
search_entities Keyword search on entities Names + descriptions only
describe_entity Schema + column stats Aggregates only (counts, uniques, distributions)
get_entity_summary Filtered/grouped stats Aggregates only (counts, means)
get_efts_report EFTS funding breakdown Aggregate funding totals (no PII)
execute_r Custom R code in sandbox Policy-filtered, code-inspected, output-scanned

What is NOT exposed in the workspace:

Layer 4: PII Field Allowlist (Configurable YAML)

A field_policy.yml defines per-entity field visibility. The file is resolved from (first match wins):

  1. SELMAR_FIELD_POLICY environment variable
  2. {cwd}/field_policy.yml (project-specific override)
  3. {pkg_root}/inst/mcp/field_policy.yml (package default)

Three modes per entity:

  • mode: allow — only listed fields pass through (whitelist)
  • mode: redact — listed fields replaced with [REDACTED]
  • mode: all — no restrictions (for lookup tables with no PII)

Default policy

The full default policy is in inst/mcp/field_policy.yml (the canonical source of truth). Below is the current content — if it drifts, the YAML file is authoritative.

students:
  mode: allow
  fields:
    - id
    - status
    - title
    - gender
    - international
    - citizenship
    - residency
    - residentialstatus
    - ethnicity1
    - ethnicity2
    - ethnicity3
    - ethnicity
    - countryofbirth
    - visatype
    - feesfree
    - prestudyactivity
    - highpostschoolqual
    - third_party_id
    - third_party_id2
    - organisation
    - interested_progid
    - interested_intakeid
  # Hidden: surname, forename, preferredname, middlename, initials,
  #         email1, email2, mobilephone, secondaryphone, dob, nsn,
  #         passportnumber, passportexpiry, visanumber, visaexpiry,
  #         medicalcondition, medicallist, medinsurance, ird_number,
  #         bank_acc, driver_licence, disability, disabilitydetails

enrolments:
  mode: allow
  fields:
    - id
    - student_id
    - intake_id
    - enrstatus
    - enrstartdate
    - enrenddate
    - enrstatusdate
    - enrreturntype
    - enrattendance
    - completedsuccessfully
    - fundingsource
    - enrcompletiondate
    - enrcompletiongrade
    - enrqualcode
    - enrwithdrawalreason
    - enrwithdrawaldate
    - enrfeesfree
    - strand_id
    - third_party_id
    - percent_attendance
    - percent_in_programme
    - percent_progress

components:
  mode: allow
  fields:
    - compenrid
    - compid
    - parentid
    - enrolid
    - studentid
    - compenrstartdate
    - compenrenddate
    - compenrduedate
    - compenrstatus
    - compenrreturntype
    - compenrsource
    - compenrefts
    - compenrfeetf
    - compenrfundingcategory
    - compenrresidency
    - compenrattendance
    - compenrgrade
    - compenrcompletioncode
    - compenrcompletiondate
    - compenrextensiondate
    - compenrwithdrawalreason
    - compenrwithdrawaldate
    - compenrfeepayingstatus
    - comp_title
    - comp_credit_fw
    - comp_level_fw
    - comp_code
    - comp_version
    - comp_type
    - milestone
    - createddate
    - updateddate
  # Hidden: compenrassessmentnote, userfieldchar*, userfielddate*,
  #         userfieldbit*, userfieldint*, userfielddecimal*,
  #         createdby, updatedby (usernames)

intakes:
  mode: allow
  fields:
    - intakeid
    - intakecode
    - intakestatus
    - progid
    - campus_id
    - intake_name
    - intakestartdate
    - intakeenddate
    - available_spaces
    - prog_ie_min_places
    - prog_ie_max_places
    - strand_id
    - funding_source
    - fees
    - createddate
    - updateddate
  # Hidden: createdby, updatedby (usernames)

programmes:
  mode: allow
  fields:
    - progid
    - progcode
    - progversion
    - progtitle
    - progdescription
    - progcrediteq
    - progleveleq
    - progefts
    - progstatus
    - proglength
    - proglengthunits
    - ygtype
    - nzqacode
    - progress_type
    - createddate
    - updateddate
  # Hidden: createdby, updatedby (usernames)

contacts:
  mode: allow
  fields:
    - id
    - type_2
    - gender
    - status
    - createddate
  # Hidden: surname, forename, preferredname, middlename, initials,
  #         email1, mobilephone, secondaryphone, workphone, dob, otherid

addresses:
  mode: allow
  fields:
    - addressid
    - addresstype
    - city
    - postcode
    - region
    - country
    - validfrom
    - validto
    - studentid
    - orgid
    - contactid
    - createddate
    - updateddate
  # Hidden: street, suburb (too specific for PII)

notes:
  mode: allow
  fields:
    - noteid
    - student_id
    - enrolmentid
    - notetype
    - notearea
    - confidential
    - createddate
  # Hidden: note1 (free text may contain PII), createdby (username)

organisations:
  mode: allow
  fields:
    - id
    - name
    - legalname
    - orgtype
    - country
    - registration_number
  # Note: currently returns 0 rows in production

classes:
  mode: allow
  fields:
    - id
    - class_name
    - capacity
    - enrolment_count
    - startdate
    - enddate
    - campusid
    - active
    - createddate
  # Hidden: description (may contain PII), otherid, createdby

# Lookup entities -- no PII
campuses:
  mode: all
ethnicities:
  mode: all
countries:
  mode: all
genders:
  mode: all
titles:
  mode: all

Implementation

apply_field_policy <- function(df, entity_name) {
  policy <- .mcp_policy[[entity_name]]
  if (is.null(policy) || identical(policy$mode, "all")) return(df)

  if (identical(policy$mode, "allow")) {
    return(df[, intersect(names(df), policy$fields), drop = FALSE])
  }

  if (identical(policy$mode, "redact")) {
    for (col in intersect(names(df), policy$fields)) df[[col]] <- "[REDACTED]"
    return(df)
  }

  df
}

Unknown fields are secure by default. With mode: allow, any new field returned by the API that is not in the allowlist is automatically excluded.

Layer 5: Hybrid Data Access

  • describe_entity and get_entity_summary return only aggregates — never individual rows.
  • execute_r provides selma_get_entity() which fetches, applies the field policy, and caches. Claude never sees raw unfiltered data.
  • Entity aliases (selma_students() etc.) in the workspace shadow the real package functions, always applying the field policy first.
  • Join helpers are exposed — they operate on already-filtered data.

Layer 6: Output Scanning

Before any MCP response is returned, all text content is scanned for PII using two complementary methods: regex pattern matching and a session-specific PII dictionary. Matches are replaced with [REDACTED:type].

6a. Regex Pattern Matching

Catches structurally recognisable PII regardless of source:

PII_PATTERNS <- list(
  email    = "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
  nz_phone = "\\+64[0-9]{7,10}",
  au_phone = "\\+61[0-9]{8,9}",
  nz_mobile = "(?:^|\\s)02[0-9]{7,9}",
  dob_iso  = paste0(
    "\\b(?:19[5-9][0-9]|200[0-9]|201[0-9])",
    "-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12][0-9]|3[01])\\b"
  )
)

Design decision: The NSN (National Student Number) pattern — 9-digit numbers — is excluded from the default regex set because it produces too many false positives. NSNs are instead handled by the PII dictionary (below).

6b. PII Dictionary (Session-Specific)

The server has access to the full unfiltered data before the field policy is applied. When an entity with PII fields is first fetched, the server extracts the unique values from known PII columns and stores them in an internal dictionary. This dictionary is never exposed to the workspace.

# Columns to harvest PII values from (per entity)
PII_DICTIONARY_SOURCES <- list(
  students = c("surname", "forename", "preferredname",
               "email1", "email2", "mobilephone", "homephone",
               "workphone", "nsn"),
  contacts = c("surname", "forename", "email",
               "mobilephone", "workphone")
)

# Built during entity fetch, before field policy is applied
build_pii_dictionary <- function(df, entity_name) {
  cols <- PII_DICTIONARY_SOURCES[[entity_name]]
  if (is.null(cols)) return(invisible(NULL))

  for (col in intersect(cols, names(df))) {
    values <- unique(na.omit(as.character(df[[col]])))
    # Filter out short/common values to avoid false positives
    values <- values[nchar(values) >= 3]
    .pii_dictionary$values <- unique(c(.pii_dictionary$values, values))
  }
}

The dictionary scan uses case-insensitive whole-word matching to minimise false positives:

scan_for_dictionary_pii <- function(text) {
  redacted <- text
  detections <- 0L

  for (val in .pii_dictionary$values) {
    pattern <- paste0("\\b", escape_regex(val), "\\b")
    if (grepl(pattern, redacted, ignore.case = TRUE)) {
      redacted <- gsub(pattern, "[REDACTED:pii]", redacted, ignore.case = TRUE)
      detections <- detections + 1L
    }
  }

  list(text = redacted, n_dictionary_hits = detections)
}

How it catches what regex can’t: if a student named “Kowalczyk” appears in a free-text note field or a computed output string, no regex pattern would match — but the dictionary scan catches it because “Kowalczyk” was harvested from the surname column.

False positive mitigation:

  • Minimum 3-character length threshold — filters out values like “Jo”, “Li”
  • Whole-word boundary matching — “Lee” won’t match “employee” or “Leeds”
  • Case-insensitive — catches “SMITH” even if the source was “Smith”
  • Dictionary is built lazily (only when entities are fetched) so it doesn’t slow down startup

All redactions (both regex and dictionary) are logged to the audit trail.

Layer 7: Audit Log

Every tool call is logged to {output_dir}/selma_mcp_audit.jsonl in JSON Lines format:

{
  "timestamp": "2026-03-14T10:15:30+1300",
  "tool": "execute_r",
  "arguments": {"code_length": 142},
  "response_bytes": 3847,
  "redactions_applied": ["email: 2 match(es)"],
  "code_blocked": false,
  "is_error": false
}

Blocked code attempts are also logged:

{
  "timestamp": "2026-03-14T10:16:05+1300",
  "tool": "execute_r",
  "arguments": {"code_length": 89},
  "response_bytes": 256,
  "code_blocked": true,
  "blocked_pattern": "selmaR::",
  "blocked_reason": "Direct namespace access bypasses PII policy",
  "is_error": true
}

The raw JSONL log can be parsed with jsonlite::stream_in() for programmatic analysis.

Session Summary Report (generated on exit)

When the MCP server shuts down (stdin EOF / client disconnect), it automatically generates a human-readable HTML summary at {output_dir}/selma_mcp_session_{timestamp}.html and writes the path to stderr. This ensures a readable audit artifact is always available, even if the user forgets to check.

The report contains:

Section Content
Session Start time, end time, duration, server version, SELMA domain
Security Field policy file used, ID pseudonymisation on/off, real-ID opt-in warning if applicable
Tool Calls Table of every call: timestamp, tool name, response size, duration
Entities Accessed Which entities were fetched, row counts, how many times
Blocked Attempts Any execute_r calls rejected by code inspection, with the blocked pattern and reason
PII Redactions Count and type of output-scanner redactions (e.g. “3 emails, 1 NZ phone”)
Warnings Any large-dataset warnings, timeouts, or errors

Example shutdown log:

[selmaR] Session summary written to: /path/to/output/selma_mcp_session_20260314_101530.html
[selmaR] 12 tool calls | 4 entities accessed | 0 blocked | 2 redactions

Implementation:

generate_session_report <- function() {
  log_entries <- jsonlite::stream_in(file(.audit$log_path), verbose = FALSE)

  # Compute summary stats
  n_calls <- nrow(log_entries)
  n_blocked <- sum(log_entries$code_blocked %in% TRUE)
  n_errors <- sum(log_entries$is_error %in% TRUE)
  entities_accessed <- unique(log_entries$entity[!is.na(log_entries$entity)])
  all_redactions <- unlist(log_entries$redactions_applied)
  duration <- difftime(Sys.time(), .audit$session_start, units = "mins")

  # Build HTML report
  html <- build_session_html(
    session_start = .audit$session_start,
    duration = duration,
    log_entries = log_entries,
    n_calls = n_calls,
    n_blocked = n_blocked,
    n_errors = n_errors,
    entities_accessed = entities_accessed,
    redactions = all_redactions,
    real_ids_enabled = .mcp_config$expose_real_ids
  )

  report_path <- file.path(
    .audit$output_dir,
    paste0("selma_mcp_session_",
           format(.audit$session_start, "%Y%m%d_%H%M%S"),
           ".html")
  )
  writeLines(html, report_path)
  mcp_log("Session summary written to: ", report_path)
  mcp_log(n_calls, " tool calls | ",
          length(entities_accessed), " entities accessed | ",
          n_blocked, " blocked | ",
          length(all_redactions), " redactions")
}

The main loop calls this on exit:

on.exit(generate_session_report(), add = TRUE)

repeat {
  line <- readLines(stdin_con, n = 1)
  if (length(line) == 0) break
  # ... dispatch ...
}

Workspace Setup

The execute_r tool runs code in a persistent .mcp_workspace environment.

Pre-loaded packages

dplyr, tidyr, ggplot2, lubridate, scales

Policy-wrapped fetch functions

# Primary interface
selma_get_entity("students")   # fetch -> apply_field_policy -> cache -> return

# Convenience aliases (shadow the real package functions)
selma_students()    # => selma_get_entity("students")
selma_enrolments()  # => selma_get_entity("enrolments")
selma_intakes()     # => selma_get_entity("intakes")
selma_components()  # => selma_get_entity("components")
selma_programmes()  # => selma_get_entity("programmes")

Join helpers

selma_join_students(enrolments, students)
selma_join_intakes(enrolments, intakes)
selma_student_pipeline(enrolments, students, intakes)
selma_component_pipeline(components, enrolments, students, intakes)

Constants

All SELMA_STATUS_*, SELMA_FUNDED_STATUSES, and SELMA_FUNDING_* constants are injected into the workspace.

What is NOT in the workspace

Runtime Guards

Guard Detail
60-second compute timeout setTimeLimit(elapsed = 60) on execute_r computation (data loading via selma_get_entity() runs outside the timeout)
800KB response cap Truncate oversized responses with [TRUNCATED] warning
Row-count reporting Workspace wrapper emits [entity: N rows x M cols] messages
Large dataset warnings Entities >50K rows trigger [WARNING: filter early]
Audience annotations Diagnostic messages sent to ["assistant"] only
HTML escaping html_esc() on all user data in plot HTML output
Filename sanitisation Slugified filenames for chart output (no directory traversal)
stderr diagnostics All internal logging goes to stderr, never stdout

Messages

Welcome Message (to the end user)

Sent via notifications/message with level: "info" when the client sends notifications/initialized. Displayed directly to the user in Claude Code or Claude Desktop:

selmaR MCP v{version} connected to {domain}.
PII redaction active ({n} entities configured).
Tools: list_entities, describe_entity, get_entity_summary, get_efts_report, execute_r.

Server Instructions (to Claude)

The MCP initialize response includes an instructions field. This is sent to the assistant (Claude) context — not shown to the end user — and front-loads domain knowledge to minimise tool call round-trips.

The full SERVER_INSTRUCTIONS string is defined in inst/mcp/server.R (search for SERVER_INSTRUCTIONS <-). It is maintained there as the single source of truth to avoid dual-maintenance drift.

The instructions cover these sections:

Section Purpose
TOOL DISCOVERY Which tools to call and in what order
SECURITY PII policy rules, blocked patterns
WORKFLOW Recommended 5-step tool progression
SELMA DATA MODEL Entity relationships and join keys
JOIN KEYS Exact column names for cross-entity joins
ENTITY FIELDS Per-entity column listing (policy-allowed only)
KEY DATE FIELDS Important date columns per entity
ENROLMENT STATUS CODES SELMA_STATUS_* constants
STATUS GROUPS SELMA_FUNDED_STATUSES, SELMA_ALL_FUNDED_STATUSES
FUNDING SOURCES SELMA_FUNDING_* constants
WORKSPACE FUNCTIONS selma_*() fetchers, join helpers, pipelines
COMMON PATTERNS Example dplyr pipelines (e.g. count(progtitle, enrstatus))
CHARTING Chart.js HTML via save_chart(), ggplot2 fallback
PRE-LOADED PACKAGES dplyr, tidyr, ggplot2, lubridate, scales

Configuration

Claude Desktop / Claude Code config

{
  "mcpServers": {
    "selmaR": {
      "command": "Rscript",
      "args": ["/path/to/selmaR/inst/mcp/server.R"],
      "cwd": "/path/to/directory/containing/config.yml"
    }
  }
}

SELMA credentials (config.yml in cwd)

default:
  selma:
    base_url: "https://myorg.selma.co.nz/"
    email: "api@selma.co.nz"
    password: "your_password"

Environment variable overrides

Variable Purpose
SELMAR_PKG_DIR Path to selmaR package root
SELMAR_OUTPUT_DIR Output directory for charts/reports
SELMAR_FIELD_POLICY Path to field policy YAML
SELMA_BASE_URL Credential override
SELMA_EMAIL Credential override
SELMA_PASSWORD Credential override

Prompt Injection Defence

The combination of layers provides defence-in-depth against prompt injection attempting to exfiltrate data:

  1. ID pseudonymisation — Student IDs and NSNs are hashed with a session-scoped seed before any data reaches Claude. Even if all other layers fail, real IDs are never exposed unless explicitly opted in via config.yml.
  2. Code inspection (AST) — Before execute_r evaluates anything, the code is parsed into an AST and walked. Blocked namespace access (selmaR::, httr::, etc.), metaprogramming (eval, do.call, get), file I/O, network access, and shell commands are all detected and rejected. The code never runs.
  3. Input allowlisting — Claude cannot call arbitrary API endpoints; only the 7 defined tools are available.
  4. Field policy — Even if Claude writes selma_students() in execute_r, the workspace wrapper strips PII fields before the data reaches Claude.
  5. Output scanning — If PII somehow enters the response (e.g. via a free-text note field that wasn’t policy-filtered), two scanners catch it: regex patterns for emails, phones, and DOBs, plus a session-specific PII dictionary built from actual student names, NSNs, and contact details.
  6. Audit log — All tool calls, redactions, blocked attempts, and real-ID opt-ins are logged, so injection attempts are detectable in post-hoc review.
  7. Server instructions — Tell Claude never to return raw unfiltered data and never to use :: namespace access, reducing the surface area for social engineering.

Known limitations:

  • The AST guard blocks eval, do.call, get, and other metaprogramming functions, which closes the main bypass routes. However, R is a dynamic language — sufficiently creative code could theoretically construct function references via mechanisms not yet blocked (e.g. environment manipulation). The output scanner and field policy provide additional defence lines. For multi-user deployment, process-level isolation would be needed (out of scope).
  • The output scanner uses regex heuristics — it will not catch all possible PII (e.g. free-text names that aren’t in a recognisable pattern). The field policy is the primary defence; the scanner is a safety net.

Build Sequence

  1. Server skeleton: preamble, package loading (jsonlite, stringr, yaml, digest), mcp_log(), output directory
  2. Security engine: pseudonymise_id(), check_code_safety(), load_field_policy(), apply_field_policy(), scan_output_for_pii(), audit_log()
  3. Entity registry and cache: ENTITY_REGISTRY, get_cached_entity()
  4. Column summariser: summarize_column()
  5. Tool definitions: 7 tool schemas in mcp_tools list
  6. Tool handlers: list/describe/search/summary/efts/auth/execute_r
  7. Workspace setup: .mcp_workspace with policy-wrapped functions
  8. Server infrastructure: SERVER_INSTRUCTIONS, welcome message, handle_request(), auth-on-startup, main loop
  9. field_policy.yml: default policy for all entities
  10. test_server.R: security-focused test suite
  11. Documentation: this article

Testing Strategy

Category What is tested
ID pseudonymisation Same ID produces same hash within session, different seed = different hash, joins still work, opt-in disables hashing
AST code inspection :: with blocked packages, blocked function calls, :::, metaprogramming (eval/do.call/get), file I/O, network, shell — all rejected. Clean dplyr/ggplot2 code passes. Comments and strings containing blocked names do not trigger. Audit log records blocks.
Field policy loading All three modes (allow/redact/all), missing policy, unknown entity
Output scanning (regex) Email, NZ/AU phone, DOB patterns, clean text passthrough
Output scanning (dictionary) Known surnames/names caught in free text, short values (<3 chars) skipped, whole-word matching avoids partial hits
apply_field_policy Mock tibbles with PII columns, verify filtering
Audit logging Write entry, verify JSONL structure
Entity registry All entries map to real selmaR functions
Integration (skip without creds) End-to-end fetch + filter + scan

Verification

  1. Rscript inst/mcp/test_server.R — all security tests pass
  2. Manual test: pipe JSON-RPC requests to stdin, verify responses
  3. Configure in Claude Code, test full workflow
  4. Verify welcome message appears on connection
  5. Verify blocked code patterns are rejected with clear warnings
  6. Verify audit log is created and populated (including blocked attempts)
  7. Verify PII fields are absent from describe_entity output
  8. Verify output scanning catches emails/phones in execute_r results