Guardrail Benchmarking for Large Language Model Agents • securebench

[!CAUTION] Alpha software. This package is part of a broader effort by Ian Flores Siaca to develop proper AI infrastructure for the R ecosystem. It is under active development and should not be used in production until an official release is published. APIs may change without notice.

Benchmarking framework for guardrail accuracy in R LLM agent workflows. Evaluate guardrails against labeled datasets, compute precision/recall/F1 metrics, generate confusion matrices, compare results across iterations, and export as vitals-compatible scorers.

Why securebench?

When you build guardrails, you need to know if they actually work. securebench gives you precision, recall, and F1 metrics for any guardrail – so you can measure how well your prompt injection detector catches attacks without blocking legitimate queries, and compare different guardrail configurations side by side.

Features

Function	Description
`guardrail_eval()`	Evaluate a guardrail against a labeled data frame
`guardrail_metrics()`	Compute precision, recall, F1, and accuracy
`guardrail_confusion()`	Generate a 2x2 confusion matrix
`guardrail_compare()`	Compare two guardrails with delta metrics and per-case diffs
`guardrail_report()`	Print a formatted report or return results as a data frame
`benchmark_guardrail()`	Quick-start: benchmark from positive/negative case vectors
`benchmark_pipeline()`	Evaluate a full secureguard pipeline end-to-end
`as_vitals_scorer()`	Convert any guardrail to a vitals-compatible scorer function

Part of the secure-r-dev Ecosystem

securebench is part of a 7-package ecosystem for building governed AI agents in R:

                    ┌─────────────┐
                    │   securer    │
                    └──────┬──────┘
          ┌────────────────┼─────────────────┐
          │                │                  │
   ┌──────▼──────┐  ┌─────▼──────┐  ┌───────▼────────┐
   │ securetools  │  │ secureguard│  │ securecontext   │
   └──────┬───────┘  └─────┬──────┘  └───────┬────────┘
          └────────────────┼─────────────────┘
                    ┌──────▼───────┐
                    │   orchestr   │
                    └──────┬───────┘
          ┌────────────────┼─────────────────┐
          │                                  │
   ┌──────▼──────┐                   ┌───────▼────────┐
   │ securetrace  │                  │>>> securebench<<<│
   └─────────────┘                   └────────────────┘

securebench sits at the bottom of the stack alongside securetrace. It benchmarks guardrail accuracy by evaluating secureguard guardrails (or any boolean classifier) against labeled datasets, producing precision/recall/F1 metrics and confusion matrices.

Package	Role
securer	Sandboxed R execution with tool-call IPC
securetools	Pre-built security-hardened tool definitions
secureguard	Input/code/output guardrails (injection, PII, secrets)
orchestr	Graph-based agent orchestration
securecontext	Document chunking, embeddings, RAG retrieval
securetrace	Structured tracing, token/cost accounting, JSONL export
securebench	Guardrail benchmarking with precision/recall/F1 metrics

Installation

# install.packages("pak")
pak::pak("ian-flores/securebench")

Quick Start

library(securebench)

# Benchmark a guardrail with known positive/negative cases
my_guardrail <- function(text) !grepl("DROP TABLE", text, fixed = TRUE)

metrics <- benchmark_guardrail(
  my_guardrail,
  positive_cases = c("DROP TABLE users", "SELECT 1; DROP TABLE x"),
  negative_cases = c("SELECT * FROM users", "Hello world")
)
metrics$precision
metrics$recall
metrics$f1

Data Frame API

data <- data.frame(
  input = c("normal text", "DROP TABLE users"),
  expected = c(TRUE, FALSE),
  label = c("benign", "injection")
)

result <- guardrail_eval(my_guardrail, data)
m <- guardrail_metrics(result)
cm <- guardrail_confusion(result)
guardrail_report(result)

Vitals Interop

scorer <- as_vitals_scorer(my_guardrail)
scorer("safe query", TRUE)    # 1 (correct)
scorer("DROP TABLE x", FALSE) # 1 (correct)

Comparing Guardrails

# Define test data
data <- data.frame(
  input = c("hello", "how are you?", "DROP TABLE users", "'; DELETE FROM accounts"),
  expected = c(TRUE, TRUE, FALSE, FALSE),
  label = c("benign", "benign", "injection", "injection")
)

# Two guardrail versions to compare
guard_v1 <- function(text) !grepl("DROP", text, fixed = TRUE)
guard_v2 <- function(text) !grepl("DROP|DELETE", text)

# Evaluate both against the same dataset
result_v1 <- guardrail_eval(guard_v1, data)
result_v2 <- guardrail_eval(guard_v2, data)

# Compare: see which improved, which regressed
diff <- guardrail_compare(result_v1, result_v2)
diff$delta_f1       # positive = v2 is better
diff$improved       # cases v2 got right that v1 missed
diff$regressed      # cases v2 got wrong that v1 had right

Documentation

securebench ships with two vignettes:

Getting Started with securebench – walkthrough of the core evaluation workflow
Guardrail Testing Patterns – strategies for building labeled datasets and iterating on guardrail accuracy

Browse the full documentation at https://ian-flores.github.io/securebench/.

Contributing

Contributions are welcome! Please file issues on GitHub and submit pull requests.

License

MIT