Skip to content

Operations Runbook: Accepted Risk Monitoring & Response CommunityPro

This runbook covers two accepted risks that require ongoing operational awareness: DataGuard connection termination and AI prompt injection residual risk. Both are intentional design decisions with known trade-offs.


DataGuard monitors per-query response size and rolling-window transfer volume. When a limit is exceeded, the connection is permanently blocked for the remainder of its lifetime. The client receives:

FATAL: Querycop: data guard: response size NNN bytes exceeds limit NNN bytes (reconnect to continue)

This is not a bug. It is a security-first design to prevent data exfiltration via repeated large queries within a single connection.

Metrics to watch:

  • Audit log events with event_type = "data_guard_violation"
  • WebSocket events of type query.data_guard_violation
  • Sudden increase in connection count (reconnect storms)

Alerting thresholds (suggested):

  • 5 violations per hour from same db_user -> investigate

  • 20 violations per hour across all users -> possible misconfiguration

  • Reconnect rate > 10x normal -> possible reconnect loop
  1. Check the violating user and query

    GET /audit?type=data_guard_violation&limit=10
  2. Determine if it is legitimate usage or exfiltration

    • Legitimate: analytics export, large JOIN, reporting query
    • Suspicious: SELECT * without WHERE, bulk dump pattern, unfamiliar user
  3. If legitimate:

    • Increase GATEKEEPER_MAX_RESPONSE_MB for the specific workload
    • Consider per-role DataGuard overrides (future feature)
    • Advise the application to use pagination (LIMIT/OFFSET)
  4. If suspicious:

    • Do NOT increase limits
    • Check if the user should have access to this data
    • Review RBAC policy for the user’s role
    • Consider temporary Break-Glass revocation or policy tightening
    • Escalate to security team if bulk data access is confirmed
VariableDefaultDescription
GATEKEEPER_MAX_RESPONSE_MB100Max single query response (MB)
GATEKEEPER_MAX_WINDOW_MB500Max transfer per 60-second window (MB)
  • If violation count exceeds alerting threshold: page on-call DBA
  • If suspected exfiltration: escalate to security incident process
  • If legitimate workload consistently hits limits: file capacity planning ticket

SQL queries are sent to an LLM for risk scoring. An attacker who controls SQL content (e.g., via application-level SQL injection) may attempt to manipulate the LLM’s analysis. Querycop has multiple defense layers, but prompt injection is inherently unsolvable at the LLM level.

LayerMechanismWhat it prevents
Comment strippingSanitizeSQLForAnalysis-- Ignore instructions
System promptAnti-injection instructions{"score":0} embedded in SQL
Server-side overrideDestructive keyword checkDELETE/DROP with score < 10 -> force score 50
Threshold enforcementShouldAutoApproveServer decides, not AI text

Metrics to watch:

  • AI score override events: search audit log for [score overridden by safety check] in risk_reason
  • Suspiciously low scores for destructive queries (< 10 for DELETE/UPDATE/DDL)
  • AI error rate (provider timeouts, parse failures)
  • Unusual risk_reason text patterns (very long, contains JSON-like content, contains English instructions)

Alerting thresholds (suggested):

  • Score override > 3 per hour -> investigate SQL content
  • AI error rate > 10% -> check provider status
  • Same query pattern receiving wildly different scores -> possible adversarial probing
  1. Check recent AI analysis results

    GET /audit?limit=20

    Look for entries with risk_score and risk_reason.

  2. If score override is firing frequently:

    • The override means the AI returned a low score for a destructive query
    • This could be prompt injection or just an AI misjudgment
    • Review the actual SQL queries that triggered the override
    • If queries contain natural language text mixed with SQL: likely injection attempt
  3. If AI provider is returning errors:

    • Check provider status page (OpenAI, Anthropic, etc.)
    • Queries will pass through without AI scoring when AI is unavailable
    • Consider temporarily setting auto_approve_threshold: 0 to require human approval for all destructive queries
  4. If you suspect active adversarial probing:

    • Lower auto_approve_threshold to 0 (all destructive queries require human)
    • Review Slack/webhook notifications for unusual patterns
    • Check if the SQL source application has a SQL injection vulnerability
    • The attacker may be exploiting the application, not Querycop directly

The risk_reason field from AI analysis is untrusted text. It appears in:

  • Slack notifications (escaped via escapeSlackMrkdwn)
  • Dashboard UI (escaped via escH)
  • Audit log (stored as-is)

Operators should treat risk_reason as advisory context, not as a reliable classification. The authoritative signal is the numeric risk_score after server-side override.

Fallback: disable AI and require human approval

Section titled “Fallback: disable AI and require human approval”

If AI scoring becomes unreliable:

Terminal window
# Remove AI provider (disables AI scoring entirely)
unset AI_API_KEY
# Set all destructive queries to require human approval
# (via policy: set auto_approve_threshold to 0 for all roles)

Without AI, Querycop still blocks destructive queries and requires human approval. AI scoring is an additional signal, not the sole gate.

  • AI scoring consistently wrong: file issue with AI provider and adjust thresholds
  • Active adversarial probing confirmed: escalate to security incident
  • Provider outage > 1 hour: consider switching to backup provider or disabling AI

DateChange
2026-04-01Initial runbook creation