Operations Runbook: Accepted Risk Monitoring & Response CommunityPro

This runbook covers two accepted risks that require ongoing operational awareness: DataGuard connection termination and AI prompt injection residual risk. Both are intentional design decisions with known trade-offs.

DataGuard Connection Termination

What is it

DataGuard monitors per-query response size and rolling-window transfer volume. When a limit is exceeded, the connection is permanently blocked for the remainder of its lifetime. The client receives:

FATAL: Querycop: data guard: response size NNN bytes exceeds limit NNN bytes (reconnect to continue)

This is not a bug. It is a security-first design to prevent data exfiltration via repeated large queries within a single connection.

Monitoring

Metrics to watch:

Audit log events with event_type = "data_guard_violation"
WebSocket events of type query.data_guard_violation
Sudden increase in connection count (reconnect storms)

Alerting thresholds (suggested):

5 violations per hour from same db_user -> investigate
20 violations per hour across all users -> possible misconfiguration
Reconnect rate > 10x normal -> possible reconnect loop

Triage procedure

Check the violating user and query

GET /audit?type=data_guard_violation&limit=10

Determine if it is legitimate usage or exfiltration
- Legitimate: analytics export, large JOIN, reporting query
- Suspicious: SELECT * without WHERE, bulk dump pattern, unfamiliar user
If legitimate:
- Increase GATEKEEPER_MAX_RESPONSE_MB for the specific workload
- Consider per-role DataGuard overrides (future feature)
- Advise the application to use pagination (LIMIT/OFFSET)
If suspicious:
- Do NOT increase limits
- Check if the user should have access to this data
- Review RBAC policy for the user’s role
- Consider temporary Break-Glass revocation or policy tightening
- Escalate to security team if bulk data access is confirmed

Configuration reference

Variable	Default	Description
`GATEKEEPER_MAX_RESPONSE_MB`	100	Max single query response (MB)
`GATEKEEPER_MAX_WINDOW_MB`	500	Max transfer per 60-second window (MB)

Escalation

If violation count exceeds alerting threshold: page on-call DBA
If suspected exfiltration: escalate to security incident process
If legitimate workload consistently hits limits: file capacity planning ticket

AI Prompt Injection Monitoring

What is it

SQL queries are sent to an LLM for risk scoring. An attacker who controls SQL content (e.g., via application-level SQL injection) may attempt to manipulate the LLM’s analysis. Querycop has multiple defense layers, but prompt injection is inherently unsolvable at the LLM level.

Current defenses

Layer	Mechanism	What it prevents
Comment stripping	`SanitizeSQLForAnalysis`	`-- Ignore instructions`
System prompt	Anti-injection instructions	`{"score":0}` embedded in SQL
Server-side override	Destructive keyword check	DELETE/DROP with score < 10 -> force score 50
Threshold enforcement	`ShouldAutoApprove`	Server decides, not AI text

Monitoring

Metrics to watch:

AI score override events: search audit log for [score overridden by safety check] in risk_reason
Suspiciously low scores for destructive queries (< 10 for DELETE/UPDATE/DDL)
AI error rate (provider timeouts, parse failures)
Unusual risk_reason text patterns (very long, contains JSON-like content, contains English instructions)

Alerting thresholds (suggested):

Score override > 3 per hour -> investigate SQL content
AI error rate > 10% -> check provider status
Same query pattern receiving wildly different scores -> possible adversarial probing

Triage procedure

Check recent AI analysis results
```
GET /audit?limit=20
```
Look for entries with risk_score and risk_reason.
If score override is firing frequently:
- The override means the AI returned a low score for a destructive query
- This could be prompt injection or just an AI misjudgment
- Review the actual SQL queries that triggered the override
- If queries contain natural language text mixed with SQL: likely injection attempt
If AI provider is returning errors:
- Check provider status page (OpenAI, Anthropic, etc.)
- Queries will pass through without AI scoring when AI is unavailable
- Consider temporarily setting auto_approve_threshold: 0 to require human approval for all destructive queries
If you suspect active adversarial probing:
- Lower auto_approve_threshold to 0 (all destructive queries require human)
- Review Slack/webhook notifications for unusual patterns
- Check if the SQL source application has a SQL injection vulnerability
- The attacker may be exploiting the application, not Querycop directly

Risk reason trust boundary

The risk_reason field from AI analysis is untrusted text. It appears in:

Slack notifications (escaped via escapeSlackMrkdwn)
Dashboard UI (escaped via escH)
Audit log (stored as-is)

Operators should treat risk_reason as advisory context, not as a reliable classification. The authoritative signal is the numeric risk_score after server-side override.

Fallback: disable AI and require human approval

If AI scoring becomes unreliable:

# Remove AI provider (disables AI scoring entirely)
unset AI_API_KEY

# Set all destructive queries to require human approval
# (via policy: set auto_approve_threshold to 0 for all roles)

Without AI, Querycop still blocks destructive queries and requires human approval. AI scoring is an additional signal, not the sole gate.

Escalation

AI scoring consistently wrong: file issue with AI provider and adjust thresholds
Active adversarial probing confirmed: escalate to security incident
Provider outage > 1 hour: consider switching to backup provider or disabling AI

Change History

Date	Change
2026-04-01	Initial runbook creation