ai-code-battle/notes/bf-5mkq.md
jedarden 7d38196302 Bug fix bf-5mkq: Document enrichment pipeline investigation
Investigated why all matches have enriched: false. Root cause is
corrupted R2 credentials in OpenBao that prevent the acb-enrichment
service from uploading AI commentary.

Key findings:
- R2 credentials at secret/rs-manager/ai-code-battle/r2 are corrupted
- endpoint/secret-key values are swapped
- Enrichment service cannot upload to R2
- Fix script exists but requires cluster access

Blocker: Expired kubeconfig (bf-5nap) prevents cluster access
and execution of the fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 18:31:39 -04:00

11 KiB

Bug Fix bf-5mkq: Enrichment Pipeline Not Running - Investigation Report

Summary

All 1000 matches in production have enriched: false. The acb-enrichment service should process completed matches and set enriched: true with AI commentary, but it's not working.

Problem Analysis

Root Cause

The enrichment pipeline is not functioning due to corrupted R2 credentials in OpenBao, which prevents the acb-enrichment service from uploading AI commentary to R2.

Evidence

  1. Match index shows all matches unenriched - The data/matches/index.json file has enriched: false for all matches
  2. R2 credentials are corrupted - According to IAD-ACB-R2-CREDENTIALS-FIX.md:
    • The endpoint property contains a SHA256 hash instead of the R2 endpoint URL
    • The secret-key property contains the actual endpoint URL instead of the secret key
    • The access-key property contains a hash instead of the R2 access key ID

How Enrichment Works

  1. acb-enrichment service (Deployment) runs on a 30-minute cycle
  2. Selector finds completed matches without commentary (commentary_json IS NULL)
  3. Generator downloads replays from B2, generates AI commentary via LLM
  4. Storage client uploads commentary to R2 at commentary/{match_id}.json
  5. Index builder checks R2 for commentary files and sets enriched: true in match index

Why It's Failing

The acb-enrichment service cannot upload commentary to R2 because:

  1. Service tries to use R2 credentials from cloudflare-pages-secret Secret
  2. This Secret is synced from OpenBao via ExternalSecret
  3. The OpenBao values at secret/rs-manager/ai-code-battle/r2 are corrupted
  4. Upload fails with authentication/endpoint errors
  5. No commentary files are created in R2
  6. Index builder sees no commentary files, sets enriched: false for all matches

Diagnostic Steps

Step 1: Check acb-enrichment Deployment Status

# Requires valid kubeconfig at /home/coding/.kube/iad-acb.kubeconfig
export KUBECONFIG=/home/coding/.kube/iad-acb.kubeconfig

# Check deployment
kubectl get deployment acb-enrichment -n ai-code-battle

# Check pods
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-enrichment

# Check logs
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-enrichment --tail=100

Expected findings:

  • Pod may be running but failing to upload to R2
  • Logs may show "Custom endpoint was not a valid URI" or authentication errors
  • Service may be skipping matches due to storage check failures

Step 2: Verify R2 Credentials

# Check secret values
kubectl get secret acb-r2-credentials -n ai-code-battle -o json | jq -r '.data | map_values(@base64d)'

# Check enrichment service's secret (cloudflare-pages-secret)
kubectl get secret cloudflare-pages-secret -n ai-code-battle -o json | jq -r '.data | map_values(@base64d)'

Expected findings:

  • Values will be corrupted (see IAD-ACB-R2-CREDENTIALS-FIX.md for details)
  • endpoint will be a hash instead of https://e26f015c7ba47a6ad6219385e77072b7.r2.cloudflarestorage.com
  • secret-key will be the endpoint URL instead of the actual secret key

Step 3: Check R2 for Commentary Files

# Check if any commentary files exist
curl -s "https://r2.aicodebattle.com/commentary/" | head -20

# Try to fetch a specific commentary file
curl -I "https://r2.aicodebattle.com/commentary/m_XXXXXX.json"

Expected findings:

  • No commentary files exist in R2
  • Directory may not exist yet

Fix Required

Follow the steps in IAD-ACB-R2-CREDENTIALS-FIX.md:

  1. Access OpenBao on rs-manager
  2. Update the secret at secret/rs-manager/ai-code-battle/r2 with correct values
  3. Force ESO to re-sync:
    kubectl annotate externalsecret acb-r2-credentials -n ai-code-battle force-sync=$(date +%s)
    

Option 2: Fix Enrichment Service Secret Directly

The enrichment service uses cloudflare-pages-secret for R2 credentials. This can be fixed directly:

# Get correct R2 credentials from Cloudflare Dashboard
# R2 > acb-data > Settings > R2 API

# Update the secret
kubectl create secret generic cloudflare-pages-secret -n ai-code-battle \
  --from-literal=r2-endpoint="https://e26f015c7ba47a6ad6219385e77072b7.r2.cloudflarestorage.com" \
  --from-literal=r2-bucket="acb-data" \
  --from-literal=r2-access-key="<R2_ACCESS_KEY_ID>" \
  --from-literal=r2-secret-key="<R2_SECRET_ACCESS_KEY>" \
  --dry-run=client -o yaml | \
  kubectl apply -f -

# Restart enrichment service to pick up new credentials
kubectl rollout restart deployment/acb-enrichment -n ai-code-battle

Option 3: Run Fix Script

/home/coding/ai-code-battle/fix-iad-acb-r2-credentials.sh

Post-Fix Verification

1. Verify R2 Credentials

kubectl get secret cloudflare-pages-secret -n ai-code-battle -o json | jq -r '.data | map_values(@base64d)'

Expected values:

  • r2-endpoint: https://e26f015c7ba47a6ad6219385e77072b7.r2.cloudflarestorage.com
  • r2-bucket: acb-data
  • r2-access-key: 32-character access key ID
  • r2-secret-key: 64-character secret access key

2. Verify Enrichment Service

# Check pod is running
kubectl get pods -n ai-code-battle -l app.kubernetes.io/name=acb-enrichment

# Check logs for successful enrichment
kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-enrichment --tail=50

# Look for:
# - "Enriched replay" messages
# - "commentary/{match_id}.json" upload confirmations
# - No R2 authentication errors

3. Verify Commentary Files in R2

# After next enrichment cycle (30 minutes)
curl -s "https://r2.aicodebattle.com/commentary/index.json"

# Should show entries like:
# {
#   "updated_at": "2026-05-13T...",
#   "entries": [
#     {"match_id": "m_XXXXXX", "criteria": ["upset_250", "back_and_forth"]}
#   ]
# }

4. Verify Match Index Updates

# Check data/matches/index.json for enriched: true
curl -s "https://aicodebattle.com/data/matches/index.json" | jq '.matches[] | select(.enriched == true)'

# After index builder runs (every 5 minutes), some matches should show enriched: true

5. Test Enrichment Endpoint

# Test the manual enrichment request endpoint
curl -X POST "https://api.aicodebattle.com/api/request-enrichment" \
  -H "Content-Type: application/json" \
  -d '{"match_id":"m_XXXXXX","shared_secret":"<bot_secret>"}'

# Should return:
# {
#   "status": "pending",
#   "request_id": "req_XXXXXX",
#   "match_id": "m_XXXXXX",
#   "estimated_wait_s": 300
# }

Expected Timeline

  1. Immediate (after fix):

    • Enrichment service can connect to R2
    • Commentary files start being uploaded
  2. After 30 minutes (next enrichment cycle):

    • First batch of matches enriched (up to 20/hour)
    • Commentary files appear in R2
  3. After 35 minutes (next index builder cycle):

    • Match index updated with enriched: true for enriched matches
    • Frontend shows "AI Commentary Available" badge
  4. After several hours:

    • Historical matches gradually enriched (up to 20/hour)
    • Newest completed matches enriched first

Configuration

Enrichment Service Settings

From manifests/acb-enrichment-deployment.yml:

  • Cycle interval: 30 minutes
  • Rate limit: 20 enrichments per hour
  • Max concurrent: 3 enrichment requests
  • Min turns: 100 (matches must have 100+ turns)
  • Min crossings: 3 (win probability must cross 0.5 three times)
  • Upset threshold: 150 rating points
  • LLM model: gpt-4o-mini
  • Storage: R2 (preferred), B2 (fallback)

Enrichment Criteria

Matches are selected for enrichment based on:

  1. Back-and-forth: Win prob crosses 0.5 at least 3 times
  2. Upset: Lower-rated bot wins by >150 rating points
  3. Close finish: Final score difference ≤2
  4. High interest score: Composite score ≥5.0
  5. Evolution milestone: Evolved bot's first top-10 appearance
  1. R2 Credentials Corruption (IAD-ACB-R2-CREDENTIALS-FIX.md)

    • Status: KNOWN, requires fix
    • Impact: All R2 operations fail
  2. Expired Kubeconfig (notes/bf-5nap.md)

    • Status: KNOWN, requires renewal
    • Impact: Cannot access cluster to diagnose

Files Modified

  • Created: /home/coding/ai-code-battle/notes/bf-5mkq.md (this file)

Current Status (2026-05-13)

Blocker

Expired iad-acb kubeconfig (see notes/bf-5nap.md) prevents access to the production cluster. Without cluster access, we cannot:

  • Run the fix script (fix-iad-acb-r2-credentials.sh)
  • Update OpenBao secrets
  • Restart the enrichment service
  • Verify the fix

Environment Verification

  • Local machine: No kubeconfig at ~/.kube/iad-acb.kubeconfig
  • API endpoint: api.aicodebattle.com not reachable from local environment
  • Fix script: Exists at /home/coding/ai-code-battle/fix-iad-acb-r2-credentials.sh
  • Fix documentation: Complete in IAD-ACB-R2-CREDENTIALS-FIX.md

Action Plan (when cluster access is restored)

  1. Restore cluster access (prerequisite):

    # On ex44 server
    export KUBECONFIG=/home/coding/.kube/iad-acb.kubeconfig
    kubectl cluster-info  # Verify access
    
  2. Fix R2 credentials (choose one):

    • Option A - Run fix script: /home/coding/ai-code-battle/fix-iad-acb-r2-credentials.sh
    • Option B - Manual OpenBao update: See IAD-ACB-R2-CREDENTIALS-FIX.md
    • Option C - Create SealedSecret: Bypass ESO with SealedSecret
  3. Restart enrichment service:

    kubectl rollout restart deployment/acb-enrichment -n ai-code-battle
    
  4. Verify enrichment resumes:

    • Check logs: kubectl logs -n ai-code-battle -l app.kubernetes.io/name=acb-enrichment
    • Monitor R2 for new commentary files
    • Verify enriched: true appears in match index

Expected Timeline After Fix

  • Immediate: Service can connect to R2
  • 30 minutes: First enrichment cycle runs, up to 20 matches enriched
  • 35 minutes: Index builder updates match index with enriched: true
  • Hours: Historical matches gradually enriched (20/hour rate limit)

Next Steps

This bead is blocked by expired kubeconfig. Complete bf-5nap first to restore cluster access, then:

  1. Fix R2 credentials using the fix script
  2. Restart acb-enrichment deployment
  3. Monitor logs for successful enrichments
  4. Verify commentary files appear in R2
  5. Confirm match index updates with enriched: true
  6. Close bead with retrospective

Prevention

To prevent future enrichment pipeline failures:

  1. Monitor R2 credentials health - Alert when uploads fail
  2. Track enrichment rate - Alert if <10 enrichments/hour for 2+ hours
  3. Verify commentary directory - Check R2 for new files every hour
  4. Test enrichment endpoint - Periodic health check of /api/request-enrichment