Docker and Logs Review - Gitinext-Golang (1-Month Analysis)
Date: 2025-01-31
Scope: Docker Swarm stack, service logs, and codebase error patterns.
1. What Was Reviewed
- Docker: Dockerfile, docker-compose.swarm.yml, stack services and env vars.
- Logs: No live log files in repo; this doc explains how to fetch and what to look for.
- Code: Startup panic/Fatal patterns, error logging, known issues from CRITICAL_ISSUES_REPORT.md, monitoring alerts, config mismatches.
2. How to Fetch Logs (Docker Swarm)
Stack name: gitinext-golang.
docker service ls | grep gitinext-golang
make aux-logs SVC=gateway
make aux-logs SVC=wallet
make aux-logs SVC=payment
make aux-logs SVC=watcher-ton
make aux-logs SVC=watcher-tron
make aux-logs SVC=telegrambot
docker service logs --tail 500 gitinext-golang_gateway
docker service logs --tail 2000 gitinext-golang_gateway 2>&1 | grep -iE 'error|panic|fatal|failed'
docker service logs --tail 2000 gitinext-golang_watcher-ton 2>&1 | grep -iE 'error|circuit|RPC|402|429|422'
3. Known Issues
3.1 RPC / Watchers (see docs/CRITICAL_ISSUES_REPORT.md)
TON/TRON watchers: GetBlock 402, NowNodes 422⁄404, public 429⁄405. Circuit breakers open; no deposit detection. Fix: renew GetBlock/NowNodes, add TonCenter/TronGrid fallbacks, restart watchers.
3.2 RabbitMQ URL - Telegrambot (fixed in repo)
Telegrambot default was unencoded (gitinext@rabbitmq). In AMQP URLs @ and ! must be percent-encoded. Default updated to amqp://gitinext%40rabbitmq:GitiNext%40Rabbit2025%21Secure@rabbitmq:5672/. Check telegrambot logs for RabbitMQ/connection/auth errors.
3.3 PostgreSQL 18 Volume
Compose uses postgres_data:/var/lib/postgresql/18/docker. PG 18 image typically uses /var/lib/postgresql/18/main or mount on /var/lib/postgresql. If you see Postgres init errors, try postgres_data:/var/lib/postgresql.
3.4 Startup Failures
Gateway, account, wallet, ledger, docs, storage, payment, telegrambot, voucher, withdrawal use panic/log.Fatal/os.Exit(1) on missing config or DB/Redis/proxy failure. Ensure .env and Postgres/Redis/RabbitMQ are correct. Check docker service ps <svc> --no-trunc for Exit 1.
3.5 Payment
See monitoring/alerts/payment_service_alerts.yml. Logs: PayStar, API key, refresh, verification, deposit creation.
3.6 Wallet
TWC: ENABLE_TWC_PLUGIN=true with missing library -> panic. Sweeps: TWC returned EMPTY signed tx, broadcast failed. Vault: VAULT_ENABLED=true and unreachable/wrong secrets -> startup or decrypt failures.
3.7 Withdrawal
Missing/invalid hot wallet encryption key -> exit. Exonyx/gRPC errors in logs.
4. Config Checklist
POSTGRES*, REDIS*, RABBITMQ_URL (percent-encoded), JWT_SECRET, ACCOUNT_JWT_SECRET, TELEGRAM_BOT_TOKEN, WALLET_ENC_KEY, Vault secrets if VAULT_ENABLED=true. For external APIs (QuickNode, NowNodes, TonCenter, PayStar, etc.) see docs/EXTERNAL_APIS_AUDIT.md and set valid keys in .env. Run ./scripts/check-external-apis.sh to test endpoints.
5. One-Month Log Review Checklist
- Fetch errors per service: grep error/panic/fatal/failed on last 2k lines.
- docker service ps for restarts and Failed state.
- Watchers: circuit, 402, 429, 422, non-200, RPC.
- Payment: PayStar, refresh, verification.
- RabbitMQ: payment, watchers, telegrambot, gateway.
- DB/Redis: gateway, account, wallet, ledger, withdrawal.
- Prometheus: PaymentServiceUnavailable, PayStarAPIHighErrorRate, PaymentServiceAPIKeyRefreshFailure.
6. Files Touched
- docs/DOCKER_AND_LOGS_REVIEW.md: New.
- docker-compose.swarm.yml: Telegrambot RABBITMQ_URL default set to percent-encoded value.
7. Mentor Tips
- Understood: Full Docker + logs review after ~1 month; find errors and issues.
- If unclear: Specify environment or priority services.
- Files considered: Dockerfile, docker-compose.swarm.yml, Makefile aux-logs, CRITICAL_ISSUES_REPORT.md, payment_service_alerts.yml, main.go error paths, packages/rpc/provider.go.
- Changes: New DOCKER_AND_LOGS_REVIEW.md; fixed telegrambot RABBITMQ_URL in docker-compose.swarm.yml.
- Per-file: DOCKER_AND_LOGS_REVIEW.md new; docker-compose.swarm.yml one-line RABBITMQ_URL fix.
8. Service-by-Service Log Review (Run 2025-01-31)
Logs and task status were reviewed one by one starting from gateway. Summary below.
8.1 Gateway
- Replicas: 2⁄2 Running.
- Task history: Past failures ~13 days ago (exit 2, exit 255). Current replicas stable.
- Recent logs: Mostly GET /healthz 200 (Traefik health checks). No ERROR/WARN in sampled tail.
8.2 Wallet
- Replicas: 1⁄2 — one replica not running.
- Task history: wallet.1 is Pending 4 weeks with “no suitable node (insufficient resources on 1 node)”. Only one node; wallet resource limits prevent a second replica.
- Action: Add a worker node, or reduce wallet resource requests, or run with 1 replica by design.
8.3 Payment
- Replicas: 1⁄1 Running.
- Task history: Multiple Failed (exit 1) ~12 days ago. Now stable.
- Log errors (current): PayStar statement API error, status 400, message (Persian): “توکن احراز هویت اشتباه است” (authentication token is wrong). Deposit sync fails every cycle.
- Action: Fix PayStar credentials (PAYSTAR_APPLICATION_ID, PAYSTAR_ACCESS_PASSWORD, PAYSTAR_REFRESH_TOKEN). Refresh token in PayStar dashboard and update env.
8.4 Watcher-TON
- Replicas: 1⁄1 Running.
- Task history: Past Failed (exit 1, 255) ~13 days ago. Now running.
- Log errors (current): Circuit breaker open for provider toncenter; RPC status 429 (rate limited). “All RPC providers failed” for getBlockTransactions — TON deposit detection not working.
- Action: Add/rotate TON RPC providers; reduce rate or add paid tier so circuit breakers can close.
8.5 Watcher-TRON
- Replicas: 1⁄1 Running.
- Task history: Past Failed (exit 1, 255); once Rejected 6 weeks ago: “No such image: registry.nextgiti.cloud/watcher:latest”. Resolved.
8.6 Telegrambot
- Replicas: 1⁄1 Running.
- Task history: Past Failed (exit 1, 255) ~13 days ago and 2–7 weeks ago. Now stable.
8.7 Account
- Replicas: 2⁄2 Running.
- Task history: Past Failed (exit 2, 255). Now stable.
8.8 Withdrawal
- Replicas: 2⁄2 Running.
- Task history: Past Failed (exit 1, 255) ~13 days ago. Now stable.
8.9 Ledger
- Replicas: 0/2 — service not running.
- Task history: Rejected — “No such image: registry.nextgiti.cloud/ledger:latest”. Image missing from registry.
- Action: Build and push ledger image, then update service or redeploy stack.
8.10 Priority Summary
| Priority | Issue | Service | Action |
|---|---|---|---|
| P0 | Ledger image missing | ledger | Build, push, update service |
| P0 | PayStar token invalid | payment | Update PAYSTAR_* credentials; refresh token |
| P1 | TON deposits broken | watcher-ton | Fix RPC 429; add/rotate providers |
| P1 | Wallet 1⁄2 replicas | wallet | Add node or reduce resource request |
| P2 | Historical exit 1/2/255 | multiple | Monitor; already recovered |