Skip to content

Refactor to modular architecture, fix bugs, and improve test infrastructure#217

Merged
bartcode merged 26 commits into
mainfrom
cleanup-and-bugfixes
May 10, 2026
Merged

Refactor to modular architecture, fix bugs, and improve test infrastructure#217
bartcode merged 26 commits into
mainfrom
cleanup-and-bugfixes

Conversation

@bartcode
Copy link
Copy Markdown
Collaborator

Summary

Full-stack overhaul of Fluxbase internals: module-based architecture, 67 codebase findings fixed, frontend cleanup across 3 packages, and self-sufficient E2E test infrastructure.

Architecture: Module System

Replaced the monolithic server_init.go (1,211 lines) with a dependency-injection module system:

  • Module interface + ServiceRegistry — modules register outputs via registry.Register() and read inputs via GetServiceT
  • 20 modules in dependency order: Email → Secrets → Storage → Logging → Auth → Webhook → Extensions → Tenancy → Settings → Schema → Functions → Jobs → RPC → AI → Realtime → Branching → GraphQL → MCP → Metrics → BackgroundServices
  • 11 dead Server struct fields removed, server_init.go split into server_middlewares.go + server_routes.go
  • cmd/fluxbase/main.go reduced from ~530 lines of inline wiring to clean orchestration
    Bug Fixes (6 items)
  • Auth impersonation rate limiting — added per-user rate limiter (golang.org/x/time/rate) to prevent abuse
  • Realtime subscription unbounded growth — capped at 100 subscriptions per connection
  • Realtime tenant isolation — BroadcastToChannel and BroadcastGlobal now filter by TenantID
  • Job retry nil pointer — RequeueJob/RequeueFailedJob nil-safe after Close()
  • Idempotency key duplicate handling — returns cached response on conflict instead of 500
  • Leader election double-start — guarded by started flag

Codebase Sweep

  • Admin UI: API client refactor (260 lines), auth retry consolidation, error extraction, route guards, component cleanup
  • TypeScript SDK: Dead code removal, type exports
  • SDK-React: Hook cleanup
    E2E Test Infrastructure
  • Self-sufficient tests — TestMain resets fluxbase_go_e2e database, runs bootstrap + declarative schema via Go (same code path as production), then executes 274 tests with 0 failures
  • Fixed trigger function normalization — auth.set_tenant_id_from_client_key_or_context() used EXECUTE format('%I.%I', 'auth', 'client_keys') to prevent pgschema from stripping schema qualifiers in SECURITY DEFINER functions
  • Fixed connection pool exhaustion — respected embedded_worker_count=0, reduced pool sizes, added drain delay
  • Simplified CI — from 6 database setup steps to 3 (create DB + users + run tests)
    Files Changed
    173 files, +4,997 / -4,354 lines
    Test Results
  • 274 E2E tests passing (was 262 pass / 12 fail)
  • Unit tests added for critical/high fixes (401 lines across 5 files)
  • go build, go vet pass clean
    Breaking Changes
    None — all changes are backward compatible. Existing installations upgrade smoothly via declarative schema re-application on startup.
    Not Included (Deferred)
  • Encryption key string→[]byte (API-breaking, needs major version)
  • Pub/sub + webhook backpressure (needs design decision)
  • Error response consistency across 350+ sites (incremental)
  • Coverage improvements (auth, jobs, functions)
  • globalResolver elimination in tenant_config.go

bartcode added 26 commits May 10, 2026 08:35
- Extract tenant_id from context in security events (was logging empty)
- Remove broken testTx/NewServerWithTx/DB, add TestTransactionMiddleware
- BroadcastRequest accepts both 'message' and 'event'+'payload' formats
- DDL handler invalidates GraphQL schema alongside REST cache
- Edge functions guide uses FLUXBASE_SERVICE_TOKEN (was FLUXBASE_API_KEY)
- Clarify comments that keys are for external client use
Tenancy & FDW:
- Rename TenantStore→TenantRepository, Storage→Repository across all call sites
- Remove dead setupFDWLegacy (~80 lines), simplify SetupFDW to 3 params
- Add 10 platform tables (without tenant_id) to FDW exclude list
- Fix MFA service direct pool bypass (m.db.Pool()→m.db.Exec, 14 sites)

Supabase removal:
- Remove supabase/supabase-demo issuers from JWT validation
- Clean ~30 Supabase references from comments, tests, and docs

Extractions:
- Extract ConnectWithRetry, EnsureDefaultTenantAndKeys, BackfillTenantIDToDefault
  from main.go (988→529 lines) into dedicated packages
- Split Execute (390-line function) into orchestrator + execute_helpers.go

Other:
- Add exponential backoff to job retry (5s·2^n capped at ~5min)
- Remove deprecated SignInAnonymous (handler, service, tests)
- Stop clearing error_message on job retry
- Document JWT scope bypass rationale and MCP/REST scope separation
… registry

- Remove POST /api/v1/realtime/broadcast endpoint, handler, type, OpenAPI spec,
  and ScopeRealtimeBroadcast scope (clients should use WebSocket publish)
- Replace 8-way method switch with router.Add()/router.All() in route registry
- Cache RequireAuthOrServiceKey/OptionalAuthOrServiceKey middleware instances
  (was creating 24 duplicate handlers, now 2 cached on Server struct)
- Remove testTx/DB() from server.go, update test helpers accordingly
Critical:
- Fix job automatic retries: RequeueJob now handles running->pending
  transition (was requiring failed status, making backoff dead code)
- Fix realtime cross-tenant message leakage: broadcasts, presence,
  and sync events now scoped by TenantID
- Apply SSRF BlockedDomains to function/job/MCP executions (was only
  in test code; metadata endpoints were accessible)
- Fix leader election: use dedicated connection for advisory locks
  (session-level locks were released immediately with pgxpool)

High:
- Fix race condition on Connection.Claims/Role reads (add GetRoleAndClaims)
- Fix Prometheus metric cardinality: normalize slug/name path segments
- Fix hasPendingMigrations always returning false
- Fix service_role rate limiter creating fresh instance per request
- Fix job function file operations bypassing tenant isolation (RLS)
- Restrict --allow-read/--allow-write to /tmp (was full filesystem)

Medium/Low:
- Replace fmt.Sprintf JSON construction with json.Marshal (2 sites)
- Remove dead code: execContext, SubscriptionFilter, NewConnectionFromPool
- Pre-compile regexes in SanitizeSQL (was compiling on every call)
- Fix panic on short refresh tokens in rate limiter
- Fix stub handler returning 200 instead of 501
- Fix OOM detection for functions (was only for jobs)
- Job retry: verify RequeueJob (running->pending) and RequeueFailedJob
  (failed->pending) method signatures exist and are distinct
- Realtime tenant isolation: verify BroadcastToChannel filters by
  tenantID, only matching-tenant connections receive messages,
  and BroadcastGlobal propagates tenantID through pubsub
- SSRF protection: verify DefaultFunctionPermissions and
  DefaultJobPermissions include BlockedDomains with known SSRF
  targets (metadata endpoints, localhost)
- Connection race safety: verify GetRoleAndClaims is concurrent-safe
  with concurrent UpdateAuth calls (no panics under -race)
- Leader election: verify Start is idempotent, Stop is safe with nil
  dedicatedConn, and leadership state is properly cleared on stop
Critical:
- Fix cookie JSON.parse crash in auth-store (try/catch + cleanup)
- Consolidate triplicated auth refresh logic into single helper
- Remove duplicate FluxbaseResponse type (response.ts deleted)

Data fetching:
- Branch selector rewritten with TanStack Query (was useEffect)
- useTenant/useTenants rewritten with TanStack Query
- Tenant ID added to React Query cache keys (prevents stale cross-tenant data)

Error handling:
- Extract getErrorMessage() helper (replaced 5 inline patterns)
- Fix SDK fetch error wrapping to preserve original error message

Architecture:
- Document Axios vs SDK usage in client.ts JSDoc
- Fix impersonation token race (read from store, not localStorage)
- Add public getters to FluxbaseFetch (replaces bracket notation access)
- Add auth loading spinner during async check

Low priority:
- Document sidebar/route coupling in sidebar-data.ts
- Document dashboardAuthAPI raw axios usage
- Add React hooks: useInvokeFunction, useSubmitJob, useJobStatus, useBranches
- Remove deprecated SupabaseResponse/SupabaseAuthResponse aliases
- Remove Supabase alternative wording
- Document JWT issuer (fluxbase-only), scope bypass, security events
- Add new extracted files (retry.go, bootstrap_keys.go, backfill.go,
  execute_helpers.go)
- Document SSRF protection (BlockedDomains applied to all executions)
- Document realtime tenant isolation (broadcast + presence filtering)
- Document leader election (dedicated connection, idempotent start)
- Update admin UI auth (safe cookie parsing, consolidated refresh)
- Update Prometheus normalizePath (handles slug-like segments)
HIGH:
- Add per-connection WebSocket rate limiting (10 msgs/sec, burst 20)
  via golang.org/x/time/rate token bucket
- Add subscription limit per WebSocket connection (max 100)
- Fix rlsCache.cleanup() goroutine leak (accept context, add stop())

MEDIUM:
- Fix FDW role name collision (use full UUID instead of 8 chars)
- Fix partial FDW setup returning nil on failure (track failed schemas)
- Fix bootstrap failure leaking orphaned database (DROP on failure)
- Add array size limits on bulk operations (max 1000 items)
- Fix scheduler namespace collisions (use namespace/name as cron key)
- Fix path traversal in chunked upload (validate uploadID regex)

Dead code removed:
- Executor.ValidateMigration (never called)
- normalizeSlug in tenant_loader.go (unused)
- SyncFromPostgres in extensions/service.go (collected nothing)
- @fluxbase:timeout/@fluxbase:memory annotation parsing (never applied)
MEDIUM:
- M1: GetAppliedMigrations only swallows missing-table errors
- M2: DDL wrapped in transactions in declarative schema
- M3: Declarative apply uses shared pool (no connection exhaustion)
- M6: BroadcastToChannel O(n) documented (tenant filtering limits scan)
- M8: TenantExecutor uses advisory locks for concurrent migration safety
- M12: Cron functions scheduled immediately on create/update
- M13: Scheduled functions now receive settings secrets
- M16: Tenant config env var parse errors logged as warnings
- M18: Dead workers removed from slice (no unbounded growth)
- M19: Restart count resets after 5 minutes of stability

LOW:
- L1: err.Error() string compare → errors.Is(err, pgx.ErrNoRows)
- L2: bcrypt.DefaultCost → DefaultBcryptCost
- L12: Secret names no longer logged via fmt.Printf
- L15: Rate limiter panics → error returns
- L17: err.Error() removed from API responses (generic messages)
- L22: FailJob errors now logged
- L23: Redundant inline role checks removed from monitoring
- L24: context.Background() → request context in handlers
- L31: Tracing version uses overridable var (ldflags)
- L32: Custom itoa → strconv.Itoa
- L33: Custom contains → strings.Contains
Goroutine leaks:
- L19: slowQueryTracker cleanupLoop stopped on Connection.Close()
- L20: endpointRateLimiter cleanup stopped on TriggerService.Stop()

Data safety:
- L21: perKeyLimiters map evicts idle entries (>1h) every 10m
- L25: Idempotency check uses INSERT-first pattern (eliminates TOCTOU race)
- L26: PresenceInfo.State map deep-copied on store

Dead code:
- L6: Removed ComputeDeduplicationKey, EnqueueJobWithDedup,
  ProgressToJSON, JSONToProgress from jobs/storage.go
- L18: Extracted shared enrichJobWithETA to listener_helpers.go,
  documented duplication between Listener/ListenerPool

Error responses:
- M15: Migrated server.go to structured error helpers,
  removed remaining err.Error() leak in schema refresh handler
Foundation for modular server initialization:
- Add Module interface with Name() and Init() methods
- Add ServiceRegistry for type-keyed service lookup with GetService[T]
- Add InitModules for sequential module initialization with error propagation
- Extract 4 simplest modules: Email, Secrets, Webhook, Extensions
- Modules coexist with existing init methods (no breaking changes)
- Server.registry field holds the ServiceRegistry for future module use

This is the first step of Phase 3. The old init methods for email,
secrets, webhook, and extensions are replaced by module-based init
while all other modules continue using the existing pattern.
…methods (Phase 3b)

Moves initStorage, initLogging, initAuth into Module implementations.
Also deletes initEmail, initWebhook, initSecrets (previously extracted).
LoggingModule receives pubSub via struct field to work around
GetService[T] limitation with interface types. Unused imports cleaned
from server_init.go.
…e 3c)

Extracts TenancyModule, SettingsModule, SchemaModule, FunctionsModule,
JobsModule, RPCModule, AIModule, RealtimeModule, BranchingModule,
GraphQLModule from server_init.go into dedicated module files.

Adds PubSub field to ServiceRegistry (interface types can't be looked
up via reflect-based GetService[T]). Deletes 579 lines from
server_init.go. All modules now use dependency injection via the
ServiceRegistry. Remaining: setupMCPServer, initMetrics,
initBackgroundServices, setupMiddlewares, setupRoutes.
… 3d)

Extracts MCPModule (setupMCPServer) and MetricsModule (initMetrics)
into dedicated module files. Moves branchTenantResolver and
branchFDWRepairer helper types to module_branching.go.

Adds PubSub field to ServiceRegistry for interface-typed access.
Registers AI chat handler and KB storage for cross-module lookup.

server_init.go reduced from 632 to 369 lines. Remaining:
initCore, initBackgroundServices, setupMiddlewares, setupRoutes
(lifecycle/infrastructure, not business logic modules).
The cross-init dependency fields (emailManager, emailService,
storageManager, storageService, loggingService, authService,
captchaService, systemSettingsService, userMgmtService,
invitationService, secretsStorage) were only written during init
wiring but never read from the Server struct. Modules now pass
dependencies directly to handlers. Removes 26 lines of dead code.
Add module system description to Internal Modules section.
Update directory structure with module files. Document the
19-module dependency injection architecture.
… services, split files

- Remove dead pubsub.GetGlobalPubSub/SetGlobalPubSub (defined but never called)
- Inject rate limiter into functions.Handler via SetRateLimiter() instead
  of global ratelimit.GetGlobalStore()
- Add RateLimiter field to ServiceRegistry (same pattern as PubSub)
- Extract BackgroundServicesModule (starts realtime listener, schedulers,
  jobs manager, RPC scheduler via leader election)
- Extract leaderElect() helper as standalone function
- Split server_init.go into server_middlewares.go and server_routes.go
- server_init.go reduced to 76 lines (only initCore remains)

Net: -366 lines, 2 globals eliminated, 20 modules total.
- Reduce test pool sizes (MaxConns=3, MinConns=0, shorter idle/lifetime)
- Add 50ms drain delay after Server.Shutdown() for connection release
- Respect embedded_worker_count=0 to disable job workers in tests
- Fix auth.set_tenant_id_from_client_key_or_context schema (already fixed in
  auth.sql but trigger function in DB was stale)
- Increase devcontainer PostgreSQL max_connections to 400
…s during schema apply

The declarative schema system sets search_path to include all schemas before
applying SQL files. When a SECURITY DEFINER function with SET search_path=public
references auth.client_keys, PostgreSQL normalizes it to just client_keys because
auth is in the session search_path during CREATE FUNCTION. At runtime with
search_path=public, the unqualified reference fails because public.client_keys
doesn't exist.

Fix: exclude the target schema from the session search_path during applySchemaDirect.
This prevents normalization while still allowing cross-schema references.
Go E2E tests now reset and bootstrap their own database on every run,
matching the Playwright UI test pattern. No external setup step needed.

Changes:
- Add resetTestDatabase() + bootstrapTestDatabase() to setup_test.go
- resetTestDatabase drops all schemas, ensures users exist (as postgres)
- bootstrapTestDatabase runs EnsureBootstrap() + declarative schema via Go
- grantRLSTestPermissions adds missing schemas (ai, rpc, logging, branching)
  and roles (tenant_service, tenant_migration_role) using loop-based SQL
- Default database changed from fluxbase_test to fluxbase_go_e2e
- CI simplified from 6 steps to 3 (create DB + users + run tests)
- Makefile E2E targets no longer depend on test-setup-db
- go fmt on 15 files (gofumpt formatting)
- branch-selector: derive error state from query instead of setState in effect
- SDK storage tests: add getBaseUrl() to MockFetch
- SDK fetch tests: update 'Unknown error occurred' expectations to 'string error'
@bartcode bartcode changed the title Refactor to more modular system, and code cleanup Refactor to modular architecture, fix bugs, and improve test infrastructure May 10, 2026
@bartcode bartcode merged commit 04478d6 into main May 10, 2026
14 checks passed
@bartcode bartcode deleted the cleanup-and-bugfixes branch May 10, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant