Universal Data Adapter for backend-core¶
Context and Problem Statement¶
backend-core today talks to exactly one datastore — ArcadeDB (multi-model graph/document/vector), reached over its HTTP/JSON API. We need a system that lets us connect to many backend types from one place: register a connection, see its settings and health in a human UI, manipulate its features, and move data in/out through a pipeline. ArcadeDB is first and essential; Google BigQuery is next; more (Postgres, etc.) will follow. Bulk data movement should ultimately route through the dlt (dlthub) pipeline. We prefer mature packages over bespoke code, must keep secrets out of source and the app database, and are open on implementation language where it makes the system more powerful.
How should we structure a "universal data adapter" that is extensible to new backend types without rewrites, secure by default, and contract-first?
Decision Drivers¶
- Extensibility — adding a new backend type should not require touching routing/core code.
- Heterogeneous capabilities — relational, graph, vector, and warehouse "job" semantics differ; the abstraction must not assume SQL.
- Pipeline reuse — lean on
dltfor ELT (sources/destinations, incremental state, schema evolution) rather than reinventing it. - Security — credentials referenced by pointer, resolved at runtime; never persisted in plaintext (consistent with the ArcadeDB secret-file hardening, PR #11).
- Contract-first — managed via the OpenAPI contract; the Svelte UI is generated from it.
- Powerful, not dogmatic, language choice — willing to use Rust/C# where it wins, but pragmatic about ecosystem gravity.
- Buy over build where a mature package fits; build the thin glue ourselves.
Considered Options¶
- Python (FastAPI) connector core on
dlt+ per-type drivers (recommended) - Buy an embedded connector platform — PyAirbyte / Singer+Meltano / Trino / Steampipe
- Fully custom connector + pipeline framework (no dlt)
- Rust/C# connector core that drives
dltas a subprocess/sidecar
Decision Outcome¶
Chosen: Option 1 — a Python connector core on dlt, with per-type drivers, a
connection registry in ArcadeDB, a capability-based connector abstraction, and a
contract-first API + Svelte UI. Rust (rust-api-v2) is retained for hot serving
paths, not the connector core.
Architecture¶
- Connection registry (ArcadeDB documents):
connection_types— catalogue: name, capability flags, a settings JSON Schema, and adltslug.connection_instances—display_name,environment,settings(validated against the type schema; no secrets),secret_ref(opaque pointer),health_state+last_checked_at+last_error, derivedcapabilities.- Connector capability interface — a
ConnectorBase(test_connection,introspect_schema,read,write) plus opt-in mixins declared per type:GraphCapable,VectorCapable,JobCapable(e.g. BigQuery jobs). Connectors are registered via a decorator (@register_connector("arcadedb")); a new type = a small package + one catalogue row, no routing changes. - dlt orchestration layer — a thin module that materializes a
dltsource/destination from a registry record (resolving thesecret_ref) and runs/schedules the pipeline. Native ops (graph traversal, vector search, ad-hoc reads) bypass dlt and use the type's client directly. - Secrets — the registry stores only a
secret_ref(Docker secret path / Key Vault name / env key). Resolution happens at pipeline instantiation viadlt'sVaultProvidertier (or a Docker secret file). A CI/semgrep rule blocks any code path that persists a resolved credential. - Contract-first API (FastAPI → OpenAPI → Svelte client):
GET /v1/connection-types— list the type catalogue.GET /v1/connections,POST /v1/connections— list / create connections.GET /v1/connections/{id},PATCH /v1/connections/{id},DELETE /v1/connections/{id}— read / update / delete one connection.POST /v1/connections/{id}/test— run a connectivity test.GET /v1/connections/{id}/schema— introspect schema.POST /v1/connections/{id}/pipelines— start a pipeline run (returns202+ a pollingLocation);GET /v1/connections/{id}/pipelines— list runs;GET /v1/connections/{id}/pipelines/{jobId}— poll one run.POST /v1/connections/{id}/query— native ad-hoc read.
RFC 9457 Problem Details for errors; async pipeline endpoints return a polling Location header.
- UI gates capability-specific actions on the instance's capability flags, not the type name.
Packages (adopt)¶
| Package | Role |
|---|---|
| dlt (Python) | ELT core; native BigQuery; runtime-dynamic pipelines; VaultProvider secrets; custom @dlt.destination for ArcadeDB |
| arcadedb-python | ArcadeDB over HTTP (the Postgres-wire path is fragile — avoid for ArcadeDB) |
ADBC (+ sqlalchemy-bigquery) |
Arrow-native warehouse extract; SQLAlchemy dialects for SQL-standard sources |
| Vault / Azure Key Vault (+ External Secrets Operator) | runtime secret resolution; never plaintext in app DB |
Avoid (this iteration): PyAirbyte / Singer / Meltano (heavy, subprocess-bound), Steampipe / Trino (daemon/cluster, read-only), connectorx / Ibis (too narrow / query-only).
Language¶
Python owns the connector core. dlt is Python-only and the connector ecosystem
(arcadedb-python, ADBC-python, SQLAlchemy) is richest there; an FFI/sidecar to a
Python dlt worker on every operation would be fragile for no gain. Rust
(rust-api-v2) owns serving — streaming Arrow batches, vector result serving, edge
rate-limiting. This is "Python where it is strongest, Rust where it is strongest,"
not "everything in Python."
To avoid a serialization bottleneck where Python and Rust do exchange data (Arrow batches), the Python↔Rust transfer uses a zero-copy IPC mechanism — Apache Arrow Flight (gRPC-based, the default) or a shared-memory buffer for co-located processes — rather than re-serializing through JSON. The chosen mechanism is recorded when the serving path is implemented (phase 3).
Consequences¶
Good:
- New backend types are additive (package + catalogue row); no core rewrites.
- Leverages dlt's incremental/state/schema machinery instead of rebuilding it.
- Secret posture matches PR #11 (pointer + runtime resolution); nothing plaintext at rest.
- Contract-first keeps the UI and API in lockstep.
Bad / risks:
- The connector core is Python-centric despite a preference to spread languages.
- ArcadeDB needs a custom dlt destination (medium effort; type mapping + idempotency are ours).
- A common schema vocabulary (CDM) for introspect_schema must be defined before BigQuery, or its type system forces a breaking change. To contain this risk the CDM adopts an existing standard rather than a bespoke vocabulary: Apache Arrow schema metadata as the canonical type system for introspect_schema, with per-type adapters mapping native types into it.
- Dynamic pipeline parameters (cursors, partitions) vary per type → the pipelines request body uses a per-type extensions object validated against the connection type's settings JSON Schema (already defined in the connection_types registry), so the Svelte UI can render the correct inputs per connector; or operators bypass the API with raw dlt scripts.
- dlt is not yet a dependency of backend-core (it lives in the AgentArmy hub today) — adding it expands the dependency surface.
Acceptance criteria¶
This ADR is accepted (ratified 2026-05-25); the following are the conditions each
implementation increment must satisfy as Phase 1 is built out:
- The ArcadeDB connector must pass the existing
agentarmy-doctorchecks and a contract conformance test. - A semgrep rule must assert that no
secret_refis ever resolved into a persisted field. - The OpenAPI contract drift gate must stay green and the Svelte client must regenerate cleanly.
Phased Plan¶
- ArcadeDB — registry schema,
ConnectorBase, ArcadeDB connector (graph+vector mixins),secret_refresolution via the Docker secret file (PR #11), the API endpoints, Svelte connection CRUD + health badge. - BigQuery —
BigQueryConnector+JobCapable,dltBigQuery destination, pipeline run/poll UI; define the CDM schema vocabulary here. - Universal — settings-schema validation at create time, connector auto-discovery via entry points, multi-environment credential namespacing, optional ADBC fast-path reads.
More Information¶
- Confirmed state: backend-core already ships an
app/package, anddltis not currently inrequirements.txt— adopting it (phase 1) adds a new dependency, as noted in the risks above. - Related: PR #11 (ArcadeDB secret-file hardening) establishes the
secret_ref/secret-file precedent this design extends. - Research basis: parallel agent research (dlt/ADBC/connectorx/SQLAlchemy landscape; build-vs-buy; architecture), 2026-05-24.