$ cat post/what-an-algorithmic-trading-platform-taught-me-about-internal-platforms.md

29MAR26

What an algorithmic trading platform taught me about internal platforms

There is a version of platform engineering that lives entirely in diagrams. The golden path exists as a Confluence page. The internal developer platform is a JIRA board with a new label. The shared library is a Slack thread from eighteen months ago that nobody can find.

And then there is what actually gets built when someone sits down and does the work.

Mimicus is an algorithmic trading platform — twenty async Python microservices wired together through Redis pub/sub, InfluxDB time-series data, and a Kubernetes cluster running GitOps via ArgoCD. It ingests real-time market data from Alpaca WebSocket feeds, routes it through analysis pipelines and ML models, validates trade signals through sequential risk gates, and executes orders with full observability across the stack.

It is also one of the clearest examples I have seen of what a real internal developer platform looks like when every decision is made with intention.

AI-generated visualization of a microservice mesh with glowing data flow lines, trading terminals, and a dark terminal aesthetic. — Twenty services. One shared library. Every cross-cutting concern resolved once and absorbed into the fabric.

The first thing that stands out is the shared utilities library at svc/_utils/.

Every service in the platform imports from it. Redis clients, InfluxDB connections, OpenTelemetry setup, Alpaca API wrappers, cached market bar fetching, session auth for dashboards — none of it gets reimplemented per service. If you are writing a new trading strategy, you reach into _utils for your infrastructure primitives. You do not write your own database connection. You do not configure your own metrics export. You do not decide how logging should work.

That sounds simple. It almost never is.

Most platforms that talk about shared libraries end up with three competing versions of the same wrapper, a set of copy-pasted patterns that diverged over eighteen months, and a quiet understanding that the “right way” and the “way it actually works” are different documents. The reason Mimicus avoids this is not magic — it is a rule that anything used by two or more services lives in _utils, and the rule is actually enforced. New service, new pattern, same library. The shared layer grows with the platform instead of getting outpaced by it.

The second thing that stands out is how the platform handles documentation.

Every service in Mimicus has a README. Every README has an accurate configuration table listing env vars, routes, and upstream dependencies. The accuracy is not a coincidence and it is not the result of disciplined humans keeping things up to date. It is the result of generate_docs.py, a GitHub Actions workflow that AST-inspects each service’s source code and rewrites the auto-generated sections of each README on every push.

os.getenv() calls become the configuration table. @app.route() decorators become the routes section. Import statements and explicit dependency maps become the upstream/downstream graph.

The human-written prose is preserved. Only the sections that can be derived from code are touched. The result is that the docs cannot drift from the implementation — not because someone enforced a discipline, but because the pipeline enforces it automatically.

This is what good platform thinking looks like. You identify the places where documentation rot is structurally inevitable — config tables, route listings, dependency inventories — and you remove the human from the maintenance loop. You stop hoping engineers will update the docs and start making it physically difficult for them not to.

The same principle extends to service onboarding. Adding a new trading strategy to Mimicus is a template exercise. There is a documented golden path in docs/docs/development/new-service.md. The Kustomize base manifests handle the Kubernetes deployment scaffolding. The _utils library handles the runtime infrastructure. The CI pipeline handles the build, image tag, and manifest update automatically. A new service goes from concept to running in a live namespace in under an hour, not because the infrastructure is simple but because the complexity has been absorbed into the platform.

That is the actual job. Not reducing complexity. Absorbing it.

AI-generated incident response dashboard showing automated remediation, Kubernetes pod status, and Grafana alert panels in a dark terminal style. — Watchdog does not page you when it can fix it. That is a platform opinion encoded as behavior.

One of the more interesting pieces of Mimicus is Watchdog — the automated incident response service.

When Grafana detects an anomaly and fires an alert, Watchdog receives the webhook, queries Loki for error logs in the affected time window, checks Kubernetes for pod state and restart count, checks ArgoCD for the current app revision and sync status, and then decides what to do. CrashLoopBackOff triggers a pod restart. OOMKilled suggests a memory leak and triggers investigation. ImagePullBackOff suggests a registry problem. If remediation succeeds, it logs the action to Loki and moves on. If it cannot fix the issue, it opens a GitHub issue with the enriched incident context already attached.

Most teams do not have this. Most teams have an engineer’s phone going off at 2am so they can open a terminal, run the same diagnostic commands, conclude it is a crash loop, and restart a pod.

Watchdog is not a novel idea. Automated remediation has been a platform engineering aspiration for years. What makes it interesting in Mimicus is that it is not a side project bolted on later — it is a first-class service in the platform with the same observability, the same shared library dependencies, and the same GitOps deployment path as every other service. The platform’s incident handling is part of the platform.

The trade execution layer works the same way.

Before any order reaches Alpaca, it passes through a sequential set of gates inside Slanger — the central order gateway. Market hours. Position limits. Volatility checks. VWAP alignment. Supertrend confirmation. Volume clusters. Each gate runs in order, each gate writes its pass/fail result and reasoning to InfluxDB, and the entire chain is queryable in Grafana. If a trade was rejected, you can pull up the gate decision audit trail and see exactly which condition failed and why.

This is the platform engineering version of a compiler. You do not let untrusted code into production. You do not let untrusted orders reach the market. You encode the rules, you run everything through the validator, and you make the rejection reasons readable.

The observability setup reflects the same philosophy. Every service in Mimicus exposes Prometheus metrics. Every service writes structured logs to Loki. OpenTelemetry traces run through Tempo. Pyroscope handles continuous profiling. None of this is optional and none of it requires the service author to make decisions — the _utils/observability.py module sets everything up with a single call and the platform carries the rest.

When something goes wrong, the question is never “does this service have telemetry?” It is always “which telemetry do I need right now?” That is a different problem to have.

The secrets management is the same story. Vault Agent runs as an init container. It injects secrets as environment variables before the main container starts. Services never touch credentials directly. API keys, database tokens, and passwords never appear in source code, never appear in Docker images, and never appear in Git. Gitleaks scans every commit as a pre-commit hook to catch anything that slips through. The platform’s security posture is not a policy document — it is structural. It is harder to do it wrong than to do it right.

What connects all of this is a consistent underlying pattern: the platform takes an opinion, encodes it into automation or tooling or template, and removes the decision from the individual service author.

How should a service connect to Redis? _utils/redis.py. How should metrics be written to InfluxDB? _utils/influx.py with auto-bucketing. How should Kubernetes manifests be structured? The Kustomize base. How should an order be validated? Through Slanger’s gates. How should a secret reach a running container? Through Vault injection.

Every one of these is a platform opinion made default. The service author does not choose. They inherit the standard, and the standard handles the complexity.

This is worth dwelling on because it is the actual leverage that platform engineering is supposed to produce. Not saving a few hours on a deployment. Not writing a nice README. Not setting up a Grafana dashboard. The leverage is the compounding effect of dozens of engineers not solving the same infrastructure problems independently, because the platform already solved them and made the solution the obvious path.

Mimicus is a fintech trading platform, but the lessons it encodes are not specific to trading. Shared cross-cutting concerns belong in one place. Documentation that can be generated should be generated. Onboarding should be a template, not a process. Risk controls should be explicit, sequential, and auditable. Observability should be structural, not optional. Secrets should never be a decision the application makes.

These are not sophisticated ideas. They are the same ideas that every platform engineering team talks about. The difference is that Mimicus actually built them, wired them together, and made them the default experience for every service in the system.

That gap — between the diagram and the thing that actually runs — is where most of the work lives.