Objectives
- Build predictive models (risk scoring, price forecasting, MEV detection) with reliable ground truth.
- Feed agentic systems (Copilots, MCP agents) with curated datasets instead of noisy RPC scraping.
- Maintain reproducibility by tying every feature back to
_tracing_idand verification metadata.
Data Building Blocks
| Feature Type | Dataset(s) | Usage |
|---|---|---|
| Ledger signals | blocks, transactions, logs | Extract temporal patterns, gas spikes, contract interactions. |
| Entity features | erc20 tokens, erc721 tokens, contracts | Enrich models with token metadata, compliance flags, contract types. |
| Market context | token-to-token prices & pricing analytics | Derive volatility, spreads, liquidity-adjusted price moves. |
| Provenance labels | Lineage datasets & Verification suite | Build training labels that prove whether data survived integrity checks. |
Pipeline Pattern
- Historical load through Archive Bulk Delivery to create feature stores (Snowflake, Databricks, BigQuery).
- Feature engineering with dbt or Spark, ensuring
_tracing_idis preserved for explainability. - Model training/deployment in your ML stack (Vertex, SageMaker, Databricks ML) referencing BlockDB artifacts.
- Incremental refresh via REST pollers or WebSocket feeds to keep inference features current.
Tips for AI Agents
- Use Machine Control Protocol for tool-augmented agents that need curated responses with built-in guardrails.
- Cache deterministic function results (
/evm/function-results) to avoid recomputing expensive call traces. - When generating synthetic data, log
_tracing_idpairs so auditors can recreate the same prompt/response context.
Governance
- Track schema drift via Schema Governance webhooks and update feature pipelines accordingly.
- Store verification hashes adjacent to your feature store to prove the lineage of each training example.