Skip to main content

Objectives

  • Build predictive models (risk scoring, price forecasting, MEV detection) with reliable ground truth.
  • Feed agentic systems (Copilots, MCP agents) with curated datasets instead of noisy RPC scraping.
  • Maintain reproducibility by tying every feature back to _tracing_id and verification metadata.

Data Building Blocks

Feature TypeDataset(s)Usage
Ledger signalsblocks, transactions, logsExtract temporal patterns, gas spikes, contract interactions.
Entity featureserc20 tokens, erc721 tokens, contractsEnrich models with token metadata, compliance flags, contract types.
Market contexttoken-to-token prices & pricing analyticsDerive volatility, spreads, liquidity-adjusted price moves.
Provenance labelsLineage datasets & Verification suiteBuild training labels that prove whether data survived integrity checks.

Pipeline Pattern

  1. Historical load through Archive Bulk Delivery to create feature stores (Snowflake, Databricks, BigQuery).
  2. Feature engineering with dbt or Spark, ensuring _tracing_id is preserved for explainability.
  3. Model training/deployment in your ML stack (Vertex, SageMaker, Databricks ML) referencing BlockDB artifacts.
  4. Incremental refresh via REST pollers or WebSocket feeds to keep inference features current.

Tips for AI Agents

  • Use Machine Control Protocol for tool-augmented agents that need curated responses with built-in guardrails.
  • Cache deterministic function results (/evm/function-results) to avoid recomputing expensive call traces.
  • When generating synthetic data, log _tracing_id pairs so auditors can recreate the same prompt/response context.

Governance

  • Track schema drift via Schema Governance webhooks and update feature pipelines accordingly.
  • Store verification hashes adjacent to your feature store to prove the lineage of each training example.