Soldev

Fetch historical data

Last updated:

Solana has a focus on high-speed state transitions (400 ms blocks). This has a trade-off that the historical state of data is not really retained, it has to be re-created. There is no built in mechanism to index transactions or track state changes at the per-slot level.

The Solana ledger grows by 100TB every year. Worse yet, any historical state reconstruction requires processing account updates at each slot. You also need to deal with ZK accounts that can reach megabytes in size, store it all in logs, which RPC nodes love to truncate at a 1KB limit.

RPC nodes typically store only between 4 hours to 3 days worth of data.

The most complete record of Solana's data can be found in the Google Bigtable that Solana used at the start of its development. This was largely out of convenience and since then there are initiatives to improve this.

It is tough to manage so many account subscriptions at once. We have to deal with filling gaps for lost connections. When you want to track someone's portfolio, you have to monitor multiple token accounts instead of a single wallet address.

Old Faithful

This is a project by Triton One which uses Content Addressable aRchive (CAR) files to reduce the storage costs for the purpose of querying historical data.

The purpose of Old Faithful is to copy the entire Solana archive into CAR files that can be accessed by the community.

The content addressable part just means that each epoch, block, transaction and shredding is uniquely identified by a content hash, a CID.

For each CID:

All these CIDs are stored in one of 4 indexes:

NameDescription
slot-to-cidLookup a CID based on a slot number
tx-to-cidLookup a CID based on a transaction signature
cid-to-offset-and-sizeIndex for a specific CAR file, used by the local RPC server to find CIDs in a CAR file
sig-existsIndex to check whether a specific signature exists in an epoch
gsfaMaps Solana addresses to a list of transaction signatures

At the end of each epoch a snapshot is generated using the default solana-ledger-tool which is used by Solana warehouse nodes to record the full epoch into a single archive.

Snapshots are captured and written at a fixed infrequent interval. The entire collection of snapshots going back to genesis is called "the warehouse" which is stored in Google Cloud Storage (GCS).

Interestingly, they found 500 slots in which no data exists anywhere.

Content Addressable Archive Format

The format of CAR files is a sequence of bytes described by an InterPlanetary Linked Data (IPLD) data model.

You can check out the data schema to learn the specifics about the kind of data stored in these archives.

IPLD is an Abstract Syntax Tree (AST) for data but without the "S". It looks and feels roughly like JSON. Codec provide serialization/deserialization to a specific DataModel.

The CAR format is a serialized representation of any IPLD DAG (graph) that is made up of:

  1. Header block
  2. One or more IPLD blocks, concatenated together

You can think of it like a simple .tar file but for IPDL DAGs.

Avoid events

Eventually you might be lured by Solana's events. There are two kinds:

  1. Events emitted to the logs
  2. Events emitted via CPI

Parsing logs makes for a terrible indexing experience so it's mostly a dead-end unless you enjoy pain.

So it is possible to parse the kind of events emitted via a CPI and they tend to have more durability than the logs because providers try to avoid storing them for very long.

The overall advice is that you rarely need to depend on events. The more durable, and easier way to parse everything you need is usually through the instructions.

Lean into instructions

The most reliable source of historical information is to filter into transactions containing the instructions you are interested in reading.

Once you have the transaction, you can use the instructions account layout and the accounts in the transaction to get all the information about an event.

This works with both real-time listening through websockets, and through catch-up mechanisms like crawling previous transactions with an RPC.