Fetch historical data
Solana has a focus on high-speed state transitions (400 ms blocks). This has a trade-off that the historical state of data is not really retained, it has to be re-created. There is no built in mechanism to index transactions or track state changes at the per-slot level.
The Solana ledger grows by 100TB every year. Worse yet, any historical state reconstruction requires processing account updates at each slot. You also need to deal with ZK accounts that can reach megabytes in size, store it all in logs, which RPC nodes love to truncate at a 1KB limit.
RPC nodes typically store only between 4 hours to 3 days worth of data.
The most complete record of Solana's data can be found in the Google Bigtable that Solana used at the start of its development. This was largely out of convenience and since then there are initiatives to improve this.
It is tough to manage so many account subscriptions at once. We have to deal with filling gaps for lost connections. When you want to track someone's portfolio, you have to monitor multiple token accounts instead of a single wallet address.
Old Faithful
This is a project by Triton One which uses Content Addressable aRchive (CAR) files to reduce the storage costs for the purpose of querying historical data.
The purpose of Old Faithful is to copy the entire Solana archive into CAR files that can be accessed by the community.
The content addressable part just means that each epoch, block, transaction and shredding is uniquely identified by a content hash, a CID.
For each CID:
- Any difference in the content will produce a different CID
- The same content added to two different IPFS nodes using the same settings will produce the same CID.
All these CIDs are stored in one of 4 indexes:
Name | Description |
---|---|
slot-to-cid | Lookup a CID based on a slot number |
tx-to-cid | Lookup a CID based on a transaction signature |
cid-to-offset-and-size | Index for a specific CAR file, used by the local RPC server to find CIDs in a CAR file |
sig-exists | Index to check whether a specific signature exists in an epoch |
gsfa | Maps Solana addresses to a list of transaction signatures |
At the end of each epoch a snapshot is generated using the default solana-ledger-tool
which is used by Solana warehouse nodes to record the full epoch into a single archive.
Snapshots are captured and written at a fixed infrequent interval. The entire collection of snapshots going back to genesis is called "the warehouse" which is stored in Google Cloud Storage (GCS).
Interestingly, they found 500 slots in which no data exists anywhere.
Content Addressable Archive Format
The format of CAR files is a sequence of bytes described by an InterPlanetary Linked Data (IPLD) data model.
You can check out the data schema to learn the specifics about the kind of data stored in these archives.
IPLD is an Abstract Syntax Tree (AST) for data but without the "S". It looks and feels roughly like JSON. Codec
provide serialization/deserialization to a specific DataModel
.
The CAR format is a serialized representation of any IPLD DAG (graph) that is made up of:
- Header block
- One or more IPLD blocks, concatenated together
You can think of it like a simple .tar
file but for IPDL DAGs.
Avoid events
Eventually you might be lured by Solana's events. There are two kinds:
- Events emitted to the logs
- Events emitted via CPI
Parsing logs makes for a terrible indexing experience so it's mostly a dead-end unless you enjoy pain.
So it is possible to parse the kind of events emitted via a CPI and they tend to have more durability than the logs because providers try to avoid storing them for very long.
The overall advice is that you rarely need to depend on events. The more durable, and easier way to parse everything you need is usually through the instructions.
Lean into instructions
The most reliable source of historical information is to filter into transactions containing the instructions you are interested in reading.
Once you have the transaction, you can use the instructions account layout and the accounts in the transaction to get all the information about an event.
This works with both real-time listening through websockets, and through catch-up mechanisms like crawling previous transactions with an RPC.