hi! i've been working on hydrant for quite a bit now and i wanted to write up something small to try and explain why it should exist compared to tap and some of it's benefits. if you don't know about it yet, tap is a sync utility, you can read about it here, otherwise this won't make much sense.
so what is hydrant? basically, it's an indexer, it indexes records, while also handling sync. tap is not an indexer because it doesn't store any records; it just provides you a channel which you can consume events from to create your own index, the events aren't persisted. hydrant on the other hand was built to persist these (and possibly more, we'll get to that). but why should you care about this? doesn't it make more sense to just build whatever you want to build so you can fit your index to your needs? most likely most people will just do this, and that's completely fine really (hydrant even has an ephemeral mode to only keep a window of events!).
so why use it?
...why would you want to keep an index with hydrant then? a shared index can help you bootstrap other apps, it can help you bootstrap multiple instances of the same app, it can help other people use your app's data so they don't have to waste time indexing things. and in my opinion it's great for people to have something they can easily use to host indexes of things, convenience should lead to more decentralization.
of course it also has some technical consequences, eg. you get a cursor-based event stream as opposed to tap's acking model; this simplifies event handling IMO, as you don't have to think about how long to hold an event, or if it's crash-safe / durable after you ack, as you can just replay events. you don't have to worry about data going missing / corrupted or incorrectly handling some data because you can just ask hydrant instead of implementing your own code for this in every app. it lets you use standard XRPC queries (com.atproto.*) to query data, and can even index backlinks (blue.microcosm.links.*) and have other custom XRPCs (for example, getting the amount of records a repo has for a given collection).
hydrant also implements runtime filter / component (crawler, firehose, backfills / sync) management. you can configure collection signals (it can be multiple), which collections to store, what repos to exclude, or pausing / resuming components and so on. tap doesn't let you do this (yet, anyway, it shouldn't be too hard to add i imagine ^^). this also extends to managing crawler sources and firehose sources of course. on that topic, hydrant lets you subscribe to multiple firehoses, so you can mix and match relays (partial relays anyone?) or even connect to PDSes directly. it supports crawling via using listReposByCollection but also listRepos + describeRepo if you don't want to depend on that / want to build your own index (or both at the same time). overall it has more features, but i'll do you one better:
hydrant-as-library
you can use hydrant as a library in Rust, (in fact the HTTP API just uses this), meaning you don't even have to necessarily run a binary. see below for an example, from the statusphere example:
let cfg = Config::from_env()?;
let hydrant = Hydrant::new(cfg).await?;
// discover only repos that publish xyz.statusphere.status records,
// and only store that collection (all other record types are dropped).
hydrant
.filter
.set_mode(FilterMode::Filter)
.set_signals([COLLECTION])
.set_collections([COLLECTION])
.apply()
.await?;
// replay all persisted events from the start to rebuild the in-memory index,
// then switch to live tail. since the index is in-memory, we always need the
// full replay on startup.
let stream = hydrant.subscribe(Some(0));
let index = Arc::new(StatusIndex::new());
tokio::select! {
// this finally starts hydrant, so it will start crawling and backfilling etc.
r = hydrant.run()? => r,
_ = run_ticker(index.clone()) => Ok(()),
_ = handle_stream(index.clone(), hydrant.repos.clone(), stream) => Ok(()),
}this starts hydrant and configures it, acquires an event stream and then we can start processing things! in that same example you can see the code calling repos.info(did) to get information about a repo and extract the handle for example. but you can do much more with it obviously (that's a todo for me to write a more comprehensive example ^^;)!
in my opinion tap should also support being used like this, that was something major i thought it lacked so i added it here. but it makes more sense for hydrant to be used like this at least, you don't necessarily even have to use your own database etc. as hydrant's API might just be enough if you mostly only need to fetch records or backlinks and so on!
benchmarking!
you might be wondering about the space efficiency and resource usage of hydrant, so i ran a couple benchmarks a few times (full hour runs, and 30 minute runs) with hydrant and tap (with sqlite), both in full network mode. these benchmarks were ran on a server with a ryzen 9600x, 16GB of memory + 8GB swap:
tap ran with resync concurrency set to 10, outbox workers set to 4 and with no acks. this achieves around 22k~/sec records. the p50 memory usage is around 3GB with 0.07GB swap and p99 is around 5GB with 0.47GB swap. CPU usage was around 12% for p50 and 13% for p99.
tap ran with resync concurrency set to 20, outbox workers set to 4 and with no acks. this achieves around 34k~/sec records. the p50 memory usage is around 6.8GB with 1.3GB swap and p99 is around 11.6GB with 2.5GB swap. CPU usage was around 23% for p50 and 25.6% for p99.
hydrant ran with its default full network settings (which has a resync concurrency of 64, i'll get to why these aren't the same later), this achieves around 60k~/sec records. the p50 memory usage is around 4.9GB and p99 at 5.1GB. CPU usage was around 13.8% for p50 and 17.3% for p99.
so as you can see, hydrant is quite a bit more performant compared to tap, almost achieving three times the throughput compared to the runs with 10 concurrency (i'm using that one as a comparison because the memory / CPU workload is similar). hydrant is actually network bound here, because it consistently maxed out the uplink (around 96mb/s, as this server has a 100M uplink).
as for why tap doesn't use 64 concurrency, you can see that increasing to 20 concurrency basically doubles the resource usage. with 64 it OOMs 7-10 minutes in everytime, using up all the memory and swap (13-14GB memory and 5-6GB swap). i'm not sure why that happens, but the throughput is still similar to if it was using 10 concurrency (the same also happens with postgres, so it's something with tap itself).
space efficiency wise, hydrant uses zstd compression by default (mostly level 3, with 5 being used for deeper LSM tree levels). for a 30 minute run, hydrant uses about ~24GB (~32GB before major compaction), 15 of that being the blocks, which hydrant stores as CBOR. the events and records keyspaces (which mainly consist of DID, collection, record key, CID) use ~4.8GB and ~4.2GB respectively. using dictionaries results in a 8% saving for the blocks and 4% saving for events (the training of these can be done through the API), i want to experiment with this more but it still helps!
i'm not comparing the space efficiency to tap because it's not exactly apples to apples, tap doesn't store blocks or events (other than the outbox), but with the 10 concurrency run, tap acquires about 11GB in its db, of which 8.2GB belongs to the repo_records table (5.2GB row data + 3GB primary key index) and the rest is the repos table with repo state info. with zstd:9 compression on the filesystem (btrfs) though the records table comes out to 2.1GB, which at least means it can achieve somewhat close "bytes per record" (being 1.4 times bigger than hydrant's entries). the difference here is purely encoding, since tap stores everything as text, while hydrant stores them as binary (including PLC DIDs and TID rkeys).
there is more to explore here
that's all for now! you can check it out yourself at the tangled repo. but i do have bigger ideas for hydrant:
a plugin system for adding additional indexing logic. this way you could just write logic for hydrant to use and not have to worry about a database if you need more indexes. (though for now, we could also expose the underlying fjall database, but maybe make it somewhat maintainable through an abstracted API).
relay mode. this is kind of complicated due to the way events work in hydrant, but it would be a good addition i think, especially considering we already support listening to PDSes.
in the same vein, jetstream API. this would be especially nice considering many apps use jetstream already anyway.
...and just more XRPCs being implemented, polishing, etc.
thank you for reading! ^^