Hello,
I am currently developing a PIR server using the pir-server-example repository.
We are anticipating a total of 10 million URLs for our dataset. In this context, what would be the optimal shard size (number of records per shard) to balance computational latency and communication overhead?
Any advice or best practices for handling a dataset of this scale would be greatly appreciated.
Thank you.
Any advice or best practices for handling a dataset of this scale would be greatly appreciated.
So, first and foremost, please file a bug asking us to provide clearer sharing guidance and then post the bug number back here. I've provided some very basic guidance below, but ultimately, this is something that we need to provide more complete documentation about.
We are anticipating a total of 10 million URLs for our dataset. In this context, what would be the optimal shard size (number of records per shard) to balance computational latency and communication overhead?
So, the primary factors to be aware of here are:
-
The server receives the shard number of every request, so the level of privacy protection is directly tied to the shard count. So, fewer shards provide better privacy.
-
For cryptographic reasons I can't easily explain, the shard count should be a power of two.
-
Performance is roughly tied to shard size, so more (smaller) shards mean better performance.
So, the basic answer here is that you should use the smallest shard count that is a power of 2 and provides good performance.
Finally, keep in mind that it's critical that all shards use the same configuration, as described here.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware