What is the optimal number of records per shard?

Hello,

I am currently developing a PIR server using the pir-server-example repository.

We are anticipating a total of 10 million URLs for our dataset. In this context, what would be the optimal shard size (number of records per shard) to balance computational latency and communication overhead?

Any advice or best practices for handling a dataset of this scale would be greatly appreciated.

Thank you.

Answered by DTS Engineer in 884834022

Any advice or best practices for handling a dataset of this scale would be greatly appreciated.

So, first and foremost, please file a bug asking us to provide clearer sharing guidance and then post the bug number back here. I've provided some very basic guidance below, but ultimately, this is something that we need to provide more complete documentation about.

We are anticipating a total of 10 million URLs for our dataset. In this context, what would be the optimal shard size (number of records per shard) to balance computational latency and communication overhead?

So, the primary factors to be aware of here are:

  1. The server receives the shard number of every request, so the level of privacy protection is directly tied to the shard count. So, fewer shards provide better privacy.

  2. For cryptographic reasons I can't easily explain, the shard count should be a power of two.

  3. Performance is roughly tied to shard size, so more (smaller) shards mean better performance.

So, the basic answer here is that you should use the smallest shard count that is a power of 2 and provides good performance.

Finally, keep in mind that it's critical that all shards use the same configuration, as described here.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks for taking the time to share your question here. Unfortunately, it hasn't received an answer yet. Here are a few suggestions that might help it attract more attention:

  • Provide more details: Expanding on your post to include any error messages, code snippets, steps you've already taken to troubleshoot, and the expected/actual outcomes would be very helpful.
  • Be specific about your technology stack: Clearly state the programming languages, frameworks, or tools you are using.
  • Check for duplicates: Before posting, make sure your question hasn't been asked before. You can use the search bar to find similar threads.

I'm sure someone in the community will be able to help once you have a chance to update your post.

Albert
  Worldwide Developer Relations.

Any advice or best practices for handling a dataset of this scale would be greatly appreciated.

So, first and foremost, please file a bug asking us to provide clearer sharing guidance and then post the bug number back here. I've provided some very basic guidance below, but ultimately, this is something that we need to provide more complete documentation about.

We are anticipating a total of 10 million URLs for our dataset. In this context, what would be the optimal shard size (number of records per shard) to balance computational latency and communication overhead?

So, the primary factors to be aware of here are:

  1. The server receives the shard number of every request, so the level of privacy protection is directly tied to the shard count. So, fewer shards provide better privacy.

  2. For cryptographic reasons I can't easily explain, the shard count should be a power of two.

  3. Performance is roughly tied to shard size, so more (smaller) shards mean better performance.

So, the basic answer here is that you should use the smallest shard count that is a power of 2 and provides good performance.

Finally, keep in mind that it's critical that all shards use the same configuration, as described here.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

What is the optimal number of records per shard?
 
 
Q