Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(bench): improve heuristic to run throughput benchmarks #1868

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

soonum
Copy link
Contributor

@soonum soonum commented Dec 13, 2024

This changes the way we define the number of elements to load in the throughput pipeline.
Light operations needs more elements to saturate a backend. Conversely a heavy operations will require less elements to be able to run in a decent time.
That improvement will dramatically decrease benchmark total duration.

This has been tested with the following operations:

  • bitand (light operation)
  • add (mid operation)
  • div_rem (heavy operation):

Saturation has been tested and successful on the following backends:

  • CPU
  • GPU

This change is Reviewable

@cla-bot cla-bot bot added the cla-signed label Dec 13, 2024
@soonum soonum self-assigned this Dec 13, 2024
@soonum
Copy link
Contributor Author

soonum commented Dec 13, 2024

I need a review of the design before exporting it to all other benchmark functions handling throughput variants.

@IceTDrinker
Copy link
Member

what's the idea of the heuristic ?

Load depending on how many PBS are in an operation ?

@soonum
Copy link
Contributor Author

soonum commented Dec 13, 2024

what's the idea of the heuristic ?

Load depending on how many PBS are in an operation ?

Yes this is the idea. The load is computed based on the number of threads available divided by the number of PBS needed for one operation.
This value is then used a coefficient within the previous implementation. Typically the coefficient has a value greater than 1.0 for quick operations, if the machine is big enough, and a value below 1.0 for slow operations.
That way the number of elements to pass during the throughput benchmark is dynamically generated.

@IceTDrinker
Copy link
Member

Yes this is the idea. The load is computed based on the number of threads available divided by the number of PBS needed for one operation. This value is then used a coefficient within the previous implementation. Typically the coefficient has a value greater than 1.0 for quick operations, if the machine is big enough, and a value below 1.0 for slow operations. That way the number of elements to pass during the throughput benchmark is dynamically generated.

Did you check the measured throughput with the new loading factor is similar to the old one ?

@soonum
Copy link
Contributor Author

soonum commented Dec 13, 2024

Yes I've just checked and they are the same 🎉

@soonum soonum force-pushed the dt/bench/throughput_heuristic branch 8 times, most recently from 1fb02c8 to d3593b0 Compare December 20, 2024 09:48
@soonum
Copy link
Contributor Author

soonum commented Dec 20, 2024

get_pbs_count() do not increment the atomic counter on GPU.
Thus I cannot test this new implementation on GPU for now.

@IceTDrinker
Copy link
Member

get_pbs_count() do not increment the atomic counter on GPU. Thus I cannot test this new implementation on GPU for now.

use the CPU stats, they should be similar I would think

@soonum soonum force-pushed the dt/bench/throughput_heuristic branch 8 times, most recently from 603cd8b to dc59088 Compare January 6, 2025 14:52
@soonum soonum marked this pull request as ready for review January 6, 2025 15:19
@soonum
Copy link
Contributor Author

soonum commented Jan 6, 2025

The PBS count for GPU is fixed. Ready for review now.

Copy link
Contributor

@agnesLeroy agnesLeroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks a lot @soonum! I only have some minor fixes 🙂 do you know how long the throughput benches take now?

tfhe/benches/integer/bench.rs Show resolved Hide resolved
tfhe/benches/integer/bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/bench.rs Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
tfhe/benches/integer/signed_bench.rs Outdated Show resolved Hide resolved
@soonum
Copy link
Contributor Author

soonum commented Jan 7, 2025

Hey! Thanks a lot @soonum! I only have some minor fixes 🙂 do you know how long the throughput benches take now?

Yes, for Cuda benchmarks, we're now down to 46 mins for de-duplicated operations in 64 bits.
I'll launch a full precision on default ops today.

@soonum soonum force-pushed the dt/bench/throughput_heuristic branch 4 times, most recently from 4c0162f to edb6501 Compare January 9, 2025 10:30
@soonum soonum requested a review from agnesLeroy January 9, 2025 11:11
Copy link
Member

@IceTDrinker IceTDrinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw I was still supposed to review some thing here ?

Btw @soonum just thought about something but enabling the pbs stats can have an impact on CPU performance as we update a single counter via many threads, so I guess there is a need (only for CPU) to do a first pass to measure PBSes for all ops and precision and relaunch throughput loading those data from file to avoid it having an adverse effect on precision I guess

we must not launch latency pbs with the pbs-stats feature

@soonum
Copy link
Contributor Author

soonum commented Jan 17, 2025

Yes you're right 100%. I was bothered about using the pbs-stats feature in the benchmarks.
So, how do you see this ? Performing a measurement dynamically and record a temporary files on disk and then trigger the actual benchmarks in another workflow step ?

@IceTDrinker
Copy link
Member

Yes you're right 100%. I was bothered about using the pbs-stats feature in the benchmarks. So, how do you see this ? Performing a measurement dynamically and record a temporary files on disk and then trigger the actual benchmarks in another workflow step ?

not a different workflow step, can be the same step reading the file from the previous command

something like:

run:|
make measure_pbs_stats
make bench_integer

@soonum soonum force-pushed the dt/bench/throughput_heuristic branch from edb6501 to 155b339 Compare January 20, 2025 09:58
@soonum
Copy link
Contributor Author

soonum commented Jan 20, 2025

Just made some benchmarks. Turned out that pbs-stat feature has no significant impact on performance, either for latency or throughput.
I guess we can let the implementation as is.

This is done to fill up backend with enough elements to fill the
backend and avoid having long execution time for heavy operations
like multiplication or division.
@soonum soonum force-pushed the dt/bench/throughput_heuristic branch from 155b339 to a3bc1a9 Compare January 20, 2025 14:21
@soonum soonum requested a review from IceTDrinker January 20, 2025 14:53
Copy link
Member

@IceTDrinker IceTDrinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question, trying reviewable at the same time so some stuff may not work exactly as I expect

Reviewed 4 of 7 files at r1, all commit messages.
Reviewable status: 4 of 7 files reviewed, 11 unresolved discussions (waiting on @agnesLeroy and @soonum)


tfhe/benches/utilities.rs line 401 at r1 (raw file):

        // Some operations with a high count of PBS (e.g. division) would yield an operation
        // loading value so low that the number of elements in the end wouldn't be meaningful.
        let minimum_loading = if num_block < 64 { 0.2 } else { 0.1 };

This is a heuristic right? could it explain the lower throughput of large precisions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants