chore(bench): improve heuristic to run throughput benchmarks #1868

soonum · 2024-12-13T13:20:36Z

This changes the way we define the number of elements to load in the throughput pipeline.
Light operations needs more elements to saturate a backend. Conversely a heavy operations will require less elements to be able to run in a decent time.
That improvement will dramatically decrease benchmark total duration.

This has been tested with the following operations:

bitand (light operation)
add (mid operation)
div_rem (heavy operation):

Saturation has been tested and successful on the following backends:

CPU
GPU

This change is

soonum · 2024-12-13T13:21:33Z

I need a review of the design before exporting it to all other benchmark functions handling throughput variants.

IceTDrinker · 2024-12-13T13:26:13Z

what's the idea of the heuristic ?

Load depending on how many PBS are in an operation ?

soonum · 2024-12-13T14:17:19Z

what's the idea of the heuristic ?

Load depending on how many PBS are in an operation ?

Yes this is the idea. The load is computed based on the number of threads available divided by the number of PBS needed for one operation.
This value is then used a coefficient within the previous implementation. Typically the coefficient has a value greater than 1.0 for quick operations, if the machine is big enough, and a value below 1.0 for slow operations.
That way the number of elements to pass during the throughput benchmark is dynamically generated.

IceTDrinker · 2024-12-13T14:23:51Z

Yes this is the idea. The load is computed based on the number of threads available divided by the number of PBS needed for one operation. This value is then used a coefficient within the previous implementation. Typically the coefficient has a value greater than 1.0 for quick operations, if the machine is big enough, and a value below 1.0 for slow operations. That way the number of elements to pass during the throughput benchmark is dynamically generated.

Did you check the measured throughput with the new loading factor is similar to the old one ?

soonum · 2024-12-13T16:41:44Z

Yes I've just checked and they are the same 🎉

soonum · 2024-12-20T11:08:11Z

get_pbs_count() do not increment the atomic counter on GPU.
Thus I cannot test this new implementation on GPU for now.

IceTDrinker · 2025-01-02T10:48:33Z

get_pbs_count() do not increment the atomic counter on GPU. Thus I cannot test this new implementation on GPU for now.

use the CPU stats, they should be similar I would think

soonum · 2025-01-06T15:20:14Z

The PBS count for GPU is fixed. Ready for review now.

agnesLeroy

Hey! Thanks a lot @soonum! I only have some minor fixes 🙂 do you know how long the throughput benches take now?

tfhe/benches/integer/bench.rs

tfhe/benches/integer/signed_bench.rs

soonum · 2025-01-07T08:28:37Z

Hey! Thanks a lot @soonum! I only have some minor fixes 🙂 do you know how long the throughput benches take now?

Yes, for Cuda benchmarks, we're now down to 46 mins for de-duplicated operations in 64 bits.
I'll launch a full precision on default ops today.

IceTDrinker

I saw I was still supposed to review some thing here ?

Btw @soonum just thought about something but enabling the pbs stats can have an impact on CPU performance as we update a single counter via many threads, so I guess there is a need (only for CPU) to do a first pass to measure PBSes for all ops and precision and relaunch throughput loading those data from file to avoid it having an adverse effect on precision I guess

we must not launch latency pbs with the pbs-stats feature

soonum · 2025-01-17T10:41:50Z

Yes you're right 100%. I was bothered about using the pbs-stats feature in the benchmarks.
So, how do you see this ? Performing a measurement dynamically and record a temporary files on disk and then trigger the actual benchmarks in another workflow step ?

IceTDrinker · 2025-01-17T10:43:40Z

Yes you're right 100%. I was bothered about using the pbs-stats feature in the benchmarks. So, how do you see this ? Performing a measurement dynamically and record a temporary files on disk and then trigger the actual benchmarks in another workflow step ?

not a different workflow step, can be the same step reading the file from the previous command

something like:

run:|
make measure_pbs_stats
make bench_integer

soonum · 2025-01-20T14:00:36Z

Just made some benchmarks. Turned out that pbs-stat feature has no significant impact on performance, either for latency or throughput.
I guess we can let the implementation as is.

This is done to fill up backend with enough elements to fill the backend and avoid having long execution time for heavy operations like multiplication or division.

IceTDrinker

A question, trying reviewable at the same time so some stuff may not work exactly as I expect

Reviewed 4 of 7 files at r1, all commit messages.
Reviewable status: 4 of 7 files reviewed, 11 unresolved discussions (waiting on @agnesLeroy and @soonum)

tfhe/benches/utilities.rs line 401 at r1 (raw file):

        // Some operations with a high count of PBS (e.g. division) would yield an operation
        // loading value so low that the number of elements in the end wouldn't be meaningful.
        let minimum_loading = if num_block < 64 { 0.2 } else { 0.1 };

This is a heuristic right? could it explain the lower throughput of large precisions?

cla-bot bot added the cla-signed label Dec 13, 2024

soonum self-assigned this Dec 13, 2024

soonum requested review from IceTDrinker and agnesLeroy December 13, 2024 13:20

soonum force-pushed the dt/bench/throughput_heuristic branch 8 times, most recently from 1fb02c8 to d3593b0 Compare December 20, 2024 09:48

soonum force-pushed the dt/bench/throughput_heuristic branch 8 times, most recently from 603cd8b to dc59088 Compare January 6, 2025 14:52

soonum marked this pull request as ready for review January 6, 2025 15:19

agnesLeroy reviewed Jan 6, 2025

View reviewed changes

soonum force-pushed the dt/bench/throughput_heuristic branch 4 times, most recently from 4c0162f to edb6501 Compare January 9, 2025 10:30

soonum requested a review from agnesLeroy January 9, 2025 11:11

IceTDrinker reviewed Jan 10, 2025

View reviewed changes

soonum force-pushed the dt/bench/throughput_heuristic branch from edb6501 to 155b339 Compare January 20, 2025 09:58

chore(bench): new heuristic to define elements for throughput

a3bc1a9

This is done to fill up backend with enough elements to fill the backend and avoid having long execution time for heavy operations like multiplication or division.

soonum force-pushed the dt/bench/throughput_heuristic branch from 155b339 to a3bc1a9 Compare January 20, 2025 14:21

soonum requested a review from IceTDrinker January 20, 2025 14:53

IceTDrinker reviewed Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(bench): improve heuristic to run throughput benchmarks #1868

chore(bench): improve heuristic to run throughput benchmarks #1868

soonum commented Dec 13, 2024 •

edited by IceTDrinker

Loading

soonum commented Dec 13, 2024

IceTDrinker commented Dec 13, 2024

soonum commented Dec 13, 2024

IceTDrinker commented Dec 13, 2024

soonum commented Dec 13, 2024

soonum commented Dec 20, 2024

IceTDrinker commented Jan 2, 2025

soonum commented Jan 6, 2025

agnesLeroy left a comment

soonum commented Jan 7, 2025

IceTDrinker left a comment

soonum commented Jan 17, 2025

IceTDrinker commented Jan 17, 2025

soonum commented Jan 20, 2025

IceTDrinker left a comment

chore(bench): improve heuristic to run throughput benchmarks #1868

Are you sure you want to change the base?

chore(bench): improve heuristic to run throughput benchmarks #1868

Conversation

soonum commented Dec 13, 2024 • edited by IceTDrinker Loading

soonum commented Dec 13, 2024

IceTDrinker commented Dec 13, 2024

soonum commented Dec 13, 2024

IceTDrinker commented Dec 13, 2024

soonum commented Dec 13, 2024

soonum commented Dec 20, 2024

IceTDrinker commented Jan 2, 2025

soonum commented Jan 6, 2025

agnesLeroy left a comment

Choose a reason for hiding this comment

soonum commented Jan 7, 2025

IceTDrinker left a comment

Choose a reason for hiding this comment

soonum commented Jan 17, 2025

IceTDrinker commented Jan 17, 2025

soonum commented Jan 20, 2025

IceTDrinker left a comment

Choose a reason for hiding this comment

soonum commented Dec 13, 2024 •

edited by IceTDrinker

Loading