Measuring the effectiveness of a scheduler

I’m currently writing a new scheduler, credit2, which I mean to be the general-purpose scheduler for the next generation of Xen, lasting at least several years before needing a significant overhaul.

But there are some problems with writing a scheduler:

  • They’re used in every workload that uses a CPU — in other words,
    every workload.
  • Schedulers really only matter when there’s competition. Simply
    running a single workload in isolation won’t tell you much about the
    effectiveness of the algorithm.
  • Schedulers may work perfectly well for one combination of competing
    workloads, but poorly for others.
  • Small changes can have unpredictable effects.

I’ve learned a lot so far about algorithms, what didn’t work for
credit1 and what I think works for credit2. But one of the most
important things I’ve learned though is not to rely on my intuition,
but make sure to do actual testing.

So what I’m working on now is an automated test framework, which will
measure the effectiveness of the scheduler. This is actually a bit
trickier than one might expect. You can’t simply run a workload by
itself and measure its performance; for that use case, you don’t
actually need a scheduler. You need to run each workload in
competition with an array of other workloads, and at various levels of
CPU “pressure”. The tested workload will get less CPU time, and the
performance will degrade. So we need to have an idea how much
performance degradation is expected and acceptable.

So what do we want from a scheduler? I’ve identified three high-level
goals:

  • Scheduler overhead should be low; about as low as the old scheduler.
  • The amount of actual CPU should be close to its “fair share”
  • The impact on workload performance should be close to its “fair
    share”.

“Fair share” is defined on a per-scheduler basis. For the credit1
scheduler, the ideal was that CPU time was divided according to
weight; if some VMs were using less than their “fair share”, that time
was divided by weight among other VMs that wished to use it. I won’t
go into calculating exact fair share here, but it’s essentially doing
what the interface says it will do.

So how do we define this precisely? First, we need to define several
variables:

  • p_o: Performance of the workload running on the old workload on an
    otherwise empty system.
  • p_b: “Baseline” performance of the workload on the new scheduler (on
    an otherwise empty system)
  • t_b: The “baseline” amount of CPU time used by the taret VM (on an
    otherwise empty system)
  • f: “Fair share” for the target VM, given the load of the system
  • p_f: The performance of the workload, given fair share f
  • t_f: The amount of CPU consumed given fair share f

Using these variables, and introducing some specific targets, we can
quantify the above goals:

  • Overhead within 1% of the old scheduler:
    • p_b >= 0.99 * p_o
  • CPU consumed within 10% of “fair”:
    • as f = t_f >= 0.9*f
  • Performance fairness
    • p_f >= p_b * (f / t_b) * F
        Where F is:

      • Ideal: 1
      • Good enough: 0.9 (within 10% of ideal)
      • Cut-off: 0.7 (within 30% of ideal)

These constraints are a starting point; I’m sure that at some point
they will run up against reality and need to be adjusted. But the
poitn is to have specific, measurable objectives, so that we know when
we have work to do, and when we can say, “This is good enough.”

Read more

Xen Project Announces Performance and Security Advancements with Release of 4.19
Aug 05 2024

New release marks significant enhancements in performance, security, and versatility across various architectures.  SAN FRANCISCO – July 31st, 2024 – The Xen Project, an open source project under the Linux Foundation, is proud to announce the release of Xen Project 4.19. This release marks a significant milestone in enhancing performance, security,

Upcoming Closure of Xen Project Colo Facility
Jul 10 2024

Dear Xen Community, We regret to inform you that the Xen Project is currently experiencing unexpected changes due to the sudden shutdown of our colocated (colo) data center facility by Synoptek. This incident is beyond our control and will impact the continuity of OSSTest (the gating Xen Project CI loop)

Xen Summit Talks Now Live on YouTube!
Jun 18 2024

Hello Xen Community! We have some thrilling news to share with you all. The highly anticipated talks from this year’s Xen Summit are now live on YouTube! Whether you attended the summit in person or couldn’t make it this time, you can now access all the insightful presentations

Get ready for Xen Summit 2024!
May 24 2024

With less than 2 weeks to go, are you ready? The Xen Project is gearing up for a summit full of discussions, collaboration and innovation. If you haven’t already done so – get involved by submitting a design session topic. Don’t worry if you can’t attend in person,