Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monotonic check of procfs fails in WSL2 #1092

Open
davidkopp opened this issue Mar 11, 2025 · 7 comments
Open

Monotonic check of procfs fails in WSL2 #1092

davidkopp opened this issue Mar 11, 2025 · 7 comments
Assignees

Comments

@davidkopp
Copy link
Contributor

The check, if the metrics of procfs are monotonic increasing fails in my environment every time:

Error: Base exception occured in runner.py

Exception (ValueError): Time from metric provider cpu_utilization_procfs_system is not monotonic increasing

I guess I can blame WSL2 for that. For testing usage scenarios I'm using Windows 11 at the moment with WSL2. My CPU is AMD Ryzen 7 PRO 7840U.

There are 3 workarounds:

  • disable the procfs metric provider
  • using the runner.py switch --dev-no-metrics
  • commenting out the following line:
    self._check_monotonic(df) # check must be made before data frame is potentially sorted in _parse_metrics

I'm thinking about if it makes sense to add another switch to runner.py (e.g. --skip-metric-checks) that allows to disable the metric checks. Not sure if it's worth it.

@ribalba
Copy link
Member

ribalba commented Mar 11, 2025

That is interesting as this only checks if the timestamps are in the right oder. Why would the procfs metric provider not report the right time in the right order?

Could you please only run the procfs metric provider with the command that is shown when you run the GMT.

Could you also please send us the procfs metrics provider output file in /tmp/green-metrics-tool/ Is there something weird with that file that you can spot right away?

@davidkopp
Copy link
Contributor Author

Running the command

stdbuf -o0 taskset -c 0 /home/david/green-metrics-tool/metric_providers/cpu/utilization/procfs/system/metric-provider-binary -i 99 > /tmp/green-metrics-tool/cpu_utilization_procfs_system.log

gives me the following result (excerpt):

 1741703676076656 0
 1741703676179662 0
 1741703676279094 0
 1741703676380353 62
 1741703676480978 62
 1741703676581397 0
 1741703676681052 0
 1741703676780350 253
 1741703674264049 0
 1741703674367692 118
 1741703674472097 0
 1741703674577483 58
 1741703674683285 0
 1741703674785687 0
 1741703674890337 0
 1741703674995398 0
 1741703675096040 569

You can see that there is a jump into the past from 1741703676780350 to 1741703674264049.
I know from a colleague, that he has the same issue when doing a measurement run in WSL2.

With WSL2 I don't want to (and can't) do energy measurements. But it can be nice to see some metrics when doing test runs. So for me it would be sufficient to be able to disable the monotonic check.

@ribalba
Copy link
Member

ribalba commented Mar 12, 2025

So after getting the GMT installed on WSL 🤢 I could reproduce the error. The problem is that gettimeofday is not monotonic on WSL and very slow. [0]. So we were getting wrong times. The solution is to use clock_gettime(CLOCK_MONOTONIC which should be used anyhow if we want timing stuff as the system time could always change.

@davidkopp can you please try the patch and see if that fixes your issue?
patch.patch

@ArneTR we should discuss if we want to change this everywhere [1]. As the system time could change for long runs. Very unlikely but possible. Remember the leap second discussion :) We could also just warn if the user uses a provider under windows with anything less than 500ms. That should be enough to get monotonic times even under WSL. As you can't really use the data anyhow that could be the easiest. (Could be a simple check routine)

[0] microsoft/WSL#77
[1] https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html

@davidkopp
Copy link
Contributor Author

The patch fixes the issue for CpuUtilizationProcfsSystemProvider!

The same issue happens also for the Cgroup providers, like CpuUtilizationCgroupContainerProvider and MemoryUsedCgroupContainerProvider. I can't test NetworkIoCgroupContainerProvider and DiskIoCgroupContainerProvider, because they don't work in my WSL2 environment due to other issues (not important for me right now).

@ArneTR
Copy link
Member

ArneTR commented Mar 13, 2025

Luckily it is the clock source and not the writing to the file that makes the output non monotonic 😅

Looking at the linked source under [0] it seems like it got way better in WSL2.

Nonetheless I guess it makes total sense to move to the monotonic clock. However @ribalba what I read is that gettimeofday is actually more performant than clock_monotonic

The best way would be to use the TSC, which however only works if it is invariant. I will make some checks and look further into this.

@ArneTR
Copy link
Member

ArneTR commented Mar 13, 2025

I have done some digging here.

  1. Current analysis on macOS
root@ff0f2bf8510f:/home/ubuntu/clockperf# ./clockperf
clockperf v2.1.0-2-g603ced3

== Reported Clock Frequencies ==

cntvct                 24MHz
realtime               1000Hz
realtime_crs           1000Hz
monotonic              1000Hz
monotonic_crs          1000Hz
monotonic_raw          1000Hz
boottime               1000Hz
process                1000MHz
thread                 1000MHz
clock                  1000KHz
time                   1Hz


== Clock Behavior Tests ==

Name                Cost(ns)      +/-    Resol  Mono  Fail  Warp  Stal  Regr
cntvct                  4.80    2.79%    24MHz    No     0     0   994     0
gettimeofday           31.37    0.19%  1000KHz    No     0     0   999     0
realtime               29.66    0.65%    24MHz    No     0     0   608     0
realtime_crs            5.92    0.50%   1000Hz    No   998     0     1     0
                        5.84   12.14%
monotonic              30.04    1.34%    24MHz    No     0     0   612     0
monotonic_crs           5.94    0.82%   1000Hz    No   998     0     1     0
                        5.84   12.14%
monotonic_raw          29.73    0.98%    24MHz    No     0     0   593     0
boottime               29.60    0.61%    24MHz    No     0     0   609     0
process               261.60    0.19%     ----   Yes     0     0     0     0
thread                253.70    0.19%     ----   Yes     0     0     0     0
clock                 266.89    0.10%  1000KHz    No     0     0   999     0
getrusage             293.97    0.15%  1000KHz    No     0     0   999     0
ftime                  34.39    0.27%   1000Hz    No   993     0     6     0
time                    7.45    0.19%      1Hz    No  1000     0     0     0
  1. Current analysis on linux
root@6c963f5d8c46:/tmp/clockperf# ./clockperf
clockperf v2.1.0

== Reported Clock Frequencies ==

tsc                    2112MHz
realtime               1000MHz
realtime_crs           1000Hz
monotonic              1000MHz
monotonic_crs          1000Hz
monotonic_raw          1000MHz
boottime               1000MHz
process                1000MHz
thread                 1000MHz
clock                  1000KHz
time                   1Hz


== Clock Behavior Tests ==

Name                Cost(ns)      +/-    Resol  Mono  Fail  Warp  Stal  Regr
tsc                    14.05    0.45%     ----   Yes     0     0     0     0
gettimeofday           35.36    0.03%  1000KHz    No     0     0   999     0
realtime               33.93    0.37%     ----   Yes     0     0     0     0
realtime_crs           14.49    0.03%   1000Hz    No   997     0     2     0
monotonic              34.05    0.40%     ----   Yes     0     0     0     0
monotonic_crs          14.49    0.02%   1000Hz    No   997     0     2     0
monotonic_raw          34.67    0.34%     ----   Yes     0     0     0     0
boottime               33.75    0.36%     ----   Yes     0     0     0     0
process              1822.48    0.03%     ----   Yes     0     0     0     0
thread               1781.52    0.01%     ----   Yes     0     0     0     0
clock                1837.68    0.02%     ----   Yes     0     0     0     0
getrusage            1908.04    0.01%     ----   Yes     0     0     0     0
ftime                  38.90    0.03%   1000Hz    No   992     0     7     0
time                    8.45    0.05%      1Hz    No  1000     0     0     0

Intermediate Summary

  • Times on Linux do not really differ between monotonic and gettimeofday. So we can easily switch
  • On macOS the timers seem to be similar but exhibit some stalls. Should not be an issue though in our polling range of 1 ms
  • Going for TSC is still an increase, but not necessarily always available. macOS for instance had no TSC and unsure how comparable cntvct is.
    • Still: Our 5 year old CPU model already had an invariant TSC!

Perfomance considerations for applying Didis Patch

No Patch

Performance counter stats for './metric-provider-binary -i 100' (10 runs):

          1.57 msec task-clock                       #    0.002 CPUs utilized               ( +-  0.14% )
       2586977      instructions                                                            ( +-  0.06% )

    1.00134014 +- 0.00000371 seconds time elapsed  ( +-  0.00% )

Performance counter stats for './metric-provider-binary -i 10' (10 runs):

         10.16 msec task-clock                       #    0.010 CPUs utilized               ( +-  0.20% )
      17188040      instructions                                                            ( +-  0.13% )

     1.0012857 +- 0.0000116 seconds time elapsed  ( +-  0.00% )

With Patch

Performance counter stats for './metric-provider-binary -i 100' (10 runs):

          1.57 msec task-clock                       #    0.002 CPUs utilized               ( +-  0.28% )
       2588862      instructions                                                            ( +-  0.11% )

    1.00134098 +- 0.00000503 seconds time elapsed  ( +-  0.00% )

Performance counter stats for './metric-provider-binary -i 10' (10 runs):

         10.16 msec task-clock                       #    0.010 CPUs utilized               ( +-  0.14% )
      17193397      instructions                                                            ( +-  0.14% )

     1.0012761 +- 0.0000114 seconds time elapsed  ( +-  0.00% )                

Summary

I can barely see any differences between the two implementations. It would be more performant to make the re-keying in Python but in this case I argue for not bringing in the complexity and keeping the re-keying in C

@ribalba: Can you make a PR here changing this in all providers please?

@ArneTR
Copy link
Member

ArneTR commented Mar 13, 2025

Also: Please use CLOCK_MONOTONIC_RAW as this has no NTP adjustments (according to ChatGPT. Please double check with docs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants