Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework winapi's QueryPerformance* functions to match XDK #663

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

GXTX
Copy link
Contributor

@GXTX GXTX commented Oct 16, 2023

We aren't following XDK with the current implementation.

QueryPerformanceFrequency is a hardcoded value of 733MHz
In later XDK revisions they used rdtsc for QueryPerformanceCounter

I don't agree with hard coded values, but didn't want to force a ~100mS sleep on developers wishing to use the function so I locked dynamic frequency behind a define.

@GXTX GXTX force-pushed the fix/query_counters branch from 4f44c63 to 3faa575 Compare October 16, 2023 11:33
avg.QuadPart += s_rdtsc.QuadPart;

// If we call rdtsc too fast we'll end up with div by 0
Sleep(10);
Copy link
Member

@JayFoxRox JayFoxRox Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is.. insane. Hard no from me.

(This function is often used in hot-code path and multiple times per frame; slowing it down like this is crazy. I can't imagine the actual XDK did this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct it doesn't, that's added by me. It only ever replies 733mhz. I'd like to hear ideas of how to get this dynamically. Perhaps a run once and global var?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with jfr, code with this kind of performance issue isn't suitable for this type of functionality.
The loop doesn't make sense to me either - doing it in multiple small steps just accumulates rounding and measurement errors, exacerbated by using a low-precision timing source such as the tick counter.

We should calibrate the performance counter frequency value only once, so this function won't have to do anything more than just return the content of a global static variable.
The variable can be initialized via a function marked as a constructor to avoid having to do any manual initialization.
The timing source for the calibration should be KeQueryPerformanceCounter, which should give us better precision given the ACPI timer's tick rate of 3.579545 MHz. It may only support a 24 bit counter, so the code needs to handle a potential overflow during calibration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We should prime, and thanks for the hint at constructor attribute. See newest commit for changes.

@GXTX
Copy link
Contributor Author

GXTX commented Oct 17, 2023

235dde3 produces a number that is only 0.000454501% off the static value on hardware. With a slightly longer sleep we can get another digit of accuracy but I'm happy where it is.


avg.QuadPart = 0;
Sleep(500);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

@GXTX GXTX Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values generated soon after the system is booted seem to give less precise results, as to why; I'm not sure yet. Runs with a 700ms initial wait and 200ms loop wait give us 733332000, and calling it again even moments later bring us to 733333000.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that's good enough? What kind of precision do you expect? With your multiplication by 1000 at the end, this means that there's a single tick difference (1.36425648 nanoseconds).

That's a 0.00013636369% error.. how is that not negligible?

What are you using this function for that you require such precision?

Copy link
Contributor Author

@GXTX GXTX Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you using this function for that you require such precision?

I don't need such precision. It's an attempt to placate you and other maintainers. I don't want a hard value, you do. So I've made it so it's as close as I think is reasonable (possible?). What percentage ± would you accept, how long of a wait?

Copy link
Member

@JayFoxRox JayFoxRox Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say a sleep of less than a millisecond would be good in constructors. Maybe 10ms if necessary.
There's valid cases of sleeping in the network, USB and GPU driver - even up to 100ms wouldn't be too problematic (especially async).

Not sure about the percentage. But I'd expect the integer to be exact (that's the "correct" solution after all) on a stock Xbox. If the integer is off by 1 I wouldn't care too much, although I could already see problems with different functions making assumptions about the fixed frequency (and there you'd likely use the hardcoded value). Maybe using the int -> float32 error would be a good measure.

But in this instance, there's a perfect solution which has no wait.
It outperforms your solution in every single way on a stock Xbox.

The only time it doesn't work is with CPU mods / under- or overclocking.
But that's not a case nxdk should support unless it's supportable with reasonable effort.
Hint: it likely isn't, because it becomes a problem of defining the limit of what is considered an Xbox. If I overclock the audio chip but that's used in some nxdk code to calibrate the overclocked CPU clock, then how could this ever work without telling nxdk what you did to your system?
I can see perfect reasons for overclocking either of them, but absolutely no way how to reasonably support it in nxdk.

We could support the more common case of overclocking the CPU, but not if it has major drawbacks for most users which run a stock Xbox (which is the majority).

ULARGE_INTEGER s_rdtsc;
ULONG s_ticks;
// If we call rdtsc too fast we'll end up with div by 0
Sleep(200);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sleep at the start of the loop? To me, it looks like this should only do measure(); wait(); measure() (= a single wait).
Although I'm not sure about waiting altogether - I'd expect the CPU to know its frequency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'm not sure about waiting altogether - I'd expect the CPU to know its frequency.

I guess I'm confused about this. Is there a cpuid flag for current frequency? There's a hardware register that sets the FSB, but this can't be guaranteed accurate for various reasons, including bad caps.

Copy link
Member

@JayFoxRox JayFoxRox Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this can't be guaranteed accurate for various reasons, including bad caps.

If the caps are so much out of tolerance that you are that far away from 733MHz that it even drifts a second per minute then I'd consider your Xbox broken.

You don't use QueryPerformanceCounter to implement a date/time system (wall-clock) but for relative measurements. If it's off by 0.1%, heck make it 1%, that shouldn't matter for applications.

Hence, we can safely assume the reference clock speed. There shouldn't be a need to read the FSB or cpuid - we know what CPU the Xbox platform has. If it doesn't, then it's not an Xbox and you should probably keep these things in local patches to nxdk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider your Xbox broken

As the Xbox continues to age this will become more and more relevant. I recently had a box which would report a 729MHz CPU frequency as the filter capacitors had blown, now, games still worked, and most wouldn't notice such an issue. Why should be not account for this? Why should we continue to make bad assumptions about the platform.

No patch I've ever introduced has attempted to move away from Xbox as a platform, to support different AGP cards, add-in PCI cards, FooBar SMBus device, I've only attempted to remove these hard coded values and addresses in an attempt to support more Xboxes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should be not account for this? Why should we continue to make bad assumptions about the platform.

Because this will likely affect the entire system.
The entire system will run faster or slower, but it doesn't matter because all components are expected to run equally different.

So even if you calibrate against another clock, the result will still be very close to 733.333MHz measured.

This is only a problem if the clock on a single component varies. However, most clocks are affected by a huge number of components, so drift on any of them can't cause the result to deviate much (likely there's an opposing force too.. and clock generators are also build to combat this).

But even if there is a problem with a single clock, then you don't know which one, you can't calibrate against another clock source because that might also have deteriorated. At best, you can average out the error of all clock sources.
But at that point, your measurement error or rounding errors due to your calculations will be more significant than the initial error.

You are also introducing a very high cost of synchronizing / measuring clocks and then continue to have a high cost due to a bunch of optimizations you can't do anymore due to lack of known clock frequencies at compile time (which also means you run more code to get the same result = the errors in clock affect you even more).

And even if you managed to get accurate measurements, you couldn't account for drift at runtime which would likely exist (in the case of electronic components due to external factors like heat).
Even by design the clock will not be a constant 733.333MHz - instead, it cycles a bit around (no idea how much or at what speed) the reference clock because otherwise it would cause some annoying humming as it would hit resonance frequencies (fail FCC tests etc.).

You can't compare a clock in a computer system to a wall-clock (unless specifically designed to be that.. which are RTCs.. which usually run at lower precision). It's not important that the frequency is perfect. But you still want it to be a very very good average if you don't know how people will use it.

Especially on a console like the Xbox, there'll be plenty of application code (out of our control) that will assume rdtsc to run with 733.333 MHz, and you can also assume that someone will compare results from QueryPerformanceCounter to rdtsc even if that's bad practice.

If you hardcode the expected frequency you get around all of these issues (and a couple of other benefits), because if one counter runs slower, they likely all will. What is important is that everyone agrees on what these clocks should be running at.

All that said, even a bad and drifting counter that's not an RTC or misconfigured is still perfectly valid for relative measurements or measurements which a human couldn't possible notice:

  • Would you notice a 0.1% speed difference when playing Halo? Probably not.
  • Would you care about a 0.1% speed difference when doing micro-optimizations on a function that's called a couple of million times per second: probably (but the error will cancel out during measurement).

I've only attempted to remove these hard coded values and addresses in an attempt to support more Xboxes.

And that's a noble goal.

The issue is that it's hard to get away from constant clock frequencies in the code for the reason mentioned above: they can't be measured (without depending on another clock.. which can also have errors).

I'm all for removing stupid checks or getting instead of setting clocks (= cases where we depend on hw setup by some XDK code instead of being able to do it ourselves).

What we should attempt is to mimic how the XDK does it.
If the XDK hardcodes a clock, then we can probably get away with it too.
If the XDK code stops working, it would imply the kernel wouldn't run and the intro animation would also fail.
The dashboard would crash and so would every game.
That wouldn't be a sign of nxdk needing code fixes, but a sign of an Xbox needing repair.

The goal should be: if XDK code works, then equivalent nxdk code should work. If the XDK hardcodes a value, then we should do that too.

Copy link
Member

@thrimbor thrimbor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1kHz tick rate with a 200ms sleep means that even the most optimistic error calculation results in a ±0.5% approximation error. Combined with the numerically disadvantageous division that is several orders of magnitude worse precision than what we had before.

Comment on lines 21 to 25
Sleep(500);

for (int i = 0; i < AVG_SET; i++) {
ULARGE_INTEGER s_rdtsc;
ULONG s_ticks;
// If we call rdtsc too fast we'll end up with div by 0
Sleep(200);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost a full second startup delay is excessive. I'd say 100ms is the upper bound of anything that runs as a constructor, and I imagine it's ample time when removing that counter-productive loop and using a higher precision timer like what I mentioned above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ofc we should prefer not having to wait at all, which should be possible when reading the CPU clockspeed from wherever it is stored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ofc we should prefer not having to wait at all, which should be possible when reading the CPU clockspeed from wherever it is stored.

We can't guarantee this is accurate. The old method of changing the FSB on CPU replaced Xbox' were to mess with the clock generator output, not the hardware register used for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue "CPU replaced Xbox" is no longer an Xbox and out of scope for nxdk.
In theory, one could also make a "GPU replaced Xbox" etc. - one could change any clock on the system.

It's impossible for us to support all of these variations in a useful way.

nxdk is open-source, so if people want to support non-stock clocks, then they can keep around a patched nxdk to make special builts.
I guess we could mark timing related variables (in code or in binary for hex editing.. at potential performance cost) in the nxdk libs, but even then any application might make its own assumptions when running on Xbox, so you'd probably still have to bring your own hex edits or patches for closed-source apps.

There's no good solution for supporting Xbox-like platforms in upstream nxdk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible for us to support all of these variations in a useful way.

Whether or not it's possible to support all variations is moot. I'm attempting to support this specific thing, and I hate how much I have to try to justify my changes to you and others. Is there such a risk providing an accurate to now clock rate instead of a hard coded value?

In theory, one could also make a "GPU replaced Xbox" etc.

These have been done while you've been gone. Replacing the GPU from 1.6 to say 1.0, etc.

Copy link
Member

@JayFoxRox JayFoxRox Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the GPU from 1.6 to say 1.0, etc.

It's still the same GPU. I'm talking about plugging some Voodoo 3dfx chip inside an Xbox.

Is there such a risk providing an accurate to now clock rate instead of a hard coded value?

No. But:

  • There's a huge risk in startup delay. (application breaking! fixable)
  • The result you get is, despite the huge delay, more than 100x worse. (application breaking! fixable)
  • This not being a constant makes it impossible to do constant propagation at build-time. (potentially application breaking. not fixable)
  • Slightly increased application size (not fixable)

... and all of this for a niche use case that's not obtainable with a "stock" Xbox.. regardless of how badly it's damaged.

All of this matters a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly increased application size

100 bytes? Are we on an 8086? If this were a org standard we'd have inline assembly all over the place.

The result you get is, despite the huge delay, more than 100x worse.

To quote you...

That's a 0.00013636369% error.. how is that not negligible?

This not being a constant makes it impossible to do constant propagation at build-time. (potentially application breaking. not fixable)

Application shouldn't assume about the hardware. We're providing a tool (local box variable timing) to avoid, bad assumptions.

There's a huge risk in startup delay. (application breaking! fixable)

I don't disagree. How long are you willing to Sleep on startup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 bytes? Are we on an 8086? If this were a org standard we'd have inline assembly all over the place.

I got this impression it's also relevant to you here: #662 and XboxDev/nxdk-pdclib#56

It's not just 100 bytes. The constant propagation can bite you badly. This could be the difference between evaluating a bunch of code at runtime vs doing it at build-time.

Also, 100 bytes can be relevant to me when working on an DXT. Some commercial games are very short on memory and if any of those 100 bytes reach into another page, then that is not just 100 bytes, but likely a multiple of pages (4096 bytes).

The result you get is, despite the huge delay, more than 100x worse.

Your error is at least 100x worse than what we would be getting for free by fixing the math.
It's not just about being 100x worse, but being 100x worse willingly and adding a lot of additional issues (like the startup delay) - you could make it 100x better while also fixing a bunch of other issues (or just keeping the existing implementation which has none of these issues).

Application shouldn't assume about the hardware. We're providing a tool (local box variable timing) to avoid, bad assumptions.

It's a gaming console with a fixed spec. That's one of the main benefits of gaming consoles (and other embedded devices) over PCs.
It has been constructed with these frequencies (these hardcoded values are literally part of the hardware) and they are built into the hardware. We can almost as safely assume these frequencies (or a multiple thereof across the entire system) as much as we can assume that the opcode for NOP is 0x90.

It's possible to approximate these constants by measuring. However you'll be adding measuring and rounding errors (also see my other comment about measuring an untrusted clock by using another untrusted clock).
Also because most clocks are linked, you'll just get the same factor that's wired into the hardware.
If you want to know the actual drift, you'll need to measure externally or exploit a more stable timer at lower precision to derive the drift of the high precision timer.. but that'd be a slow and complicated application itself, not some implementation of timer functionality in a platform API to be used in generic applications.

I don't disagree. How long are you willing to Sleep on startup?

See my other comment about this. But this wait is avoidable (by better measuring code / math) or by hardcoding the frequency.

BOOL QueryPerformanceFrequency (LARGE_INTEGER *lpFrequency)
{
assert(lpFrequency != NULL);

#ifdef USE_RDTSC_FOR_FREQ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't hide correct behavior behind a compile-time flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XDK only replies with 733MHz, so that would be "correct". As stated, I don't like hard numbers, but don't want to force wasted time on developers if they don't require it.

Copy link
Member

@JayFoxRox JayFoxRox Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't hide correct behavior behind a compile-time flag.

Is it correct though? I'd even consider rejecting this with "move it to a fork":

  • Added complexity
  • We can safely assume every single official Xbox will be running at 733MHz (and even if they don't, then it's up to the person who over-/underclocked to fix applications, because none of the MS code will work anymore either)

XDK only replies with 733MHz, so that would be "correct".

Agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can safely assume every single official Xbox will be running at 733MHz

It's assumptions like this that stop nxdk with pbkit running from functioning on my Xbox. There are applications now due to MIT licensing that I can't readily modify the source of, which means I need to hack the xbe.

Copy link
Member

@JayFoxRox JayFoxRox Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's assumptions like this that stop nxdk with pbkit running from functioning on my Xbox

Note that this is code in the winapi - so it's probably not mission critical.

If your Xbox drifts too far from 733.333MHz to crash applications then it's probably broken - it's not a fault of nxdk.
If you changed the clocks or the CPU (which I suspect), then it's no longer an Xbox and breaks many assumptions that both, MS XDK and the nxdk have (also the kernel probably.. which you probably had to patch then to even get your Xbox to boot).

It's assumptions like this ...

... which make the Xbox a gaming console instead of a generic PC.
nxdk winapi isn't a generic winapi and it's not a PC OS / environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a fault of nxdk

It's entirely nxdk's fault for killing pbkit if the frequency is not default. https://github.com/XboxDev/nxdk/blob/master/lib/pbkit/pbkit.c#L2812

Again bad assumptions about the box we're running on. I've tried multiple times to introduce similar things (NV base address being one of them) and at every turn I seem to hit this review bomb or ignore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's entirely nxdk's fault for killing pbkit if the frequency is not default

Yes: That particular check is bad and inprecise - it can easily cause issues and is totally avoidable.
And yes.. and that should be addressed.. but that's not what this PR is doing.

I've tried multiple times to introduce similar things (NV base address being one of them)

But modifying the NV base address is also not really reasonable. I also needed that for some of my projects.
However, I can see how it's not practical to do such invasive changes in nxdk.

I think one of the things we could do is to expose constants in some header (probably as macro, so they can be changed dynamically, potentially with a source marker so you know where the constant gets used from).
People could just reconfigure their nxdk if necessary then - however, that doesn't help with closed source apps.

However, I don't think it's practical to keep some of these hacks for niche system configurations or even niche software applications (like my attempt to emulate a virtual NIC from a DXT / xbe-loader) in the upstream nxdk codebase.
A lot of these changes quickly become incompatible with much more important (and required) changes.
Testing such niche applications is also a lot more difficult and therefore we'd be constantly introducing new bugs for those niche cases.

We'd also lose the major benefit of a gaming console (at least in the sense up to the 7th console generation or so): fixed hardware which allows massive simplifications and assumptions (which brings performance gains).
I think this is also a huge draw for some of the developers - Xbox being a fixed target makes it an interesting coding challenge and having a fixed spec brings some comfort.

xbox-linux and some of OpenXDK tried to turn Xbox into a PC.. that's also a noble goal, but I think that's a non-goal for nxdk. And even then, the xbox-linux Xbox platform likely has to make a bunch of assumptions.

and at every turn I seem to hit this review bomb or ignore.

I feel like this PR has gotten a lot more attention than it deserves.

There aren't many active maintainers (or outside reviewers) in XboxDev, but good changes still gets merged eventually.. occasionally I see merges in my GH notifications.

I'd prefer if some things got merged sooner than later, but this is how it is.
How many PRs did you review today? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes.. and that should be addressed..

It should, and I tried. Both an frequency accurate timer version, and a dumb "remove the kill code" version. Both were left stale and nitpicked.

I feel like this PR has gotten a lot more attention than it deserves.

We're all just giving our free time to see the change in the world that we envision. If you feel you're giving it more attention than you should, then I'm confused why you are.

How many PRs did you review today?

None. I don't do software development for work. Hopefully you had a good day at work today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're all just giving our free time to see the change in the world that we envision. If you feel you're giving it more attention than you should, then I'm confused why you are.

There's an assumption of good faith: One would expect to gain more than to lose on doing a review.

Even if there's a bad PR we still have to take the time to review and close it.
That's the cost of doing FOSS.

In turn, we hope that the author improves their code so that it might be eventually mergeable (hopefully giving us a return of invest). Or we hope that some author shows up with perfectly mergeable code (which is an instant win).

This PR in particular looked to become mergeable eventually:

  • it started out with good arguments
  • it had fatal design flaws, but it looked like you are willing to fix it (and learn)

However, then it stagnated quickly, leading to many discussions, rather than progress on the code.
The initial arguments faded into the background as some weird arguments came up like "bad caps" or "unstable clocks at XBE launch" which made little sense.. unless you aren't running a stock Xbox.

Both, @thrimbor and I tried to assist or nudge you in the right direction, but you don't seem to care for our guidance (regarding our comments about numerical stability for example, starting at surface level and then even taking time to explain it with examples - still in good faith.. either that this PR is merged, or that others learn from it).

By now, I've lost faith in this PR. I'll do a last pass responding to the remaining open discussions.
It might still be merged eventually, but I don't think it's going to happen because of comments like "Wont fix".

lib/winapi/profiling.c Outdated Show resolved Hide resolved
lib/winapi/profiling.c Outdated Show resolved Hide resolved
@GXTX
Copy link
Contributor Author

GXTX commented Oct 17, 2023

The 1kHz tick rate with a 200ms sleep means that even the most optimistic error calculation results in a ±0.5% approximation error. Combined with the numerically disadvantageous division that is several orders of magnitude worse precision than what we had before.

It's 0.0004% off on hardware with an extra digit of accuracy by waiting 250mS longer.

PXL_20231017_162706810

lib/winapi/profiling.c Outdated Show resolved Hide resolved
lib/winapi/profiling.c Outdated Show resolved Hide resolved
@thrimbor
Copy link
Member

That's 0.0004% coming from only the last step in your calculation, and arguing that you can get lucky with the measurements doesn't change that this code has ±0.5% approximation error, is numerically badly conditioned and orders of magnitude less precise than what we already have.

@GXTX GXTX force-pushed the fix/query_counters branch from 487f296 to 9a603f7 Compare October 17, 2023 22:33
@GXTX
Copy link
Contributor Author

GXTX commented Oct 17, 2023

Latest commit brings us down from 0.0004% to 0.0001%, with an initial run value of 733332000. Subsequent calls on prime bring this to 733333000.

@GXTX
Copy link
Contributor Author

GXTX commented Oct 18, 2023

I do want to leave this as well. My motivations for this are Microsoft specifically moved away from using KeQueryPerformance* for these functions directly after the initial launch of the Xbox. 3824 uses it, 4361 does not.

I have to imagine there's some reason for this.

s_ticks = KeTickCount;

s_rdtsc.QuadPart -= f_rdtsc.QuadPart;
s_rdtsc.QuadPart /= s_ticks - f_ticks;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question, but what are s_ and f_ for? Why not start/end or before/after or begin/finish or a/b or ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a real big nit and probably shouldn't have been commented. I'll change it if it's bothering. It's private variables in a function that isn't meant to be called by a developer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8ea50ce

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's private variables in a function that isn't meant to be called by a developer.

❓ ❗ ❓

It's public code in an open-source project that needs to be maintained and should be easily readable and extensible.
It should also follow the established code-convention (of surrounding code).

Copy link
Contributor Author

@GXTX GXTX Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f_ first, we can tell this by looking at the placement when setting the variables value. s_ second, again, we can tell this by the placement within the function. It's been addressed. Not sure the need for the additional comment.

s_rdtsc.QuadPart /= s_ticks - f_ticks;

frequency.QuadPart = s_rdtsc.QuadPart;
frequency.QuadPart *= 1000LL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why multiply here? wouldn't it be more stable to do s_rdtsc *= 1000LL above, before the division?

Copy link
Contributor Author

@GXTX GXTX Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting more digits, but not stable, no. Each run is slightly more variable. Wont fix.

Copy link
Member

@JayFoxRox JayFoxRox Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment wasn't about clock stability, but https://en.wikipedia.org/wiki/Numerical_stability

You are dividing away up to 3 decimal digits (leading to ...000), but at some other point complain about a 1 digit loss in accuracy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but at some other point complain about a 1 digit loss in accuracy

Let me be clear, as this goes into another comment I made. I don't care about this amount of accuracy. I am only doing it to make you happy. If you do not care, then great. Hopefully other maintainers feel the same.

Copy link
Member

@JayFoxRox JayFoxRox Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't care, then why is there still a 700ms sleep at startup and a 200ms sleep for the measurement?

The review mostly focused on getting that time away because it's absolutely not tolerable to delay startup.
It will negatively affect many applications and make many use cases entirely impossible.

You can probably get away with much much less measurement time and get a more accurate result by fixing your math.

You want to calculate something like 2/3*1000 = 666.666... with integer math:

z = round(2/3)*1000 = 0 (yours)
but
z = round(2*1000)/3 = 666 (my proposal)
or even
tmp = round(2*(1000 << 1))/3; z = (x >> 1) + (x & 1) = 667 (even better)

The current method (hardcoded value) is even more accurate than this.. and it will be more exact than the measurement regardless how broken your caps are (unless you overclock/underclock/swapped your CPU).

Ideally you wouldn't have to wait at all (and there's probably many ways to get there).


We care about these changes because they matter. You are making it impossible to get this merged by saying "Wont fix".
You claim you do these changes to support more Xboxes (and our definition of what an Xbox is seems to differ) but effectively you are breaking nxdk for a lot of applications: things like multi-XBEs booting from one XBE into another now taking 1 second longer and timers running a lot less accurate than before, leading to more than 100x times worse measurements - which is large enough to matter in real-world applications; there's a high risk some code (GPU driver, USB, networking, ..) can't handle the 1 second hand-over you aim to introduce.

Copy link
Contributor Author

@GXTX GXTX Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review mostly focused on getting that time away

Hard disagree. The review has been maintainers attempting to get me to justify every single line of this PR. What percentage would you be happy with? We know @thrimbor isn't happy with 0.5%, so we need to be lower than that. How long are you willing to allow for startup delay? (2nd time I've asked)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are breaking nxdk for a lot of applications

Is this an assumption or do you have real world examples of this? Have you ran this code?

timers running a lot less accurate than before

0.0001% less accurate on a stock box. Which I was under the impression you don't care about.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an assumption or do you have real world examples of this? Have you ran this code?

I don't need to run this code. I have enough coding experience to know that almost 1 second of additional startup time is bad.

I have a couple of applications in mind which would be broken by this (xbe-loader being one of them).
Generally many network protocols if you tried to forward the network connection between XBEs.
I can also think of things like gamepads. Even without the XID specs I can see that > 500 ms is a huge timeout for a protocol.
Disconnecting a gamepad for almost 1 second likely could cause hardware resets (flickering of LEDs or worse - think of force-feedback).

I'm also familiar enough with these APIs and now how they are being used that the loss in precision matters.

0.0001% less accurate on a stock box. Which I was under the impression you don't care about.

Yours is 0.1% as accurate as the existing solution (~1000x worse? Don't quote me on that.. too many numbers being thrown around).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review has been maintainers attempting to get me to justify every single line of this PR.

Not justify, but modify those lines with errors (or which can be significantly improved with little effort).
That might be close to every line, but that's not an issue with the review - it's an issue with the code.

You should be grateful that people are trying to guide you to better code.

The real issue with this particular review is that valid concerns by two maintainers about the same thing are being ignored (like the numerical stability, which then requires excessive startup delay). Initially we reported some problem at one location, but then the next iteration of the same code had the same issue (the loop was replaced by a single wait which was still too long). And when that wasn't addressed (like the numerical stability), I (we?) sometimes pointed out new problems which arose because the first thing wasn't fixed (such as the 700ms wait with the bad comment).

Pair that with initial skepticism if more than 50% PR is beneficial.
Doing it like the XDK (rdtsc) can be a good idea and it's the initial argument which made this worth reviewing.
Adding the dynamic measurement is likely a bad idea and the way it's done here is poorly executed - especially on closer inspection.

The XDK / rdtsc thing with hardcoded value is mergeable in isolation if the later XDKs also do it (although this might just be a case of function inlining / LTO across the kernel barrier).
It should have been a separate PR and it likely would have been merged immediately.

The dynamic measurement - with some changes - could become acceptable, but it's still worse than the hardcoding and doesn't address the initial use-case you envision (the reasoning / description of which doesn't make sense).. the only use case it does help is with overclocking/underclocking/CPU swaps (which is fine, but not if it has severe drawbacks for everyone else).

KeEnterCriticalRegion();

// The values generated after launching aren't accurate, give it time to increment...
Sleep(700);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values generated after launching aren't accurate

This can't be true for rdtsc. I'm not sure about the clock source for the KeTickCount, but I assume it's some other low level timer.

Also, why would the ticks be incorrect during early boot?! Especially in an XBE I'd expect a soft-reset.

By the point the XBE loads, a bunch of kernel code has ran already. The rdtsc is a CPU hw feature and directly tied to the clock in some older x86 CPUs (including Pentium 3 I assume?), I've never heard of it being unstable (at least not for 700ms) - if it was, then your CPU would likely hang.

If your reference clock source (KeTickCount) is unstable, then find one which is. However, I assume if your CPU clock is off, then you can't really trust the other clocks either. I also don't think KeTickCount has such a bad startup time either.

All of this hackery, to me, looks like it's only meant to support overclocking the CPU (so you attempt to measure a constant factor of your CPU clock to some reference clock which runs at the stock clock).

@thrimbor
Copy link
Member

thrimbor commented Oct 18, 2023

I feel like this PR wasted far too much reviewer time already and it's time for a "final word". To summarize, the issues are:

  • excessive start-up delay
  • imprecise timing source (tick counter) which introduces a ±0.5% error when measuring with a 200ms interval
  • numerical issues due to how the calculation is done
  • compile-time switch

Doubling down on the decisions behind these flaws, or, as you call it, "justifying", will not change any of this. If the issues above are not an issue to you than that is fine and you can keep the code around in your own fork, but with them present this PR will not get merged. If you are unwilling to fix them - or want to continue to ignore the parts of the review that made you aware of them - then I suggest you close this PR and let it serve as a source of information for someone who is willing to put his effort into a solution that doesn't have these problems.

Here are some suggestions of things that could get merged:

  • A PR with just the rdtsc change. Not necessarily because it's the way the XDK does it, but because it's an objectively better timing source than what we have now (should be around ~209 times more precise)
  • A PR that calculates the timer frequency from FSB frequency and multipliers and stores it in a global variable - if it matches the hardcoded value we have on a stock Xbox (or is maybe even more precise). This will support configurations that are outside the Xbox's spec, so this is not a necessity nor a priority for nxdk, but rather something that is nice to have. I don't expect such a change to have the negative performance impact that a runtime-modifiable NV base address would have.

Whether something with a ~100ms or less start-up delay should get merged is still debatable, especially since its only benefit would be supporting systems that - intentionally or not - run outside of the specs of an Xbox.

The main focus of nxdk is supporting retail Xbox systems, like the XDK did. If we can extend that support to things the XDK did not support, without any major drawbacks (that includes maintainability!), then why not. But adding a hack that benefits only people that run their Xbox outside of what the XDK supported and has very obvious problems is not an option for nxdk.

@GXTX
Copy link
Contributor Author

GXTX commented Oct 24, 2023

Better?

@GXTX GXTX force-pushed the fix/query_counters branch from 2f6d75b to ef0e000 Compare October 24, 2023 10:11
@GXTX
Copy link
Contributor Author

GXTX commented Nov 24, 2023

Status?

@GXTX
Copy link
Contributor Author

GXTX commented Jan 26, 2024

Status?

1 similar comment
@GXTX
Copy link
Contributor Author

GXTX commented Jun 23, 2024

Status?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants