Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-x86 architectures #17

Open
nemequ opened this issue Jul 31, 2017 · 5 comments
Open

Support for non-x86 architectures #17

nemequ opened this issue Jul 31, 2017 · 5 comments

Comments

@nemequ
Copy link
Collaborator

nemequ commented Jul 31, 2017

Seems like this would be pretty difficult, but I'd love to have something like this working on other architectures, especially ARM.

@travisdowns
Copy link
Owner

Indeed, it's "on the roadmap" so to speak. The idea is that most of the code should be as portable as reasonable, with only the required amount of assembly. I'm also planning "C" benchmarks for various things: some things are only reasonable in assembly, but many can be done in C too, making them automatic on other archs.

I do have an Android phone, so probably I can help do this as well. Right now I'm working on the x86 performance counter support, however, and this is almost done.

@nemequ
Copy link
Collaborator Author

nemequ commented Aug 1, 2017

I do have an Android phone, so probably I can help do this as well. Right now I'm working on the x86 performance counter support, however, and this is almost done.

If you want, I can give you an SSH account on a Raspberry Pi 3 I have sitting around for development, running Fedora 26.

@travisdowns
Copy link
Owner

OK, I may take you up on that offer if it's still open when I get to this!

@nemequ
Copy link
Collaborator Author

nemequ commented Mar 9, 2018

I just came across __builtin_readcyclecounter which I didn't know existed, though maybe you did. I know uarch-bench does a lot more than just a rdtsc so I'm not sure if it's usable or not, but I thought I'd mention it in case it is.

Apparently it's been around for a while (at least since clang 3.4, didn't bother checking past that), though it doesn't work everywhere. According to a SO answer it does work on AArch64…

@travisdowns
Copy link
Owner

travisdowns commented Mar 9, 2018

@nemequ - thanks for the note, I didn't know about it! Some thoughts (probably most of this is not news to you, but it's helpful for me to write it anyways):

On x86 it uses rdtsc which makes the name a little bit wrong: it's counting wall-clock time, not cycle time[0]. I don't actually use rdtsc directly at all in uarch-bench at the moment: if you use the default timer, it just uses std::chrono::high_resolution_clock::now(), which pretty much directly calls clock_gettime() on Linux which in turn is implemented in the VDSO as a usermode call to rdtsc and some adjustment. So I am kind of using rdtsc, but in an indirect way (AFAIK the overhead is perhaps 2x a raw rdtsc call, with the benefit that I'm using a portable C++ implementation). The way the tests and scaffolding are written, we usually do several loops, and also try to subtract out the clock overhead, so the absolute overhead itself isn't a problem: stability is more important (that said, no doubt a raw rdtsc call will be more stable as well - many few sources of variance).

Unfortunately, godbolt doesn't seem to have any clang-ARM targets (it does have gcc-ARM, but gcc doesn't support this builtin).

All that to say that if the compiler in question on ARM implements high_resolution_clock::now() in a similarly efficient way, then the default timer should more or less just already work with reasonable performance[1]. Still on both x86 and ARM it's probably worth adding a mode that uses rdtsc directly (via this builtin or inline asm) to reduce the variance.

Now the more interesting timer is the --timer=libpfc one, which gives you access to the PMU and is what I usually use. Not only does it often give you cycle-accurate measurements (at least in some modes and some types of benchmark), but you can add other interesting events and have them displayed alongside the cycle results. To get that to work on ARM we'd need a library like libpfc that gives access to the performance counters. I know they exist on ARM, but I don't know, for example, if there is a "user mode" instruction to read them. The existence of that on x86 (note: the OS needs to give you permission to use it) is what makes the cycle-accurate timings possible.


[0] Except on a small slice of decade-old CPUs around the time frequency scaling was becoming popular where rdtsc briefly counted in cycles even though the frequency could vary. That made it suck for implementing time APIs, which are much more popular than cycle APIs, so Intel changed it. Notably, it's still implemented under the covers as a true cycle (not time) counter, with adjustment logic to scale the count based on the current frequency, which makes it slower than a raw cycle counter. Unfortunately, there is no instruction to access the raw cycle counter, even though it exists!

[1] Of course there is the small problem that nearly all of the benchmarks themselves are written not in C/C++ but in x86 asm, so those naturally won't work on ARM. Still it would be easy to port most simple benchmarks over and the idea is to have C++ version of the ones that can be expressed without asm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants