In case you wonder, what Nuitka is, look here. Over the 0.3.x release cycle, I have mostly looked at its performance with "pystone". I merely wanted to have a target to look at and enjoy the progress we have made there.
In the context of the Windows port then, Khalid Abu Bakr used the pybench on Windows and that got me interested. It's a nice collection of micro benchmarks, which is quite obviously aimed for looking CPython implementations only. In that it's quite good to check where Nuitka is good at, and where it can still take improvements for the milestone 2 stuff.
The pybench refused to accept that Nuitka could use so little time on some tests, I needed to hack it to allow it.
Then it had "ZeroDivisionError" exceptions, because Nuitka can run fully predictable code not at all, thus with a time of 0ms, which gives interesting factors.
Also these are many results, we are going to care for regressions only, so there is an option now to output only tests with negative values.
Nuitka currently has some fields where optimizations are already so effective as to render the whole benchmark pointless. Longterm, most of PyBench will not be looked at anymore, where the factor becomes "infinity", there is little point in looking at it. We will likely just use it as a test that optimizations didn't suddenly regress. Publishing the numbers will not be as interesting.
Then there are slow downs. These I take seriously, because of course I expect that Nuitka shall only be faster than CPython. Sometimes the implementation of Nuitka for some rarely used features is sub par though. I color coded these in red in the table below.
ComplexPythonFunctionCalls: These are twice as slow, which is an tribute to the fact, that the code in this domain is only as good as it needs to be. Of course function calls are very important, and this needs to be addressed.
TryRaiseExcept: This is much slower because of the cost of the raise statement, which is extremely high currently. For every raise, a frame object with a specific code object is created, so the traceback will point to the correct location. This is very inefficient, and wasteful. We need to be able to create code objects that can be used for all lines needed, and then we can re-use it and only have one frame object per function, which then can be re-used itself. There is already some work for that in current git (0.3.9 pre 2), but it's not yet complete at all.
WithRaiseExcept: Same problem as TryRaiseExcept, the exception raising is too expensive.
Note also that -90% is in fact much worse that +90%, the "diff" numbers from pybench make improvements look much better than regressions do. You can also checkout the comparison on the new benchmark pages that I am just creating, they are based on codespeed, which I will blog upon separately.
Look at this table of results as produced by pybench:
|**Test Name**||**min CPython**||**min Nuitka**||**diff**|