I didn't find Casey's arguments to be that persuasive.
Modern processors are insanely fast at adding or multiplying two numbers together.
Enclose the math in a subroutine call and loop over that a hundred trillion times so you can actually measure it, and of course the subroutine call results in significant overhead as a percentage of work done.
Make that a vtable-based method dispatch (horrors) and it's blindingly obvious that you're spending more time measuring the overhead than the addition.
Jump to a function, however, that's looping through and processing a list of a thousand items, and now the effect of the function call or dispatch is dwarfed by the amount of actual work being done.
Have that function call do a disk read or (heavens) make a network call, and that overhead is now about as significant as a single grain of sand on a beach.