Optimization Techniques by Benchmark Winners

By Juanito Fatas

This post is based on Jeremy Evans‘s Optimization Techniques Used by the Benchmark Winners presentation. Slides | YouTube at Ruby Kaigi 2019. It‘s better to watch his fabulous presentation, but you can come back here for written references. These techniques coming from maintaining Sequel and Roda. These two gems have always been 0 issues. It‘s probably the best maintained gems. Sequel and Roda are the leader for Ruby in TechEmpower benchmark results. Let‘s see what lessons can we learn to optimize Ruby.

These techniques may not be the best to write at application code level but good when building libraries, optimize critical path of your application. The ideas and principles are universally applicable.

Avoid processing


No code is faster than no code
-- Merb Motto

Fastest code does the least. Execute less code. Run code once instead of many times, onces is better than many. In big-o terms: O(1) > O(n).

Take initialize model instance as example, the Sequel is definitely faster than Active Record because it does much less things.

Active Record:

# https://github.com/rails/rails/blob/8ae8e19a33f4749e529b1649f4f0ace1dbb37285/activerecord/lib/active_record/core.rb#L357-L369
def init_with_attributes(attributes, new_record = false) # :nodoc: @new_record = new_record @attributes = attributes init_internals yield self if block_given? _run_find_callbacks _run_initialize_callbacks self


# https://github.com/jeremyevans/sequel/blob/4a146796f273ce14e121954ad4e7f3134209386d/lib/sequel/model/base.rb#L221
def call(values) o = allocate o.instance_variable_set(:@values, values) o

Delay Computations

For web app, it means to do at initialization time is better than doing at Runtime. See Hound applies this for putting environment variables into constants.

Example: Active Record & Sequel Callbacks

Active Record runs find and initialize callbacks during intialization. Even when there is no callbacks defined on the model, so it slows down:

# lib/active_record/core.rb
def init_with_attributes(attributes, new_record = false) ... _run_find_callbacks _run_initialize_callbacks ...

Sequel does this by having a Plugin System. In Sequel, if you need callbacks, add plugin :after_initialize to your model. Nothing gets run if you‘re not using callbacks.

Inheritance gives every subclass things that they may not need. Prefer Composition over Inheritance. Plugins give all subclass a chance to only take what they need. Sequel and Roda use plugins for most features so they are faster.

Reduce Object Allocations


The init_internals in Active Record set few instance variables to nil:

def init_internals @primary_key = self.class.primary_key @readonly = false @destroyed = false @marked_for_destruction = false @destroyed_by_association = nil @_start_transaction_state = nil @transaction_state = nil self.class.define_attribute_methods

Jeremy found avoid @instance_variable = nil resulted in 150% faster in Sequel and Roda. This is controversial because only in verbose mode, using instance variable without define first will emit warnings. Emit warnings makes it slower. He made a patch to remove such warnings but got rejected for backward compatability.

You can find another example of delay allocation of instance variables in this commit from rails/rails. Delay setting instance variable to hash by setting it to nil first.

Another common place to reduce object allocations is string. Before Ruby 2.3, we need to call freeze on strings. Since 2.3, we can add a frozen string literal in a file and by default all strings in that file are frozen.

Reduce String Allocations

Before frozen string literal (Ruby < 2.3), introduce a constant for common places of strings SELECT = 'SELECT'.freeze was okay:

SELECT = 'SELECT'.freeze
SPACE = ' '.freeze
FROM = 'FROM'.freeze def select_sql sql = String.new sql << SELECT << SPACE sql << literal(columns) sql << SPACE << FROM << SPACE sql << literal(table)

Since Ruby 2.3:

# frozen-string-literal: true def select_sql sql = String.new sql << "SELECT " sql << literal(columns) sql << " FROM " sql << literal(table)

It‘s simpler and easier to understand. Avoids string allocations in the meantime with less operations (<<), hence faster.

Reduce Hash allocations

Avoid default hash in method definition:

class Sequel::Dataset def union(dataset, opts={}) compound_clone(:union, dataset, opts) end

Introduce a empty frozen hash constant:

class Sequel::Dataset OPTS = {}.freeze # 100% faster than `opts={}`. def union(dataset, opts=OPTS) compound_clone(:union, dataset, opts) end # Keyword argument is slower than Optimal hash constant. def union(dataset, **opts) compound_clone(:union, dataset, opts) end

Right now double splat keyword arguments are slower (2.7) than above empty frozen hash constant technique. Keyword arguments also have maintenance barriers the method caller and calle all need to change when keyword changes.

Reduce Proc Allocation


For proc does not depend on something else or any runtime state:

def indifferent_params Hash.new { |h, k| h[k.to_s] if k.is_a?(Symbol) }

can extract into a constant:

IND = proc { |h, k| h[k.to_s] if k.is_a?(Symbol) }
def indifferent_params2 Hash.new(&IND)

About 300%+ faster.

Benchmark.ips do |x| x.report("inline proc") { indifferent_params } x.report("constant proc") { indifferent_params2 } x.compare!
end Comparison: constant proc: 5645873.6 i/s inline proc: 1469374.7 i/s - 3.84x slower

Reduce Array Allocation

This is not from the presentation but I think it‘s great to also include an example here.

Use transform_values when modifying hashes. See this commit from rails/rails:

 def escape(params)
- Hash[params.map { |k, v| <ruby>k, Rack::Utils.escape<rp>(</rp><rt>v</rt><rp>)</rp></ruby> }]
+ params.transform_values { |v| Rack::Utils.escape(v) }

Minimize Indirection

Say a common operation in our app is to convert a string into Integer. We could have a lambda, a kernel method, a dedicated object with call to do so, or an object that aliased call to Integer():

string = "42" using_lambda = lambda { |str| Integer(str) }
using_kernel = Kernel.method(:Integer)
using_object = Object.new
def using_object.call(str); Integer(str); end
no_indirection_object = Object.new
class << no_indirection_object alias call Integer public :call
end Benchmark.ips do |x| x.report("lambda") { using_lambda.call(string) } x.report("Kernel") { using_kernel.call(string) } x.report("Object") { using_object.call(string) } x.report("Object v2") { no_indirection_object.call(string) } x.compare!
Comparison: Object v2: 6258404.2 i/s Object: 5674514.7 i/s - same-ish: difference falls within error lambda: 5199998.6 i/s - 1.20x slower Kernel: 5130274.6 i/s - 1.22x slower

We see using Kernel.method is slowest here. A lambda is slightly faster than that. Define a method on an object is faster than calling a lambda. Avoid indirection (by alias call to Integer) further makes it ~10% faster. The benchmark example above aligns with Jeremy‘s observations on benchmarking Sequel from the presentation.

Defining Methods


def foo 1
end # 50% slower than `def foo; 1; end`
define_method(:foo) do 1

So we always want to define method with def. Sometimes we want to define at runtime. Take Sequel as example, it needs to dynamically define model columns‘s setters and getters, can achieve with class_eval + def.

# def_column
columns.each do |column| class_eval("def #{column}; @values[:#{column}]; end") class_eval("def #{column}=(v); @values[:#{column}] = v; end")

This works for column name without spaces "name", with space "employee name" it does not work. We need to use define_method:

columns.each do |column| column = column.to_sym define_method(column) do @values[column] end define_method(:"#{column}=") do |v| @values[column] = v end

which is slower than using class_eval + def.

These columns with "bad" names are not the majority of the case but changing the implementation to this hurts performance. Here comes another general optimization principle.

Separate Common from Uncommon


We usually have a fast approach for the simple cases but it fails in more complex cases. Assume the simple cases is more common, we can speed up by separating the two cases. Sequel separate the common from the uncommon, to have the best of both worlds. Below are simplified from Sequel‘s column definitions:

def def_column_accessor(*columns) columns, bad_columns = columns.partition { |x| /\A[A-Za-z_][A-Za-z0-9_]*\z/.match(x.to_s) } bad_columns.each{|x| def_bad_column_accessor(x)} # define_method columns.each do |column| # faster with `def` end

In general def > define_method, but there is one case where define_method is preferred for performance. Let‘s see an example:

class Def def self.def_numbers(first, last) class_eval("def numbers; (#{first}..#{last}).to_a.freeze; end") end def self.def_numbers2(first, last) array = (first..last).to_a.freeze define_method(:numbers2) do array end end
Benchmark.ips do |x| x.report("def") { Def.def_numbers(1, 10) } x.report("define_method") { Def.def_numbers2(1, 10) } x.compare!
end Comparison: define_method: 542551.0 i/s def: 62085.7 i/s - 8.74x slower

So prefer def over define_method unless you can access local variables from surrounding scope to avoid computations, only then define_method > def.

Related to this. If we need to accept blocks and store blocks, and use instance_exec to execute these blocks. Let‘s implement a before hook to demostrate this idea.

def self.before(&block) before_hooks << block
end def before # execute all blocks passed to self.before self.class.before_hooks.each { |block| instance_exec(&block) }

This works but instance_exec creates singelton class for the instance, it‘s faster to use methods. We can define before hooks methods by naming them based on their position in the array, then we pass to define_method:

def self.before(&block) meth = :"_before_hook_#{before_hooks.length}" define_method(meth, &block) before_hooks << meth
end def before self.class.before_hooks.each { |m| send(m) }

Execute these before hooks could be a send which is faster. This is good but we can do better. Since we know which methods will be executed, we can define the before to execute these before hook methods by:

def self.before(&block) # ... class_eval("def before; #{before_hooks.join(';')}; end")

This is faster because it avoids the need to call each on before hooks array and each method invoked directly instead of one indirection send. But we can still do better.

def self.before(&block) # ... if before_hooks.length > 1 class_eval("def before; #{before_hooks.join(';')}; end") else class_eval("alias before #{before_hooks.first}") end

If there is only a single before hook, we can alias to that method hence minimized indirection.

This approach to implement before hooks compare to using define_method, is about 200% faster because it avoids a lot of internal indirections. One caveat of this approach is that it does not work with before { |x| }. But we can fix this by:

unless block.arity == 0 || block.arity == -1 b = block block = lambda { instance_exec(&b) }

Optimize Inner Loops


Another best place to start optimizing is inside any inner loops. Let‘s see an example from Sequel‘s Anywhere adapter:

# https://github.com/jeremyevans/sequel/blob/ae50b5e98efb3ad902fb37f370cbf237644a12d3/lib/sequel/adapters/sqlanywhere.rb#L157-L187
while db.api.sqlany_fetch_next(rs) == 1 max_cols = db.api.sqlany_num_cols(rs) h2 = {} max_cols.times do |cols| h2[col_map[db.api.sqlany_get_column_info(rs, cols)[2]]||db.api.sqlany_get_column_info(rs, cols)[2]] = cps[db.api.sqlany_get_column_info(rs, cols)[4]].nil? ? db.api.sqlany_get_column(rs, cols)[1] : db.api.sqlany_get_column_info(rs, cols)[4] != 500 ? cps[db.api.sqlany_get_column_info(rs, cols)[4]].call(db.api.sqlany_get_column(rs, cols)[1]) : convert ? cps[db.api.sqlany_get_column_info(rs, cols)[4]].call(db.api.sqlany_get_column(rs, cols)[1]) : db.api.sqlany_get_column(rs, cols)[1] end yield h2
end unless rs.nil?

And there are some hidden ternary operators without parentheses...So this is complex, but not the reason why it‘s slow. It‘s slow because:

db.api.sqlany_get_column_info(rs, cols) repeated 5 times with same arguments in the inner loop. Another one is db.api.sqlany_get_column(rs, cols)[1] repeated 4 times. After a serious refactoring, now we arrive at:

def fetch_rows(sql) db = @db cps = db.conversion_procs api = db.api execute(sql) do |rs| convert = convert_smallint_to_bool col_infos = [] api.sqlany_num_cols(rs).times do |i| _, _, name, _, type = api.sqlany_get_column_info(rs, i) cp = if type == 500 cps[500] if convert else cps[type] end col_infos << [output_identifier(name), cp] end self.columns = col_infos.map(&:first) max = col_infos.length if rs while api.sqlany_fetch_next(rs) == 1 i = -1 h = {} while (i+=1) < max name, cp = col_infos[i] v = api.sqlany_get_column(rs, i)[1] h[name] = cp && v ? cp.call(v) : v end yield h end end end self

The inner loop now become much simpler:

while (i+=1) < max name, cp = col_infos[i] v = api.sqlany_get_column(rs, i)[1] h[name] = cp && v ? cp.call(v) : v

All operations inside are stored on local variables. This seems not a big deal but if you are retrieving 100,000 rows with each row of 10 columns, this makes a big difference. By define local variables, this saves millions of method calls. This is another general optimization principle.

Prefer Local Variables


Prefer local variables whenever possible especially in inner loop as demostrated above.

Local variables > Instance variables
Local variables > Constant
Local variables > Method Calls

local variables are faster because it minimize the amount of indirections.

while v.s. each

Let‘s go back to above refactoring example. The deliberate use of while in the inner loop:

while (i+=1) < max name, cp = col_infos[i] v = api.sqlany_get_column(rs, i)[1] h[name] = cp && v ? cp.call(v) : v

Inner loop like this is one of the few places that make sense to use while instead of each because using each for inner loops can hurt performance, because it requires a seperate a stackframe to push and pop for each iteration.

Choose Faster Algorithms


Sinatra routes (by pattern) v.s. Roda‘s routing tree

Given thoundsands of routes. Sinatra will be busy matching against all routes. O(N) Roda‘s routing tree would take route segments. Filter out the segment to the right tree. O(log(N))

Roda.route do |r| r.on "foo" do # /foo ... end r.on "bar" do # /bar ... end

But the worst case is still O(N) and Jeremy does not accept the wrost case being O(n)... Here comes multi route:

Roda.plugin :multi_route Roda.route("foo") { |r| # /foo... }
Roda.route("bar") { |r| # /bar... } Roda.route { |r| r.multi_route }

Multi route builds a regexp to match initialial segment (/foo, /bar, ...) then dispatch:

Roda.route { |r| r.multi_route }

which brings the performance back to O(log(n)). But he wants O(1)...

Roda.plugin :static_routing Roda.static_get("foo") { |r| # /foo... }
Roda.static_get("bar") { |r| # /bar... } Roda.route { |r| }

But O(1) gives up many advantages. In the end, he created hash_routes plugin to have the best of both worlds:

Roda.plugin :hash_routes Roda.hash_routes do on("foo") do |r| r.hash_routes end is("foo/bar") do |r| end Roda.route do |r| r.hash_routes

Cache when Possible


Introduce cache is the highest ratio of following formula:

 % Increase in Performance
----------------------------- Lines of Code Changed

Maximum performance increase over minimum lines of code changed.

Use case: Sequel Symbol Literalization


# => '"column_name"'
def self.split_symbol(sym) v = case s = sym.to_s when /\A((?:(?!__).)+)__((?:(?!___).)+)___(.+)\z/ [$1.freeze, $2.freeze, $3.freeze].freeze when /\A((?:(?!___).)+)___(.+)\z/ [nil, $1.freeze, $2.freeze].freeze when /\A((?:(?!__).)+)__(.+)\z/ [$1.freeze, $2.freeze, nil].freeze else [nil, s.freeze, nil].freeze end v

Introduce symbol cache:

SPLIT_SYMBOL_CACHE = {} def self.split_symbol(sym) unless v = Sequel.synchronize { SPLIT_SYMBOL_CACHE[sym] } v = case s = sym.to_s when /\A((?:(?!__).)+)__((?:(?!___).)+)___(.+)\z/ [$1.freeze, $2.freeze, $3.freeze].freeze when /\A((?:(?!___).)+)___(.+)\z/ [nil, $1.freeze, $2.freeze].freeze when /\A((?:(?!__).)+)__(.+)\z/ [$1.freeze, $2.freeze, nil].freeze else [nil, s.freeze, nil].freeze end Sequel.synchronize { SPLIT_SYMBOL_CACHE[sym] = v } end v

This is over 1000% faster after introduced this cache.

Globally Frozen, Locally Mutable


Freeze global state like classes, objects that is accessed by multiple threads. Local objects instantiated per request remain mutable for ease of use (cache). This approach is mainly to improve reliablity (thread safe). This also improves performance because frozen object means it is easy to cache them.

Note that Frozen is not the same as Immutable. Frozen means Immutable State + Mutable Cache.

# https://github.com/jeremyevans/sequel/blob/cc5c1612fb4c42e7cbf6db6b0572001fd5bf6b58/lib/sequel/dataset/misc.rb#L25-L30
OPTS = {}
def initialize(db) @db = db @opts = OPTS # state in frozen hash cache = {} # <-- mutable cache freeze # <-- freeze this object

Make sure access to mutable cache is protected by Mutex for thread safety.

Optimization Through Metaprogramming


class Album < Sequel::Model dataset_module do def by_name order(:name) end def released where(released: true) end end
end # Album.where(released: true).order(:name).first
# => Album.released.by_name.first

This is good to DRY up code but does not improve performance. Jeremy introduced metaprogramming for users to define like following and speed up by intrdouced dataset cache to these metaprogramming generated methods:

class Album < Sequel::Model dataset_module do order :by_name, :name # def by_name # cache { order(:name) } # end where :released, released: true # def released # cache { where(released: true) } # end end

Example: Roda String Matching


Roda was forked from Cuba. The matching of routing segment was:

def match_string(str) consume(Regexp.escape(str))
end # https://github.com/soveran/cuba/blob/1dce54d0926c5d6e5daab033a8e9685c6faa4227/lib/cuba.rb#L214-L226
def consume(pattern) matchdata = env[Rack::PATH_INFO].match(/\A\/(#{pattern})(\/|\z)/) return false unless matchdata path, *vars = matchdata.captures env[Rack::SCRIPT_NAME] += "/#{path}" env[Rack::PATH_INFO] = "#{vars.pop}#{matchdata.post_match}" captures.push(*vars)
private :consume

This code looks perfectly fine. But let‘s see how Jeremy optimized it. First is to avoid allocations. First change was to include the proceeding slash in the first capture to avoid extra array allocation for the captures:

- /\A\/(#{pattern})(\/|\z)/
+ /\A(\/(#{pattern}))(\/|\z)/

... - path, *vars = matchdata.captures
+ vars = matchdata.captures

- env[Rack::SCRIPT_NAME] += "/#{path}"
+ env[Rack::SCRIPT_NAME] += vars.shift

Next is to use a regular expression feature, positive lookahead assertion instead of a capture:

- /\A(\/(#{pattern}))(\/|\z)/
+ /\A(\/(#{pattern}))(?=\/|\z)/

So we no longer need to pop the last element of capture variables (vars):

- env[Rack::PATH_INFO] = "#{vars.pop}#{matchdata.post_match}"
+ env[Rack::PATH_INFO] = matchdata.post_match

Also avoid the allocation of additional string by using post match directly.

He then profile and found generating a new regular expression every time consume was called take a large portion of time. Then he introduced cached matcher:

 def match_string(str)
- consume(Regexp.escape(str))
+ consume(self.class.cached_matcher(str) { Regexp.escape(str) })

This change is possible because consume was a private method. Another optimizations technique is to keep most methods private.

Keep Most Methods Private

Only makes a method public if it needs to be public. When method is private, you are free to change its API to improve performance. Public methods have limit options to optimize.

Prefer String operations over Regexp operations

In Roda 3.0, the matching of string segment changed to use string operations instead of regular expression matching:

def match_string(str) rp = @remaining_path if rp.start_with?("/#{str}") # <---- last = str.length + 1 case = rp[last] # <---- when "/" @remaining_path = rp[last, rp.length] # <---- when nil @remaining_path = "" end end

But still, there are two String allocations: rp.start_with?("/#{str}") and rp[last].

Prefer Integer operations over String operations

Iterating by index and checking if first character of string is 47 (47 is ascii code of /):

def _match_string(str) rp = @remaining_path length = str.length match = case rp.rindex(str, length) # <------ when nil return when 1 rp.getbyte(0) == 47 # <--- start_with?("/") else length == 0 && rp.getbyte(0) == 47 # <--- start_with?("/") end if match length += 1 case rp.getbyte(length) when 47 @remaining_path = rp[length, 100000000] when nil @remaining_path = "" end end

The most common failure case when nil also moved up because is‘s more common. Another trick is to use a large number that is larger than any reasonable path length: rp[length, 100000000].


Remember that optimizations comes last. First make it work, then make it correct. After you make it fun, then you can make it fast.

—Jeremy Evans

Profile to find where to optimize. Then Benchmark your change.

That‘s all from Jeremy‘s presentation.

There is no silver bullet. It depends on the context. We must try ourselves. Verify it‘s faster than the other instead of directly quoting anything here.

The faults are probably mine. I‘m happily to fix if you would let me know. You can reach out to him on twitter for more questions: @jeremyevans0 or GitHub to see more of his work.

Thanks for reading!