Speeding up Ruby with Shared Strings


It’s not often I am able to write a patch that not only reduces memory usage, but increases speed as well. Usually I find myself trading memory for speed, so it’s a real treat when I can improve both in one patch. Today I want to talk about the patch I submitted to Ruby in this ticket. It decreases “after boot” memory usage of a Rails application by 4% and speeds up require by about 35%.

When I was writing this patch, I was actually focusing on trying to reduce memory usage. It just happens that reducing memory usage also resulted in faster runtime. So really I wanted to title this post “Reducing Memory Usage in Ruby”, but I already made a post with that title.

As I mentioned in previous posts, Ruby objects are limited to 40 bytes. But a string can be much longer than 40 bytes, so how are they stored? If we look at the struct that represents strings, we’ll find there is a char * pointer:

struct RString { struct RBasic basic; union { struct { long len; char *ptr; union { long capa; VALUE shared; } aux; } heap; char ary[RSTRING_EMBED_LEN_MAX + 1]; } as;
};

The ptr field in the string struct points to a byte array which is our string. So the actual memory usage of a string is approximately 40 bytes for the object, plus however long the string is. If we were to visualize the layout, it would look something like this:

RString pointing to char array

In this case, there are really two allocations: the RString object and the “hello world” character array. The RString object is the 40 byte Ruby object allocated using the GC, and the character array was allocated using the system’s malloc implementation.

Side note: There is another optimization called “embedding”. Without getting too far off track, “embedding” is just keeping strings that are “small enough” stored directly inside the RString structure. We can talk about that in a different post, but today pretend there are always two distinct allocations.

We can take advantage of this character array and represent substrings by just pointing at a different location. For example, we can have two Ruby objects, one representing the string “hello world” and the other representing the string “world” and only allocate one character array buffer:

RStrings sharing a char array

This example only has 3 allocations: 2 from the GC for the Ruby string objects, and one malloc for the character array. Using ObjectSpace, we can actually observe this optimization by measuring memory size of the objects after slicing them:

>> require 'objspace'
=> true
>> str = "x" * 9000; nil
=> nil
>> ObjectSpace.memsize_of str
=> 9041
>> substr = str[30, str.length - 30]; nil
=> nil
>> str.length
=> 9000
>> substr.length
=> 8970
>> ObjectSpace.memsize_of substr
=> 40

The example above first allocates a string that is 9000 characters. Next we measure the memory size of the string. The total size is 9000 for the characters, plus some overhead for the Ruby object for a total of 9041. Next we take a substring, slicing off the first 30 characters of the original. As expected, the original string is 9000 characters, and the substring is 8970. However, if we measure the size of the substring it is only 40 bytes! This is because the new string only requires a new Ruby object to be allocated, and the new object just points at a different location in the original string’s character buffer, just like the graph above showed.

This optimization isn’t limited to just strings, we can use it with arrays too:

>> list = ["x"] * 9000; nil
=> nil
>> ObjectSpace.memsize_of(list)
=> 72040
>> list2 = list[30, list.length - 30]; nil
=> nil
>> ObjectSpace.memsize_of(list2)
=> 40

In fact, functional languages where data structures are immutable can take great advantage of this optimization. In languages that allow mutations, we have to deal with the case that the original string might be mutated, where languages with immutable data structures can be even more aggressive about optimization.

This shared string optimization isn’t without limits though. To take advantage of this optimization, we have to always go to the end of the string. In other words, we can’t take a slice from the middle of the string and get the optimization. Lets take our sample string and slice 15 characters off each side and see what the memsize is:

>> str = "x" * 9000; nil
=> nil
>> str.length
=> 9000
>> substr = str[15, str.length - 30]; nil
=> nil
>> substr.length
=> 8970
>> ObjectSpace.memsize_of(substr)
=> 9011

We can see in the above example that the memsize of the substring is much larger than in the first example. That is because Ruby had to create a new buffer to store the substring. So our lesson here is: if you have to slice strings, start from the left and go all the way to the end.

Here is an interesting thing to think about. At the end of the following program, what is the memsize of substr? How much memory is this program actually consuming? Is the str object still alive, and how can we find out?

require 'objspace' str = "x" * 9000
substr = str[30, str.length - 30]
str = nil
GC.start 

The optimization I explained above works exactly the same way for strings in C as it does in Ruby. We will use this optimization to reduce memory usage and speed up require in Ruby.

Reducing Memory Usage and Speeding Up require

I’ve already described the technique we’re going to use to speed up require, so lets take a look at the problem. After that, we’ll apply the shared string optimization to improve performance of require.

Every time a program requires a file, Ruby has to check to see if that file has already been required. The global variable $LOADED_FEATURES is a list of all the files that have been required so far. Of course, searching through a list for a file would be quite slow and get slower as the list grows, so Ruby keeps a hash to look up entries in the $LOADED_FEATURES list. This hash is called the loaded_features_index, and it’s stored on the virtual machine structure here.

The keys of this hash are strings that could be passed to require to require a particular file, and the value is the index in the $LOADED_FEATURES array of the file that actually got required. So, for example if you have a file on your system: /a/b/c.rb, the keys to the hash will be:

  • “/a/b/c.rb”
  • “a/b/c.rb”
  • “b/c.rb”
  • “c.rb”
  • “/a/b/c”
  • “a/b/c”
  • “b/c”
  • “c”

Given a well crafted load path, any of the strings above could be used to load the /a/b/c.rb file, so the index needs to keep all of them. For example, you could do ruby -I / -e"require 'a/b/c'", or ruby -I /a -e"require 'b/c'"', etc, and they all point to the same file.

The loaded_features_index hash is built in the features_index_add function. Lets pick apart this function a little.

static void
features_index_add(VALUE feature, VALUE offset)
{ VALUE short_feature; const char *feature_str, *feature_end, *ext, *p; feature_str = StringValuePtr(feature); feature_end = feature_str + RSTRING_LEN(feature); for (ext = feature_end; ext > feature_str; ext--) if (*ext == '.' || *ext == '/') break; if (*ext != '.') ext = NULL; 

This function takes a feature and an offset as parameters. The feature is the full name of the file that was required, extension and everything. offset is the index in the loaded features list where this string is. The first part of this function starts at the end of the string and scans backwards looking for a period or a forward slash. If it finds a period, we know the file has an extension (it is possible to require a Ruby file without an extension!), if it finds a forward slash, it gives up and assumes there is no extension.

 while (1) { long beg; p--; while (p >= feature_str && *p != '/') p--; if (p < feature_str) break; beg = p + 1 - feature_str; short_feature = rb_str_subseq(feature, beg, feature_end - p - 1); features_index_add_single(short_feature, offset); if (ext) { short_feature = rb_str_subseq(feature, beg, ext - p - 1); features_index_add_single(short_feature, offset); } }

Next we scan backwards in the string looking for forward slashes. Every time it finds a forward slash, it uses rb_str_subseq to get a substring and then calls features_index_add_single to register that substring. rb_str_subseq gets substrings in the same way we were doing above in Ruby, and applies the same optimizations.

The if (ext) conditional deals with files that have an extension, and this is really where our problems begin. This conditional gets a substring of feature, but it doesn’t go all the way to the end of the string. It must exclude the file extension. This means it will copy the underlying string. So these two calls to rb_str_subseq do 3 allocations total: 2 Ruby objects (the function returns a Ruby object) and one malloc to copy the string for the “no extension substring” case.

This function calls features_index_add_single to add the substring to the index. I want to call out one excerpt from the features_index_add_single function:

 features_index = get_loaded_features_index_raw(); st_lookup(features_index, (st_data_t)short_feature_cstr, (st_data_t *)&this_feature_index); if (NIL_P(this_feature_index)) { st_insert(features_index, (st_data_t)ruby_strdup(short_feature_cstr), (st_data_t)offset); }

This code looks up the string in the index, and if the string isn’t in the index, it adds it to the index. The caller allocated a new Ruby string, and that string could get garbage collected, so this function calls ruby_strdup to copy the string for the hash key. It’s important to note that the keys to this hash aren’t Ruby objects, but char * pointers that came from Ruby objects (the char *ptr field that we were looking at earlier).

Lets count the allocations. So far, we have 2 Ruby objects: one with a file extension and one without, 1 malloc for the non-sharable substring, then 2 more mallocs to copy the string in to the hash. So each iteration of the while loop in features_index_add will do 5 allocations: 2 Ruby objects, and 3 mallocs.

In cases like this, a picture might help explain better. Below is a diagram of the allocated memory and how they relate to each other.

Allocations on Trunk

This diagram shows what the memory layout looks like when adding the path /a/b/c.rb to the index, resulting in 8 hash entries.

Blue nodes are allocations that were alive before the call to add the path to the index. Red nodes are intermediate allocations done while populating the index, and will be freed at some point. Black nodes are allocations made while adding the path to the index but live after we’ve finished adding the path to the index. Solid arrows represent actual references, where dotted lines indicate a relationship but not actually a reference (like one string was ruby_strdup‘d from another).

The graph has lots of nodes and is very complicated, but we will clean it up!

I’ve translated the C code to Ruby code so that we can more easily see the optimization at work:

$features_index = {} def features_index_add(feature, index) ext = feature.index('.') p = ext ? ext : feature.length loop do p -= 1 while p > 0 && feature[p] != '/' p -= 1 end break if p == 0 short_feature = feature[p + 1, feature.length - p - 1] features_index_add_single(short_feature, index) if ext short_feature = feature[p + 1, ext - p - 1] features_index_add_single(short_feature, index) end end
end def features_index_add_single(str, index) return if $features_index.key?(str) $features_index[str.dup] = index end features_index_add "/a/b/c.rb", 1

As we already learned, the shared string optimization only works when the substrings include the end of the shared string. That is, we can only take substrings from the left side of the string.

The first change we can make is to split the strings in to two cases: one with an extension, and one without. Since the “no extension” if statement does not scan to the end of the string, it always allocates a new string. If we make a new string that doesn’t contain the extension, then we can eliminate one of the malloc cases:

$features_index = {} def features_index_add(feature, index) no_ext_feature = nil p = feature.length ext = feature.index('.') if ext p = ext no_ext_feature = feature[0, ext] end loop do p -= 1 while p > 0 && feature[p] != '/' p -= 1 end break if p == 0 short_feature = feature[p + 1, feature.length - p - 1] features_index_add_single(short_feature, index) if ext len = no_ext_feature.length short_feature = no_ext_feature[p + 1, len - p - 1] features_index_add_single(short_feature, index) end end
end def features_index_add_single(str, index) return if $features_index.key?(str) $features_index[str.dup] = index end features_index_add "/a/b/c.rb", 1

This changes the function to allocate one new string, but always scan to the end of both strings. Now we have two strings that we can use to “scan from the left”, we’re able to avoid new substring mallocs in the loop. You can see this change, where I allocate a new string without an extension here.

Below is a graph of what the memory layout and relationships look like after pulling up one slice, then sharing the string:

Allocations after shared slice

You can see from this graph that we were able to eliminate string buffers by allocating the “extensionless” substring first, then taking slices from it.

There are two more optimizations I applied in this patch. Unfortunately they are specific to the C language and not easy to explain using Ruby.

Eliminating Ruby Object Allocations

The existing code uses Ruby to slice strings. This allocates a new Ruby object. Now that we have two strings, we can always take substrings from the left, and that means we can use pointers in C to “create” substrings. Rather than asking Ruby APIs to slice the string for us, we simply use a pointer in C to point at where we want the substring to start. The hash table that maintains the index uses C strings as keys, so instead of passing Ruby objects around, we’ll just pass a pointer in to the string:

- short_feature = rb_str_subseq(feature, beg, feature_end - p - 1);
- features_index_add_single(short_feature, offset);
+ features_index_add_single(feature_str + beg, offset); if (ext) {
- short_feature = rb_str_subseq(feature, beg, ext - p - 1);
- features_index_add_single(short_feature, offset);
+ features_index_add_single(feature_no_ext_str + beg, offset); } }
- features_index_add_single(feature, offset);
+ features_index_add_single(feature_str, offset); if (ext) {
- short_feature = rb_str_subseq(feature, 0, ext - feature_str);
- features_index_add_single(short_feature, offset);
+ features_index_add_single(feature_no_ext_str, offset);

In this case, using a pointer in to the string simplifies our code. feature_str is a pointer to the head of the string that has a file extension, and feature_no_ext_str is a pointer to the head of the string that doesn’t have a file extension. beg is the number of characters from the head of the string where we want to slice. All we have to do now is just add beg to the head of each pointer and pass that to features_index_add_single.

In this graph you can see we no longer need the intermediate Ruby objects because the “add single” function directly accesses the underlying char * pointer:

Allocations after pointer substrings

Eliminating malloc Calls

Finally, lets eliminate the ruby_strdup calls. As we covered earlier, new Ruby strings could get allocated. These Ruby strings would get free’d by the garbage collector, so we had to call ruby_strdup to keep a copy around inside the index hash. The feature string passed in is also stored in the $LOADED_FEATURES global array, so there is no need to copy that string as the array will prevent the GC from collecting it. However, we created a new string that does not have an extension, and that object could get collected. If we can prevent the GC from collecting those strings, then we don’t need to copy anything.

To keep these new strings alive, I added an array to the virtual machine (the virtual machine lives for the life of the process):

 vm->loaded_features = rb_ary_new(); vm->loaded_features_snapshot = rb_ary_tmp_new(0); vm->loaded_features_index = st_init_strtable();
+ vm->loaded_features_index_pool = rb_ary_new();

Then I add the new string to the array via rb_ary_push right after allocation:

+ short_feature_no_ext = rb_fstring(rb_str_freeze(rb_str_subseq(feature, 0, ext - feature_str)));
+ feature_no_ext_str = StringValuePtr(short_feature_no_ext);
+ rb_ary_push(get_loaded_features_index_pool_raw(), short_feature_no_ext);

Now all strings in the index hash are shared and kept alive. This means we can safely remove the ruby_strdup without any strings getting free’d by the GC:

 if (NIL_P(this_feature_index)) {
- st_insert(features_index, (st_data_t)ruby_strdup(short_feature_cstr), (st_data_t)offset);
+ st_insert(features_index, (st_data_t)short_feature_cstr, (st_data_t)offset); }

After this change, we don’t need to copy any memory because the hash keys can point directly in to the underlying character array inside the Ruby string object:

Use string indexes for keys

This new algorithm does 2 allocations: one to create a “no extension” copy of the original string, and one RString object to wrap it. The “loaded features index pool” array keeps the newly created string from being garbage collected, and now we can point directly in to the string arrays without needing to copy the strings.

For any file added to the “loaded features” array, we changed it from requiring O(N) allocations (where N is the number of slashes in a string) to always requiring only two allocations regardless of the number of slashes in the string.

END

By using shared strings I was able to eliminate over 76000 system calls during the Rails boot process on a basic app, reduce the memory footprint by 4%, and speed up require by 35%. Next week I will try to get some statistics from a large application and see how well it performs there!

Thanks for reading!