Improving mod_perl Driven Site's Performance -- Part V: Sharing Memory

Improving mod_perl Driven Site's Performance - Part V: Sharing Memory (continued)
Sharing Memory
Preloading Registry Scripts at Server Startup
Modules Initializing at Server Startup

Initializing DBI.pm
Initializing CGI.pm

Increasing Shared Memory With mergemem
References

Written by Stas Bekman

Improving mod_perl Driven Site's Performance -- Part V: Sharing Memory (continued)

Sharing Memory

As we have learned in the previous article sharing memory helps us to save memory with mod_perl, which gives us a huge speed up but we pay with price of big memory foot print. I presented a few techniques to save memory by trying to share more of it. In this article we will see other techniques allowing you to save even more memory.

Preloading Registry Scripts at Server Startup

What happens if you find yourself stuck with Perl CGI scripts and you cannot or don't want to move most of the stuff into modules to benefit from modules preloading, so the code will be shared by the children. Luckily you can preload scripts as well. This time the Apache::RegistryLoader modules comes to aid. Apache::RegistryLoader compiles Apache::Registry scripts at server startup.

For example to preload the script /perl/test.pl which is in fact the file /home/httpd/perl/test.pl you would do the following:

  use Apache::RegistryLoader ();
  Apache::RegistryLoader->new->handler("/perl/test.pl",
                            "/home/httpd/perl/test.pl");

You should put this code either into <Perl> sections or into a startup script.

But what if you have a bunch of scripts located under the same directory and you don't want to list them one by one. Take the benefit of Perl modules and put them to a good use. The File::Find module will do most of the work for you.

The following code walks the directory tree under which all Apache::Registry scripts are located. For each encountered file with extension .pl, it calls the Apache::RegistryLoader::handler() method to preload the script in the parent server, before pre-forking the child processes:

  use File::Find qw(finddepth);
  use Apache::RegistryLoader ();
  {
    my $scripts_root_dir = "/home/httpd/perl/";
    my $rl = Apache::RegistryLoader->new;
    finddepth
      (
       sub {
         return unless /\.pl$/;
         my $url = "$File::Find::dir/$_";
         $url =~ s|$scripts_root_dir/?|/|;
         warn "pre-loading $url\n";
           # preload $url
         my $status = $rl->handler($url);
         unless($status == 200) {
           warn "pre-load of `$url' failed, status=$status\n";
         }
       },
       $scripts_root_dir);
  }

Note that I didn't use the second argument to handler() here, as in the first example. To make the loader smarter about the URI to filename translation, you might need to provide a trans() function to translate the URI to filename. URI to filename translation normally doesn't happen until HTTP request time, so the module is forced to roll its own translation. If filename is omitted and a trans() function was not defined, the loader will try using the URI relative to ServerRoot.

A simple trans() function can be something like that:

  sub mytrans {
    my $uri = shift;
    $uri =~ s|^/perl/|/home/httpd/perl/|;
    return $uri;
  }

You can easily derive the right translation by looking at the Alias directive. The above mytrans() function is matching our Alias:

  Alias /perl/ /home/httpd/perl/

After defining the URI to filename translation function you should pass it during the creation of the Apache::RegistryLoader object:

  my $rl = Apache::RegistryLoader->new(trans => \&mytrans);

I won't show any benchmarks here, since the effect is absolutely the same as with preloading modules.

Modules Initializing at Server Startup

We have just learned that it's important to preload the modules and scripts at the server startup. It turns out that it's not enough for some modules and you have to prerun their initialization code to get more memory pages shared. Basically you will find an information about specific modules in their respective manpages. I will present a few examples of widely used modules where the code can be initialized.

Initializing DBI.pm

The first example is the DBI module. As you know DBI works with many database drivers falling into the DBD:: category, e.g. DBD::mysql. It's not enough to preload DBI, you should initialize DBI with driver(s) that you are going to use (usually a single driver is used), if you want to minimize memory use after forking the child processes. Note that you want to do this under mod_perl and other environments where the shared memory is very important. Otherwise you shouldn't initialize drivers.

You probably know already that under mod_perl you should use the Apache::DBI module to get the connection persistence, unless you open a separate connection for each user--in this case you should not use this module. Apache::DBI automatically loads DBI and overrides some of its methods, so you should continue coding like there is only a DBI module.

Just as with modules preloading our goal is to find the startup environment that will lead to the smallest ``difference'' between the shared and normal memory reported, therefore a smaller total memory usage.

And again in order to have an easy measurement I will use only one child process, therefore I will use this setting in httpd.conf:

  MinSpareServers 1
  MaxSpareServers 1
  StartServers 1
  MaxClients 1
  MaxRequestsPerChild 100

I'm going to run memory benchmarks on five different versions of the startup.pl file. I always preload these modules:

  use Gtop();
  use Apache::DBI(); # preloads DBI as well

option 1

Leave the file unmodified.

option 2

Install MySQL driver (I will use MySQL RDBMS for our test):

  DBI->install_driver("mysql");

It's safe to use this method, since just like with use(), if it can't be installed it'll die().

option 3

Preload MySQL driver module:

  use DBD::mysql;

option 4

Tell Apache::DBI to connect to the database when the child process starts (ChildInitHandler), no driver is preload before the child gets spawned!

  Apache::DBI->connect_on_init('DBI:mysql:test::localhost',
                             "",
                             "",
                             {
                              PrintError => 1, # warn() on errors
                              RaiseError => 0, # don't die on error
                              AutoCommit => 1, # commit executes
                              # immediately
                             }
                            )
  or die "Cannot connect to database: $DBI::errstr";

Here is the Apache::Registry test script that I have used:

  preload_dbi.pl
  --------------
  use strict;
  use GTop ();
  use DBI ();
    
  my $dbh = DBI->connect("DBI:mysql:test::localhost",
                         "",
                         "",
                         {
                          PrintError => 1, # warn() on errors
                          RaiseError => 0, # don't die on error
                          AutoCommit => 1, # commit executes
                                           # immediately
                         }
                        )
    or die "Cannot connect to database: $DBI::errstr";
  
  my $r = shift;
  $r->send_http_header('text/plain');
  
  my $do_sql = "show tables";
  my $sth = $dbh->prepare($do_sql);
  $sth->execute();
  my @data = ();
  while (my @row = $sth->fetchrow_array){
    push @data, @row;
  }
  print "Data: @data\n";
  $dbh->disconnect(); # NOP under Apache::DBI
  
  my $proc_mem = GTop->new->proc_mem($$);
  my $size  = $proc_mem->size;
  my $share = $proc_mem->share;
  my $diff  = $size - $share;
  printf "%8s %8s %8s\n", qw(Size Shared Diff);
  printf "%8d %8d %8d (bytes)\n",$size,$share,$diff;

The script opens a connection to the database 'test' and issues a query to learn what tables the databases has. When the data is collected and printed the connection would be closed in the regular case, but Apache::DBI overrides it with empty method. When the data is processed a familiar to you already code to print the memory usage follows.

The server was restarted before each new test.

So here are the results of the five tests that were conducted, sorted by the Diff column:

After the first request:

  Version     Size   Shared     Diff        Test type
  --------------------------------------------------------------------
        1  3465216  2621440   843776  install_driver
        2  3461120  2609152   851968  install_driver & connect_on_init
        3  3465216  2605056   860160  preload driver
        4  3461120  2494464   966656  nothing added
        5  3461120  2482176   978944  connect_on_init

After the second request (all the subsequent request showed the same results):

  Version     Size   Shared    Diff         Test type
  --------------------------------------------------------------------
        1  3469312  2609152   860160  install_driver
        2  3481600  2605056   876544  install_driver & connect_on_init
        3  3469312  2588672   880640  preload driver
        4  3477504  2482176   995328  nothing added
        5  3481600  2469888  1011712  connect_on_init

Now what do we conclude from looking at these numbers. First we see that only after a second reload we get the final memory footprint for a specific request in question (if you pass different arguments the memory usage might and will be different).

But both tables show the same pattern of memory usage. We can clearly see that the real winner is the startup.pl file's version where the MySQL driver was installed (1). Since we want to have a connection ready for the first request made to the freshly spawned child process, we generally use the second version (2) which uses somewhat more memory, but has almost the same number of shared memory pages. The third version only preloads the driver which results in smaller shared memory. The last two versions having nothing initialized (4) and having only the connect_on_init() method used (5). The former is a little bit better than the latter, but both significantly worse than the first two versions.

To remind you why do we look for the smallest value in the column diff, recall the real memory usage formula:

  RAM_dedicated_to_mod_perl = diff * number_of_processes
                            + the_processes_with_largest_shared_memory

Notice that the smaller the diff is, the bigger the number of processes you can have using the same amount of RAM. Therefore every 100K difference counts, when you multiply it by the number of processes. If we take the number from the version version (1) vs. (4) and assume that we have 256M of memory dedicated to mod_perl processes we will get the following numbers using the formula derived from the above formula:

               RAM - largest_shared_size
  N_of Procs = -------------------------
                        Diff

                268435456 - 2609152
  (ver 1)  N =  ------------------- = 309
                      860160

                268435456 - 2469888
  (ver 5)  N =  ------------------- = 262
                     1011712

So you can tell the difference (17% more child processes in the first version).

Initializing CGI.pm

CGI.pm is a big module that by default postpones the compilation of its methods until they are actually needed, thus making it possible to use it under a slow mod_cgi handler without adding a big overhead. That's not what we want under mod_perl and if you use CGI.pm you should precompile the methods that you are going to use at the server startup in addition to preloading the module. Use the compile method for that:

  use CGI;
  CGI->compile(':all');

where you should replace the tag group :all with the real tags and group tags that you are going to use if you want to optimize the memory usage.

I'm going to compare the shared memory foot print by using the script which is back compatible with mod_cgi. You will see that you can improve performance of this kind of scripts as well, but if you really want a fast code think about porting it to use Apache::Request for CGI interface and some other module for HTML generation.

So here is the Apache::Registry script that I'm going to use to make the comparison:

  preload_cgi_pm.pl
  -----------------
  use strict;
  use CGI ();
  use GTop ();

  my $q = new CGI;
  print $q->header('text/plain');
  print join "\n", map {"$_ => ".$q->param($_) } $q->param;
  print "\n";
  
  my $proc_mem = GTop->new->proc_mem($$);
  my $size  = $proc_mem->size;
  my $share = $proc_mem->share;
  my $diff  = $size - $share;
  printf "%8s %8s %8s\n", qw(Size Shared Diff);
  printf "%8d %8d %8d (bytes)\n",$size,$share,$diff;

The script initializes the CGI object, sends HTTP header and then print all the arguments and values that were passed to the script if at all. At the end as usual I print the memory usage.

As usual I are going to use a single child process, therefore I will use this setting in httpd.conf:

  MinSpareServers 1
  MaxSpareServers 1
  StartServers 1
  MaxClients 1
  MaxRequestsPerChild 100

I'm going to run memory benchmarks on three different versions of the startup.pl file. I always preload this module:

  use Gtop();

option 1

Leave the file unmodified.

option 2

Preload CGI.pm:

  use CGI ();

option 3

Preload CGI.pm and pre-compile the methods that I'm going to use in the script:

  use CGI ();
  CGI->compile(qw(header param));

The server was restarted before each new test.

So here are the results of the five tests that were conducted, sorted by the Diff column:

After the first request:

  Version     Size   Shared     Diff        Test type
  --------------------------------------------------------------------
        1  3321856  2146304  1175552  not preloaded
        2  3321856  2326528   995328  preloaded
        3  3244032  2465792   778240  preloaded & methods+compiled

After the second request (all the subsequent request showed the same results):

  Version     Size   Shared    Diff         Test type
  --------------------------------------------------------------------
        1  3325952  2134016  1191936 not preloaded
        2  3325952  2314240  1011712 preloaded
        3  3248128  2445312   802816 preloaded & methods+compiled

The first version shows the results of the script execution when CGI.pm wasn't preloaded. The second version with module preloaded. The third when it's both preloaded and the methods that are going to be used are precompiled at the server startup.

By looking at the version one of the second table we can conclude that, preloading adds about 20K of shared size. As I have mention at the beginning of this section that's how CGI.pm was implemented--to reduce the load overhead. Which means that preloading CGI is almost hardly change a thing. But if we compare the second and the third versions we will see a very significant difference of 207K (1011712-802816), and I have used only a few methods (the header method loads a few more method transparently for a user). Imagine how much memory I'm going to save if I'm going to precompile all the methods that I'm using in other scripts that use CGI.pm and do a little bit more than the script that I have used in the test.

But even in our very simple case using the same formula, what do we see? (assuming that I have 256MB dedicated for mod_perl)

               RAM - largest_shared_size
  N_of Procs = -------------------------
                        Diff

                268435456 - 2134016
  (ver 1)  N =  ------------------- = 223
                      1191936

                268435456 - 2445312
  (ver 3)  N =  ------------------- = 331
                     802816

If I preload CGI.pm and precompile a few methods that I use in the test script, I can have 50% more child processes than when we don't preload and precompile the methods that I am going to use.

I've heard that the 3.x generation will be less bloated. But it's in a beta state as of this writing.

Increasing Shared Memory With mergemem

mergemem is an experimental utility for linux, which looks very interesting for us mod_perl users: http://www.complang.tuwien.ac.at/ulrich/mergemem/

It looks like it could be run periodically on your server to find and merge duplicate pages. It won't halt your httpds during the merge, this aspect has been taken into consideration already during the design of mergemem: Merging is not performed with one big system call. Instead most operation is in userspace, making a lot of small system calls.

Therefore blocking of the system should not happen. And, if it really should turn out to take too much time you can reduce the priority of the process.

The worst case that can happen is this: mergemem merges two pages and immediately afterwards they will be split. The split costs about the same as the time consumed by merging.

This software comes with a utility called memcmp to tell you how much you might save.