Parallel::ForkManager

celogeek

Feb 17, 2012

Slides made for "Mongueurs of Paris" show

 

Instructions

  • Press space to see the plan and space again to go the slide.
  • Use arrow to move between slide (left, right, down, up).
  • Click on the code area to zoom in, click again to zoom out.

What is Parallel::ForkManager ?

It's a simple parallel processing fork manager

dLux (Szabó, Balázs)

What can you do with Parallel::ForkManager ?

  • Process faster a high amount of simple job
  • Use the maximum ressource of your server
  • Reduce your process time

How to do that efficiently ?

I will show you cases of usage, it could be usefull for everyday usage.

Case N°1

Transform a simple process into a multi-process with Parallel::ForkManager

This is your code:


    for my $job(@jobs) {
        compute_this_job($job);
    }
    

This is with Parallel::ForkManager


use Parallel::ForkManager;
my $MAX_PROCESS = 4; #number of your CPUs for example
my $pfm = Parallel::ForkManager->new($MAX_PROCESS);
for my $job(@jobs) {
    $pfm->start and next;
    compute_this_job($job);
    $pfm->finish;
}
$pfm->wait_all_children;

Simple ? Yeah, let's do a bit more

Case N°2

Compute your job and return the result to the parent to do furthermore

run_on_finish method:


my $pfm = p_fork; #return a Parallel::ForkManager instance
my $result = {};
$pm->run_on_finish(sub {
    my ($pid, $exit, $id, $exit_signal, $core_dump, $data) = @_;
    croak "Failed to process on one job, stop here !" 
        if $exit || $exit_signal;
    $result->{$id} = $data;
});
for my $job(@jobs) {
    $pfm->start($job->{id}) and next;
    my $job_result = compute_this_job($job);
    my $job_error = ref $job_result eq 'HASH' ? 0 : 1;
    $pfm->finish($job_error, $job_result);
}
$pfm->wait_all_children;
do_more_stuff_with($result);

Great ! I want database connection now !

Case N°3

Issue with DBI and how to handle it !

What kind of problem can I have with DBI and fork ?

  • If you use the parent connection in a fork, your SQL server will close the parent one.
  • If you forget to remove the parent connection in children, the autodestroy of Perl will call the DBI autodestroy which close the parent connection.

How can I solve this?

  • Directly remove the parent connection just after a fork.
  • Connect the child with a fresh connection if you need one.

Let's see an example (children don't need connection):


#return a Parallel::ForkManager instance
my ($pfm, $MAX_PROCESS) = p_fork;
my $result = {};
#here the run_on_finish stuff
my $conn = get_my_dbi_connection();
$conn->prepare("SQL to fetch stuff");
$conn->execute;
while(my $job = $conn->fetchrow_hashref) {
    $pfm->start($job->{id}) and next;
    my $job_result = compute_this_job($job);
    my $job_error = ref $job_result eq 'HASH' ? 0 : 1;
    $pfm->finish($job_error, $job_result);
}
$pfm->wait_all_children;
$conn->finish;
do_more_stuff_with($result);

You will loose your parent connection.

Let's see how to remove properly the parent connection:


#...
$pfm->start($job->{id}) and next;
if ($MAX_PROCESS) { #if we have fork
    $conn->{InactiveDestroy} = 1;
    $conn = undef;
}
#connection is useless, 
#autodestroy of DBI is call
#InactiveDestroy is made for Fork, 
#and let the connection intact

If you need a connection, just do this:


#...
$pfm->start($job->{id}) and next;
if ($MAX_PROCESS) { #if we have fork
    $conn->{InactiveDestroy} = 1;
    $conn = get_my_new_dbi_connection();
}
#autodestroy of DBI is call for previous connection

Great ! Now how to handle my ressources properly ?

Case N°4

Compute MAX_PROCESS properly

Why ?

  • You want to use all your CPU
  • You want to avoid swapping, because it could lead to an out of memory or slow down your process

 

Let see how to do this

The p_fork method:


use strict; use warnings;
use Parallel::ForkManager;
use Sys::Info; use Sys::Statistics::Linux::MemStats;
use 5.010; #for new feature
sub p_fork {
    #minimum require memory for your process
    my ($min_mem) = @_; # default 1 Go
    $min_mem //= 1024 ** 2; #1 GO => expr in Kb
    # get number of cpus on the machine
    my $cpu_info = Sys::Info->new;
    my $cpu = $cpu_info->device('CPU');
    my $MAX_PROCESSES_FOR_CPU = $cpu->count || 1;
    # get real free mem in KB
    my $freemem = Sys::Statistics::Linux::MemStats->new->get->{realfree};

    # 3GB by fork max
    my $MAX_PROCESSES_FOR_MEM = int($freemem / ($min_mem));
    # get the min between cpu and memory slot, 
    # 0 mean no fork because not enough memory
    my $MAX_PROCESSES = 
          min($MAX_PROCESSES_FOR_CPU, $MAX_PROCESSES_FOR_MEM);
    # return the process, ready to use
    my $pm = new Parallel::ForkManager($MAX_PROCESSES);
    wantarray and return ($pm, $MAX_PROCESSES) or return $pm;
}

Well, we have almost finished. Now you must be carefull.

Be carefull

Don't fork if your job is too short :

  • Fork is time and memory consuming.
  • If your job take less a second, regroup many jobs together and fork for computing them.

Let's see an example:


#return a Parallel::ForkManager instance
    my ($pfm,$MAX_PROCESS) = p_fork;
    my $result = {};
#here the run_on_finish stuff
    my $conn = get_my_dbi_connection();
    $conn->prepare("SQL to fetch stuff"); $conn->execute;
    my @jobs = $conn->fetchall_arrayref; $conn->finish;
    my $step = max(50, @jobs / $MAX_PROCESS + 1); #min 50 job per pack
    for(my $job_start = 0; $job_start < @jobs, $job_start += $step) {
        $pfm->start($job->{id}) and next;
        if ($MAX_PROCESS) { #if we have fork
            $conn->{InactiveDestroy} = 1; 
            $conn = undef;
        }
    

        my $job_result = {};
        for(my $job_current = $job_start; 
                $job_current <= min(@jobs - 1,$job_start + $step); 
                $job_current ++) { 
            my $job = $jobs[$job_current];
            my $job_result->{$job->{id}} = compute_this_job($job)
            or croak "Error !";
        }
        $pfm->finish(0, $job_result);
    }
    $pfm->wait_all_children;
    do_more_stuff_with($result);
    

We have finished !

Thank you!

celogeek

Feb 17, 2012