Updates to my 4chan Image Downloader Script

I made a few changes to my 4chan Image Downloader Script.

Two of the changes were cosmetic:

  • Added copyright boilerplate text to the code, which should have been done long ago.
  • Made minor changes to formatting of comments.

The other two changes were more substantial and were suggested by Callum Gare (see his comments on the 4chan Image Downloader Script page):

  • Changed shebang from #!/usr/bin/perl to #!/usr/bin/env perl to accommodate users’ personal Perl install preferences.
  • Added .webm files to the list of file types that will be downloaded.

Thanks for the excellent suggestions, Callum! I hope you get around to creating a downloader script that uses 4chan’s API. That approach seems like an improvement over my web-scraping program.

Posted in programming, scripts, tools | Tagged , , , | Leave a comment

Final 2014 NFL Pick’em League Standings

This year I finished in fourth place out of 17 participants, but I needed my win of week 17 to jump from sixth place to fourth place. Last year I was consistently in the top three throughout the year and finished in third place, so this year feels like a bit of a step back. Interestingly, my percent of points correct this year was 71.4% compared to 70.9% for last year.

I plan on trying some new modeling methodologies next season. An extra 3 points per week this season (a lift of 2.8% in model performance) would have been enough to grab first place. This sounds like a small and easily achievable improvement in model performance, but I anticipate it will be challenging to improve my current model performance by this much with any other modeling technique.

league_standings_2014_week_17

See my past post “NFL Pick ’em League Standings Through 9 Weeks” for more background on the picks league and the creation of the chart.

Posted in graphics, NFL model | Tagged , , , | Leave a comment

Scraping ESPN.com Website to Get 2014 NFL Schedule Data

Another NFL season is about to start, and once again I’ll be using a statistical model to predict winners for the three office pools I’m in.

The first step is to create a text file containing schedule data for all 17 weeks of the regular season. The text file for 2014 looks something like this:

 week away home
----------------   
    1 sd    ari  
    1 ten   kc   
    1 sf    dal  
    1 cle   pit  
   ...
   17 ari   sf   
   17 ind   ten  
   17 jac   hou  
   17 nyj   mia

I then load this text file into R to create the R dataframe that is used to generate my predictions throughout the season. I’m WAY too lazy to enter the data for all 256 regular season games by hand.

I’ve written and posted code for scraping ESPN schedule pages in past years (see Parsing ESPN.com Website to Create 2012 NFL Schedule Data for a previous example). This year I located a page on ESPN.com containing a very compact “grid schedule”, which made the scraping job much easier than in past years.

The grid schedule is a simple HTML table, which means I could have easily used a more purpose-built HTML table scraping tool.  However, the HTML::Parser code was quite short and straightforward as well.  A large chunk of the scraping code was for logging and QA purposes.

To create the 2014 schedule data:

  1. Go to ESPN.com’s 2014 Grid Schedule Page, and save the page source code. I used the filename nfl_schedule_2014.htm.
  2. Run the script below, changing input and output file names as necessary in the configuration section.
  3. Standard output from the script is a bunch of logging and QA information, so redirect the output of the script to a file if you want to create a log file to QA the script.

Now you have a text file with the full 2014 NFL schedule from about 2 minutes of work!

Here’s the script:

################################################################################
## parses 2014 schedule webpage and creates 
## dataset of week, road, and home for all 256 games
##
## source data:  ESPN.com website, NFL schedule page (grid schedule) as of 
##               08/14/2014
## author:  Nicholas Montpetit
################################################################################
#!/cygdrive/c/Perl/bin/perl -w

use strict;
use HTML::Parser;
use Data::Dumper;

#-------------------------------------------------------------------------------
# configuration
#-------------------------------------------------------------------------------
my $input_filename  = 'nfl_schedule_2014.htm';
my $output_filename = 'schedule_2014.txt';

my %team_mapping = ('wsh' => 'was', 'jax' => 'jac');

my %team_game_count;

my @score_data;
my ($this_team_ref, $in_table, $in_data);
#-------------------------------------------------------------------------------

#-------------------------------------------------------------------------------
# handler functions for HTML::Parser
#-------------------------------------------------------------------------------
sub tag_handler {
    my ($tagname, $attr) = @_;
    
    if (    $tagname eq 'tr' && defined($attr->{'class'}) && 
            $attr->{'class'} eq 'colhead'){
        $in_table = 1;
    }
    elsif ( $tagname eq 'tr' && $in_table == 1){
        push @score_data, ($this_team_ref = [])
    }
    elsif ( $tagname eq 'td' && $in_table == 1){
        $in_data = 1;
    }
}

sub end_handler {
    my ($self, $tagname) = @_;
    if ($tagname eq 'td'){
        $in_data = 0;
    }
}

sub text_handler {
    my ($text) = @_;
    if ($in_data == 1){
        push @$this_team_ref, $text;
    }
}
#-------------------------------------------------------------------------------

#-------------------------------------------------------------------------------
# run parser to create data structure
# data structure is an array of array references, each array reference is
#   to a row in the ESPN score grid table
#-------------------------------------------------------------------------------
my $header_parser = HTML::Parser->new(  
                        api_version     => 3,
                        start_h         => [\&tag_handler,  "tagname, attr"],
                        end_h           => [\&end_handler,  "self, tagname"],
                        text_h          => [\&text_handler, "dtext"],
                        unbroken_text   => 1,
                        attr_encoded    => 1,
                        );
                                        
$header_parser->parse_file($input_filename);

print "Score data\n";
print Dumper(\@score_data);
#-------------------------------------------------------------------------------

#-------------------------------------------------------------------------------
# create score_data and team_count data structures
#   score_data contains home and road teams for each week
#   team_count contains count of weeks for each team for QA purposes
#      (count must be 17 for each team)
#-------------------------------------------------------------------------------
my %score_data;
my %team_count;
foreach my $team_aref (@score_data){
    my $team = shift @$team_aref;
    $team_count{$team}++;
    foreach my $index (0..16){
        my $opp = $team_aref->[$index];
        next if lc $opp eq 'bye';
        if (index($opp, '@') > -1){
            $opp =~ s/@//;
            $score_data{$index + 1}->{$opp}{$team} = 1;
            $team_count{$opp}++;
        }
        else {
            $score_data{$index + 1}->{$team}{$opp} = 1;
            $team_count{$opp}++;
        }
    }
}

print Dumper(\%score_data);
print Dumper(\%team_count);
#-------------------------------------------------------------------------------

#-------------------------------------------------------------------------------
# write file and QA data -- make sure each team has 8 home and 8 road games ##
#-------------------------------------------------------------------------------
open my $OP, '>', $output_filename;

my %team_sched;
foreach my $week (sort {$a <=> $b} keys %score_data){
    foreach my $home_team (keys %{$score_data{$week}}){
        foreach my $away_team (keys %{$score_data{$week}->{$home_team}}){
            $team_sched{$away_team}->{'away'}++;
            $team_sched{$home_team}->{'home'}++;
            printf $OP "%5s %-5s %-5s\n",
                $week,
                $team_mapping{lc $away_team} || lc $away_team,
                $team_mapping{lc $home_team} || lc $home_team
        }
    }
}

print Dumper(\%team_sched);
#-------------------------------------------------------------------------------
Posted in NFL model, programming, scripts | Tagged , , , | Leave a comment

All In on Python!

Perl, R, and SAS have been my workhorse scripting and statistical languages for the past 15 years. I’ve been using Python for about 10 years, but almost always for projects strictly requiring Python and/or projects without tight deadlines.

Perl has always been my go-to scripting language for getting things done quickly, primarily due to (1) my familiarity with the language, (2) the seamless integration of regular expressions into the language (which seem to find their way into every project I touch),  and (3) my confidence in Perl to provide everything I need (either in the standard library or in CPAN) for solving any programming or data manipulation problem imaginable.

As I continue to increase my usage of Python and learn more about the language, it has become clear to me that Python is likely to be a better long-term scripting tool than Perl for my needs, and (more surprisingly) I think Python will also largely replace R as my primary data analysis and statistical language.

My reasons for choosing Python over Perl for general-purpose programming and R for data analysis could easily fill at least one lengthy blog post, and maybe I’ll cover that in the future. Here I’ll focus on my approach to replacing Perl with Python as my day-to-day scripting language.

Geeking out on Python

My typical approach to learning and using a new programming language is to jump feet-first into a project, picking up the needed knowledge about the language and its tools along the way. I’m going to take a different approach with Python over the next couple months: my plan is to systematically and thoroughly learn Python’s fundamental software development and data analysis tools.  That is, I’ll be thoroughly reading official module documentation and seeking out the best tutorials and books for the various tools.

My reasons for wanting to focus on Python’s software development tools have a lot to do with the Python language, as well as my changing attitudes and beliefs on best practices for software development:

  • I’m starting to see the value in using debuggers, unit testing, and (gasp!) documenting code in a formal way, even for ad hoc or one-off projects.   In the past, I’ve almost always taken a “quick-and-dirty” approach to programming:  writing sparsely (or not at all) documented code and using print statements (or their equivalent) to test and debug code.   As an example of my previous attitude, I viewed debuggers as a “crutch”, a tool that would make me “lazy” and take away my ability to code in a quick-and-dirty fashion when there’s a tight deadline.  After giving debuggers a try, I’ve realized that they can be a huge time saver and productivity tool.  Just as importantly, I realized that using print statements for debugging is really the lazy approach, and thus using debuggers is not going to affect my ability to take the quick-and-dirty approach to programming and debugging when that’s needed.
  • Almost all of Python’s software development tools are tightly integrated into the language and part of the standard library.  In Perl, the software development tools (and even the object-oriented model, for that matter) seem to be “bolted on” to the core language, and thus their usage doesn’t always feel “natural” within a Perl script.  In contrast, the Python development tools that I’ll be focusing on are seamlessly incorporated into the language, and using them is as natural and “Pythonic” as using any other core part of the language.

Here is the list of software development tools I’m going to focus on, roughly in priority order, though I’ll tackle the items on the list somewhat concurrently:

  • IPython — An interactive Python shell.

I want to do more of my development work interactively from a Python shell, rather than following my standard process of writing text files and then running from the command line.  Running programs from the command line is great for the end user, but using an interactive Python shell offers several benefits over this approach during development.

I could never get into IDLE (the standard shell provided with Python) because it doesn’t seem to have a set of features sufficient to justify the “hassle” of running my code in an interactive mode.  IPython, on the other hand,  provides a complete software development environment within a Python shell, including numerous productivity tools and shortcuts, as well as the ability to customize the environment to a large extent.  Beyond that, IPython (and in particular the IPython Notebook) is visually stunning, rivaling the aesthetic appeal of the notebook-style offerings of Mathematica and Matlab.

One of the shortcomings of Perl is that it does not have a readily available interactive shell that I know of (please let me know if I’m wrong about that, but mentioning Perl’s debugger does not count).  It almost seems criminal for a scripting language not to have a decent interactive shell.

  • pdb — the Python Debugger

The biggest problem with my standard approach of using print statements for debugging my programs is that it’s a pain to find and remove these statements when they’re no longer needed.  Furthermore, this approach to debugging often leads to large volumes of output when only a couple of examples are needed to understand what’s going on.   pdb allows me to examine program behavior as the program is running, and then I can extract only the needed information about the program’s state in exactly the desired context.  Thus, pdb eliminates the need to use print for debugging and all of the hassles that come with it.

  • Docstring comments

Documenting code is about as unpleasant as taking bad-tasting medicine, especially when working on a project with a tight deadline.  But just like using the right medicine, good documentation makes things better in the long run.  We all know that documentation makes our programs better, and I’m slowly starting to increase the amount of documentation in my code.

Documentation is obviously useful when examining source code, but Python goes one step further, multiplying the usefulness of well-placed documentation with no extra effort from the programmer.   Appropriately-placed code documentation is automatically attached to associated objects and is available at runtime (and during interactive shell sessions) for inspection.   What’s more, there are many tools available that automatically convert Python code documentation into easy-to-read reports and documents.   In particular, you can automatically generate user-friendly documentation of your module’s API from the code documentation, which is a major benefit for anyone using your module.

  • Logging via the logging module

The logging module provides a framework for outputting data and results in response to events in your program, which means it can be used for debugging when things go wrong, reporting when things are running fine, and warnings when things might be somewhere in between.

Why not just use print statements for logging?  In short,  it’s a real pain to add print statements, change them as necessary, and then remove them when they’re no longer needed.

Now imagine writing a module for which you want to turn logging on or off depending on how the module is being used.  For example, you might want to turn on logging when debugging and developing your module, but you might not want the end user of your module — which could be a human or another module — to see the logging output.  To repetitively add print statements when developing your module and then remove them when shipping the code is a tedious, time-consuming, and potentially error-causing process.   In this example, the logging module makes it easy — with a single line of code — to turn logging on and off.

However, the potential uses of the logging module go far beyond using it as a way to easily turn the generation of output on and off.    The logging module makes it easy to control what gets logged and how it gets logged.  Furthermore, logging behavior can be easily changed by programs using your module, without any need to modify your module code.

  • Unit testing
    • mocking and patching objects with the mock library
    • unittest module

Testing code is such a pain that I typically don’t do it, outside of a few print statements.  Good testing requires verifying that program behavior exactly matches the programmer’s expectations, and this level of testing cannot be reliably or efficiently achieved with print statements alone.  Luckily, the unittest module that’s part of Python’s standard library helps with automation of the testing process.  I can’t claim that unittest makes my testing completely fun or quick (although it apparently does for some people), but it does make testing as efficient and painless as it can get.

My goal is to learn these tools thoroughly and then use each of them on every project, even when using the tools seems like overkill for the given project.  At some point, using the tools will become second-nature, which means they won’t slow me down on projects for which they are truly required.

Along the way I plan on sharing my learnings and thoughts, in particular the gotchas, pitfalls, and other aspects of the tools that aren’t obvious and/or not covered in the documentation and tutorials.  I already have a list of such learnings.

I’d love to hear your thoughts on my approach and suggestions for other Python software development tools I should be using!

Posted in programming, software development, Uncategorized | Tagged , | Leave a comment

Another Update to my 4chan Image Downloader Script

In the last couple weeks 4chan changed their HTML in a way that broke my 4chan image downloader script.

According to their website:

“The “src/” and “thumb/” directories will be removed from images/thumbs.4chan.org and i/t.4cdn.org. Files will live at the board root on those subdomains.”

Luckily, fixing my script required only changing my regular expression pattern for matching image URLs. I replaced this regular expression to match image URLs:

/"(\/\/[^\/]*\/[A-Z]{1,4}\/src\/\d+\.(?:jpg|png|gif))"/gsim

with the following:

/href\s*=\s*"(\/\/[^\/]*\/[A-Z]{1,4}\/\d+\.(?:jpg|png|gif))"/gsim

Note that adding the href\s*=\s*" portion to the beginning of the new regular expression has nothing to do with the change 4chan made; rather it just adds a little more specificity to my pattern for matching image URLs.

Yes, I know I should probably use a more purpose-built Perl module such as HTML::Parser to extract the image URLs, so maybe I’ll update my script in the future.

See the page 4chan image downloader script for the complete and updated script.

Posted in programming, scripts | Tagged , , , , | Leave a comment

Final 2013 NFL Pick’em League Standings

This was my best overall finish in this league, though I’ve won more weeks in every past year.  This is consistent with the changes I made to my prediction model.  This season I used a model that produced more accurate, lower variance predictions, which is great for overall long-run results.

However, if the goal is to maximize the number of weeks with the highest score in the pool, you are better off using a model that produces predictions with a high variance.   Such a model may not produce the best long-run cumulative results (as measured by the overall percentage of points corresponding to correct picks as shown in the chart below), but it gives you a better chance to have the top score in any given week.  In other words, if your good weeks are really good and your bad weeks are really bad (as is the case in a high-variance model), you have a higher chance to win a given week than someone who consistently picks well enough to finish in the top 5 but never has the really good week.  Or another way to put it is that you can win a week only if you take some chances (especially when there are many people in the pool), but taking chances will lead to sub-optimal correct pick rates in the long run.

league_standings_2013_week_17

See my past post “NFL Pick ’em League Standings Through 9 Weeks” for more background on the picks league and the creation of the chart.

Here are my results for the three pools I’m in (excluding participants who had a zero week):

Pool # Rank Number of Participants Percentile
1 1 14 92.9%
2 3 18 83.3%
3 2 11 81.8%

Sorry about the awful table formatting; that’s what WordPress gives by default, and fixing it is probably not worth the time spent researching how to fix it. 🙂

Posted in graphics, NFL model | Tagged , , , | Leave a comment

Update to my 4chan Image Downloader Script

Occasionally 4chan changes their HTML in a way that causes my 4chan image downloader script to break.  I am not complaining in any way, as they have good reasons for their changes as far as I can tell.

In the latest case they changed their static domains from 4chan.org to 4cdn.org, and the image URLs changed accordingly.  To address this change, I replaced this line

$image_link =~ /(\/\/images.4chan.org\/[A-Z]{1,4}\/src\/(\d+\.(jpg|png|gif)))/i

with the following:

$image_link =~ /(\/\/[^\/]*\/[A-Z]{1,4}\/src\/(\d+\.(jpg|png|gif)))/i

This change should also make the script less likely to miss image URLs if 4chan changes their image URLs again in the future.

See the page 4chan image downloader script for the complete and updated script.

Posted in scripts | Tagged , | Leave a comment

SAS Tip: Applying Labels From one SAS Variable to Another

Summary

What if you want to apply (or copy) the label of one SAS variable to another SAS variable? Doing so is not as straightforward as you might guess, and it requires a trick that is not obvious. Well it wasn’t obvious to me anyway.

Here’s the most basic example of this trick in action:

data my_data;
 set my_data end = eof;

 income_2010_copy = income_2010;

 if eof then call execute('data &syslast; set &syslast;');
 if eof then call execute('label income_2010_copy = "copy of ' || 
                            strip(vlabel(income_2010)) || '";');
 if eof then call execute('run;');
run;

As you can see from the PROC CONTENTS output, the newly created variable income_2010_copy does indeed take the label information from income_2010:

#    Variable            Type    Len    Label

1    income_2010         Num       8    wages and investment income in 2010        
2    income_2010_copy    Num       8    copy of wages and investment income in 2010

That was a quick start guide, and I’ll get into more details below.

The Problem

Creating a new variable from an existing variable is a common operation in SAS. If the original variable has a label, you might want to create a label for the new variable that carries over all or some of the information contained in the original variable’s label. For example, you might have a variable called income_2010 with the label wages and investment income in 2010. If you create a new variable called income_2010_sqrt that is the square root of income_2010, you probably want to label the variable as such. One approach is to create a brand new label for income_2010_sqrt using the label statment, most likely after copying and pasting the original label for income_2010:

label income_2010_sqrt = "square root of wages and investment income in 2012";

As you can imagine, cutting and pasting label text can get tedious and error-prone if you have to deal with dozens, hundreds, or even thousands of variables. What if you could grab the label for a variable and then use it to populate the label of another variable? It can be done in SAS, but it’s not as easy or straightforward as I think it should be.

We have the vlabel() function in SAS, which returns a variable’s label. Great — just what we need! Is there a SAS function that allows us to perform the inverse of this operation, setting the label for a variable? Of course not! I’ll put that omission in my list of SAS annoyances; pretty much any other language I can think of would provide both the “getter” and the “setter”, but SAS only provides the “getter” for variable labels.

So we’re stuck with using the label statement, which requires literal text for the label values. This is not conducive to automation. So is it possible to use macros and macro variables to generate a label statement that is part of the current data step? Maybe, but it seems really tricky (or probably impossible) due to the occasionally mind-bending interrelationship between macro resolution and DATA step compilation and execution. To sum up the issue (probably somewhat incorrectly and incompletely), all of the macro references and statements within a DATA step need to be resolved for the DATA step to execute, but the DATA step needs to execute to get the variable labels using vlabel() to generate the label code using macros.

Fortunately, there is a trick that allows us to automate applying labels from one variable to another. It’s not obvious or elegant, but it’s easy to understand and flexible enough to handle almost any automatic variable labeling problem. The trick is to use vlabel() inside of the current DATA step to get variable labels, and then generate new label statements for the variables you want to label. These generated label statements will be executed as part a subsequent DATA step.

We can create and execute the variable labeling code with the CALL EXECUTE call routine, which takes its text arguments and executes them as SAS statements after the current DATA step finishes. Thus, we will use a DATA step (with vlabel() and CALL EXECUTE) to get variable labels and generate SAS DATA step code that creates the new variable labels. This variable labeling DATA step will execute automatically after the current DATA step.

An Example

Assume the original SAS dataset has the following variables and labels:

#    Variable       Type    Len    Label

1    income_2010    Num       8    wages and investment income in 2010
2    income_2011    Num       8    wages and investment income in 2011
3    income_2012    Num       8    wages and investment income in 2012

Here’s the code to do some variable creation and automatic variable labeling:

data my_data;
 set my_data end = eof;

 income_2010_sqrt = sqrt(income_2010);
 income_2011_sqrt = sqrt(income_2011);
 income_2012_sqrt = sqrt(income_2012);

 if eof then call execute('data &syslast; set &syslast;');

 if eof then call execute('label income_2010_sqrt = "square root of ' || 
                            strip(vlabel(income_2010)) || '";');

 if eof then call execute('label income_2011_sqrt = "square root of ' || 
                            strip(vlabel(income_2011)) || '";');

 if eof then call execute('label income_2012_sqrt = "square root of ' || 
                            strip(vlabel(income_2012)) || '";');

 if eof then call execute('run;');
run;

Looking at the SAS log, you can see that the CALL EXECUTE calls above generated the necessary variable labeling code, which was executed by SAS:

NOTE: CALL EXECUTE generated line.
1   + data WORK.MY_DATA                         ; set WORK.MY_DATA                        
 ;
2   + label income_2010_sqrt = "square root of wages and investment income in 2010";
3   + label income_2011_sqrt = "square root of wages and investment income in 2011";
4   + label income_2012_sqrt = "square root of wages and investment income in 2012";
5   + run;

And the PROC CONTENTS output shows the new variable labels:

#    Variable            Type    Len    Label

1    income_2010         Num       8    wages and investment income in 2010               
2    income_2011         Num       8    wages and investment income in 2011               
3    income_2012         Num       8    wages and investment income in 2012               
4    income_2010_sqrt    Num       8    square root of wages and investment income in 2010
5    income_2011_sqrt    Num       8    square root of wages and investment income in 2011
6    income_2012_sqrt    Num       8    square root of wages and investment income in 2012

This is just one simple example of the general method for applying labels from one variable to another one. This example and methodology can be easily extended to other situations. For example, you could easily write a macro that uses the techniques shown above to automatically generate labels for hundreds of variables at once.

Note the usage of the macro variable &syslast in the code above. It holds the location of the most recent SAS dataset, so using it can often be “safer” than hard-coding a dataset name.

I want to give a big thanks to John Ladds. His paper “So Many Variables; Too Many Labels, Moving Labels from One Variable to Another” from the 2010 SAS Global Forum is where I first learned about this trick, and I would recommend that you look at his paper to get many more details and examples than I’ve provided in this post.

Hopefully you found this information helpful. I don’t do too many “Tips and Tricks” posts, because most of the tricks I come up with are obvious, not widely useful, and/or already covered better by somebody else. I thought this tip was worth passing on because it took a lot of web searching to find this non-obvious solution to a common problem.

Posted in programming, tips and tricks | Tagged | 3 Comments

Cutting Wood Efficiently, Part II: Triumphant Return to Home Depot

As decribed in my earlier post “Cutting Wood Efficiently“, I wrote a quick Perl script to give me the optimal (or close to optimal) cutting plan for a given set of needed board lengths.

These were the lengths I needed:

  • 8 2x4s of length 42 inches
  • 4 2x4s each of length 12, 23, 27, and 30 inches

The 24 board lengths available at Home Depot are 96, 104, 144, 168, and 192 inches.

Inputting the above information into my script gave me the following cutting plan, in Perlish notation:

[ [192, [42, 42, 42, 30, 30]],
  [192, [27, 27, 27, 23, 23, 23, 23, 12]],
  [144, [42, 42, 42, 12]],
  [104, [42, 42, 12]],
  [104, [30, 30, 27, 12]]
]

For example, I want to cut boards of length 42, 42, 42, 30, and 30 inches out of a 192 inch 24.

2x4s for my workbench

2x4s for my workbench. Waste wood is to the right.

And the picture shows the results.  The usable wood is to the left and the scrap wood is to the right.  I ended up with 704 inches of usable wood and 30 inches of waste wood.  Because the algorithm requires any solution to have at least 4 inches of leftover wood from each source board (for reasons explained in my previous post), the real waste is 30 – 5*4 = 10 inches.  So the waste rate is 10/(704 – 20) = 1.5%.  I thought that was pretty good.

The guy at Home Depot seemed unimpressed with my writing a script to determine the optimal cutting plan.  His reaction was not surprising to me initially, because I assumed the pro desk had a similar program available to customers.  So I asked him, “So the pro desk must have a program like mine to assist customers?”  His reply did surprise me.  “No, the pro desk doesn’t have anything like that.  It’s not needed, because customers just figure out their own cutting plan.”

This seems like an opportunity.  Without a program it’s not easy to come up with the optimal cutting plan for any project that isn’t trivial.  And in the time span I was at Home Depot, there were at least two other people who had similar board cutting requests.

Next post in the series will show the completed workbench, which isn’t too far off!

Posted in programming, scripts | Tagged , | Leave a comment

Cutting Wood Efficiently

I’m building a workbench out of plywood and 24’s for my man cave, so I figured I could just give Home Depot a list of the needed 24 lengths, and the guy at the saw would take it from there. As you can imagine, for a nontrivial project it’s not easy to come up with the optimal cutting plan that minimizes the amount of leftover wood. I assumed that Home Depot would have a program for determining the optimal cutting plan given a list of needed lengths and source board lengths available, or the guy at the saw would instinctively know how to cut up the boards in a way that is “close enough”.  And close enough would have been good enough for me.

I was so wrong! The guy at Home Depot looked at my list of lengths and sort of freaked out. He insisted that either I come up with a detailed cutting plan, or go to the “pro desk” for assistance. This response surprised me, considering I did not mind paying for an extra few 24’s, which I made clear to him.

This incident provided the spark I needed (I never need much) to write a program to help me out.

Using the excellent book Higher Order Perl by Mark Jason Dominus as my inspiration and reference (more on that book in a later post), I wrote some code to help me come up with optimal (or close to optimal) cutting plans.

The source code is at the bottom of this post.  Here’s an example of usage:

my $iterator = make_lumber_partitioner([96, 104, 144, 168, 192], [(42) x 8, (30, 27, 23, 12) x 4]);
my $solution_1 = $iterator->();
my $solution_2 = $iterator->();

The first argument to make_lumber_partitioner() is an array reference to the 24 lengths available at the store, and the second argument is an array reference to the needed lengths. So in this case I need 8 42-inch boards, 4 30-inch boards, 4 27-inch boards, and so on.

make_lumber_partitioner() returns an iterator that returns a different solution each time you “kick” the iterator by calling $iterator->(). Each solution is an array reference that lists the source boards, giving the length of the source board and the lengths to be cut from it. So here is one solution (in YAML format):

-
  - 104
  -
    - 42
    - 42
    - 12
-
  - 144
  -
    - 42
    - 42
    - 42
    - 12
-
  - 192
  -
    - 42
    - 42
    - 42
    - 30
    - 30
-
  - 104
  -
    - 30
    - 30
    - 27
    - 12
-
  - 192
  -
    - 27
    - 27
    - 27
    - 23
    - 23
    - 23
    - 23
    - 12

So we want 1 104-inch 24 cut into lengths of 42, 42, and 12 inches, 1 144-inch 24 cut into lengths of 42, 42, 42, and 12, and so on.

I’ve written it as an iterator because there are literally millions of different solutions for this example, and it would (or did???) blow up my computer (and take close to forever) to attempt to get all of them.  Instead of getting all solutions, I can kick the iterator a million times or so, and then choose the best solution.  We’re not guaranteed to get the best solution, but this is quick and should get us close.

A few interesting things about the problem and code:

  • The saw blade makes a certain amount of the wood disappear each time it makes a cut; this is known as the kerf.  The guy at Home Depot said the kerf is about 1/4 inch.  This program accounts for the kerf with each cut.
  • The guy at Home Depot also mentioned there has to be at least 4 inches of board remaining after a cut, so shaving an inch off the end of a board is not allowed.  Something about smaller pieces coming off the saw like projectiles.  Thus this program makes sure at least 4 inches is left over from each board.  It’s probably not a bad idea to leave some padding anyway to account for boards that are a little shorter than advertised.
  • The program uses a typical queue (or agenda) based approach, whereby each potential (but incomplete) solution is placed into the queue for further investigation.  Partial solutions are taken from the queue and resolved further.  If a partial solution resolves into a complete solution, it is returned.  If not, it is placed back on the queue for subsequent investigation.
  • When removing a partial solution from the queue, resolving it further, and then placing it back on the queue, one cannot merely copy the reference to the original partial solution coming off the queue.  Rather, it is necessary to make a “deep copy” of the partial solution off the queue before changing it and placing it back on the queue so that its state is not changed in the current or future iterations.  Using the dclone() subroutine (on line 44 below) from the Storable works perfectly for this.

I threw this program together quickly so I could keep my workbench project moving, so there are almost certainly bugs and edge cases that are not handled correctly.  The results I got are good enough for my purposes, but there are many improvements I could (and maybe someday will) make to the program:

  • The current program assumes an infinite number of boards available from the store for each length specified.  That assumption is fine for my purposes, but the program would be more generally useful — say, if you were using boards available at home — if you could specify how many boards are available for each length, with infinite boards being one option.
  • Removing partial solutions from and placing them on the queue can be improved so that the iterator tends to provide better solutions earlier.
  • The iterator could be made to filter out solutions that are known not to be optimal compared to what’s been returned already, so that each returned solution is guaranteed to be at least as good as what’s been returned previously.  This would tend to drastically cut down on solutions returned, so this is a good approach only if you are very clear on what makes one solution better than another. Typically solutions are very similar to each other, and there are often many ways to assess the quality of a solution (some of which might not have been considered when writing the program), so this approach might be too restrictive.
  • I need to read the “From Recursion to Iterators” chapter of my favorite programming book Higher Order Perl more closely, because I’m sure it will give lots of ideas for making my code better and and more efficient.

Look for a follow-up post to find out if my programatically-generated cutting plan leads to a more successful follow-up trip to Home Depot!

#!/usr/bin/perl -w

use warnings;
use strict;
use Storable qw(dclone);

sub mk_sum {
    my $aref = shift;
    my $sum = 0;
    $sum += $_ foreach (@$aref);
    return $sum;
}

sub make_lumber_partitioner {
    my $sizes_available_aref = shift;
    my $pieces_aref = shift;
    my $kerf = shift || 0.25;
    my $padding = shift || 4;

    $sizes_available_aref   = [sort             @$sizes_available_aref];
    $pieces_aref            = [sort {$b <=> $a} @$pieces_aref];

    my @todo = [$pieces_aref, []];

    sub {
        while (@todo){
            my $current = pop @todo;
            my ($pieces_aref, $solution_aref) = @$current;

            if (scalar(@$pieces_aref) == 0) {return $solution_aref}

            my ($new_piece, @remaining_pieces) = @$pieces_aref;

            my $added_board = 0;
            foreach my $solution_idx (0 .. scalar(@$solution_aref)-1){

                my $this_board_aref = $solution_aref->[$solution_idx];
                my ($board_length, $allocated_pieces_aref) = @$this_board_aref;
                if ($board_length >=    mk_sum($allocated_pieces_aref) + 
                                        $new_piece + 
                                        $kerf*(scalar @$allocated_pieces_aref) + 
                                        $padding){
                    $added_board = 1;
                    my $new_solution_aref = dclone($solution_aref);
                    push @{$new_solution_aref->[$solution_idx][1]}, $new_piece;
                    push @todo, [[@remaining_pieces], $new_solution_aref]; 
                }
            }

            if (not($added_board)){
                foreach my $size (@$sizes_available_aref){
                    if ($size >= $new_piece + $padding){
                        my $new_solution_aref = dclone($solution_aref);
                        push @$new_solution_aref, [$size, [$new_piece]];
                        unshift @todo, [[@remaining_pieces], $new_solution_aref];
                    }
                }
            }
        }
        return undef;
    }
}
Posted in programming, scripts | Tagged , , | 1 Comment