• New version of HTML::Lint validates HTML entities

    I've released a beta of the new version of HTML::Lint, version 2.11_01. (At the time of this writing, this 2.11_01 release has not reached its search.cpan.org page yet) This version adds HTML entity checking to the tag checking that HTML::Lint has done since the dawn of time. If you're already using HTML::Lint, please help test this beta version!

    Entity checking can be a messy business, but can be invaluable for finding little mistakes, especially in static HTML pages sent to you from other sources. For example, if I have this HTML file, filled with HTML entities and ampersands and all sorts of potential problems, HTML::Lint sniffs out the problems and reports them:

    <html>
    <head>
    <title>Ace of &spades;: A tribute to Mot&oumlrhead. ® &metalhorns;</title>
    <script>
    function foo() {
    if ( 6 == 9 && 25 == 6 ) {
    x = 14;
    }
    }
    </script>
    </head>
    <body bgcolor="white">
    <p>
    Thanks for visiting Ace of ♠
    <!-- Numeric version of &spades; -->
    <p>
    Ace of ♠ is your single source for
    everything related to Mot&ouml;rhead.
    <p>
    Here's an icon of my girlfriend Jenny: &#8675309;
    <!-- invalid because we cap at 65536 -->
    <p>
    And here's an icon of a deceased cow: &#xdeadbeef;
    <!-- Invalid because we cap at xFFFF -->
    <p>
    Another <i>deceased cow: &xdeadbeef;
    <!-- Not a valid hex entity, but unknown to our lookup tables -->
    <p>
    Here's an awesome link to
    <!-- here comes the ampersand in the YouTube URL! -->
    <a href="http://www.youtube.com/watch?v=8yLhA0ROGi4&feature=related">"You Better Swim"</a>
    from the SpongeBob movie.
    <!--
    Here in the safety of comments, we can put whatever &invalid; and &malformed entities we want, &
    nobody can stop us.  Except maybe Cheech & Chong.
    -->
    </body>
    </html>
    $ weblint motorhead.html
    motorhead.html (3:9) Entity &ouml; is missing its closing semicolon
    motorhead.html (3:9) Entity &oumlrhead. ® is unknown
    motorhead.html (3:9) Entity &metalhorns; is unknown
    motorhead.html (17:9) Entity &#8675309; is invalid
    motorhead.html (19:9) Entity &#xdeadbeef; is invalid
    motorhead.html (22:17) Entity &xdeadbeef; is unknown
    motorhead.html (31:5) <i> at (22:17) is never closed
    

    That last error about the unclosed <i> tag has always been part of HTML::Lint, but all the others are new with this version of HTML::Lint.

    The HTML-Lint distribution includes the HTML::Lint module, which is object based for easy handling, and also includes Test::HTML::Lint so that you can add HTML validation to your test suites.

    my $html = $app->generate_home_page();
    html_ok( $html, 'Home page is valid HTML' );
    

    If you're not doing any validation of your HTML in your apps, I suggest you give HTML::Lint a try.

  • Parrot tickets now converted to GitHub

    The Parrot project is now using GitHub's issue tracking system. Parrot has used GitHub's source code control for months now, but we had hundreds of tickets in the Trac system.  Now, over the past few weeks, I've been working with Rick from GitHub to migrate the tickets out of Trac into GitHub's issue system.

    Like most data conversion projects, the challenges were less about the coding and more about making the decisions about how to massage the data between two similar systems.  For example, Trac has fields for Severity and Priority of tickets, but GitHub only has free-form tagging, so I had to create GitHub tags that correspond to Severity and Priority in Trac. GitHub's tracking system doesn't handle file attachments, so my conversion code had to make inline comments of the file attachments.

    Most time-consuming of all was the conversion of users from Trac to GitHub.  We needed the issue history to have accurate user IDs on them, so I needed a big lookup table to do the job. While users like "coke" and "chromatic" have the same user IDs on both the Trac instance and GitHub, Trac user "jonathan" is "jnthn" on GitHub, and so on. Anyone I couldn't find a match for became generic user "Parrot".

    The actual code to do all this is only about 200 lines of Perl code, which should be no surprise for someone who has the CPAN at his disposal. I used Net::Trac to read from the Trac instance, and the JSON module to write out JSON files in the GitHub API format. The bulk of the code is project-specific conversions to make little data tweaks like change severity to tags, and to make the output code a little more friendly in Markdown.

    I have to specifically thank Rick at GitHub for helping us through this project.  I used a lot of his time with questions about how GitHub would handle my import format, and we had two test imports for us to see real results, so that I could adjust my conversion process. The final results are beautiful, and the Parrot team is excited to see this move made.

    I've long been a fan of GitHub and how they help out the community, and this just adds to it.  This sort of aid to open source projects should stand as an example to other companies that work with open source.  Many companies give back to the communities of the projects on which their businesses are based.  It's fantastic to have a company willing to use human capital actually working with a project in which they have no direct involvement.  In helping us, GitHub gains nothing but the grateful thanks of the Parrot project.

  • Perl::Critic finds annoying little bugs in your code.

    My work colleague Mike O'Regan created a policy for the latest version of Perl::Critic.

    Now if you have a line of code like this:

    
    my $n += somefunc();
    # Should be my $n = somefunc();
    

    Perl::Critic will tell you

    Augmented assignment operator '+=' used in declaration at line X, column Y. Use simple assignment when initializing variables.

    If you haven't let Perl::Critic loose on your code yet, now's a great time to try.

    To the loyal Perl::Critic users, what's the nastiest bug Perl::Critic found for you? Let me know in the comments.

  • Finding a lost dog's owner with Perl and WWW::Mechanize

    It's not every day you get to save a dog with Perl, but Perlbuzz reader Adam Gotch did just that the other day.

    Adam tells me "I'm a telecommute Perl/Python contract programmer at O'Reilly Media. I live in Springboro, OH. I've been coding in Perl for about 10 years and love it."

    On Saturday, Adam found a dog wandering the highway about a mile from his home. The local shelters didn't open until Monday, so he took it upon himself to try to find the owner himself.

    Adam explains:

    I located the Warren County dog registration website and discovered a simple web form that allowed you to look up an owner if you had the dog license # and registration year. Not having a clue what a license # looked like, I entered '1' with year '2011' and got a result. Dog license #'s were simple integers. Using binary search, I quickly discovered that there were 24996 registration records for 2011. The web form's search result provided a dog's owner's name, address and phone as well as the dog's breed, color and sex. With this knowledge I decided it was feasible to write a script to pull back all the records and filter for a female brown lab.

    The dog registration website was ASP.NET with __VIEWSTATE and __EVENTVALIDATION post variables so a simple LWP script was going to be a pain. I had worked with WWW::Mechanize before so I checked the CPAN docs to see if it was going to work. It seemed to have everything I needed so I began coding. I wrote a quick test to see if I could pull back one record, but no luck. I ran wireshark captures of both a manual post in Chrome and my test script. Comparison of the captures revealed that the submit button name/value was not being sent by my script. Looking at the WWW::Mechanize docs, I found the button parameter to the submit_form() method for simulating a submit button click. It worked. I finished the script, looping over all 24996 records and soon I was pulling down all the Warren County dog registration records for 2011.

    Here's the program Adam wrote:

    use WWW::Mechanize;
    use HTML::TreeBuilder::XPath;
    use strict;
    my $m = WWW::Mechanize->new();
    $m->get('http://www.co.warren.oh.us/auditor/licensing/dog_search/');
    my @info = ();
    $| = 1;
    for (my $i = 1; $i < 24997; $i++) {
    my $response;
    eval {
    $response = $m->submit_form( form_number => 1,
    fields => {
    'ctl00$ContentPlaceHolder1$txtlicense' => "$i",
    'ctl00$ContentPlaceHolder1$txtyear' => '2011'
    },
    button => 'ctl00$ContentPlaceHolder1$btnSubmit');
    };
    if (!$@ && $response->is_success) {
    my $tree = HTML::TreeBuilder::XPath->new;
    $tree->parse($response->decoded_content);
    # Use XPath selectors to find fields in the table
    my $owner_info = $tree->findvalue('//div/fieldset[1]/p');
    my $dog_info = $tree->findvalue('//div/fieldset[2]/p');
    push @info, [$owner_info, $dog_info, $i];
    print "$owner_info|$dog_info|$in";
    }
    else {
    warn "WARNING: POST FAILED";
    }
    $m->back();
    }
    

    After that, it was some simple calls to grep to filter the results:

    cat warren_county_dogs.txt | 
    grep -i springboro | 
    grep -i lab | 
    grep -i brown | 
    grep -i female 
    > brown_labs.txt
    

    This narrowed down the 25,000 records to 39. That made it easily to visually scan the list and find the addresses that were closest to where the dog was found. That narrowed it down to three. Adam Googled the phone numbers, found that one was a cell, and texted it.

    I texted the first number, explaining I had found this dog on the highway and sure enough, it was the owner! He promptly drove to my house to pick up "Izzy". When he arrived he was very glad to have his dog back but also confused as to how I found his phone number. I told him I "scraped" the dog registration site and left it at that (yeah it's a bit unnerving how easy it is to find information on people).

    Note that if Adam was using a system that didn't have grep or ack, he could have done the string matching in the Perl program before writing out to the file:

    next unless $owner_info =~ /springboro/i;
    next unless $dog_info =~ /lab/ && $dog_info =~ /brown/
    && $dog_info =~ /female/ && $dog_info =~ /lab/;
    

    He could probably have done the matching with XPath as well, but I am very green on XPath. Such a modification is left as an exercise to the reader.

    Thanks for the story, Adam!

  • Mark Jason Dominus on giving fish

    By Mark Jason Dominus, from a talk in 2003, reprinted here with permission. Sadly, it's still relevant today.

    The #perl IRC channel has a big problem. People come in asking questions, say, "How do I remove the first character from a string?" And the answer they get from the regulars on the channel is something like "perldoc perlre".

    This isn't particularly helpful, since perlre is a very large reference manual, and even I have trouble reading it. It's sort of like telling someone to read the Camel book when what they want to know is how to get the integer part of a number. Sure, the answer is in there somewhere, but it might take you a year to find it.

    The channel regulars have this idiotic saying about how if you give a man a fish he can eat for one day, but if you teach him to fish, he can eat for his whole life. Apparently "perldoc perlre" is what passes for "teaching a man to fish" in this channel.

    I'm more likely to just answer the question (you use $string =~ s/.//s) and someone once asked me why. I had to think about that a while. Two easy reasons are that it's helpful and kind, and if you're not in the channel to be helpful and kind, then what's the point of answering questions at all? It's also easy to give the answer, so why not? I've seen people write long treatises on why the querent should be looking in the manual instead of asking on-channel, which it would have been a lot shorter to just answer the question. That's a puzzle all right.

    The channel regulars say that answering people's questions will make them dependent on you for assistance, which I think is bullshit. Apparently they're worried that the same people will come back and ask more and more and more questions. They seem to have forgotten that if that did happen (and I don't think it does) they could stop answering; problem solved.

    The channel regulars also have this fantasy that saying perldoc perlre is somehow more helpful than simply answering the question, which I also think is bullshit. Something they apparently haven't figured out is that if you really want someone to look in the manual, saying perldoc perlre is not the way to do it. A much more effective way to get them to look in the manual is to answer the question first, and then, after they thank you, say "You could have found the answer to that in the such-and-so section of the manual." People are a lot more willing to take your advice once you have established that you are a helpful person. Saying perldoc perlre seems to me to be most effective as a way to get people to decide that Perl programmers are assholes and to quit Perl for some other language.

    After I wrote the slides for this talk I found an old Usenet discussion in which I expressed many of the same views. One of the Usenet regulars went so far as to say that he didn't answer people's questions because he didn't want to insult their intelligence by suggesting that they would be unable to look in the documentation, and that if he came into a newsgroup with a question and received a straightforward answer to it, he would be offended. I told him that I thought if he really believed that he needed a vacation, because it was totally warped.

    Mark Jason Dominus has been doing Perl forever. He is the author of Higher Order Perl which belongs on the shelf of every Perl programmer. Follow him on Twitter at @mjdominus.