Rethinking the interface to CPAN

| 17 Comments

I've started a group, rethinking-cpan, for discussing the ideas I've posted here. -- Andy

Every few months, someone comes up with a modest proposal to improve CPAN and its public face. Usually it'll be about "how to make CPAN easier to search". It may be about adding reviews to search.cpan.org, or reorganizing the categories, or any number of relatively easy-to-implement tasks. It'll be a good idea, but it's focused too tightly.

We don't want to "make CPAN easier to search." What we're really trying to do is help with the selection process. We want to help the user find and select the best tool for the job.

It might involve showing the user the bug queue; or a list of reviews; or an average star rating. But ultimately, the goal is to let any person with a given problem find and select a solution.

"I want to parse XML, what should I use?" is a common question. XML::Parser? XML::Simple? XML::Twig? If "parse XML" really means "find a single tag out of a big order file my boss gave me", the answer might well be a regex, no? Perl's mighty CPAN is both blessing and curse. We have 14,966 distributions as I write this, but people say "I can't find what I want." Searching for "XML" is barely a useful exercise.

Success in the real world

Let's take a look at an example outside of the programming world. In my day job, I work for Follett Library Resources and Book Wholesalers, Inc. We are basically the Amazon.com for the school & public library markets, respectively. The key feature to the website is not ordering, but in helping librarians decide what books they should buy for their libraries. Imagine you have an elementary school library, and $10,000 in book budget for the year. What books do you buy? Our website is geared to making that happen.

Part of this is technical solutions. We have effective keyword searching, so you can search for "horses" and get books about horses. Part of it is filtering, like "I want books for this grade level, and that have been positively reviewed in at least two journals," in addition to plain ol' keyword searching. Part of it is showing book covers, and reprinting reviews from journals. (If anyone's interested in specifics, let me know and I can probably get you some screenshots and/or guest access.)

BWI takes it even farther. There's an entire department called Collection Development where librarians select books, CDs & DVDs to recommend to the librarians. The recommendations could be based on choices made by the CollDev staff directly. They could be compiled from awards lists (Caldecott, Newbery) or state lists (the Texas Bluebonnet Awards, for example). Whatever the source, they help solve the customer's problem of "I need to buy some books, what's good?"

This is no small part of the business. The websites for the two companies are key differentiators in the marketplace. Specifically, they raise the company's level of service from simply providing an item to purchase to actually helping the customer do her/his job. There's no point in providing access to hundreds of thousands of books, CDs and DVDs if the librarian can't decide what to buy. FLR is the #1 vendor in the market, in large part because of the effectiveness of the website.

Relentless focus on finding the right thing

Take a look at the front of the FLR website. As I write this, the page first thing a user sees is "Looking for lists of top titles?" That link leads to a page of lists for users to browse. Award lists, popular series grouped by grade level, top video choices, a list called "Too good to miss," and so on. The entire focus that the user sees is "How can I help you find what you want?"

Compare that with the front page of search.cpan.org. Twenty-six links to the categories that link to modules in the archaic Module List. Go on, tell me what's in "Control Flow Utilities," I dare you. Where do I find my XML modules? Seriously, read through all 26 categories without laughing and/or crying. Where would someone find Template Toolkit? Catalyst? ack? Class::Accessor? That one module that I heard about somewhere that lets me access my Lloyd's bank account programtically?

Even if you can navigate the categories, it hardly matters. Clicking through to the category list leads to a one-line description like "Another way of exporting symbols." Plus, the majority of modules on CPAN are not registered in the Module List. The Module List is an artifact a decade old that has far outlived its original usefulness.

What can we do?

There have been attempts, some implemented, some not, to do many of these things that FLR & BWI do very effectively. We have CPAN ratings and keyword searching, for example. BWI selects lists of top books, and Shlomi Fish has recently suggested having reviews of categories of modules, which sounds like a great idea. I made a very tentative start on this on perl101.org. But it's not enough.

We need to stop thinking tactical ("Let's have reviews") and start thinking ("How do we get the proper modules/solutions in the hands of the users that want them.") Nothing short of a complete overhaul of the front end of the CPAN will make a dent in this problem. We need a revolution, not evolution, to solve the problem.

17 Comments

First, I should apologize about a bit of a rambling, brainstorming, kind of comment.

The reviews idea sounds like a good one and I can see that the category review idea would be very helpful. However, after following the perl-xml list for years, I suspect that many people won't read the reviews either.

If we want to get far with the reviews idea, I would think we would need to be able to deal with reviews either as pointers to other sites or CPAN-local reviews. There's probably also some benefit in both category-level reviews and module-specific reviews.

The category-level reviews would require a lot of work on the part of the reviewers. Using multiple modules in different ways to develop an opinion of strengths and weaknesses. Even summarizing the opinions of a mailing list like perl-xml would take quite a bit of effort.

Maybe a review system like Amazon supplies added to CPAN would encourage module-specific reviews.

In addition to the reviews, I wouldn't want to give up on the search system. Maybe a module ranking system added to the current search might be handy. This approach is what Google used to become the search standard engine.

Could we begin by generating metrics on modules that would help us improve searching?

  1. Kwalitee
  2. Other modules that depend on it
  3. References to the modules on Perl-specific sites
  4. Number of modules by the same author
  5. Keywords extracted from the POD
  6. Reviews on the review list (once we have some...)

This would give us an semi-automated way to improve search.

I know none of these is quite the revolution you were talking about, but your article really set me thinking.

The only problem with reviews is that as software changes they quickly go out of date. This is less of a problem with books, because they are slow to have new revisions.



For example, there is a review of one of my modules that gave it a low rating because they thought my api needed work. The next version of the module included updates to the api to address the comments in the review. There is no easy way for me to have that person update their review for the new version. I am stuck with a low rating even though it is for a previous version of my module and I have addressed the concerns of the reviewer.

I think the intention of CPANHQ was to be revolutionary as far as interface and organization goes. I'm not sure how it's coming along, but it seemed to at least be well-architected, code-wise.

-Max

Some of my comments. Each one in its own post.

Regarding "revolution" - I approve as long as we don't lose away all the important knowledge we've accumulated so far, and that we won't implement it all at once. It would be easier to build the CPAN Module-Rank<tm> system into the kobesearch source code (and hope that it is implemented in search.cpan.org or finally get the latter's source released), than to try to design something better, bigger and badder from scratch.

I prefer to think of such revolutions as "paradigm shifts" because they completely revolve the way I work with computers or with whatever, but do not interfere with the rest of my life. (In an "invention is the mother of necessity" aspect.) As Andy noted, we need something less incremental, but we shouldn't do something completely different and unfamiliar.

For the record, search.cpan.org is good enough for most things I'm trying to do as it is, and I have its search as a keyword in every browser I'm using on a given system in a given time. We just need to optimise the minority of the cases, where people resort to asking someone for a recommendation, giving up, writing something themselves, using something else instead of Perl, instead of finding what they're looking for. But most of the basic and useful information that s.c.o/kobesearch provide should stay where it is.

Maybe a wiki could work as a convenient method to gather information about all modules in a 'category': http://www.perlfoundation.org/perl5/index.cgi?form_processing ? In a way the popular consensus on what is good is more meaningful than an 'objective' assessment by an authority of module quality. Popularity is a guarantee for usability in diverse circumstances and also for community feedback and probably more active development. So wiki with the more 'community' approach should work here quite well.

I apologize for posting offtopic, but I didn't get an answer to my email I sent to editors at perlbuzz.com almst 3 weeks ago.

I can't read articles with Opera 9.27, all I see is a big yellow page.

Now I've validated the source code and after removing most of the 86(!) errors the W3C validator reported it worked in Opera 9.27.

The important fix was probably adding a closing tag for the <xMTIf> starting tag.

It would be really nice if you could fix the HTML so that people with browsers like Opera can read the page. Especially if it's because the HTML is broken and not the browser (there are a lot of doubled <p> tags and missing </li> tags. Is this Movable Type which generates so poor source code? Even urls are not correctly escaped (ampersands in urls need to be written as &amp;)

Again, sorry for being offtopic, but I didn't get an answer to my email.

While reading this I kept thinking about cookbooks...which I think do just what you are talking about (help people find answers to specific problems).

Perhaps building a wiki-style cookbook that included the statistics, reviews, and ratings about each package in a side bar or something (and even include 'related' packages list)?

The top level pages would just be references to more specific questions...so "I want to parse XML" would be a page that really just broke down into 'better' questions..."I want to get a single tag out of an XML file"...and those 'better' questions would lead to a cookbook-like page like I mentioned (or to another page of questions until the user gets to a level worthy of a cookbook-like response)...

This way search would still be the core (I think everyone sort of expects that)...but the community would then be helping you walk through your selection process (via question refinement links)...and in the end, you would also get (hopefully) a decent explanation of at least a (simple) related solution to your problem...along with lots of other details that will hopefully help you decide if it's the right 'thing' for your situation.

Shlomi Fish -> Kevin Marshall : Andy Lester (the original poster of this post) suggested a similar expert system (though one that I believed requiring much more AI), and I dare say your suggestion for implementation is clean and simple. We can easily do it using any half-decent wiki engine while re-using an existing wiki instance. I'm not sure it's a panacea for the problem, but it is doable and it is easy to get started and contribute to.

Could become a bit out of date, or vandalised, but since it's a wiki then anyone would be able to update it or correct vandalism.

Good article, and very similar to the thoughts I've been having apropos CPAN and the rest of Perl's web presences.

A new build, making use of new ideas, new conventions and conveniences would be great for Perl.

Anyway I can help, just let me know.

I just want to echo G. Wade's suggestion for compiling and posting metrics on modules, especially on his #2:

The number of modules that depend on it.

To me, this would be a simplistic equivalent to Google's PageRank. But it would be so helpful in determining which of a half-dozen roughly equivalent modules I should use.

To me, this would be a simplistic equivalent to Google's PageRank. But it would be so helpful in determining which of a half-dozen roughly equivalent modules I should use.

Why would that be better than a human recommendation?

As Hammer put it, Don't Automate, Obliterate.
A noble start for making things better for sure.

"Compare that with the front page of search.cpan.org. Twenty-six links to the categories that link to modules in the archaic Module List."



Since I usually know the name of the module I want, I just type it in to the search pane. Which means that it's been years since I looked at the 26 categories and realized how uninformative a display it is. Thanks for jogging my attention to that.

I said:

To me, this would be a simplistic equivalent to Google's PageRank. But it would be so helpful in determining which of a half-dozen roughly equivalent modules I should use.

Andy asked:

Why would that be better than a human recommendation?

I didn't say it would be better, I said it would be really helpful. :-)

I'd prefer to have recommendations and usage metrics. I think the two are complimentary.

However, here's the short case for metrics, particularly information how many other modules/projects depend on a certain module:

1) The more people who rely on a piece of code, the more likely it is that that module will have its bugs found and fixed, its code refactored for speed and its capabilities expanded. That's a generalization that doesn't hold for all cases, obviously, but I think it holds a valuable amount of water.

2) There's no better vote for the usefulness of a module than somebody relying on it for their own published code.

3) Not all modules will get reviews. All modules have usage metrics.


That said, metrics don't obviously tell the whole story. A depended-upon metric has the following pitfalls:

1) Favors older modules over newer ones. Your brand new xyxfg.pm may be light years beyond vvgmk.pm, but if the latter's been out for 5 years and yours has only been out 3 months, your metrics will stink, comparatively.

2) Doesn't tell you if the module actually does what you need it to. Just because a bunch of other folks use it, doesn't mean it has the exact capabilities you need.

That's why I'd vote for metrics and reviews. Combined, they offer a much more complete picture than either does alone.


So I was thinking about google's "top secret" algorithm ther other day. When I pondered the comments for hit number 3 on the query "gogopuffs". It occurred to me that in addition to hit count, you could use "last hit" to further refine the popularity of an item.

In this case, I did not click on hit number 3, but did click on hit number 1. So hit 1 deserves extra priority since it met my needs. Now if I had hit number 2 also, then hit 1 really wasn't cutting it for me. So should hit 2 get a popularity boost because of my clicking order?

The algorithm would be something like this:
Track session with cookie or some creative get/post variable. When the user queries something like 'XML', you could assume that the last one they click on was more useful, because they stopped clicking on search results.


Or maybe you do a reverse logic where you don't improve the last hit's popularity score, so much as reduce non-last hit's popularity since the user kept clicking.

Another idea would be to outsource. I know the temptation is to do it in house, but any chance we could leverage Google's engine to build some sort of smarter query process?

Hi,


I've commented on my own blog about this, http://www.simplicidade.org/notes/archives/2008/04/rethinking_cpan.html.

Executive summary: start with a iusethis-style of site for modules. Let people create their own module list, the one they use on a regular basis.

Best regards,

PS: I could not add a trackback, HTTP 403 error.

a set up similar to macupdate.com might be a good way to go. but with more search-list ordering options. Finding things on that site is a fairly enjoyable and informative experience.
user reviews are easily read and tend to be very helpful with the current build listed with the review, so its clear what build they are commenting on. Combine that with the quick 'whats new' update notes by the developers and its very useful.

Leave a comment

Job hunting for programmers


Land the Tech Job You Love, Andy Lester's guide to job hunting for programmers and other technical professionals, is available in PDF, ePub and .mobi formats, all DRM-free, as well as good old-fashioned paper.