Going from comic to comic is more tedious than it should be, especially with a browser groaning under the load of the ad networks that support the comics (more relevant in 2003 than today, of course.) Also, why impose expectations of schedule on the author, even unvoiced, when the computer can tell me when there's an actual update?
Eventually piperka came along and does a good job at this, shared among many users (so it saves much more time in aggregate than any single-user personal effort ever could.) Some of the code is documented here.
Most recently, switched to conkeror+firefox, and now hit c-c c-n in
the web browser to get the next comic. The same queue is fed by a
one-liner in ~/.snownews/browser
:
curl -Hurl:\ %s -d "" <http://localhost:3383/push_url>
so urls from that are queued up the same way (see ../firefox.)
Still should do some kind of auto-training; the simplest algorithm is
to fetch the whole page, then fetch it small-T (an hour? 5m?) later,
and see what changed, and ignore it. Keep a record (vector?) of the
whittling, and use that instead of check_maybe
. Possibly pick a
threshold of change and display it to me - and let me vote on same
vs. different...
Inspired by a discussion (on raw?) it occurred to me that running a single "common" aggregator might actually be useful for me and be sharable, saving polling bandwidth.
User interface: a "next comic" bookmarkable link. You hit it, it authenticates you and finds your set, and redirects you to the next available comic. You might have different links (cgi arguments) for subsets or priorities, maybe even for triggers (or at very least, the "no more" page can have some features, like add, rearrange, "unread", force another scan, etc.)
Then the question is - should similar processing be used as a rss inflow mechanism... it'd be a good place to put the categorizer...
Other scraping:
curl <http://news.bbc.co.uk/2/hi/uk_news/magazine/default.stm> | sgrep 'attvalue("*") in attribute("HREF") in (elements parenting "10 things")'
is a URL I'd like to fetch when it changes, which should be weekly.
Some fixes from kcr that deal with running this for the first time;
some new comics; some quoting horror that I should work around (or get
the upstream site to fix; ''
are not valid attribute quotes...)
Switching to FancyURLparser
had the unexpected side effect of not
raising an exception for 304 anymore; worked around that, also treat
etag-not-changed as a query, just like any other not-matched case.
Implement Jarno Virtanen's etag/last-modified example, since that causes even less load on the target server, if it supports them. Switched the db entries to hashes - even though we have to explicitly read and write them, that fits the commit model well enough, and lets us add new verbs later.
Also implemented "summary" verb which tells us that we have 57 entries, 26 have etags, 28 have last-modified times - so this change makes a difference on more than half of the comics sites...
added KeyboardInterrupt
test, to cleanly abort the right way.
Upgraded to NNWLite 1.0.5fc1, but it still seems to only notice one
change per batch.
added lastBuildDate
, since NNWLite is still not noticing changes.
add explicit IOError
case since most get_content
errors are of that
form. add a check for the-gadgeteer.com changes, based on a regexp.
looks like having multiple identical links with different titles isn't
enough for nnw to call them different - so now we actually parse the
old items, filter them against themselves (as a migration step, but no
reason not to keep it), and then have write_item
nuke anything it is
supplying. This automatically bounds the size of the feed, too.
51 out of 56 - but a typo ate the rss, so now it needs to use a temp
file. (Turns out to be a trivial change to the rssfile constructor
and destruct.) Also looks like we need separate times for "last
changed" and "last seen". This leads to implementing a "fix" command
to update the database format. After commenting out a couple of pages
that don't actually lead to comics anymore, we're now up to 53/53,
36/53 going through check_maybe
and the rest being (currently)
singletons. Good enough for now...
Dropped the rest-time to 6 hours. Implemented insert-at-start mode of dealing with an existing RSS file. Implemented a few more handlers, also just repointed a few start-urls to help urlbase work (47/57 now, and the remaining ones are kind of odd.)
4.5 hours later, it handles 38/55 comic sites, including one with frames (helen.) It may pick up a few more if keenspace gets unhosed tomorrow. The ad-hoc ones could probably be unified a bit more, a few of them could probably use a "this year" match string. pillars turned out to be more easily handled just by checking the title, and I suspect some others fall to that as well.
It should probably do an insert-and-prune on the rss file, so that I can run it at will.
comics page to rss feed - not to scrape the comic itself, just to handle notification of changes.
given a comic page url, pick a regexp to find the "this comic" link. If that match changes, generate an rss "item" pointing to the comic top level, or maybe the archive page if we have a way to do that.
have generic "process_keenspace
" functionality.