Rants, Sermons, Received Wisdom. I don't drink enough to insist on telling these over a beer, but they Need To Be Told...
C privilege
Previously I went on at length about the flaws in the language-based package manager model; coincidentally I got to see some of the background/sausage-making1 of a Debian attempt to reconcile certain pip vs. dpkg issues. While I have a lot of respect for the people involved and the effort they're putting in to the problem, I stood (quietly) by my belief that the whole space is wrong... and then a digression about C led someone2 to state that
it could be argued that distros packaging is the C package manager
which collided with a few other ideas floating around in my head:
career_limiting_memcpy
)relatime
story specifically, as an instance of a solution
that implies a class of approaches.)All of this clicked into shape as a pattern of privilege - not to coopt the language of the very real problems the industry faces, but just to look at the "anti-pattern" of the conflict between distro and language package management from a perspective that makes some sense of why language package managers refuse to die off - it's not just that the people who work on them are persistent and/or stubborn. Certainly it turns the question around: given all the other language-specific package managers, where is the C one? And, as with social privilege, once you start looking for it, you realize that maybe "you're soaking in it" - and that Unix has been, since the 1970's, an extended support system for the C language, so of course any "distro packaging system" is going to be heavily C-program biased.
Now is that a helpful epiphany? It has the initial smell of "reduced to a previously unsolved problem", but it does suggest that attempting to credit C with it's proper share of the packaging burden might be productive; that leads to the idea that instead of "why would a language have anything intelligent to say about where files get installed" that in fact, the language may have one set of things to say, and the distro has a different set of things to say, and while we need them to mesh, they are more likely (today) to grind. It also suggests the fun mental exercise of considering a world, perhaps not without C but one where you assume your language of choice (yes, Python) as a starting point - maybe not as deeply special as the Lisp Machine, but it could still lead to some implementable insight.
"If you like law, or sausage, best not watch them being made" - not actually said by Otto von Bismarck, according to wikiquote. ↩
Pretty sure it was @schmichael but I could be mistaken. ↩
Language-based Package Managers are Wrong
I have some obvious biases (having started as a Debian Developer back when that meant "you've done some interesting work so Bruce added you to the mailing list"1) but I'd like to make it clear up front that fundamentally, packaging is a good thing - there really is a very important abstraction of "This named piece of software is composed of these files with these properties" with patterns of interaction like "how to upgrade", "how to remove", "how to get automatically started at boot time" and beyond.
Unix got there early - tarballs were never enough but Solaris (among others) had "System V packages", the early days of linux included a distribution called BoGuS that didn't include the software at all - just a snippet of text that had a download command, build command, and any renaming necessary to "fix" the output of make install - baby steps, but you probably recognize the immediate descendant of BoGuS, called RPM...
When Java came along, it was clumsily cross-platform - "write once run everywhere" in the marketing, "write once run screaming" to developers - the JVM served as a lowest-common-denominator interface to the operating system, didn't really fit Windows or Unix very well (what were you thinking, Sun? Couldn't even do unix domain sockets until a decade later?) which meant that a number of things that were traditionally part of the operating-system-as-platform got reinvented (usually at lower fidelity) on the inside of the JVM-as-platform. Of course, there was a lot of torque (and money) behind this, and "jar" files are just zip files with a little metadata, the namespace was already there, and entire infrastructures got built around deploying those jar files and connecting them up with things. (None of this improved the interaction with the operating system underneath, to this day manually tuning JVM memory usage is still a thing, and the "bootable jvm" model fell by the wayside.)
On an entirely different path, one of the secondary pillars of perl's
success was CPAN - because it provided search and taxonomy and
hierarchy, so you that you could find code without too much work, and
you were guided somewhat naturally into publishing code in namespaces
where people would discover it (search by itself is not enough, you
still need to be able to guess that the solution you're looking for
actually exists and enough about what shape it is to come up with a
useful query.) Having all of this in one place made it easier to
propagate consistent practices for building and installing your code
(the namespace alone helped a lot, Makefile.PL
helped the automation
along.) There wasn't a lot of pressure for deep integration here,
most CPAN consumers were solving problems, not building systems, so
being able to grab a particular package and drop it in place was
enough. Debian picked up some automation around turning CPAN metadata
into Debian metadata fairly early - in the form of tools that produced
a "first draft" package layout that you could refine into something
that was good enough, because inherently, CPAN wasn't going to have
information on a number of distribution-specific concerns, that was
information supplied by the installing sysadmin - or in the case of a
decent packaging job, supplied by Debian itself, often in the form of
distribution-wide tools used by packages of all forms (uid management,
documentation management, cron and inetd management, etc.)
Note that at this point, packaging was a feature of technical-quality operating systems2... Solaris, Debian, Redhat... while desktop operating systems like Windows3 and Mac OS4 entirely left it out, instead having entirely third-party "installer" models with poorly managed central "registry" metadata with no real standards for installation-level cooperation among tools (see DLL Hell as one of the results of this approach.) Eventually Visual Studio started including installer-builder tools which improved the overall story but didn't help with combining separate software installs that worked together. Add to this the desire in the various language communities to include beginners on Windows and the path towards "doing it yourself" becomes quite justifiable - which led to Ruby "gems" and subsequently to Python "eggs". While "eggs" helped enable dropping a copy of a Python module into each of the various Python installations a Windows box would end up with, it interfered with packaging on Debian for quite a while.
Even if the existing language managers improved enough to actually be good at the job they've set out to do (and I know good people putting serious work towards this goal) they would still be "wrong" - because the idea that an advanced software project would be entirely served by a single language has continued to fall over in the real world. Even the vaunted Lisp Machine had as a selling point5 that it could do cross-language debugging... defense projects that are locked in to Ada95 still bring in tcl or python for tool-wrangling... even the 3D printing world has data formats that are complex enough that they themselves can reasonably be considered languages.
While it is both entertaining and educational to recast your entire toolchain in one language (I've done big chunks of this in perl and later python (as have many others), I suspect joeyh is already halfway there in Haskell) when you consider that any programming language is expressive in a specific way, if your problem is at all challenging there may be parts of it that are better expressed in other languages. The products I've worked on over the last decade went from perl/shell/C++ to Python/C++ to Java/Python/C++ to Java/R/Python with a little Scala - in an environment with a lot of pressure to reduce complexity and to especially avoid engineer retraining or replacement. I suspect that most projects that survive in a single language simply aren't done yet, or don't have the kind of delivery or performance pressure that leads you to recognize the presence of domains of expressiveness in which an alternative language can be vastly more effective. Given that a single language really can't solve your entire interesting problem, how could a single-language packaging system ever be enough to deploy it?
While this is expressed as a disclosure of bias, it's also a declaration that I really have been part of the problem for That Long, and a slight justification of the lack of cited references - this is a blog, not a paper, and this is far more about uncorking some accumulated experience to make a point than it is about proving one. (That said, I was around for all of this...) ↩
Not mentioning the BSDs because they spent this period trailing about 5 years behind Debian in the packaging space; as such they didn't really influence the trajectory of history in this area. ↩
Only in 2014 is Windows finally getting "modern" packaging systems, with OneGet and Chocolatey NuGet being integrated into PowerShell. ↩
Apple basically followed a slightly more sophisticated "installer"
path, keeping around manifests that were reminiscent of Solaris
packages, using disk images (DMG
s) instead of zip files or
tarballs, and providing a central tool to manage permissions and
locations; there was a flurry of third party installers as well,
for a number of years - if they hadn't died off naturally, the up
and coming App Store model would have killed them for good. ↩
Literally a selling point - there were sites that bought Lisp Machines
specifically to run C code on them, because the Lisp Machine C
environment was so strict about otherwise undefined behaviour that
getting it to compile at all resulted in fixing bugs you didn't
(yet) know you had, far beyond what lint
could do at the time,
and easily as effective as valgrind
is today. ↩
Correctness is a Constraint, Performance is a Goal
The recent brouhaha over the change in glibc memcpy
behaviour
(which broke some notable bits of non-open-source code, like Adobe
Flash) suggests that while it might have made some loose sense "back
in the day" (note that "the day" was before memmove was introduced at
all; my recollection is that the first memcpy
that wasn't
overlap-safe was an early SunOS for SPARC version - which cause lots
of pain when porting 68k-SunOS code in the early 1990's) the thing
that should have been done (and we should do now) is to make that
"undefined" behaviour simply be abort()
.
In the days of SPECINT92, the raw speed of very-short memcpy's might
have mattered enough to strip off the tests - but we're talking two
compares (which should have very good branch prediction properties,
and which compilers that are already inlining should be able to
resolve in many cases anyhow.) I do appreciate having memcpy
be efficient per byte-moved, after having it not scramble data.
For those who believe the speed matters - well, we could leave behind
a career_limiting_memcpy
that doesn't have the checks, and see
who's willing to justify using it...
Exercise for the reader: take Linus' sample mymemcpy
from bug 638477
and add the abort-on-overlap test to it - then run chrome, firefox, or
other large-project of your choice. Note that every time it aborts is
probably a reportable bug -- but it's a lot cheaper than running it
under fullscale valgrind
.
Footnotes:
Premature Optimization and Object-Oriented Blame
The year is 1989. A small project team, working on a C++ project (at a time when "5 years of C++ experience" meant your name was Bjarne or Andrew); two adventurous types who'd used the language on toy projects, one experienced but not terribly detail-oriented one, and an enthusiastic newbie. Rush-rush application, starting from a throwaway prototype in smalltalk that showed great UI but 1% of the required performance; thus, sufficiently neophilic management to buy into the idea of doing rapid development with brand new tools - or maybe they felt that if we were wrong, we'd at least fail quickly too :-)
About two months in, the UI is going well (one of the things the C++ object model was actually quite well suited for is graphics classes, something like
and maybe another layer of input handling, broken down along the same geometric hierarchy, oh and replacing drawable with implementations in SunView, Domain, and X11 in about a day, made us very popular) but we're starting to load enough data to start seeing slow-to-reproduce difficult-to-narrow-down bugs in the underlying data structures.
After some in-fighting over whose classes were the problem (the UI side or the data-wrangling side, different developers) we come upon our first significant application of the true power of C++: object oriented blame. Since the UI classes and the data classes talked to each other over a very narrow interface, it was easy to stub out one or the other with a "stupid but obviously correct" implementation for testing... UI classes that just printed text instead of drawing anything, data classes with hard coded values, that sort of thing. Not only does this technique work for finding bugs - in 2006, we call it Mock Object Testing - but it has some psychological benefit for the programmers involved, in terms of keeping the mistake-finding more reality-based.
In this story, the suspected data structure was a doubly linked ordered list, with cached "recent" pointer to (attempt to) speed up insertion. (Remember this was 5 years before the SGI STL release...) That's kind of complex, and the performance aspects were entirely speculative. Since a test harness would have been on the "wrong" side of the argument (that is, the arguments that it "wasn't the data structure" would have been wielded against the tests too; yes, we know better today) the alternative was a testing-only implementation of a simple vector based on realloc and memcpy. Completely naive implementation (on the principle of making it easy to test): no buffer gap or indirection, every insert was a realloc and data move, every reference was a linear search.
As might be expected from the tone of the story, dropping in the naive implementation caused the bugs to vanish, and pretty effectively ended the argument... but it turned out to be a bigger hammer than expected: after finally fixing the bugs in the original data structure, some "large scale" data load tests could finally get run. They took... rather a long time. Swapping in the naive implementation solved that problem - graphing the two showed an apparent difference in order! Empirically, the memcpy based implementation took time linear in the quantity of input data, while the list was at least quadratic. This was probably due to the underlying realloc doing bucketing for us, and having most of its overhead be per-object rather than per-byte, so the single block technique won out over many smaller allocations.
At the time, the idea that "this application is all about manipulating this set of data in this way - therefore that needs something sophisticated enough to perform well" made sense. Perhaps we can be forgiven for this - after all, the popularization (by Knuth) of Hoare's dictum about premature optimization also dates to 1989. But clearly the smarter approach would have been to take our knowledge about the structure of the problem and conclude that "therefore the data performance should be measured more closely."
This is a story I've told a number of times in person, and has held its relevance for over 15 years.
Footnotes:
There Are No Mysteries
I believe (and argue) that there are no mysteries in computing. If "something strange" is happening - especially to a system you've built - it is always possible to identify the problem. To put it another way, claiming that a problem is mysterious is simply an admission of personal failure to be curious enough. (I distinguish that from "not having enough resources" because when people say something is mysterious, they don't mean that they're chosing not to look, they mean that looking is pointless - and, of course, they're wrong :-)
I got my start in computing with a book on 8085 Hex Machine Code, in an era when Radio Shack still carried chips and people still built things with them. When the TRS-80 came out, it was made of chips like that, and you could get easily comprehensible schematics. I learned to solder and use a voltmeter and an oscilliscope before I learned to code.
My first "real" job began with porting a device driver for a 24M hard drive from 8085 to Z80, on an upgraded IMSAI. Serial terminals, Diablo daisy wheel printing terminal, the front panel later made popular in War Games. We wired a lot of our own cables. One particular memory is of having wired a serial cable through the wall to the Diablo in the other room, trying to print from WordStar, and having the whole system hang.
Today, the response would probably be to pound on the keyboard a bit and reboot something. Then, it was more arcane, but vastly more effective:
That was just a local patch - I suspect we fixed the cable after that. Note that this wasn't experienced old hands - this was two high school kids in a summer job - but we knew what was going on at a very low level, and thought this was entirely reasonable. (Note that we weren't only bit-banging hackers; we also wrote dBase apps, sold software and did telecom planning...)
One lesson you could take from this is "wow, glad that's someone else's problem this century." It is obviously useful to be able to sweep all of that under the rug of Abstraction - right up to the point where it fails to work... I think a more powerful lesson is that all of the interesting debugging is down a layer from where you're actually trying to work - abstractions are great when things work, and have to be the first thing out the window when they don't (this is also a reason that proprietary layers hurt, always.)
Another lesson is that we can find out what's going on, even in the face of obvious complexity. There are more abstractions (more layers and more breadth), but they can still be pried open. Thus, There Are No Mysteries.
(There are certainly cases where you've lost/discarded too much information to find out what happenned. And this is limited in scope to computing, though I personally believe it's much more broadly applicable - as an approach, a philosophy, a mental model - "mystery means you haven't bothered to look hard enough". I think it's worthwhile to take that attitude.)