The limits of my abilities…

Posted by Elf Sternberg as chat

In the cellar was a tunnel scarcely ten yards long, that had taken him a week to dig. I could have dug that much in a day, and I suddenly had my first inkling of the gulf between his dreams and his powers. — H.G. Wells, The War of the Worlds

The past week, work has been slow so I’ve had a little time to work on updating my git pre-commit hook. For a while, I actually had a “hack” wrapped around it to run it from the command line, so I could see what was failing before I tried to commit it. I realized as I was working that I was basically writing a lint hook, and have since changed the project’s name to reflect that: git-lint.

The problem is that working on projects like this a polyloader have taught me that the gulf between my dreams and my powers is enormous. It took me a week to refactor pre-commit into something with actual command line arguments, an external configuration file, and policies to implement, as well as adding the ‘dry run’ and ‘sort order’ capabilities– things the pre-commit version doesn’t really need. Obviously, I’d like to know and do a lot more with my non-professional development life. But finding the time is hard, and frankly, when I’m done working on code at work I really don’t have the brains left to write, draw, or code at home.

I remain committed to a few basic ideas: that there’s too much code in the world; that 99% of what we do is translating from formats that are human-comfortable to those that are machine-ready and back; that we can and should make as much of that work declarative; and that even interpreted languages should invest heavily in pre-processors to remove new scopes where none are needed, inline where possible, and exploit the CPU to the best of any human ability.

I know, I’m not helping by writing more of it.

I need to get rich enough to stay home and hack all day. That’s the answer. And I do; just ask my long-suffering wife, who bemoans my willingness to spend all Saturday in front of the computer, geeking out.

University computer science libraries have fallen into a sad and tragic state. I went by the University of Washington Engineering Library. When I was at Isilon, that library was a kind of miracle; if we needed to know anything interesting going on in the world of data management, you could go to the library and find a raftload of interesting papers, digest them over the weekend, and be ready with some new trick for Distributed Dynamic RAID by Monday. It was a thrilling place to enter.

And you could make photocopies of the really interesting stuff.

I went in there recently and discovered that the stacks haven’t changed. I mean that literally. Most of the interesting CS journals have moved entirely on-line, and there wasn’t a single collected journal available dated after 2007. There was a wall labeled “Computer Manuals” that had books covering the initial industrial release of SQL. There was a general damp, fusty smell to everything.

There was one lone machine against the wall where you could survey and, if you had the time, read all the papers the world had waiting, tens of thousands of articles, conference submissions, books, precis, even patents. But you couldn’t print anything out and, since I’m no longer a student, I couldn’t mail copies to myself.

There’s more interesting stuff happening in the world now than there was a decade ago, but the academic CS journals are working ever harder to lock it up and “protect” it from the prying eyes of industry. And that’s a damned shame.

I can’t remember where I found it, but there was a brilliant explanation of how functional code maps value. Remember, in a functional program, the basic notation is x → y, that is, for every function, it maps value x to another value y. Things like map() map an array to another array, while reduce() maps a single thing (an array) to another single thing (a value). How does functional programming encode other things?

Well, there’s

x → y
x is mapped to y
x → y∪E
x is mapped to y or Error (Maybe)
x → P(y)
x is mapped to all possible values of y (Random Number Generators)
x → (S -> y ⨯ S)
x is mapped to a function that takes a state and returns a value and a new state (State)
x → Σy
x is mapped to the set of all real-world consequences (IO)

The other day I realized that there’s one missing from this list:

x → ♢y
x is mapped to y eventually (Promises)

I’m not sure what to do with this knowledge, but it’s fun to realize I actually knew one more thing than my teacher.  Note that the first case, x → y, really does cover all sum (union) and product (struct) types, which tells me that the ML-style languages’ internal type discrimination features are orthogonal to their encapsulation of non-linear mappings.

The really weird thing is to realize that the last four are all order-dependent.  They’re all about making sure things happen in the correct sorted order (and temporal order, if that matters).  That leads me to think more about compiler design…


Menders, Makers, Mentors

Posted by Elf Sternberg as chat

Andrea Goulet is giving me an existential crisis. The CEO of a software development consultation shop, she recently wrote an article called Menders vs. Makers, and something happened this week that makes me think, maybe I’m in the wrong line of work. I’m starting to suspect I’m a mender in a business that only values makers.

This week, I was working on a code base that provided a hierarchical tag editor for an inventory system. I had recently added a new feature that made it possible to see individual elements of the tag system on the Collection page; you not longer had to go visit a single object to see if it had, for example, a location tag; you could just say on the Collection page, “Show me all the objects that have a location tag, and add a new column, location.”

Now that we were able to see the tags, a new problem was found: it wasn’t possible to delete tags. Odd nobody had noticed that before. Since I was the last person in that code base, it was my duty to fix it. Down into the legacy code I went.

The tagging code was, well, intermingled. Validating the tags, determining the changes between the version on the client and the version on the server, writing those changes back, were all in a single gigantic Backbone sync method involving empty arrays, for loops, and concat methods. I spent about four hours, during which I:

  • Replaced all for loops with map / reduce / filter
  • Separated the model validation into its own method
  • Used underscores’s intersection / union / difference functions to create instruction sets for deleting and adding to the tag system
  • Used Backbone’s set([_], (void 0), {unset: true}) method to delete the tags, rather than hammer the event bus with a series of change events in a each loop.

I struggled a lot to make sure I was using names that explained what each thing did.

In short, I did with my code what I did with my writing: try to make every line a pleasure to read, something that told a story about what was happening and what was going to happen next. I hope when someone sees overlappingTags = _.intersection(newTags, restrictedTagNames), it’s obvious what’s happening, and it should create anticipation that soon there will be a line that checks to see if overlappingTags has anything in it and, if it does, reports an error with the offending tags.

I’ve always had fun doing stuff like that, turning unreadable mash into clarity. Even my recent bragging project, Polyloader, is actually a fix for the “All Python on the filesystem ends in .py” bug that sorta firewalls Python syntax from the rest of the language universe.

I’ve found this industry doesn’t really like menders. Code editors, people who go in after the fact and apply measures both aesthetic and qualitative to the code they see, are often seen as nothing but agency overhead by managers.

On the other hand, I’ve yet to meet another developer who resented menders. They like menders; they want to learn from menders how make code better. Menders tend to be older, tend to know more, tend to be broadly learned and strongly opinionated. Nothing “just gets thrown there.” It has to be fixed, it has to work, it has to be right. And I’ve yet to meet a software developer who didn’t want to get it right. Often, they just don’t know how, or nobody’s ever told them how.

Let’s show them how.


Programmers need a class in aesthetics.

Posted by Elf Sternberg as chat

Sometimes it’s a little hilarious to read the back-and-forth of academics. My favorite is this exchange from Roman R. Redziejowski and Brian Ford over packrat parsing. Redziejowski writes

PEG is not good as a language specification tool. The most basic property of a specification is that one can clearly see what it specifies. And this is, unfortunately, not true for PEG.

To which Ford responds,

Such permissiveness can create unexpected syntactic subtleties, of course, and caution and good taste are in order: a powerful syntax description paradigm also means more rope for the careless language designer to hang himself with.

No points for complaining that Ford ends his sentence with a preposition.

This exchange highlights an issue in the programming language community that stands out for me. There’s a debate raging between two camps, with Google Go at one pole and Haskell at the other. Google Go is fundamentally an ugly language, one the designers admit up front is meant to make mediocre programmers productive, to constrain them from hurting themselves while making them capable of producing working code. And while it’s fine for that, consider the Microsoft “wizards” of the mid-1990s that pumped out huge blocks of C++ that nobody, not even the template designers, could understand; when it comes to Go, that’s where we’re headed. On the other hand, Haskell is fundamentally a beautiful language that’s really, really hard to understand; you have to immerse yourself in decisions where you, yourself describe the constraints with precision, with care, with taste.

Ira Glass has a speech, On Storytelling, in which he says, about being creative,

We get into it because we have good taste, but there’s like a gap.

The first couple years that you’re making stuff, what you’re making isn’t so good, It’s trying to be good, it has ambition to be good, but it’s not quite that good.

But your taste, the thing that got you into the game, your taste is still killer. And your taste is so good that you can tell that what you’re making is kind of a disappointment to you, you know what I mean?

The thing is, this is true of storytelling, of drawing, of any creative endeavor. A lot of programmers don’t get into programming because they view it as a creative endeavor. They view it as puzzle solving. They view it as engineering. They view it as a way to make money fast.

They have no taste.

Often, they don’t want to have taste. They want to get the job done and get paid. “Taste” slows them down and gets in the way. Aesthetic decisions about code layout and arrangement, they believe, are irrelevant to getting the job done.

This isn’t true, of course; Tasteless Go is still as unmaintainable as tasteless C++. It’s possible to write aesthetically horrifying Haskell. Let’s not even talk about Perl.

I believe this is the fundamental dividing line betnween Go, C, and C++ on the one side, and Rust, Clojure, and Haskell on the other. The whole point of Go is make programmers with no interest in taste or aesthetics write programs that work. Maintainability is secondary.

Which goes back to my tweet above. Java and Go programmers want to write the first kind. Haskell and Lisp programmers and their descendents love to write the second type. But my experience with reading and writing in a variety of lanugages convinces me we frequenty end up at the third with no help for it.

The solution is to teach aesthetics. To teach people that readability and maintainability matter more than just getting the job done.  That if it doesn’t make you feel good the day after you wrote it, re-write it.

After all, sometimes your code will live much longer than you expect.

This feels like something that deserves clarification. It’s not that I fear any and all of my projects becoming popular. I would love for some of them to become very popular. Polyloader would be awesome, as would Tumble. But I draw a distinction between tools, products, and examples. Catalogia is a product, and I don’t want to be tech support. There’s a huge difference between getting something right, and teaching the average user about the cupholder that came with his desktop machine. Tumble and Polyloader are tools: I want them to reach the widest possible audience and make that audience, my fellow developers, smarter and happier and more effective. The Backbone Store is an example, but examples are just examples. If they’re inadequate to the state of the art, it’s my duty to revise or remove them, or at least comment on their deprecation, but I’m not going to help individual users understand what’s going on.


What comes next?

Posted by Elf Sternberg as chat, programming

After publishing The Semantics of Python Import and explicating on the history and internals of how Python turns source code into running operations, I thought I had a pretty clear idea of what to do next. I extended Hy such that it was now possible to write an entire Django application in a Lisp dialect, which was cool, and started on Catalogia, a program that would help me index, search, clean up and organize my music collection. The idea behind Catalogia was to demonstrate that writing an entire Django application in Hy was possible. I’ve done that, even going so far as to demonstrate that it’s possible to replace the boilerplate of generic views with a Lisp macro to generate the boilerplate automatically at compile-time.

The problem is that Catalogia isn’t done, but I’m already bored with it. This is a classic problem in side projects, I know, but I’m trying to figure out what to do with it.

What I’d really like to do, now that I’ve got a viable Lisp running on and auto-transpiling to Python, is write even more tools to extend Hy even further. I don’t like the existing suite of PEGs for Python; I’m completely spoiled by David Majda’s PEG.js, which marries PEG to Javascript with the absolute minimal amount of boiler plate possible; I’d like to port something that succint to Python. Lexer/Parser technology is one of those spaces that’s assumed to be “solved,” but there are still places where it could be better, especially in the UX of the development process, and there’s also an entirely new Lexer/Parser theory called the Derivatives of Determinite Finite Automata that has one implementation (in Racket, natch) that I’d like to see happening in a popular language like Python.

What I’d also like to do is strike out on my own and build on the experience Hy gave me to build a language research platform for Python. Something like GardenSnake, but complete, on top of which it would be easy, even trivial, to add new tiles that construct whole new operators in Python. I’d like to be able to pipe and compose Python instructions in a point-free syntax; how cool would that be without having to run through a transpiler? Just write in this “extended” Python, call Python, run Python, and have it work. (Psst: Polyloader is a key component of this idea.) Something that could be rolled back simply, providing plug-and-play additions to the Python grammar/compiler, just by adding a single call in your script to polyloader.install()?

Meanwhile, all the other desires are piling up. I want to run through this class, and this class. (I already have the textbooks.) I want to move this blog off effin’ WordPress onto something sane, and then Dockerize the sanity. I want to finish my basic editor for my stories, with all the front-end stuff that’s been missing for so long, and do a visual refresh, and all the other critical things that happen when the Web Guy’s Website Doesn’t Get Revamped.

I really should commit to Catalogia. I’m just afraid of it becoming popular.

Module Iterators, as defined in pkgutil.py, aren’t really part of the mess that has been imposed on us by PEP-302 and its follow-on attempts to rationalize the loading process, but they’re used by so many different libraries that when we talk about creating a new general class of importers, we have to talk about iterators.

Iterators, after all, are why I started down this project in the first place. It was Django’s inability to find heterogeneously defined modules that I set out to fix.

Iterators are define in the pgkutil module; their entire purpose is, given some kind of reference to an archive, to be able to list the contents of that archive, and to recursively descend into that archive if it happens to be a tree-like structure.

When you call pkgutil.iter_modules(path, prefix), you get back a list of all the modules within that path or, if no path is supplied, all the paths in sys.path. As I pointed out in my last post, the paths is sys.path aren’t necessarily paths on the filesystem or, if they are, they’re not necessarily directory paths. All that matters is that for each path, a path_hook exists that can return a Finder, and that Finder has a method for listing the contents of the path found.

In Python 2, pkgutil depends upon Finders (those things we said were attached to meta_path and path_hooks) to have a special function called iter_modules; if it does, that function is used to list the contents of the “path”.

In Python 3, the functools.singledispatch tools is used to differentiate between different Finders; once a Finder has been identified by path_hooks, the singledispath us used to find a corresponding resource iterator for that Finder. It doesn’t necessarily have to be a method on the Finder, although the default has a classmethod that is its finder.

An iterator is pretty straightforward; once you know the “path” (resource identifier) and the Finder for that path, you can call a function that checks for the presence of modules. In the case of FileFinder, that function is a combination of listdir, isfile, and isdir/isfile to check fordir/__init__ pairs indicating a submodule.

For our purposes, of course, we had to provide a path_hook that eclipses the existing path_hook, and we had a provide a Finder that was more precisely ours than the inherited base FileFinder, so that single dispatch would find ours before it found FileFinder‘s and still work correctly.

There is one other module I have to worry about: modulefinder. It’s not used often, it’s not used by Django or any of the other major tools that I usually use, and it’s never been covered by Python Module of the Week. That doesn’t mean that it’s hard-coding of the ‘.py’ suffix isn’t problematic. I’m just not sure what to do about it at this point.

It’s time to come around to a point that’s been bugging me for a long time: why is the Python import routine so, well, so darned convoluted? The answer is “history,” basically the history of Python and the attempt to turn import foo.bar.baz into a tool that’s incredibly easy to use and understand for the common programmer, yet flexible enough to give the advanced programmer the power to redefine it into whatever else it has to mean.

We’ve talked about how the system has two different loading systems: the sys.meta_path and the sys.path_hooks, and how the latter is just as arbitrary as the former: the last path_hook is for the filesystem, so it runs os.isdir() on every item in sys.path and only offers to handle the ones that returns true, and it only runs after everything else has been run, so:

  • If a meta_path interpreted an import fullname with respect to a path that’s a directory, the default won’t get it,
  • If a path_hook said it could handle it, the default won’t get it,

… and so on.  The whole point of  using first-one-wins priority pathing is to leave the responsibility for not failing up to the developer. The default really is the fallback position, and it uses only a subset of sys.path.  The formal type of a sys.path entry is… no type at all. It could be a string, a filesystem directory iterator, an object that interacts with a path_hook. It could be anything at all. The only consideration is that, if it can’t be coerced into a string that os.isdir() can reject, you had better handle it before it falls through to the default.

It’s really time to call it like it is: sys.path and sys.path_hooks are a special case for loading. They’re the original special case, but that’s what they are. They lead to weird results like one finder saying it can handle foo.bar.baz and another foo.bar.quux, turning the leading elements of the fullname into arbitrary and meaningless tokens.

I wish I could call for a more rational import system, one in which we talked only about resource managers which had the ability to access resource archives, iterate through the contents, identify corresponding resources, load the contents of that resource, and compilers that could identify the text that had just been accessed (via whatever metadata was available) and turn it into a Python module.

But we can’t. Python is too well-established to put up with such rationalizing shenanigans, and too many people are dependent upon the existing behavior to help make it so. Python was born when NFS was the thing, when there were no real open-source databases, no object stores. Python was released two years before the Mosaic web browser! It would be far too disruptive. So we’ll keep getting PEPs forever trying to rationalize the irrational.

That’s okay. It gives me something to get paid for.

But, it does point out one major flaw: because Finders and Loaders are so intimately linked, even if we manage to rationalize FileFinder and SourceFileLoader, that’s only with respect to the Filesystem. We’ll have to make equivalent loader/finders for any other sort of accessor, be it Zipfiles or any of the other wacky resource pools that people have come up with.

Unfortunately, I don’t have a good plan for those. Fortunately, filesystems are still the most common way of storing and loading libraries, so concentrating on those gets us 99% of the way there.

The Semantics of Python Import, Part 3: Loaders

In the last post we discussed Finders. The whole point of a Finder is to find a resource stored somewhere (usually a file on a filesystem, but it could be anything– a row in a database, a webpage, a range in a zip file) and supply the appropriate loader for it.

More accurately, there is a “FinderFinder” mechanism by which sys.meta_path and sys.path are searched to find the best Finder to run against a resource, and then the Finder is invoked to find the loader to load the resource. This lets Python differentiates between the archive (resource type– folder, database, zipfile, etc), the resource itself (file, row/column, zipfile index), and the type of that resource: source code (.py), compiled Python bytecode (.pyc or .pyo), or a compiled binary (.so or .dll) file that conforms to the Python ABI.

The point of the Loader is to take what the Finder has found and convert that resource into a stream of characters, which it then turns into Python executable code. Compared to the Finder, the Loader is pretty simple.

Typically, the Loader does whatever work is necessary to read in and convert (for example, to uncompress) the resource, compile it, attach the resulting compiled code as the executable to a new Module object, decorate the object with metadata, and then attach that new module object to the calling context, as well as caching a copy in sys.modules.

That’s more or less it.

Python 3.4 introduces the idea of a ModuleSpec, which describes the relationship between a module and its loader, in much the same way that the ModuleType describes a relationship between a module and the modules that import it.

Unfortunately for my needs, ModuleSpec doesn’t address several critical issues that we care about for the Heterogeneous Python project. It doesn’t really address the disconnect between Finders, Loaders, and the navigation of archives; Finders and Loaders are still very much related to each other with respect to the way a resource is identified and incorporated into the Python running instance.

Typical import tutorials focus on one of two different issues: loading Python source out of alternative resource types (like databases or websites), or loading alternative source code that cannot ever be confused with or treated as Python source. An example of the latter would be to have a path hook early in sys.path_hooks that says, “That path there belongs to me, and it contains CSV files, and when you import from it, the end result is an array of processed CSV rows.” By putting it before all other path hooks, that prevents Python from Finding inside that path and rejecting its contents for not having any .py files.

Our goals are different: A directory in sys.paths should be able to have a mixed code: CSV files, Hy (lisp) files, regular Python files, and byte-compiled Python files, and the loader/finder pair should be able to understand and interpret all of them correctly.

To do that, the loader has to be able to find the right compiler at load time. But there’s a problem: Python 2 hard-codes what suffixes (filename extensions) it recognizes and compiles as Python in the imp builtin module; in Python 3 these suffixes are constants defined in a private section of importlib; in either case, they are unavailable for modification. This lack of access to the extensions list prevents the discovery of heterogenous source code packages.

We have to get in front of Python’s native handlers, supply our own Finder that recognizes all our code-like suffixes, provides a source code loader that provides our compilers for our own suffixes and falls back on Python’s native loader behavior when we encounter native suffixes.

I can now announce that Polyloader accomplishes this.  After you import polyloader, you call polyloader.install(compiler, [extensions]) for files that compiler can handle, and it… works.

It works well with Hy. And it works performantly and without breakage on a modern Django application, allowing you to write Django models, views, urls, management commands, even manage.hy and settings.hy, in Hy.

There are three more posts in this series: Python Package Iterators, the resource-vs-compiler problem, and a really crazy idea that may break Python– or may finally get around all the other code that hard-codes “.py” problematically (I’m looking at you, django.core.migrations.loader, and you, modulefinder).


September 2016
« Aug