04Dec

If you write Python, you should use MyPy.

Posted by Elf Sternberg as Uncategorized

I’ve discovered MyPy at work, and it works with Python 2.7 so amazingly well that I’m here to say, if you haven’t added it to your workflow, you need to immediately.

The last project I worked on at my day job, an automated inventory and services manager, was 2200 lines of one class object. When I was done, it was 1800 lines and 145 functions split over ten files, with separate classes for services and inventory, intake and output, local and network transformation, authorization, and utility. It’s primary purpose was to take in CSV (comma separated values– the text format for spreadsheets) and output JSON. We wanted to add other input formats, we wanted the output to be able to handle different formats as well, and we wanted to move the network transformation phase to an asynchronous handler with recovery-on-failure.

Doing that took a lot of heavy lifting. Figuring out what functions were for, what the dataflow looked like, what each step of the transformation required, how to handle conflicts with objects already in the inventory. Like most shops, we have rules about how to use docstrings that look a lot like jDoc, but by complete luck I found PEP-484 and MyPy the program that actually does Python type-hint checking, and immediately started using it aggressively.

The internals of the original, a one-off written for a single customer, had very strong assumptions about CSV inbound and JSON outbound. Functions were literally named with csv_ and json_ prefixes, even though internally everything was just Python dictionaries and lists!

I named things, and by their purpose: Classes called InventoryItem and TrackingService. I named collections of things. I named containers. And I added MyPy type hints to everything, like this (Code is from a personal project, mp_suggest, an early MP3 inventory manager I wrote a while ago.):

def make_album_deriver(opts, found, likely):
    # type: (Dict[Text, Text], Text, Text) -> Text
    if 'album' in opts:
        return opts[u'album']

    if 'usedir' in opts:
        return found

        return (ascii_or_nothing(likely) or sfix(found))

That little one-line comment there guarantees that anyone calling this function must send it what’s listed in the type line, and nothing else. MyPy will track back and make sure that the variables passed by the caller are actually created as a dictionary, a string, and a string! It will track forward and make sure that the receiver always uses the results as a string! It will ensure that the dictionary is created with strings as keys and values— no lists, no sets, no nested dictionaries are legal values. It will make sure that ascii_or_nothing and sfix take strings and return strings.

Programmers in dynamic, duck-typed languages like Python (and Javascript, Perl, Ruby, even Lisp) tend to assume we know what we’re doing. Given a problem, we come up with a solution, and then we start coding toward the solution, as if building and assembling a jigsaw puzzle along the way. On a very big project, we can miss things. Often, what we miss is error handling and corner cases. We can write a function and twenty minutes later call it, all the while assuming we can correctly remember what the calling protocol was. Often, we are wrong, and learning we’re wrong is sometimes a painstaking effort in type management. I believe half my debugging time is being a human typechecker: “Wait, it’s supposed to be passing back class X, but I’m getting a list. Why?”

PEP-484 eliminates all of that, and that makes PEP-484 the biggest leap forward in high-quality Python development since PEP-8. By eliminating the single most common class of running errors in all of Python, you can double your productivity. If you are not using MyPy, you are at a tactical disadvantage, and you will remain so until you adopt it.

Infoworld has an article entitled “The seven most vexing problems in programming.” The first two are “multithreading” and “closures.”

I call bullshit. Neither of those are hard problems. Both of those are problems of programmer laziness. (Not compiler laziness, of which I’ve recently become a fan.) It’s really hard to believe closures are a problem when we’ve had object oriented programming for over thirty years now and classes are nothing more than a specialized syntax for describing a closure.

You know what’s really broken about these two issues? Mutation. Objects (in the abstract, categorical sense, rather than the OO sense) changing and being changed in the code in such a way that it’s impossible to reason about the code in front of you. It’s impossible to say “This is what the object is.” It’s impossible to say “This is what this function does.”

There are times and places where mutation has a place, but they’re actually exceptionally rare. Mutuation can be used when performance is the only thing that matters; but if performance is the only thing that matters, then locks, semaphores and monitors are the performance drags and your code shouldn’t be involved with them.

Wayner’s complaint is wrong. These issues aren’t hard. These issues exist because programmers have been taught poorly. They’ve been taught that in-place mutation is the right way to do these things. They’ve been taught that lock-free data structures are “too hard” (but apparently debugging multithreaded is just fine). They’ve been taught to save unnecessary microcycles at the expense of their employers time and money.

The other day I debugged someone’s code where he looped through a very large list of objects. There was a bug where some of the first items didn’t get saved to the database with all the data the specification required. I discovered that he was pulling new data from the database every iteration, and saving some of it in a mutating array. Eventually it had enough data to do the job right, but not at the very beginning. When I asked him why he did it that way he said, “I’d have to iterate through the data twice, once to get all the keys, then to perform the updates. This was more efficient.”

It was also broken. “You mean, you accumulate mutations as you go along?” I said.

“Yes.” And then he broke out into a grin. “Isn’t that called evolution?”

“No,” I said. “99.999% of the time, that’s called cancer.”

And so it is with most programmers these days. They think cancer is preferable to thinking harder about the problem. I blame Python and Ruby, where mutation is easy, cheap, and often “gets the job done,” but makes it much harder to understand what the Hell is going on underneath the code.

05Nov

“Huh. So UML is actually useful?”

Posted by Elf Sternberg as chat, Design

Since I’m generally opposed to giving Hacker News oxygen, I refuse to comment there on a recent question, Do You Still Use UML?

As I’ve made clear, I have both anxiety and confidence about being an older developer in this world of high-tech startups and high-speed, high-burnout “only twentysomethings can do anything” world. Bloomberg the other day pointed out that almost all productivity gains over the past four years are in the hands of people like me, folks in our forties or later, folks who know stuff, folks with experience.

Anyway, to answer the question: Yes. When I was at CompuServe, they were kind and crazy enough to go through a full-on UML phase. It was a huge drag and it produced terrible software, mostly because the powers that be thought they could buy Rational Rose and jump straight into Class Diagrams and Sequence Charts without, you know, actually caring if anyone on the team knew about SOLID, DRY, KISS, YAGNI, etc. (I heard a great one the other day: TYKD, pronounced “Ticked:” Test, YAGNI, KISS, DRY. An acronym of acronyms.) The promised productivity never showed, and we ditched it after a year or so.

This cycle I have the responsibility for developing a fairly complicated piece of data analysis software, and while trying to describe my ideas for it, other members of the team weren’t getting it. Desperate to make myself clear, I actually plunked down $70 for a piece of UML software (StarUML, and pretty good, actually) and designed a sequence diagram to show how I expected the system to work with failures and fallbacks clearly described.

It was a rough diagram, and it had some ad-hockery because UML was designed before asynchronous and reactive programming became a thing. But when I showed it to my review committee, they were like, “Wow. That really does explain what’s happening here. That’s amazing. What is this?”

I explained what it was, and the reaction was, “Huh. So UML is useful?”

They approved my project quickly, and things are underway. And so far, it really is looking much like what I drew, which is fabulous. But I was able to get approval because I had bothered to learn UML.

Because I know things.

29Oct

Will there ever be a Flying Shuttle for Software?

Posted by Elf Sternberg as chat

Fred Brooks once famously wrote that there was no silver bullet for software development. Once a project and its tooling were agreed upon, software development took the time that it took. Brooks’s insight was that software developers spend their days more or less at the limits of their intellectual capacity, and no amount of tooling or management could expand that limit. Since a developer is usually fully engaged with any given task, adding a second developer to that task will actually slow production down: the two developers now have the overhead of communicating with each other the nature of the task and each may be blocked by the other to accomplish a signifcant subtask.

The term for this “No Silver Bullet,” and the book by that title has gone on to be a perennial best-seller, usually purchased by harried software developers and left anonymously on the desks of clueless project managers.

Treadle-based handlooms for weaving cloth have existed for almost two thousand years, and their current form is unchanged from the one that emerged in Germanic and Italian forms a thousand years ago. A loom has a frame of vertical threads in two alternating sets; each press of the foot peddle pulls one set apart from the other, and the user strings a spool of thread called a “shuttle” across this horizontally, then cycles the foot peddle to close the lifted threads down, then lift the other set up, giving the weaver that alternating over-under sequence that holds cloth together. Skilled users were known to “throw” the shuttle across the loom field, but cloth could only be woven as wide as one person could reasonably reach.

In 1733, a man named John Kay built a narrow wooden track, put the shuttle into the track, and with a piece of string and a handle, “jerked” the shuttle back and forth across the loom field. A few improvements later, he had made a loom three times as wide as existing looms, and one that could be operated twice as fast. Kay’s “flying shuttle” replaced the thrown shuttle almost instantly, and the industrial revolution had its first major component.

The more I write software, the more I have this sensation that something is very wrong with the way we write software. I listen to developers describe their projects and my overwhelming thought is, “Didn’t someone do that already?” I can’t begin to count the number of times I have implemented the same small algorithms over and over. Consider Stack Overflow, in which questions “How do I find a substring?” has several different code snippets that dozens of people will now simply cut and paste into their own code. Programmers get paid like princes these days mostly to know how to know these things, and glue them together in the right order for our corporate masters, and make them work profitably.

If there’s No Silver Bullet, could there be a Flying Shuttle waiting for us? I don’t have an answer, but I fear that there is: there’s an answer that glues everything from the hardware to the UI together into a meaningful whole, that answers the questions people want answered, and does the things people want done. I suspect it’s already here; we just haven’t identified it for what it is yet.

So, Spectrum IEEE has a “The sky is falling! The sky is falling!” article claiming that in 2016 tech layoffs have been nasty and that in 2017 it’s going to get even nastier. This is one of many articles on this theme, but it’s a little disheartening to see it in Spectrum. Worse, none of the articles I’ve read on this theme list the skills are going to be out-of-date. Which skills? What disciplines?

In 2008, I was laid off after 8 years at a large company, and I’d been using the same tools for those 8 years. As a front-end developer for dev-ops shops, my skills were woefully out-of-date: We’d been using Sencha (JS) and Webware (PY), with some Python 2 Python-to-C libraries. I knew nothing about what the cool kids were doing. I sat down and in a few days taught myself Django and jQuery; I rebooted by SQL knowledge from my 90s-era experience with Oracle and taught myself the ins and outs of Postgresql.

And then, in the bottom of the recession, I took shit contracts that paid very little (in one mistake, nothing) but promised to teach me something. I worked for a Netflix clone startup; I traded my knowledge of video transcoding for the promise of learning AWS. I worked for a genetic engineering startup, trading my knowledge of C++ for the promise of learning Node, Backbone, SMS messaging, and credit card processing; a textbook startup, trading my knowledge of LaTeX for the promise of learning Java; an advertising startup trading my basic Django skills to learn modern unit testing; a security training startup, trading my knowledge of assembly language in order to learn Websockets.

The market improved. I never stopped learning. I gave speeches at Javascript and Python meet-ups. Recruiters sought me out. I’ve been at another big company for four years now.

Will things go to hell in March? I don’t care. I have the one skill that matters.

14Oct

PacMan doesn’t need AI

Posted by Elf Sternberg as Design, programming

The other day, I was reading through the course syllabus for a second-year AI class, as one does, when I noticed that the assignment for the sixth week was to turn in a working version of PacMan. Which is kind of weird, because the actual algorithm for PacMan involves more or less zero AI. It involves something else, and one of my favorite words: stigmergy.

Alright, so, here’s the algorithm in a nutshell: PacMan is played on a 29-by-26 square grid of cells. Everything else is special effects. There is a clock cycle: every cycle, the characters move from one square of the grid to another. If PacMan and a ghost share the same cell in a cycle, PacMan loses a life. There’s an animation engine running to make it look smoother than it is, but that’s the basic game.

The grid is actually three different grids layered together: One grid constrains movement by providing the walls. One grid tracks the dots that have been eaten. (The actual end-of-round tracking is done with a counter.)

The last grid is the stigmergy grid: every clock cycle, PacMan moves forward in a direction. The grid he just left is given a number: 255. Every clock cycle, the stigmergy grid is scanned for these numbers, and they’re reduced according to some formula until they reach zero. A ghost wandering the maze has a few rules: when it reaches a cell that has more than one neigbor, it chooses a direction based on a formula, and part of that formula includes adding in the stigmergy number of the neighboring cells. Blue ghosts use a reverse strategy; “dead” ghosts use a simple vector-weight strategy to go back to the center room.

In short, the ghosts are following PacMan’s scent, in much the same way ants follow a trail laid down by other ants.

There’s also a clock-cycle counter that causes the ghosts to reverse themselves from time to time, but that’s the basic gist of it. Unfortunately, the random number generator is seeded with the same number every level, so it became possible to master the game and play infinitely long. As smooth as the game looks, you actually have half a second of leeway time between moves, which is well within the average video gamer’s skill to master. Ms. PacMan fixed the seeding issue, and the game is significantly harder to play for a long time.

That’s it. You could implement PacMan in a few hundred lines of Javascript and HTML. Some animate CSS using the FLIP trick would be awesome. There’s no magic, and certainly no AI about it.

My latest contribution to the world is Git Lint, a plug-in for git that allows you to pre-configure your linters and syntax checkers, and then run them all at once on only the things you’ve changed. About half of us still live in the command line, and I like being able to set-and-forget tools that make me a better developer.

Here are a few things I’ve learned along the way about Python projects.

1. Use A Project Template

Project templates provide a means to magically produce a lot of the boilerplate you’re going to be producing anyway. I’m fond of Cookiecutter. While Git Lint started life as a single Bash script (later, a Hy script, and now a Python module), at some point I needed much more than just that: I needed documentation and testing. Up-to-date templates provide you with up-to-date tools: Tox, Travis, Sphinx, PyTest, Flake8, Twine, and a Makefile come pre-packaged with Cookiecutter’s base template, and that’s more than enough to launch most projects.

2. Setup.py is a beast

Getting setup.py to conform to my needs was a serious pain in the neck. It still doesn’t work correctly when installing to Mac OSX because the Python libraries and the manual (man pages) are in two different locations. If I’m building a command line tool, I always try to provide man pages. It’s usually the first place I look.

The manual pages also didn’t show up reliably in the build process; I had to force it by adding it explicitly to the manifest, even though it included the docs tree by default.

Setup.py and man pages are NOT friends.

Getting the build to include man pages, which I require for any command-line utility, was truly a pain in the neck, and now every upload to pypi has a manual step where I figure out if I have a man page to deliver or not. It’s truly painful.

3. Sphinx is a pain in the neck — but it’s worth it for Github Pages

Sphinx, the documentation tool for Python, uses RST (reStructured Text), which has just about the worst imaginable syntax for external links I’ve ever wrestled with. Inconsistencies about mixing links and styles drove me out of my mind.

On the other hand, I now have what I consider to be a solid idiom for generating Github Pages (gh-pages) from Sphinx documentation. A branch named “gh-pages” that contains your documentation will automatically be converted into a documentation tree on Github Pages (github.io), and you can see the results for Git Lint. This tree looks completely different from your development trees, so don’t get them confused, merge, or rebase them!

Simple generation of gh-pages

If you check out the Makefile, you’ll see the idiom clearly: it checks out a complete copy of itself into a sub directory, builds the documentation, copies it back to the parent directory, and then fixes all the links (because Github Pages really doesn’t like underscores in file and directory names). There’s irony in that it uses Perl to do the fixing– it’s just what I knew, it was fast, and I always have both Python and Perl installed.

This, by the way, points to another issue: always use the same virtual environment wherever you work. My Macbook and my Linux box had different versions of Sphinx on them, and the resulting generated pages were different on both boxes, making git report “everything’s changed!” when I went to fix a single typo in a link somewhere (I told you I hated those links).

It might be worth it to bag sphinx as a docker feature, or ensure that the version is locked down in your virtual environment.

4. Tox is amazing

Working with tox allowed me to be reassured that my code ran correctly every time, the first time. It did not catch other critical issues with installation, like the man page issue mentioned above, and that was painful to manage, but it did everything else.

5. Git Porcelain Zero is ridiculous

If you’re not familiar with git --porcelain, it’s an argument that many of the status-oriented git commands have that changes the output to a stable, machine-readable form meant to be consumed by other tools. Git Lint uses it a lot.

But the git --porcelain command doesn’t have any other guarantees: it doesn’t guarantee filename sanity, or unicode compatibility. For that, there’s git --porcelain -z, which produces a report in which everything is null-terminated so weird filenames can be consumed. This would be fine if the output were columnar, but it’s not always. The most egregious example I found was git status --porcelain -z, which is usually three columns, but if there’s an ‘R’ in the first column, then it’s four columns– ‘R’ means the operation is ‘rename,’ and the fourth column is the original name.

Since the -z argument makes both the cell and the line terminators null, you have to parse positionally. And if you’re parsing positionally and the number of positions can change, well, that’s context-sensitive parsing. And it’s ridiculous to have to put a context-sensitive parser into a small project like this. There was only one exceptional case here, so it’s a small issue, but inconsistencies like this really bother me.

6. Git lint is amazing

Now that I’ve actually used my little beastie, I can’t tell you how happy I am with it. As a full-stack developer with Python, C++, XML, HTML, CSS, Javascript, and some in-house stuff I can’t discuss, being able to check the entire toolchain without caring about what I’m checking, just set and forget, makes me extremely happy.

All in all, this was one of those projects where I learned a lot about everything: git, python, unit testing, documentation, github, jekyll, reStructuredText, Cookiecutter, PyPi. All this knowledge poured into one small project.

There’s been a bit of chatter on the topic of being an old geek. As most people here know, code quality in the small is one of my favorite topics, and I realized after reading an article this morning that the two topics are actually significantly linked.

Kent Dodd’s Why Users Care About How You Write Code hammered home something that took me a long time to learn. A lifetime, so to speak. There’s a mantra that I’ve heard inside every enterprise I’ve ever been in: “Customer’s don’t care what language you use or your company’s code style guides or your build system. They care about the experience.” But Dodd’s observation is this: if your system has poor abstractions, or is abstracted in the wrong way, then certain requests for future adaptations of your code are going to be difficult. Dodd’s case is that “the experience of the user” is more than just what happens when they sit down with your software: it’s about how fast you can innovate without risking the stability, security, or reliablity of the product. It’s about having a relationship with the user that spans months or years, all the while your software is growing, adapting to new missions, and improving.

Knowing that these pitfalls exist is something that only comes with experience. Recognizing technical debt before it grows into a monster that eats up your development cycles is something that comes from doing this job for a long time.

Every startup that says greybeard geeks aren’t a “cultural fit” is buying into the idea that it can outrun technical debt. That it doesn’t need to mind its long-term, multi-release relationship with its users. Or that technical debt is something to be managed, something middle-management is going to deal with, and you can hire young developers and burn them out, and it’ll all be fine.

It won’t be fine.  Experience matters.  And more to the point, experience matters most to your customers, because without it, the experience your customers will have with your software will be as inconsistent and callow as the developers you hire.

26Sep

The limits of my abilities…

Posted by Elf Sternberg as chat

In the cellar was a tunnel scarcely ten yards long, that had taken him a week to dig. I could have dug that much in a day, and I suddenly had my first inkling of the gulf between his dreams and his powers. — H.G. Wells, The War of the Worlds

The past week, work has been slow so I’ve had a little time to work on updating my git pre-commit hook. For a while, I actually had a “hack” wrapped around it to run it from the command line, so I could see what was failing before I tried to commit it. I realized as I was working that I was basically writing a lint hook, and have since changed the project’s name to reflect that: git-lint.

The problem is that working on projects like this a polyloader have taught me that the gulf between my dreams and my powers is enormous. It took me a week to refactor pre-commit into something with actual command line arguments, an external configuration file, and policies to implement, as well as adding the ‘dry run’ and ‘sort order’ capabilities– things the pre-commit version doesn’t really need. Obviously, I’d like to know and do a lot more with my non-professional development life. But finding the time is hard, and frankly, when I’m done working on code at work I really don’t have the brains left to write, draw, or code at home.

I remain committed to a few basic ideas: that there’s too much code in the world; that 99% of what we do is translating from formats that are human-comfortable to those that are machine-ready and back; that we can and should make as much of that work declarative; and that even interpreted languages should invest heavily in pre-processors to remove new scopes where none are needed, inline where possible, and exploit the CPU to the best of any human ability.

I know, I’m not helping by writing more of it.

I need to get rich enough to stay home and hack all day. That’s the answer. And I do; just ask my long-suffering wife, who bemoans my willingness to spend all Saturday in front of the computer, geeking out.

University computer science libraries have fallen into a sad and tragic state. I went by the University of Washington Engineering Library. When I was at Isilon, that library was a kind of miracle; if we needed to know anything interesting going on in the world of data management, you could go to the library and find a raftload of interesting papers, digest them over the weekend, and be ready with some new trick for Distributed Dynamic RAID by Monday. It was a thrilling place to enter.

And you could make photocopies of the really interesting stuff.

I went in there recently and discovered that the stacks haven’t changed. I mean that literally. Most of the interesting CS journals have moved entirely on-line, and there wasn’t a single collected journal available dated after 2007. There was a wall labeled “Computer Manuals” that had books covering the initial industrial release of SQL. There was a general damp, fusty smell to everything.

There was one lone machine against the wall where you could survey and, if you had the time, read all the papers the world had waiting, tens of thousands of articles, conference submissions, books, precis, even patents. But you couldn’t print anything out and, since I’m no longer a student, I couldn’t mail copies to myself.

There’s more interesting stuff happening in the world now than there was a decade ago, but the academic CS journals are working ever harder to lock it up and “protect” it from the prying eyes of industry. And that’s a damned shame.

Subscribe to Feed

Categories

Calendar

December 2016
M T W T F S S
« Nov    
 1234
567891011
12131415161718
19202122232425
262728293031