/ code

Sorting in Factor

So, mrjbq7's post with words to find the maximum and minimum of a sequence of tuples got me thinking, and playing, and reading the Factor docs: all good things! I started with his test data:

    TUPLE: person name age ;
    CONSTANT: PEOPLE {
        T{ person f "Jim" 30 }
        T{ person f "Sally" 27 }
        T{ person f "Rebecca" 32 }
        T{ person f "James" 28 }
        T{ person f "Benjamin" 22 }
    }

So, the words from his vocab let me say:

( scratchpad ) PEOPLE [ age>> ] maximum .
T{ person f "Rebecca" 32 }

( scratchpad ) PEOPLE [ name>> length ] minimum .
T{ person f "Jim" 30 }

My first idea was, "hey, couldn't I represent this operation as a sequence?" since, if it was a kind of sequence then I could use sequence operations like sum and map on the result. Looking back, that really wasn't a good idea, for reasons I'll discuss below, but I made it work anyway:

    ( scratchpad ) PEOPLE [ name>> length ] seq-by
    T{ seq-by f ~array~ ~quotation~ }
    ( scratchpad ) infimum
    { 3 T{ person { name "Jim" } { age 30 } } }

This uses an underlying assumption about comparisons between sequences: they are compared by their first non-matching elements or, if all elements up to the length of the shortest sequence were equal, then they are compared by their lengths. How do I know? By looking at the help for generic word <=> as implemented by the sequence class:

    USING: kernel math.order sequences sequences.private ;
    M: sequence <=>
        [ mismatch ] 2keep pick
        [ 2nth-unsafe <=> ] [ [ length ] compare nip ] if ;

mismatch does the first part, comparing the elements of the sequences, outputting the first non-matching index or f if the end of either sequence is reached. The if is a little bit tricky, but if mismatch outputs a non-false value, then 2nth-unsafe grabs the nth element from each sequence (without bounds checking, hence -unsafe) and then calls <=> on those elements.

This is the power of the Factor's online help system and a glance at the underlying power of the language: the definitions of all words are not only known by the system, but available in a human-readable format, and the definitions are all hyperlinked to other definitions. Nothing I've yet used can touch this.

Back to the topic at hand: we just made up a virtual sequence called seq-by that applies a quotation to each element of the underlying sequence when asked. Here's seq-by's implementation of the nth generic word:

TUPLE: seq-by seq quot ;
INSTANCE: seq-by sequence

M: seq-by nth ( n seq -- elt )
    [ seq>> nth ]
    [ seq>> nth ]
    [ nip quot>> ] 2tri
    call( elt -- elt' ) swap 2array ;

seq-by wraps around another sequence, stored in the seq slot. It also stores a quotation that should be applied to each element in the quot slot. The nth word for seq-by says: grab the nth element of the underlying sequence, put it on the stack twice, then call quot on the top value of the stack, then store the whole result into a two-element array.

This is kind of cool, as the values in the sequence get re-calculated each time nth is called. But really, what have I done here? I've re-implemented map as an object that follows the sequence protocol which, while interesting for me, is probably not all that useful or clear. When am I going to want to sum up the ages of everyone in the group? Or when am I going to want to go person-by-person in order of age? I conceivably might want to do one of those things, but then if I do, I think the next approach is a lot clearer.

In fact, there's some built-in fun in Factor's sorting.slots vocabulary that handles just what we're trying to accomplish here: sorting our person objects by the values of their slots.

    ( scratchpad ) PEOPLE { { age>> <=> } } sort-by
    --- Data stack:
    { ~person~ ~person~ ~person~ ~person~ ~person~ }
    ( scratchpad ) .
    {
        T{ person { name "Benjamin" } { age 22 } }
        T{ person { name "Sally" } { age 27 } }
        T{ person { name "James" } { age 28 } }
        T{ person { name "Jim" } { age 30 } }
        T{ person { name "Rebecca" } { age 32 } }
    }

sort-by is cool in that it takes a sequence of { slot-name comparator } pairs, so you can have a cascading sort across multiple columns of the objects. In the case above, I just want to sort by age, in natural order (1, 2, 3...). After sorting, the first element is guaranteed to be the one with the lowest age and the last value is guaranteed to be the one with the highest.

Now, back to another example, how about finding person with the shortest or longest name? First, we're going to need a name comparator.

    :: length<=> ( obj1 obj2 -- <=> ) obj1 length obj2 length <=> ;

I used the locals vocab just to make this a little more clear, and because I'm feeling lazy at the moment: get the length of the first thing, then the second thing, then call <=>. If passed two strings, or two sequences, this will compare them based on their length alone instead the usual behavior of <=> for sequences mentioned above.

Now, to call sort-by:

    ( scratchpad ) PEOPLE { { name>> length<=> } } sort-by
    --- Data stack:
    { ~person~ ~person~ ~person~ ~person~ ~person~ }
    ( scratchpad ) .
    {
        T{ person { name "Jim" } { age 30 } }
        T{ person { name "Sally" } { age 27 } }
        T{ person { name "James" } { age 28 } }
        T{ person { name "Rebecca" } { age 32 } }
        T{ person { name "Benjamin" } { age 22 } }
    }

Not bad! Only one new word. A brief discussion of efficiency: sort is a merge sort, meaning it's O(N log N) at worst. Finding the minimum or maximum of an unsorted sequence is O(N). In practice, though, I don't give a fig about order of complexity, nor about any performance penalties incurred by the underlying language or machine model.

What matters to me, then? Well, in this case, a couple of things: using the tools that I already have in the standard library, and writing code that's readable and meaningful — both now and when I have to maintain this sucker in a few weeks, months, or years. Let's say, extending the above example, that our program talks about the 'oldest' of a group of people often, in that case it makes sense to make a new word:

: oldest ( seq -- elt )
    #! ....

Then we can say things like, "get me the name of the oldest person:"

PEOPLE oldest name>>

Or, in idiomatic English, "Of those people, who is the oldest one, and can you tell me their name?"

The beauty of Factor and other concatenative languages is that it doesn't matter if oldest is implemented in terms of maximum, minimum, sort-by, or some other method. Instead, we have a single word that clearly expresses the intent of what we want to do: find the oldest person in a group.

This kind of thing is usually written as an expression or series of statements in imperative languages:

    var oldest_persons_name = people.sort().pop().name; // Find the oldest person, get their name

    // OK, we wrote a helper method, this is a little better:
    var oldest_persons_name = this.oldest_person().name;

Already, the code is getting messy, and the intent is getting buried in so much useless noise and repetition. This kind of thinking also leads to rampant copy-and-pasting of common idioms, or elaborate preprocessor macros, or template programming and other bastard OOP paradigms replete with heinous dependency graphs.

What happens if I need to replace PEOPLE with a result cursor from a database? Or if I now want to sort lists of pets? In Factor, it's all the same, oldest will work on either of those things, presuming they are sequences of objects. In the languages of the day, any of those changes will require changing a lot of code, or maybe adjusting my class hierarchy, all things that are gonna derail me from the task at hand. Not only that, but the code will likely get harder to read, and even end up less flexible than before.

Philosophical question for the night: why does this happen so much in programs? Is it something about the programmers or something about the languages or something about the nature of programming itself?

I can't produce a qualified answer to those questions, but I know this: once you go concatenative, you never go back. :-)