Apex: My Editor Project

Dec 19 2010 Published by under Programming

Lots of people were intrigued by my reference to my editor project. So I'm sharing the current language design with you folks. I'm calling it Apex, as a homage to the brilliant Acme, which is the editor that comes closest to what I'd like to be able to use.

So. The language for Apex is sort of a combination of TECO and Sam. Once the basic system is working, I'm planning on a UI that's modeled on Acme. For now, I'm going to focus on the language.

Goals

It's always good to be clear about what you're trying to do. So for the Apex command language, my goals are:

  1. Conciseness: since I'm planning on using this for all of my everyday programming, it's really important for it to be concise. It doesn't matter if it's easy to read if I need to type something like forall match in regexp.match("foo") do match.replace("bar") end. It's just too damned much typing for an everyday task. In what I describe below, a global search and replace is g/foo/,{r'bar'}.
  2. Consistency: everything works in roughly the same way. Everything can succeed or fail, and all of the semantics are based around that idea. Everything that takes parameters takes parameters in the same way. If something works in one context, it'll work in another.
  3. Clarity: if you look at the code fragments below, this one takes a bit of explanation. The conciseness of the syntax means that to someone who isn't familiar with the language, it's going to be absolutely impossible to read. But the way that things work is straightforward, and so once you understand the basic ideas of the syntax, you can easily read a program. It's not like TECO where you need to know the specific command in order to have a clue of what it does. And the parser can look at an entire program, and tell you before executing any of it whether it's got a syntax error.

Syntax

The syntax for commands is:

stmt: sub | command

sub: 'sub'  sub_params FUN_IDENT sub_params '{' command '}'

sub_params : '(' ( VAR_IDENT ( ',' VAR_IDENT )*  )?  ')'

command : choice_command ( '?' simple_command ':' simple_command )

choice_command : seq_command ('^' seq_command)*

seq_command : simple_command ( '&' simple_command )*

simple_command : atomic_command
               | '[' command ']'


atomic_command: ( params )? command_name ( post_params )?
	      |  params '!' VAR_IDENT

post_params: post_param (',' post_param)*

post_param: QUOTED_STRING
          | PATTERN
          | block
          | '$(' expr ')'

params	: NUMERIC_LITERAL
        | '(' ( expr ( ',' expr )* )? ')'

quoted_param: QUOTE_CHAR  ( NON_QUOTE_CHAR )* QUOTE_CHAR
            | '(' expr ')'


expr : NUMERIC_LITERAL
     | QUOTED_STRING
     | funcall
     | block
     | command
     | VAR_IDENT


funcall: params FUN_IDENT ( quoted_param )?

block : '{' ( '|'  VAR_IDENT (',' VAR_IDENT)*  '|' )?
            command '}'

FUN_IDENT = '@' [A-Za-z_+-*/=!%^&><]+

VAR_IDENT = '$' [A-Za-z_]+

Commands

This is a language focused on text editing, so the core of it is built around buffers. All of the language constructs implicitly work on a buffer. Within the buffer, you have a focus. The focus is the current location of the cursor. The interesting bit, though, is that the cursor isn't necessarily between two characters. It can span over a range of text, all of which is under the cursor. In other words, the the currently selected range of text and the cursor are the same thing.

Commands all work in terms of either moving the cursor, or modifying the contents of the cursor. Most commands have a long name, and a short abbreviated name.

Cursor Motion
Pattern Search: s+/pattern/
Moves the cursor so that it covers the next instance of the pattern in the current buffer. Returns the start position of the match. There's also a "s-" version, which looks for the previous instance of the match.
move: number m unit
Moves the cursor by a specified distance. The units are c (for characters), l (for lines), or p (for pages). So 3ml means "move the cursor" forward three lines. Returns the start position of the cursor after the move.
jump: number j unit
Jumps the cursor to a specific position. The units are the same as for the m command, where "character" units specify column numbers. Returns the
extend: e motion-command
Extend cursor. The cursor is extended by the effect of the following command. So, for example, since 3mc is a command that means "move the cursor forward three characters, 3emc is a command that means "extend the cursor forward by three characters - it moves the end-point of the cursor forward by three, without changing thestart. -3eml adds the previous three lines to the cursor. es+/foo/ extends the cursor to include the next match for "foo".
pick: (expr, expr)p
Selects a range of text as the current cursor. Each expression is interpreted as a location. (3lj,4pj)p covers the range from the beginning of the third line, to the end of the fourth page. (s+/foo/, s+/bar/)p covers the range from the beginning of the first match of "foo" to the end of the first match of "bar".
selectall: *
Makes the current cursor cover the entire buffer.
Edits
delete: d
Delete the contents of the cursor. If it's followed by a variable name, then the deleted text is inserted into that variable.
copy: c$var
Copy the contents of the selection into a variable.
insert: i'text'
Inserts text before the cursor. The quote character can actually be any character: the first character after an i is the delimiter, and the insert string runs to the next instance of that delimiter.
append: a'text'
Appends text after the cursor. Quotes work just like i.
replace: r'text'
Replaces the current contents of the cursor with the new
text.
Control Flow
global: g/pattern/,block
A simple loop construct. For each match of the pattern within the current cursor, execute the block. So, for example, to do a global search and replace of foo with bar, * g/foo/,{r'bar'}.
stmt ^ stmt
Choice/logical or statement: any statement can either succeed or fail. ^ allows you to combine statements so that the second one only executes if the first one fails. The statement as a whole succeeds if either the first or second statement succeeds. Ret turns the value of the statement that succeeds.
stmt & stmt
Sequencing/logical and. The second statement will only be executed if the first one succeeds, and the entire statement succeeds only if both succeed. Returns the value of the second statement.
( stmt )
Should be obvious, eh?
stmt1 ? stmt2 : stmt3
If-then-else. A simple if-then without an else is just a , sequence. You can get an if-then-else effect without this, but it's tricky enough to justify adding this.
loop: l{block}
A general loop. Executes the block over and over as long as it succeeds.
execute: x block
Executes the block on the current cursor. The contents of the current cursor becomes the target buffer of the body of the block, and the cursor is set to position 0 of that target buffer.
Variables
$ident
Any symbol starting with a $ is a variable. In an expression, a variable name evaluates to its value.
set!: expr!$ident
Assign the result of executing the preceeding expression to a variable. If the variable is already defined in this scope, or in any enclosing scope, update it; otherwise, create a new local variable.
External Interaction
<'shellcommand'
Execute shellcommand in an external shell, and insert the standard out from the command into the position at the start of the current cursor; then set the cursor to cover the inserted text.
<<'shellcommand'
Some as the < command, except that it also inserts the contents of stderr from the shell command.
|'shellcommand'
Execute shellcommand, with the current cursor as its standard input, and replace the contents of the cursor with the standard output.
||'shellcommand'
Same as |, except that it also inserts the contents of stderr.
I/O
write: w
Write the current buffer out to a file. If no filename is specified, then use the buffer's associated filename. If a filename is specified, then write it to that file, and update the buffer's filename to match the written name.
open: o'filename'
Open a file in a new buffer.
revert: v
Discard all changes to this buffer.

Expressions

In general, any command is also usable an expression. Every command returns a value: motion commands return the new cursor position; edit commands return any deleted text, or the size of the change.

Control statements don't depend on true and false values; instead, they're defined in terms of success and failure. Any statement can succeed or fail.

Arithmetic is done using built-in functions.

Blocks

A lot of statements take block parameters. A block is an executable code fragment. Blocks are enclosed in braces. They always implicitly take the current cursor as a parameter. In the case of the "x" and "g" commands, the block is executed using the current selection as if it were the entire buffer. In addition to the selection, a block can take additional parameters. They're written by enclosing them in "|"s at the beginning of the block. For example, you could define a block that returned the sum of its parameters by writing:

  {|$x, $y| ($x,$y)@+ }

Parameters for a block preceed its call. So to invoke the block above, you could use: (3,2)x{|$x,$y| ($x,$y)@+)}, which would then return 5.

Blocks are lexically scoped; a block declared inside of another block can access variables from that enclosing block.

You can declare named subroutine. A named subroutine is mostly syntactic sugar for a block. The main difference is that if you go to the trouble of creating a named subroutine, then you can declare both prefix and postfix parameters. The names of named subroutines always start with an "@" symbols. A named subroutine just associates a global name with a block.

  fun ($x) @fact {($x,0)@= ? {1} : { ($x, ($x,1)@-@fact)@* }

When calling a block, the parameters preceed it. So to get the factorial of 10, you'd write 10@fact.

For numeric arguments to commands, you can just put the expression before the command instead of a number. For example, to move to line fact(4), you'd write: 4@fact jl. For string parameters that appear in quoted positions, if you use an "$()" instead of a quote character, then the contents are evaluated as an expression, and the result is used as the string parameter value. So to insert the string "5@fact", you could write i'5@fact'. To insert the result of evaluating it, you'd write "i$(5@fact)".

No responses yet

  • Once you design a programming language like this, how much time and effort and difficulty is it to create the interpreter that implements it?

    • MarkCC says:

      It really depends greatly on the language. I've implemented some simple interpreters in as little as two days; and in my days at IBM, I worked on a C++ compiler that totaled about 75 person-years.

      For this, it's already mostly implemented. I started off by implementing a basic edit buffer, and all of the operations it would need. Pretty much everything here has at least a basic, primitive implementation. The regular expression implementation is pretty horrible, so it probably needs a complete rewrite. (It's somewhere around 2 orders of magnitude slower than most off-the-shelf regex packages.)

      Completing the basic interpreter, if I had the time to work on it uninterrupted, would probably take me a week. Given my time constraints, that will expand to at least a month.

      Then the regular expressions are going to be a major problem. I'm debating between implementing a sam-style structural regex engine, and scrapping regular expressions entirely in favor of something like Icon scan expressions. If I go the sam-like route, there are a ton of off-the-shelf regex implementations that I could adapt to fit into apex. But my inclination is to build something more Icon-like. And that translates into something pretty hairy.

  • slabounty says:

    What language are you planning on implementing the interpreter in?

    • MarkCC says:

      The current code is all in Go.

      • slabounty says:

        Go makes sense given the Google connection. Anything interesting in Go that makes this easier/harder than it would be in a more mainstream language? I had thought that concurrency was one of the big selling points of Go, but that doesn't seem to be very relevant here. It'd be fun to see some different implementations of this as it does seem to be a pretty cool idea.

        • MarkCC says:

          There are a lot of reasons for using Go:

          • I really like the language, in particular the type system. It's very well suited towards this kind of programming.
          • It generates fast code. Not the fastest, but very respectable.
          • It's compact. Go executables aren't bloated monstrosities.
          • Go has excellent lightweight libraries that provide a lot of the facilities that I need.
          • Concurrency is actually relevant. This isn't just a standalone language, but the back-end of an editor. The ultimate editor will be, basically, client-server. The editor will talk to the interpreter through a socket connection. Each edit buffer will have its own socket connection. So having really convenient, fast concurrency will be a big help.
          • Tools. Go's tools are great, and blazingly fast. Compiling my current code with a C++ compiler would take 30 seconds. With Go, it's instant. Testing in Go is simple. It's just all around a very lightweight, pleasant language to use.
  • GavinB says:

    Does Go integrate with any windowing toolkits (cross-platform) ?
    A quick glance at Go's 2D graphics api shows that it is still rudimentary.

    • MarkCC says:

      Not particularly well at the moment. There's an incomplete implementation of an interface for Gtk. And there's an in-progress implementation of SWIG bindings.

      For my purposes, it really doesn't matter. The inner architecture of Apex is very sam-like. The GUI is actually a separate program, which talks to the editor back-end through a socket connection. So the front-end can be implemented in pretty-much anything, so long as it can talk to sockets. The current prototype is wxPython; I'll probably end up using the native wxWidgets C++ for the real thing.

  • AJS says:

    The quote character can actually be any character: the first character after an i is the delimiter, and the insert string runs to the next instance of that delimiter.

    Ooooh ..... Interesting!

    Perl has something similar, in that you can use any character as a string delimiter; but if there are a pair of similar characters that are obviously "beginning" and "ending", and you use the "beginning" character for an "any character quote", then it expects the appropriate "ending" character for the corresponding closing quote.

    For instance, qq/wibble/ (here using forward slashes as the delimiters; and qq means behave like "double speech marks" as opposed to 'single speech marks' -- they behave differently) is the same as "wibble". But you could also write qq(wibble) -- the Perl interpreter knows that ) is the opposite of ( and looks for that as the delimiter.

    I think this is neat, but I'll freely admit I'm biased because I like Perl (and not just for the way it uses distinct operators for the distinct operations of numeric addition and string concatenation; though every time JavaScript concatenates a one onto the end of a number instead of incrementing it, I throw up a little in my mouth).

    Are you going to have matched pairs of any-quotes in Apex, or is that a piece of eye candy too far?

    • No, I'm not planning on doing anything like Perl's automatic bracket pairing for quoting.

      First: I hate perl. I really hate perl. It's pretty much a case study in how to create a cryptic, sloppy, incomprehensible muddle of a programming language.

      Second: the point of flexible quoting is to allow you to pick a quote character that works in a particular context. Anything that tries to be clever is going to be prone to breaking that.

      Third: I don't think that there's really a lot of benefit to things like automatic bracket pairing. I can't think of any realistic examples where the confusion factor of overloading the meaning of "()" is outweighed by the clarity of using "()" for quoting, instead of some other character.

      • AJS says:

        I can see your point on the automatic bracket-pairing. At least if they ever introduce a brand new unicode variant containing a brand new pair of bracket-ish characters, it won't suddenly break. And there are plenty of other characters to choose from anyway.

  • Chris says:

    Are you aware that a code language called Apex already exists? It's kind of a hybrid of Java, Javascript, and SQL query code used to write triggers, queries, and the like in database systems such as Salesforce

    • MarkCC says:

      I didn't know, but I don't particularly care. Almost any good name that you can come up with has been used *somewhere*. I don't think that there's any great likelihood that my little editor language is exactly going to take over the world; and even it were to become popular, it's not like there's going to be any confusion between a TECO-ish text editing language and a database trigger scripting language.

      • James Sweet says:

        If you're not worried about getting a cease-and-desist, why not call it "Editing R Us"?

        (I'm just being snarky because none of the blogs I follow on Scientopia are being updated... heh)

  • Kevin C. says:

    Looks pretty nice and simple to understand! (and this from someone who tends to prefer WIMP editors over vi or emacs)

    Isn't your factorial function missing a closing brace though?

    How do you jump to the end of a line? -1jc?

  • Paul C. Anagnostopoulos says:

    What is the point of allowing more than one string quoting option? I've never programmed in a language that allowed this and I've never missed it. Most languages don't have a rich enough set of escape sequences, so perhaps that's the reason?

    ~~ Paul

    • MarkCC says:

      It's a feature that I'm copying from TECO, which I think is pretty important.

      When it comes to programming languages, context is everything. Putting a flexible quoting option into, say, C++, would be a nightmare. In that context, there's really not much reason for it, and it would just make everything more complicated.

      But this is a language that's all about processing text. And so making dealing with all kinds of text simple is important.

      What if you're running a program from the command-line, a la sed? Then you can't use standard quotes without escaping. And you rapidly get into the hell of multiple levels of escaping... quick, how many backslashes do you need to put before a literal double-quote inside of a quoted string in an awk program executed from the command line?

      No matter what quoting character you pick, it's entirely possible to come up with a scenario for text processing where it's wrong. Escape syntax is a solution to that, but it's not a good one. No one wants to be typing a ton of backslashes, or whatever you use for escapes.

      For an example, think of vi. There's a reason why you usually use "/" as a separator in the "g" command; it's the normal quote for a regular expression. But if you're trying to search for references to a complex pathname... do you really want to type a commend like ":g/foo/bar/baz/biz/" when you could type ":g|foo/bar/baz/biz|"?

  • Paul C. Anagnostopoulos says:

    You make a good point, although nowadays I never write programs in command languages. Nor do I use vi. This may be because I spent many years writing large programs in VMS DCL (google ) and had my fill of command languages with whacky syntax. (I have no excuse for the fact that I've also written large programs in TeX.)

    I don't like programming languages that are designed to minimize keystrokes. Perl is the classic example, and apparently you're no fan of it, either. If I am putting a file spec in a program, I don't mind having to escape some characters. Of course, slash works just fine as the path separator, eliminating the backslashes.

    In my personal language (we all have one, apparently), you can use a pragma to select the string escape character, so if you are going to use many backslashes in strings you can choose something else for the escape character. I do this when writing programs that deal with TeX.

    My language also has a powerful macro facility, but that is another debate.

    ~~ Paul

  • ix says:

    First of all, happy new year (I'm a bit late with this post).

    Do you plan on supporting the kind of stuff vim does for programmers, i.e. syntax highlighting, indenting, basic refactoring, etc? I'm wondering on how you see that fitting in (or not at all). Proper indenting in particular is something I really miss in vim, and which seems tied to its rudimentary understanding of what it's editing.

    • The answer to that is sort-of complicated - both yes and no, depending on how you look at things.

      My basic belief about editors is that we're dumping far too much crap into some basic tools, and that that's a bad thing.

      Back when Go first came out, before anyone had written an emacs mode for it, I thought I'd give it a shot. The emacs lisp manual is over a thousand pages long. That's not the complete emacs manual: that's just the manual for the emacs extension language. The manual for the CC-mode, the skeleton for modes for c-like languages is over a hundred pages long! That's just ridiculous.

      Eclipse makes it look downright easy to write something in Elisp.

      For comparison, the entire source code lengths of several different powerful text editors:
      - aoeui: 108 pages
      - Sam: 114 pages
      - Acme: 239 pages, including its own window manager!

      The manual for CC-mode is roughly the same size as a complete text editor. That's insane.

      For some reason, when it comes to text editors, we've decided that the way to build a good editor is to throw everything including the kitchen sink all into the editor. The editor is virtually a complete monolithic operating system.

      I'm trying to take a very different approach - making it as easy as possible to integrate the use of external tools. I do plan to support syntax highlighting - but the way it will work is by talking to an external lexer. I'll support code formatting - but it'll be done through an external code formatter. If you want automatic indentation, that's no problem: but it'll be done through an external tool. There's absolutely no good reason that you should need to write a complete code formatter inside of Eclipse when an external tool can do it for you. Why waste your effort writing C++ or Go semi-parsers in Elisp in order to provide code formatting, when there are perfectly good, fast, configurable tools like gofmt and indent that already do that?

      Refactoring tools are a similar story. What are you really doing in a typical refactoring operation? You're picking a code element, and then specifying an operation to perform on it. So why does that need to be part of the editor? When you say: "Here's a code block: extract it into a function", why does that need to be built-in to the editor? Just pass it to an external program "c++-function-extract", with the code to be extracted passed as a location range, and the name of the new function as a string? Then the external program either gives you a new version of the buffer, or a string of edit commands to do the operation.

      • ix says:

        That's close to what I was thinking, but I've never tried to write an editor. 🙂 The problem in my editor of choice (vim) is that the minute you start using external tools it seems to become more cumbersome than it's worth.

        Consider tags (ctags, jtags, ...). You want those updated pretty much every time you touch up a class or a type. It would be nice if the editor knew when you did that and updated its index (eclipse does something like that, by constantly compiling and maintaining a central index). In an editor like vim, I use exuberant-tags, an external tool. Usually I only update the index manually now and then (because I know where I am), but ideally I would like to not even have to think about it. You could mess around a bit with buffer write events or whatever and have it run automatically every time you make an update. Likely not ideal, but Good Enough. The trouble with that is that this is only one of the many things that you would want done by external programs, and I'm way too lazy to write the kind of .vimrc's that are longer than the vim source code.

        Plus it does tend to take a bunch of figuring out. Soon enough, you're no longer customizing but constructing your own development tool kit (the border is fuzzy, I admit). Anyway, it always interests me to see where people are going with this. A big monolithic piece of software is cumbersome, often slow and flash-prone (Eclipse, depending on your plugin set), but it definitely feels like using an external tool set has some missing links too. It's hard to quantify, but when using either I feel like all this time I've put into getting to know my editor should be paying of right now, instead of constantly leading to more config pain.

  • Pages 31 - 32 of Royden's Real Analysis has a more complete listing of the field axioms. One important field axiom is the trichotomy axiom that says: given any two real numbers x and y one and only one of the following holds: x y. Brouwer gave a counterexample to it in, Benacerraf, P. and Putnam, H. Philosophy of Mathematics, Cambridge University Press: Cambridge, 1985, pp. 61 - 62. I have my version of the counterexample in the paper, The new real number system and discrete computation and calculus, Neural, Parallel and Scientific Computations, 2009, 17, pp. 59 – 84. Another important field axiom is the completeness axiom, a variant of the axiom of choice that leads to the contradiction in R^3 known as the Banach-Tarski Paradox.

    Cheers,

    E. E. Escultura

  • Sorry, the statement of the trichotomy axiom is incorrect. The full statement should be: Given two real numbers, x, y, one and only one of the following holds: x y.

    Cheers,

    E. E. Escultura

Leave a Reply