It's been a while since I've written about any data structures. But it just so happens that I'm actually really working on implementing a really interesting and broadly useful data structure now, called a Rope.
A bit of background, to lead in. I've got this love-hate relationship with some of the development tools that Rob Pike has built. (Rob is one of the Unix guys from Bell labs, and was one of the principal people involved in both the Plan9 and Inferno operating systems.) Rob has implemented some amazing development tools. The two that fascinate me were called Sam and Acme. The best and worst features of both are a sort of extreme elegant minimalism. There's no bloat in Rob's tools, no eye-candy, no redundancy. They're built to do a job, and do it well - but not to do any more than their intended job. (This can be contrasted against Emacs, which is a text editor that's grown into a virtual operating system.) The positive side of this is that they're incredibly effective, and they demonstrate just how simple a programmers text editor should be. I've never used another tool that is more effective than Acme or Sam. In all seriousness, I can do more of my work more easily in Sam than I can in Emacs (which is my everyday editor). But on the other hand, that extreme minimalist aesthetic has the effect of strictly eliminating any overlaps: there's one way to do things, and if you don't like it, tough. In the case of Acme and Sam, that meant that you used the mouse for damn-near everything. You couldn't even use the up and down arrows to move the cursor!
This is a non-starter for me. As I've mentioned once or twice, I've got terrible RSI in my wrists. I can't use the mouse that much. I like to keep my hands on my keyboard. I don't mind using the mouse when it's appropriate, but moving my hand from the keyboard to the mouse every time I want to move the cursor?. No. No damned way. Just writing this much of this post, I would have had to go back and forth between the keyboard and mouse over 50 times. (I was counting, but gave up when I it 50.) A full day of that, and I'd be in serious pain.
I recently got reminded of Acme, because my new project at work involves using a programming language developed by Rob Pike. And Acme would really be incredibly useful for my new project. But I can't use it. So I decided to bite the bullet, and use my free time to put together an Acme-like tool. (Most of the pieces that you need for a prototype of a tool like that are available as open-source components, so it's just a matter of assembling them. Still a very non-trivial task, but a possible one.)
Which finally, leads us back to today's data structure. The fundamental piece of a text editor is the data structure that you use to represent the text that you're editing. For simplicity, I'm going to use Emacs terminology, and refer to the editable contents of a file as a Buffer.
How do you represent a buffer?
As usual with data structures, you start by asking: What do I need it to do? What performance characteristics are important?
In the case of a text buffer, you can get by with a fairly small set of basic operations:
- Fast concatenation: concatenating blocks of text needs to be really fast.
- Fast insert: given a point in a block of text, you need to be able to quickly insert text at that point.
- Fast delete: given two points in a block of text, you need to be able to quickly remove the text between those points.
- Reasonably fast traversal: Lots of algorithms, ranging from printing out the text to searching it are based on linear traversals of the contents. This doesn't have to be incredibly fast - it is an intrinsically linear process, and it's usually done in the context of something with a non-trivial cost (I/O, regular-expression scanning). But you can't afford to make it expensive.
- Size: you need to be able to store effectively unlimited amounts of text, without significant performance degradation in the operations described above.
Continue Reading »