Storing Documents

Documents are CouchDB’s central data structure. To best understand and use CouchDB, you need to think in documents. This chapter walks you though the lifecycle of designing and saving a document. We’ll follow up by reading documents and aggregating and querying them with views. In the next section, you’ll see how CouchDB can also transform documents into other formats.

Documents are self-contained units of data. You might have heard the term record to describe something similar. Your data is usually made up of small native types such as integers and strings. Documents are the first level of abstraction over these native types. They provide some structure and logically group the primitive data. The height of a person might be encoded as an integer (176), but this integer is usually part of a larger structure that contains a label ("height": 176) and related data ({"name":"Chris", "height": 176}).

How many data items you put into your documents depends on your application and a bit on how you want to use views (later), but generally, a document roughly corresponds to an object instance in your programming language. Are you running an online shop? You will have items and sales and comments for your items. They all make good candidates for objects and, subsequently, documents.

Documents differ subtly from garden-variety objects in that they usually have authors and CRUD operations (create, read, update, delete). Document-based software (like the word processors and spreadsheets of yore) builds its storage model around saving documents so that authors get back what they created. Similarly, in a CouchDB application you may find yourself giving greater leeway to the presentation layer. If, instead of adding timestamps to your data in a controller, you allow the user to control them, you get draft status and the ability to publish articles in the future for free (by viewing published documents using an endkey of now).

Validation functions are available so that you don’t have to worry about bad data causing errors in your system. Often in document-based software, the client application edits and manipulates the data, saving it back. As long as you give the user the document she asked you to save, she’ll be happy.

Say your users can comment on the item (“lovely book”); you have the option to store the comments as an array, on the item document. This makes it trivial to find the item’s comments, but, as they say, “it doesn’t scale.” A popular item could have tens of comments, or even hundreds or more.

Instead of storing a list on the item document, in this case it may be better to model comments into a collection of documents. There are patterns for accessing collections, which CouchDB makes easy. You likely want to show only 10 or 20 at a time and provide previous and next links. By handling comments as individual entities, you can group them with views. A group could be the entire collection or slices of 10 or 20, sorted by the item they apply to so that it’s easy to grab the set you need.

A rule of thumb: break up into documents everything that you will be handling separately in your application. Items are single, and comments are single, but you don’t need to break them into smaller pieces. Views are a convenient way to group your documents in meaningful ways.

Let’s go through building our example application to show you in practice how to work with documents.

JSON Document Format

The first step in designing any application (once you know what the program is for and have the user interaction nailed down) is deciding on the format it will use to represent and store data. Our example blog is written in JavaScript. A few lines back we said documents roughly represent your data objects. In this case, there is a an exact correspondence. CouchDB borrowed the JSON data format from JavaScript; this allows us to use documents directly as native objects when programming. This is really convenient and leads to fewer problems down the road (if you ever worked with an ORM system, you might know what we are hinting at).

Let’s draft a JSON format for blog posts. We know we’ll need each post to have an author, a title, and a body. We know we’d like to use document IDs to find documents so that URLs are search engine–friendly, and we’d also like to list them by creation date.

It should be pretty straightforward to see how JSON works. Curly braces ({}) wrap objects, and objects are key/value lists. Keys are strings that are wrapped in double quotes (""). Finally, a value is a string, an integer, an object, or an array ([]). Keys and values are separated by a colon (:), and multiple keys and values by comma (,). That’s it. For a complete description of the JSON format, see Appendix E, JSON Primer.

Figure 1, “The JSON post format” shows a document that meets our requirements. The cool thing is we just made it up on the spot. We didn’t go and define a schema, and we didn’t define how things should look. We just created a document with whatever we needed. Now, requirements for objects change all the time during the development of an application. Coming up with a different document that meets new, evolved needs is just as easy.

Let’s examine the document in a little more detail. The first two members (_id and _rev) are for CouchDB’s housekeeping and act as identification for a particular instance of a document. _id is easy: if I store something in CouchDB, it creates the _id and returns it to me. I can use the _id to build the URL where I can get my something back.

Your document’s _id defines the URL the document can be found under. Say you have a database movies. All documents can be found somewhere under the URL /movies, but where exactly?

If you store a document with the _id Jabberwocky ({"_id":"Jabberwocky"}) into your movies database, it will be available under the URL /movies/Jabberwocky. So if you send a GET request to /movies/Jabberwocky, you will get back the JSON that makes up your document ({"_id":"Jabberwocky"}).

The _rev (or revision ID) describes a version of a document. Each change creates a new document version (that again is self-contained) and updates the _rev. This becomes useful because, when saving a document, you must provide an up-to-date _rev so that CouchDB knows you’ve been working against the latest document version.

We touched on this in Chapter 2, Eventual Consistency. The revision ID acts as a gatekeeper for writes to a document in CouchDB’s MVCC system. A document is a shared resource; many clients can read and write them at the same time. To make sure two writing clients don’t step on each other’s feet, each client must provide what it believes is the latest revision ID of a document along with the proposed changes. If the on-disk revision ID matches the provided _rev, CouchDB will accept the change. If it doesn’t, the update will be rejected. The client should read the latest version, integrate the changes, and try saving again.

This mechanism ensures two things: a client can only overwrite a version it knows, and it can’t trip over changes made by other clients. This works without CouchDB having to manage explicit locks on any document. This ensures that no client has to wait for another client to complete any work. Updates are serialized, so CouchDB will never attempt to write documents faster than your disk can spin, and it also means that two mutually conflicting writes can’t be written at the same time.

Beyond _id and _rev: Your Document Data

Now that you thoroughly understand the role of _id and _rev on a document, let’s look at everything else we’re storing.

The first thing is the type of the document. Note that this is an application-level parameter, not anything particular to CouchDB. The type is just an arbitrarily named key/value pair as far as CouchDB is concerned. For us, as we’re adding blog posts to Sofa, it has a little deeper meaning. Sofa uses the type field to determine which validations to apply. It can then rely on documents of that type being valid in the views and the user interface. This removes the need to check for every field and nested JSON value before using it. This is purely by convention, and you can make up your own or infer the type of a document by its structure (“has an array with three elements”—a.k.a. duck typing). We just thought this was easy to follow and hope you agree.

The author and title fields are set when the post is created. The title field can be changed, but the author field is locked by the validation function for security. Only the author may edit the post.

Sofa’s tag system just stores them as an array on the document. This kind of denormalization is a particularly good fit for CouchDB.

Blog posts are composed in the Markdown HTML format to make them easy to author. The Markdown format as typed by the user is stored in the body field. Before the blog post is saved, Sofa converts it to HTML in the client’s browser. There is an interface for previewing the Markdown conversion, so you can be sure it will display as you like.

The created_at field is used to order blog posts in the Atom feed and on the HTML index page.

The Edit Page

The first page we need to build in order to get one of these blog entries into our post is the interface for creating and editing posts.

Editing is more complex than just rendering posts for visitors to read, but that means once you’ve read this chapter, you’ll have seen most of the techniques we touch on in the other chapters.

The first thing to look at is the show function used to render the HTML page. If you haven’t already, read Chapter 8, Show Functions to learn about the details of the API. We’ll just look at this code in the context of Sofa, so you can see how it all fits together.

Sofa’s edit page show function is very straightforward. In the previous section, we showed the important templates and libraries we’ll use. The important line is the !json macro, which loads the edit.html template from the templates directory. These macros are run by CouchApp, as Sofa is being deployed to CouchDB. For more information about the macros, see Chapter 13, Showing Documents in Custom Formats.

The rest of the function is simple. We’re just rendering the HTML template with data culled from the document. In the case where the document does not yet exist, we make sure to set the docid to null. This allows us to use the same template both for creating new blog posts as well as editing existing ones.

The HTML Scaffold

The only missing piece of this puzzle is the HTML that it takes to save a document like this.

In your browser, visit http://127.0.0.1:5984/sofa/_design/sofa/_show/edit and, using your text editor, open the source file templates/edit.html (or view source in your browser). Everything is ready to go; all we have to do is wire up CouchDB using in-page JavaScript. See Figure 2, “HTML listing for edit.html”.

Just like any web application, the important part of the HTML is the form for accepting edits. The edit form captures a few basic data items: the post title, the body (in Markdown format), and any tags the author would like to apply.

We start with just a raw HTML document, containing a normal HTML form. We use JavaScript to convert user input into a JSON document and save it to CouchDB. In the spirit of focusing on CouchDB, we won’t dwell on the JavaScript here. It’s a combination of Sofa-specific application code, CouchApp’s JavaScript helpers, and jQuery for interface elements. The basic story is that it watches for the user to click “Save,” and then applies some callbacks to the document before sending it to CouchDB.

Saving a Document

The JavaScript that drives blog post creation and editing centers around the HTML form from Figure 2, “HTML listing for edit.html”. The CouchApp jQuery plug-in provides some abstraction, so we don’t have to concern ourselves with the details of how the form is converted to a JSON document when the user hits the submit button. $.CouchApp also ensures that the user is logged in and makes her information available to the application. See Figure 3, “JavaScript callbacks for edit.html”.

The first thing we do is ask the CouchApp library to make sure the user is logged in. Assuming the answer is yes, we’ll proceed to set up the page as an editor. This means we apply a JavaScript event handler to the form and specify callbacks we’d like to run on the document, both when it is loaded and when it saved.

Now that we know the user is logged in, we can render his username at the top of the page. The variable B is just a shortcut to some of the Sofa-specific blog rendering code. It contains methods for converting blog post bodies from Markdown to HTML, as well as a few other odds and ends. We pulled these functions into blog.js so we could keep them out of the way of main code.

CouchApp’s app.docForm() helper is a function to set up and maintain a correspondence between a CouchDB document and an HTML form. Let’s look at the first three arguments passed to it by Sofa. The id argument tells docForm() where to save the document. This can be null in the case of a new document. We set fields to an array of form elements that will correspond directly to JSON fields in the CouchDB document. Finally, the template argument is given a JavaScript object that will be used as the starting point, in the case of a new document. In this case, we ensure that the document has a type equal to “post,” and that the default format is Markdown. We also set the author to be the login name of the current user.

The onLoad callback is run when the document is loaded from CouchDB. It is useful for decorating the document before passing it to the form, or for setting up other user interface elements. In this case, we check to see if the document already has an ID. If it does, that means it’s been saved, so we create a button for deleting it and set up the callback to the delete function. It may look like a lot of code, but it’s pretty standard for Ajax applications. If there is one criticism to make of this section, it’s that the logic for creating the delete button could be moved to the blog.js file so we can keep more user-interface details out of the main flow.

The beforeSave() callback to docForm is run after the user clicks the submit button. In Sofa’s case, it manages setting the blog post’s timestamp, transforming the title into an acceptable document ID (for prettier URLs), and processing the document tags from a string into an array. It also runs the Markdown-to-HTML conversion in the browser so that once the document is saved, the rest of the application has direct access to the HTML.

The last callback we use in Sofa is the success callback. It is fired when the document is successfully saved. In our case, we use it to flash a message to the user that lets her know she’s succeeded, as well as to add a link to the blog post so that when you create a blog post for the first time, you can click through to see its permalink page.

Sofa has a function to preview blog posts before saving them. Since this doesn’t affect how the document is saved, the code that watches for events from the “preview” button is not applied within the docForm() callbacks.

The last bit of code here is triggered when the user is not logged in. All it does is redirect him to the account page so that he can log in and try editing again.

Validation

Hopefully, you can see how the previous code will send a JSON document to CouchDB when the user clicks save. That’s great for creating a user interface, but it does nothing to protect the database from unwanted updates. This is where validation functions come into play. With a proper validation function, even a determined hacker cannot get unwanted documents into your database. Let’s look at how Sofa’s works. For more on validation functions, see Chapter 7, Validation Functions.

This line imports a library from Sofa that makes the rest of the function much more readable. It is just a wrapper around the basic ability to mark requests as either forbidden or unauthorized. In this chapter, we’ve concentrated on the business logic of the validation function. Just be aware that unless you use Sofa’s validate.js, you’ll need to work with the more primitive logic that the library abstracts.

These lines do just what they say. If the document’s type, author, or created_at fields are changed, they throw an error saying the update is forbidden. Note that these lines make no assumptions about the content of these fields. They merely state that updates must not change the content from one revision of the document to the next.

The dateFormat helper makes sure that the date (if one is provided) is in the format that Sofa’s views expect.

If the person saving the document is an admin, let the edit proceed. Otherwise, make certain that the author and the person saving the document are the same. This ensures that authors may edit only their own posts.

The next block of code will check the validity of various types of documents. However, deletions will normally not be valid according to those specifications, because their content is just _deleted: true, so we short-circut the validation function here.

Finally, we have the validation for the actual post document itself. Here we require the fields that are particular to the post document. Because we’ve validated that they are present, we can count on them in views and user interface code.

Save Your First Post

Let’s see how this all works together! Fill out the form with some practice data, and hit “save” to see a success response.

Figure 4, “JSON over HTTP to save the blog post” shows how JavaScript has used HTTP to PUT the document to a URL constructed of the database name plus the document ID. It also shows how the document is just sent as a JSON string in the body of the PUT request. If you were to GET the document URL, you’d see the same set of JSON data, with the addition of the _rev parameter as applied by CouchDB.

To see the JSON version of the document you’ve saved, you can also browse to it in Futon. Visit http://127.0.0.1:5984/_utils/database.html?blog/_all_docs and you should see a document with an ID corresponding to the one you just saved. Click it to see what Sofa is sending to CouchDB.

Wrapping Up

We’ve covered how to design JSON formats for your application, how to enforce those designs with validation functions, and the basics of how documents are saved. In the next chapter, we’ll show how to load documents from CouchDB and display them in the browser.