A post you wish to read before considering using MongoDB for your next app

This post details lessons learned from a 8 months experience with MongoDB. Although I’m writing this article alone, my experience of MongoDB was in the context of teamwork, so I’ll be using “us” often in this article.

In this blog post, I’ll be sharing experience with MongoDB that you can only get when working with it for quite some time. It will also contains elements of understanding of upcoming MongoDB improvements as well as general thoughts on data modeling.

I’ll be mostly focusing on downsides, but keep in mind that I’m only sharing my own experience which may not apply to your case. Also, I haven’t been in the situation of working with all the features of MongoDB. Especially, I have never used sharding and all other scalability-related features, so I have no opinion whatsoever on these.

Quick intro to MongoDB

Core concepts

In MongoDB, a database is composed of collections (somehow the equivalent of tables of those who are used to SQL). A collection contains documents which are JSON objects (with some additions), this means that a document can represent objects in the naive JavaScript sense (mapping between string and values) and even arrays, but also nested objects (when values are themselves objects or arrays).

Among the document values, you have the classic strings, numbers, booleans, but also ObjectIDs. Each document in a collection is uniquely identified by an ObjectID and you can store a reference to the document as a value using this ObjectID which is nothing more than an object wrapper around a unique string.

That’s about it for the core concepts. In a nutshell, it means that when you model your data, you have to think in terms of tree-shapes entities that you’ll put in collections. Cross-collection references are possible using ObjectIDs (there is a DBRef thing which isn’t strictly necessary and is just a built-in structure bundling an ObjectID and a collection name).

Queries

MongoDB has its own query system. It does the job for the most part with a couple of limitations. One good point is that you can query over data nested deep and get deep data if you need without the need to pull the entire document.

One minor limitation is very specific (occurs when you have nested arrays) and is planned to be fixed. Another minor limitation is that since the query language relies on dots, dots can’t be used in object keys. This is usually not a problem except when you want to use urls as keys as it has been my case. It has to be noted that apparently, when using the update primitives (like $push), dot-key’d objects can be inserted anyway, putting your database in a quite inconsistent state (and that’s how I discovered the dot restriction :-s Needless to say the error message was rather cryptic).

The second limitation is more fundamental: any query you do is performed on a single collection. In a nutshell, there is no equivalent to the SQL JOIN. This has a very important consequence for any application written on top which is that whenever you need data from different collections using the ID values in one to reach some objects in another, you need to do several round trips between your application and your database. That’s fine-ish if they’re on the same server; that’s an outrageous constraint when they are on different servers, because it means network latency is imposed by a database design choice.

MongoDB justifies that there is no need for joins because data are denormalized in MongoDB. I have to disagree with that. First off, data modeling is hard and people make mistake. Sometimes, what should have been one collection is two. Changing a data model is a perious task and having to pay a systematic performance cost for that is unacceptable. Then, from experience, it seems that all applications data model cannot be represented as trees. Some are graphs, true graphs. And when it’s the case, I don’t see why Mongo would impose a performance cost (either in round-trip or data duplication), that’s just ridiculous.

That reason alone makes that I don’t recommend Mongo. Although sold for being flexible, the inability to make multi-collection queries makes Mongo quite inflexible. This got me thinking that from now on, I will be looking at databases that enable to do any query in one round-trip.

The no-schema fantasy and reality

Data in the database

MongoDB as one champion of the NoSQL movement claims that you don’t need a schema for your data. It is true from a technical standpoint that MongoDB doesn’t enforce a model. Reality of use and maintainability make that within the same collection, you tend to structure objects roughly the same way. Mongo allows optional fields where SQL doesn’t, but that’s pretty much where the difference stops in practice.

I have to share that in one case, there was one field for which we had decided to do no validation over, because we knew that the data supposed to be stored there would change over time. It did happen and having the freedom to store whatever we needed in this field at “no cost” really was a life-saving feature in our context.

Experience in storing objects from code without model quickly leads to a maintenance issue, because since you don’t have a model, you start wondering “do we already have a field X in the collection Y?”, or “how is structured field Z, b.t.dubs?” which inevitably forces you to write a schema documentation to compensate the lack of enforced schema.

Once you have acknowledged that you do need a form of schema anyhow, an interesting question is where to set the bar of strictness. Imposing a schema at the database level is a development burden that devs using SQL experience the hard way. Reality is that the data model of an application changes. Maybe because the need changes, maybe because the current design is imperfect. The data model does change and the cost of this reality is too high in SQL. No schema or documentation schema induce too high of a cognitive burden on the developer though.

A middleground we have found was to use Mongoose which enables to programmatically define a model in a fairly readable and somewhat declarative way. Documentation can easily be added as comments to the schema declaration and checks on data integrity can be performed at runtime. I have a good share of rants against Mongoose specifically, but for the most part, it does the 80% part of the job that you need to move on safely and conveniently with your communication with the database.

When the world around the database changes

So you have an application and a database. Your application necessarily relies on assertion regarding your database; names of collections, field names, value types, value ranges, nesting structure shape, etc. At some point, the application you write on top of the database will need to see the database model change. in SQL, it systematically means to do an “ALTER table” or equivalent operation. Interestingly, in Mongo, some classes of assertion changes will require no change to the database whatsoever. For example, if you add a field to new documents in a collection, you don’t need to add it to all existing documents. Depending on the case, you can ignore the missing field or deal with it at the application level.

But of course, for other classes of changes, you’ll need to write a script to reorganise it. To the best of my knowledge, there is no way to send a script to the database so that it reorganises itself in situ. You need to pull the data, remodel it and re-push it. Needless to say that it feels really dirty. I hope I have missed something.

Regardless, it seems that the problem is somewhat equivalent in SQL. Data model change is part of the development flow and I wish it was treated as a first-class use case in database designs. I’m not an expert in that field. If I’m plain wrong, feel free to share your knowledge in the comments.

The Map-Reduce Eldorado

Map reduce promises huge performance gains under the condition of writing code in a certain way (I’m giving a practical definition, not a formal one here). Be aware it’s not a silver bullet applicable to any problem.

I have found the API more complicated than necessary, but once you understand it, you’re good to go; I let you judge by yourself, but at least, the map, reduce and finalize functions can be written in JS… I mean… sort of… which brings me to:

Third-world JavaScript

The JS engine is SpiderMonkey 1.7 (the one shipped along with Firefox 2 for the curious). It means you have no Object.keys, only Array#forEach as array extra, etc. It fills like writing grandma’s JS when you actually know the language and the goodness added in ES5.

In theory, all map and reduce calls could be done in parallel independently; that’s the reason you need to write your code with some discipline. In practice, on one MongoDB install, all operations occur in sequence because of SpiderMonkey I have read somewhere. That and SpiderMonkey 1.7 pure performance, of course…

Good news, there is a plan to switch to V8 and that should solve all JS-related problems. Bad news, it’s unclear when the switch will be effective.

Debugging

The debugging experience is very painful. There is no proper debugger, no console.log. The best tool you have is conditionally throwing, because the only thing you can get back from a MapReduce operation is the final data or the first uncaught error (which stops the operation). That’s annoying when your map or reduce or finalize function gets above 30 lines. I came to a point where I wrote a small browser-based emulator and pulled data out of the database to test my MapReduce code in an environment where I could debug it (I probably should open source that actually…).

… and a couple of noobish design mistakes

Points here are not major drawbacks. They are just some WTF I came across

Can’t rename a database

Self-explanatory title. No need to explain why it feels such a stupid feature that should have been implemented as part of Mongo 0.0.1. Are we really at Mongo 2.2 and still can’t rename a database? To track the bug.

Global lock

Until Mongo 2.0 included, there is a global MongoDB lock. It means that write (or write+read) operations on completely unrelated database on the same install can’t be performed concurrently. I’m sorry, but this is plain stupid.

As per Mongo 2.2, there is a per-database lock, which means that operations on independent collections within the same collection can’t be done concurrently. Same issue than before at a different level (the one that matter when you have only one database). They are working on that issue to have locks at finer-grained entities. I have no info as to when this will happen, but they seem committed to fix the problem.

Safe off by default

I have read in some blog posts that safe mode is off by default. I haven’t experienced data loss, so I have no clue of what is the danger in being in “unsafe mode”. It can be changed anyway, but it really is not a good default value it seems.

Additional readings

Some insightful readings. Informations in these posts partially overlap with one another and with what I have said already.

Conclusion

MongoDB has some major design mistakes in it. Its developer experience is overall much better than the SQL one, but I still feel dissatisfied by both and wouldn’t recommend any for now. Sadly, I don’t know any database software I would recommend. Likely, I’ll be interested in exploring Neo4j for a next experience. I’m open to suggestions, of course.

About these ads

10 thoughts on “A post you wish to read before considering using MongoDB for your next app

  1. As I am currently obliged to work with MongoDB and furthermore perform in-depth queries of embedded data, I found your blog a most interesting read. Lucky me for not having to suffer through some of your scenarions, I suppose.
    I’ll point out that, for the time being (speaking in terms of stable releases), what is tearing me apart the most, is that without JOIN and thus no virtual collections, extracting only one field from embedded data, filtered by value, is nearly impossible with just glance knowledge of Java, when it comes to semi-automation of the forementioned process.

  2. MongoDB is a NoSQL type of DB. The concepts of a NoSQL usually are:
    * Not using SQL queries.
    * Not supporting the JOIN operation (to allow data to be partitioned on among different machines).
    * NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema.

    MongoDB is a NoSQL, it is expected to have these attributes. This is not a shortcoming, it is a design choice that allows for other technical features. You should list this into the core concept section.

    The possibilities to add fields to some collections, but not all, can also be seen as a strength of a NoSQL. It depends on your requirements.

    • I agree with what you’re saying. I didn’t mean this post to be an intro to NoSQL or MongoDB, that’s why I didn’t expand on that.
      I wanted to share a developer-oriented experience, not necessarily a perfect overview of MongoDB nor NoSQL. This is a really biased post based on my experience.

      I also warn very early in the article that “I have never used sharding and all other scalability-related features, so I have no opinion whatsoever on these.”

      I learned only long after the experience that if you don’t huge amount of data that require easy scalability, MongoDB may actually be an awful choice. You only learn by doing mistakes I guess.

      In our case, we needed the flexibility of schema, but could trade the scalability to get JOIN back. Is there a database that enables that?

  3. Thank you for that article.
    I tried MongoDB for month now, but i think i will switch back to MySQL.
    The lack of Joins is my biggest problem, because the application gets slower and slower the more data i add :(

  4. Interesting read! If you’re considering Neo4j, you maybe want to have a look at Structr. It’s an open-source application framework on top of Neo4j. Its back-end provides bi-directional mapping between JSON documents and (sub-)graphs, based on a strong schema definition which either lives in a schema graph (with UI editor for ad-hoc schema editing), or Java beans (if you want full control and 100% flexibility. And there’s a supplemental UI which provides CMS functionality for rapid development of HTML5 web apps.

    http://structr.org

  5. Two comments in a row recommanding graph databases; was my blogpost relayed in these communities recently? ;-)

    More seriously, since I wrote this blogpost, I’ve been a bit turned down by Neo4J because of their licencing mess. As far as I’m concerned, licencing matters A LOT and Neo4J interpretation of the AGPL licence is pretty scary.
    Licencing matters a lot, the data in the database belongs to the user that puts it in (through the applications I write). The vast majority of the code I write is open source (MIT), but in the rare cases I come to the conclusion that I want to keep it private, I expect the right to do so. Any licence that’s any ambiguous on this topic is a stopper for me. Any company that plays smart with subtely different licences for pretty much the same product shows this ambiguity.

    As a web app developer, I’m particularly interested in the Blueprints ecosystem https://github.com/tinkerpop/blueprints/wiki/The-Benefits-of-Blueprints and a bit less by the specific underlying database implementation. Among other things, Blueprint guarantees that I can change of underlying database and keep my code unchanged. This matters a lot especially now with graph databases of various levels of maturity.
    The Property Graph Model https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model also makes a lot of sense to me. Much more so than the Neo4J model which, last I looked at it didn’t really allow to attach structured data on the edges.

    I haven’t really had a big application to build on top of a graph database (lots of people don’t think it’s ready, so they default to wanting SQL), but last at looked at all of this a few months back, I thought that if I had an app to build, I’d probably use OrientDB+Blueprint on top… until Titan gets stable. Titan looks very promising, but very early stage. (to be honest, I don’t really know that much, so my opinions really are based on impressions, nothing to be given too much credit to)

    • Hi David,

      yes, your post has reached the graph db community as graph databases are typically what people evaluate when they find document databases not flexible enough.

      A few words on Neo4j:

      a. It’s not AGPL anymore (since 2011), but GPL, exactly like MySQL f.e.. And for commercial use, there’s a very interesting Startup License (since 2013). http://nosql.mypopescu.com/post/4775061058/emil-eifrem-about-neo4j-1-3-and-the-neo4j-gpl-community
      b. Blueprints is a 1:1 copy of the original Neo4j API. One of the authors is Peter Neubauer, a co-founder of Neo. There’s an interesting forum thread: https://groups.google.com/forum/#!topic/neo4j/TWISaqitivo

      From what I understand, what you really want (and many others, too) is the best of both worlds: The ease of use of a Document Database (store/retrieve JSON documents) and the power of a Graph Database when it comes to (complex, fast) queries.

      If I understand you right, what you describe as ‘JOIN’ is the process of aggregating data in your database based on (ad-hoc) queries, to form JSON documents suited for your application, without the need to store the data in the exact structure as you retrieve them. This requires a schema (as you wrote), and the question is, where does this schema live, how flexible is it, and how simple can it be changed. I think we have some answers in Structr, but of course there’s always more than one way to get things done.

      Axel

  6. “yes, your post has reached the graph db community as graph databases are typically what people evaluate when they find document databases not flexible enough.”
    => Ahah :-) I had never thought of it, but it makes so much sense.

    “[Neo4J]’s not AGPL anymore (since 2011), but GPL, exactly like MySQL”
    => Ok, good to know.

    I had missed that Neo4J was a Property Graph (I knew it could have data on the nodes, but I had missed that it was possible on the edges. Is it “recent”?)

    “From what I understand, what you really want”
    => Overall, what I want is a natural way to model data (and graphs resonate a lot with how I think the world) and that the database doesn’t gets in the way when I make a query (like MongoDB does by forcing several concrete queries for a single conceptual query). JSON documents aren’t important for themselves, but the key-value model tends to be very useful indeed.
    I’ll take a look at structr, it looks interesting :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s