Quick Intro

  • Rob Pocklington
  • Full-stack developer (jack of all trades)
  • Started with the usual SQL DBs (8 - 10 years)
  • Tinkered with Graph DB (Neo4J) for 1 year

  • Been working with MongoDB in production for 2+ years





  • What does NoSQL typically mean?

  • Web-scalable
  • Fault tolerance
  • Scalable architecture
  • Different QLs
  • K,V or JSON-based
  • Eventual (or tunable) consistency
  • What's out there? (K,V)


  • DynamoDB
  • Cassandra
  • RocksDB
  • Redis
  • Neo4J - GraphDB
  • What's out there? (NoSQL)


  • RethinkDB
  • CouchBase
  • Riak
  • MongoRocks
  • Elastic Search
  • FoundationDB
  • Introduction - what is MongoDB?

  • A Document DB
  • Horizontally scalable
  • Designed for high performance
  • Designed for flexibility
  • Framework SDKs and drivers in most languages
    (Ruby, Java, .NET, Javascript etc)
  • Drivers in all common languages
  • MongoDB - History

  • Build in 2007 by 10Gen to support their PaaS
  • Suffered from some early bad-press
    (optimistic defaults = loss of data)
  • Used by Foursquare, Forbes, Disney, Cisco, Github, Bitly, Ebay, LinkedIn, CraigsList, Adobe etc.
  • Now the fourth most popular DB and the most popular NoSQL DB
  • It is open source!
  • Features


  • Strong Data Types (Dates, Booleans, Arrays ...)
  • Extensive query support
  • Large file storage (GridFS)
  • Indexing / Load Balancing
  • Capped Collections
  • Map reduce
  • Aggregation pipelines

  • Features (cont...)

  • No join tables
  • No transactions (tunable consistency)
  • Atomic at a document-level
  • Geo-query support (simple and complex)
  • Full text searching (not as good as ES)

  • Features (cont...)

  • Command Line Interface (REPL)
  • Sharding and Replication (for scaling and redundancy)
  • Full set of tooling (backup and monitoring)
  • Custom Validations (think: constraints)
  • BI (v3.2)
  • Software Vendors

  • https://mongolab.com
  • https://scalegrid.io
  • https://cloud.mongodb.com



  • Common Reasons for Use

  • Flexible - good for relational, denormalised or graph data structures
  • Cuts down on time to market and speeds development

  • natural choice for any JSON structured data
  • popular choice for IoT (real time metrics)
  • Considerations

  • Max size of document (16MB)
  • Big Files / Binaries >> GridFS
  • No schema != schemaless
  • Schema design (referencing vs. embedding)

  • Begin with the end in mind
  • CAP Theorem

  • MongoDB Architecture

  • Optimised for 64-bit systems, written in C++ (some Go for newer tools)
  • Tunable C.A.P (Consistent, Available or Partial Tolerance)
  • Pluggable Storage engines (3.0)
  • MMAPv1 (original)
  • WiredTiger (with compression)
  • MongoRocks (extra library)
  • MongoDB Design

  • GUIDs for id

  • Has an equivalent to foreign key relationships (referencing)
  • 1 -> M, M -> 1, M -> N

  • or you can embed (more later)
  • Schema Considations

  • Design is the difference between loving and hating MongoDB
  • Schema reviews are important
  • DB Migrations are easy (but still necessary)
  • Design for use (fast write / fast read)

  • Denormalisation is not evil
  • Mongo Commands

  • Commands in MongoDB (vs. SQL)
  • Let's CRUD in Mongo!
  • Other queries (Regex, etc.)
  • Atomic updates

  • 3T - MongoChef Demo
  • Referenced Documents (foreign keys)

  • Embedded Documents (nested)

  • Banking scenario

  • You can't use non-transaction dbs for banking, right?
  • Well, Stripe does.
  • Just get it right!
  • To be atomic in MongoDB you must execute an atomic operation on a single document.
  • Banking scenario (cont...)

  • There is no `BEGIN TRANSACTION` - time to do it another way.
  • db.account.update({ _id: ..., balance: { $gte: amount }},
    { $inc: { balance: amount }});
  • More complex strategies can use MVCC or 2-phase commits if required
  • Security

  • Access Control (roles / permissions)
  • Limiting network access (ports)
  • Certificates (SSL)
  • Encryption (file system)
  • Trust between boxes
  • Durability

  • Durability is a question of how much data would be lost in a crash.
  • Ultimately, you can define how consistent / available you want to be in MongoDB.
  • Write concern
  • 0, 1, majority
  • Read concern
  • local, majority
  • Journalling

  • Journalling is MongoDBs way to make pending operations durable (per node).
  • By default it writes the journal every 50ms (configurable down to 2ms)

  • In practice, replication and good backup processes are more important than absolute durability.
  • Replication (Replica Sets)

  • Creates additional copies of the data and allows for automatic failover to another node.
  • Requires heart-beat / time synchronisation
  • Can improve read performance (unless read from master is required)

  • Think: RAID 1 - mirroring aka duplication (for redundancy)
  • Replication (cont...)

  • Can use hidden and delayed replicas for analytics / monitoring
  • Replicate locally (separate disk for example) just to sort out configuration first.
  • Careful creating and adding a replica - don't do it in peak traffic!
  • Consider restoring a primary backup to a replica then adding (less delta)
  • Sharding

  • Allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key.
  • It's important to choose a good shard key.
  • Think RAID 0 - striping aka splitting (for performance).
  • NOTE: Don't do sharding without replication first.
  • Sharding (cont...)

  • Shard is done per-collection
  • Choosing shard key (eg. region / country)
  • Shard locally before sharding over network (work out issues before adding latency)
  • Avoid sharding unless you've explored all other scaling options.
  • IoT (Internet of Things)

  • The next big Thing™ (sorry Cloud)
  • efficient logging of RT metrics
  • aggregate metrics to minute-level (in an array)
  • store in per-hour documents
  • IoT (cont...)

  • pre-pad document to avoid fragmentation
  • pre-allocate 60 seconds for 1 minute of data
  • also for rolling aggregate metrics (ie. last hour, last day last week)
  • use slice to keep it the same size (less disk fragmenetation)
  • Metrics - Map-Reduce (old way)

  • Original form of real-time data processing
  • Outputs results to another collection
  • Superceded by Aggregation Pipelines
  • Can be useful for discovering what data you ultimately want
  • Executed in Javascript (vs. C++ for Aggregations)
  • Metrics - Aggregation Pipelines

  • Faster than superman
  • Good for daily, weekly, monthly data

  • Example
  • Backup and Restoring

  • Simpler than most:
  • mongodump and mongorestore
  • Backup admin db if you want to keep roles / permissions
  • Ensure you restore the admin db as well

  • Demo
  • Backup and Restoring (cont...)

  •  Can backup volume (ie. snapshot on EC2)
  • Vendor Solutions
  •  Ops Manager (Enterprise $)
  •  Mongo DB Cloud Manager ($)
  • Performance Tuning

  • Monitor the usual suspects (memory, disk and CPU)
  • mongotop, mongostat, htop

  • Monitor page faults
  • Monitor Index misses (tune your queries) and
  • DB Queue length (is the node saturated / hammered?)
  • Finishing

  • Q & A
  • Thanks!

    Presentation is available at:
    http://rp.js.org/mongodb-pres