Introduction to MongoDB

Quick Intro

Rob Pocklington

Full-stack developer (jack of all trades)

Started with the usual SQL DBs (8 - 10 years)

Tinkered with Graph DB (Neo4J) for 1 year

Been working with MongoDB in production for 2+ years

What does NoSQL typically mean?

Web-scalable

Fault tolerance

Scalable architecture

Different QLs

K,V or JSON-based

Eventual (or tunable) consistency

What's out there? (K,V)

DynamoDB

Cassandra

RocksDB

Redis

Neo4J - GraphDB

What's out there? (NoSQL)

RethinkDB

CouchBase

Riak

MongoRocks

Elastic Search

~~FoundationDB~~

Introduction - what is MongoDB?

A Document DB

Horizontally scalable

Designed for high performance

Designed for flexibility

Framework SDKs and drivers in most languages
(Ruby, Java, .NET, Javascript etc)

Drivers in all common languages

MongoDB - History

Build in 2007 by 10Gen to support their PaaS

Suffered from some early bad-press
(optimistic defaults = loss of data)

Used by Foursquare, Forbes, Disney, Cisco, Github, Bitly, Ebay, LinkedIn, CraigsList, Adobe etc.

Now the fourth most popular DB and the most popular NoSQL DB

It is open source!

Features

Strong Data Types (Dates, Booleans, Arrays ...)

Extensive query support

Large file storage (GridFS)

Indexing / Load Balancing

Capped Collections

Map reduce

Aggregation pipelines

Features (cont...)

No join tables

No transactions (tunable consistency)

Atomic at a document-level

Geo-query support (simple and complex)

Full text searching (not as good as ES)

Features (cont...)

Command Line Interface (REPL)

Sharding and Replication (for scaling and redundancy)

Full set of tooling (backup and monitoring)

Custom Validations (think: constraints)

BI (v3.2)

Software Vendors

https://mongolab.com

https://scalegrid.io

https://cloud.mongodb.com

Common Reasons for Use

Flexible - good for relational, denormalised or graph data structures

Cuts down on time to market and speeds development

natural choice for any JSON structured data

popular choice for IoT (real time metrics)

Considerations

Max size of document (16MB)

Big Files / Binaries >> GridFS

No schema != schemaless

Schema design (referencing vs. embedding)

Begin with the end in mind

CAP Theorem

MongoDB Architecture

Optimised for 64-bit systems, written in C++ (some Go for newer tools)

Tunable C.A.P (Consistent, Available or Partial Tolerance)

Pluggable Storage engines (3.0)

MMAPv1 (original)

WiredTiger (with compression)

MongoRocks (extra library)

MongoDB Design

GUIDs for id

Has an equivalent to foreign key relationships (referencing)

1 -> M, M -> 1, M -> N

or you can embed (more later)

Schema Considations

Design is the difference between loving and hating MongoDB

Schema reviews are important

DB Migrations are easy (but still necessary)

Design for use (fast write / fast read)

Denormalisation is not evil

Mongo Commands

Commands in MongoDB (vs. SQL)

Let's CRUD in Mongo!

Other queries (Regex, etc.)

Atomic updates

3T - MongoChef Demo

Referenced Documents (foreign keys)

Embedded Documents (nested)

Banking scenario

You can't use non-transaction dbs for banking, right?

Well, Stripe does.

Just get it right!

To be atomic in MongoDB you must execute an atomic operation on a single document.

Banking scenario (cont...)

There is no `BEGIN TRANSACTION` - time to do it another way.

db.account.update({ _id: ..., balance: { $gte: amount }},
{ $inc: { balance: amount }});

More complex strategies can use MVCC or 2-phase commits if required

Security

Access Control (roles / permissions)

Limiting network access (ports)

Certificates (SSL)

Encryption (file system)

Trust between boxes

Durability

Durability is a question of how much data would be lost in a crash.

Ultimately, you can define how consistent / available you want to be in MongoDB.

Write concern

0, 1, majority

Read concern

local, majority

Journalling

Journalling is MongoDBs way to make pending operations durable (per node).

By default it writes the journal every 50ms (configurable down to 2ms)

In practice, replication and good backup processes are more important than absolute durability.

Replication (Replica Sets)

Creates additional copies of the data and allows for automatic failover to another node.

Requires heart-beat / time synchronisation

Can improve read performance (unless read from master is required)

Think: RAID 1 - mirroring aka duplication (for redundancy)

Replication (cont...)

Can use hidden and delayed replicas for analytics / monitoring

Replicate locally (separate disk for example) just to sort out configuration first.

Careful creating and adding a replica - don't do it in peak traffic!

Consider restoring a primary backup to a replica then adding (less delta)

Sharding

Allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key.

It's important to choose a good shard key.

Think RAID 0 - striping aka splitting (for performance).

NOTE: Don't do sharding without replication first.

Sharding (cont...)

Shard is done per-collection

Choosing shard key (eg. region / country)

Shard locally before sharding over network (work out issues before adding latency)

Avoid sharding unless you've explored all other scaling options.

IoT (Internet of Things)

The next big Thing™ (sorry Cloud)

efficient logging of RT metrics

aggregate metrics to minute-level (in an array)

store in per-hour documents

IoT (cont...)

pre-pad document to avoid fragmentation

pre-allocate 60 seconds for 1 minute of data

also for rolling aggregate metrics (ie. last hour, last day last week)

use slice to keep it the same size (less disk fragmenetation)

Metrics - Map-Reduce (old way)

Original form of real-time data processing

Outputs results to another collection

Superceded by Aggregation Pipelines

Can be useful for discovering what data you ultimately want

Executed in Javascript (vs. C++ for Aggregations)

Metrics - Aggregation Pipelines

Faster than superman

Good for daily, weekly, monthly data

Example

Backup and Restoring

Simpler than most:

mongodump and mongorestore

Backup admin db if you want to keep roles / permissions

Ensure you restore the admin db as well

Demo

Backup and Restoring (cont...)

Can backup volume (ie. snapshot on EC2)

Vendor Solutions

Ops Manager (Enterprise $)

Mongo DB Cloud Manager ($)

Performance Tuning

Monitor the usual suspects (memory, disk and CPU)

mongotop, mongostat, htop

Monitor page faults

Monitor Index misses (tune your queries) and

DB Queue length (is the node saturated / hammered?)

Finishing

Q & A

Thanks!

Presentation is available at:
http://rp.js.org/mongodb-pres