Quick Intro
Rob Pocklington
Full-stack developer (jack of all trades)
Started with the usual SQL DBs (8 - 10 years)
Tinkered with Graph DB (Neo4J) for 1 year
Been working with MongoDB in production for 2+ years
What does NoSQL typically mean?
Web-scalable
Fault tolerance
Scalable architecture
Different QLs
K,V or JSON-based
Eventual (or tunable) consistency
What's out there? (K,V)
DynamoDB
Cassandra
RocksDB
Redis
Neo4J - GraphDB
What's out there? (NoSQL)
RethinkDB
CouchBase
Riak
MongoRocks
Elastic Search
FoundationDB
Introduction - what is MongoDB?
A Document DB
Horizontally scalable
Designed for high performance
Designed for flexibility
Framework SDKs and drivers in most languages
(Ruby, Java, .NET, Javascript etc)
Drivers in all common languages
MongoDB - History
Build in 2007 by 10Gen to support their PaaS
Suffered from some early bad-press
(optimistic defaults = loss of data)
Used by Foursquare, Forbes, Disney, Cisco, Github, Bitly, Ebay, LinkedIn, CraigsList, Adobe etc.
Now the fourth most popular DB and the most popular NoSQL DB
It is open source!
Features
Strong Data Types (Dates, Booleans, Arrays ...)
Extensive query support
Large file storage (GridFS)
Indexing / Load Balancing
Capped Collections
Map reduce
Aggregation pipelines
Features (cont...)
No join tables
No transactions (tunable consistency)
Atomic at a document-level
Geo-query support (simple and complex)
Full text searching (not as good as ES)
Features (cont...)
Command Line Interface (REPL)
Sharding and Replication (for scaling and redundancy)
Full set of tooling (backup and monitoring)
Custom Validations (think: constraints)
BI (v3.2)
Software Vendors
https://mongolab.com
https://scalegrid.io
https://cloud.mongodb.com
Common Reasons for Use
Flexible - good for relational, denormalised or graph data structures
Cuts down on time to market and speeds development
natural choice for any JSON structured data
popular choice for IoT (real time metrics)
Considerations
Max size of document (16MB)
Big Files / Binaries >> GridFS
No schema != schemaless
Schema design (referencing vs. embedding)
Begin with the end in mind
CAP Theorem
MongoDB Architecture
Optimised for 64-bit systems, written in C++ (some Go for newer tools)
Tunable C.A.P (Consistent, Available or Partial Tolerance)
Pluggable Storage engines (3.0)
MMAPv1 (original)
WiredTiger (with compression)
MongoRocks (extra library)
MongoDB Design
GUIDs for id
Has an equivalent to foreign key relationships (referencing)
1 -> M, M -> 1, M -> N
or you can embed (more later)
Schema Considations
Design is the difference between loving and hating MongoDB
Schema reviews are important
DB Migrations are easy (but still necessary)
Design for use (fast write / fast read)
Denormalisation is not evil
Mongo Commands
Commands in MongoDB (vs. SQL)
Let's CRUD in Mongo!
Other queries (Regex, etc.)
Atomic updates
3T - MongoChef Demo
Referenced Documents (foreign keys)
Embedded Documents (nested)
Banking scenario
You can't use non-transaction dbs for banking, right?
Well, Stripe does.
Just get it right!
To be atomic in MongoDB you must execute an atomic operation on a single document.
Banking scenario (cont...)
There is no `BEGIN TRANSACTION` - time to do it another way.
db.account.update({ _id: ..., balance: { $gte: amount }},
{ $inc: { balance: amount }});
More complex strategies can use MVCC or 2-phase commits if required
Security
Access Control (roles / permissions)
Limiting network access (ports)
Certificates (SSL)
Encryption (file system)
Trust between boxes
Durability
Durability is a question of how much data would be lost in a crash.
Ultimately, you can define how consistent / available you want to be in MongoDB.
Write concern
0, 1, majority
Read concern
local, majority
Journalling
Journalling is MongoDBs way to make pending operations durable (per node).
By default it writes the journal every 50ms (configurable down to 2ms)
In practice, replication and good backup processes are more important than absolute durability.
Replication (Replica Sets)
Creates additional copies of the data and allows for automatic failover to another node.
Requires heart-beat / time synchronisation
Can improve read performance (unless read from master is required)
Think: RAID 1 - mirroring aka duplication (for redundancy)
Replication (cont...)
Can use hidden and delayed replicas for analytics / monitoring
Replicate locally (separate disk for example) just to sort out configuration first.
Careful creating and adding a replica - don't do it in peak traffic!
Consider restoring a primary backup to a replica then adding (less delta)
Sharding
Allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key.
It's important to choose a good shard key.
Think RAID 0 - striping aka splitting (for performance).
NOTE: Don't do sharding without replication first.
Sharding (cont...)
Shard is done per-collection
Choosing shard key (eg. region / country)
Shard locally before sharding over network (work out issues before adding latency)
Avoid sharding unless you've explored all other scaling options.
IoT (Internet of Things)
The next big Thing™ (sorry Cloud)
efficient logging of RT metrics
aggregate metrics to minute-level (in an array)
store in per-hour documents
IoT (cont...)
pre-pad document to avoid fragmentation
pre-allocate 60 seconds for 1 minute of data
also for rolling aggregate metrics (ie. last hour, last day last week)
use slice to keep it the same size (less disk fragmenetation)
Metrics - Map-Reduce (old way)
Original form of real-time data processing
Outputs results to another collection
Superceded by Aggregation Pipelines
Can be useful for discovering what data you ultimately want
Executed in Javascript (vs. C++ for Aggregations)
Metrics - Aggregation Pipelines
Faster than superman
Good for daily, weekly, monthly data
Example
Backup and Restoring
Simpler than most:
mongodump and mongorestore
Backup admin db if you want to keep roles / permissions
Ensure you restore the admin db as well
Demo
Backup and Restoring (cont...)
Can backup volume (ie. snapshot on EC2)
Vendor Solutions
Ops Manager (Enterprise $)
Mongo DB Cloud Manager ($)
Performance Tuning
Monitor the usual suspects (memory, disk and CPU)
mongotop, mongostat, htop
Monitor page faults
Monitor Index misses (tune your queries) and
DB Queue length (is the node saturated / hammered?)
Finishing
Q & A
Thanks!
Presentation is available at:
http://rp.js.org/mongodb-pres