MySQL(Or Other Relational Databases), Mongo or Hadoop?

Thursday, September 2, 2010

MySQL(Or Other Relational Databases), Mongo or Hadoop?

I was reading through the digests of mongodb-user group when I picked up a very good stand on whether you should use MongoDb, Hadoop or both, or MySQL and other relational databases.

Thanks to GVP of mongodb-user group.

OK, so I have an ad-network background and can probably hit these
points here. (and I'm talking about experience with > 500M records /
day)

The MySQL method (#1), will suck for a variety of reasons. The biggest
issue you're going to face is that MySQL doesn't really scale
horizontally. You can't put 500 M records on 10 servers and then get a
reasonable roll-up. What you end up having to do is bulk inserts and
roll-ups on each server. You then pass those roll-ups on to another
server which then summarizes the data. (sound familiar?)

Of course, once you're designed this whole process of "bulk insert ->
summarize -> export -> bulk insert -> summarize", you come to this
great realization... you've built a map-reduce engine on top of MySQL
(yay!).

So now you get options #2 & #3. Hadoop will definitely work for this
problem. I know of working installations that not only do stats, but
also do targeting based using Hadoop. So this is indeed a known usage
for Hadoop. Of course, if you want to query the result, you'll have to
dump those output files somewhere (typically in to an RDBMS)

Where does Mongo fall in this?

That's a little more complex. Well, with the release of auto-sharding,
it's definitely looking like MongoDB can handle this type of
functionality. The fact that map-reduce is "single-threaded" can be
limiting in terms of performance (or you can just use more smaller
servers). However, the big win that Mongo gives you is data
flexibility.

You say that you have 50 fields, but are they always present?
(probably not)
Do they change? (probably with every client)
Do you query against all of them all of the time? (probably not)

So the benefit Mongo has is that it can accept all of your data, run
your map-reduces and the output is actually a database into which you
can query some more (holy smokes!). Or you can mongoexport the reduced
data to CSV and quickly drop it into MySQL for the finishing touches
on your reports. Mongo doesn't need a "data dictionary" or all kinds
of Metadata to parse your input files.

Oh and it's relatively easy to configure.

Picking MongoDB over other similar tools is really about learning
curve and ramp-up time. MongoDB is relatively simple compared to many
of the other tools out there (like Hadoop). Give me sudo, ssh and 10
boxes and we can have an auto-partitioned cluster available in
probably 30 minutes. Mongo can clock in at tens of thousands writes /
sec (especially across shards, even on "commodity hardware"). So
you're talking about an hour or two to insert 50M records and start
running sample roll-ups.

So for the cost of about a day, you can actually test if MongoDB is
right for you. That means a lot me :)

Related Links Widget for Blogspot

No comments:

Post a Comment

Author Interests

Amazon.com Widgets

Parallel and Distributed Computing
Service-Oriented Architecture
Application Optimization
Network and Application Security
Process Automation
Data Warehousing
Data Visualization
Artificial Intelligence
Open Source Software

This blog is here because I realized that there is no better way to ensure knowledge assimilation for myself and education to the netizens about the things I discover, invent and learn from anything about computing, especially information technology in the cloud, system administration and anything interesting more than writing. Blogger is free so I don't have to worry about publication.

I would be writing mostly on open-source solutions to real-world IT and computing problems. I would also like to write on topics about simplifying, analyzing and aggregating sparsely distributed information, natural language and human behavior whenever possible. I will start by discussing concepts then theories proceeding on practical application or a proof with the aim to provide a model solution.

I'd also like to note that I do have substantial knowledge and experience with Microsoft products but as a matter of preference, I won't be discussing any of those things as long as I can avoid.

I might jump to other topics depending on my mood.

Thursday, September 2, 2010