Thanks to GVP of mongodb-user group.
OK, so I have an ad-network background and can probably hit these points here. (and I'm talking about experience with > 500M records / day) The MySQL method (#1), will suck for a variety of reasons. The biggest issue you're going to face is that MySQL doesn't really scale horizontally. You can't put 500 M records on 10 servers and then get a reasonable roll-up. What you end up having to do is bulk inserts and roll-ups on each server. You then pass those roll-ups on to another server which then summarizes the data. (sound familiar?) Of course, once you're designed this whole process of "bulk insert -> summarize -> export -> bulk insert -> summarize", you come to this great realization... you've built a map-reduce engine on top of MySQL (yay!). So now you get options #2 & #3. Hadoop will definitely work for this problem. I know of working installations that not only do stats, but also do targeting based using Hadoop. So this is indeed a known usage for Hadoop. Of course, if you want to query the result, you'll have to dump those output files somewhere (typically in to an RDBMS) Where does Mongo fall in this? That's a little more complex. Well, with the release of auto-sharding, it's definitely looking like MongoDB can handle this type of functionality. The fact that map-reduce is "single-threaded" can be limiting in terms of performance (or you can just use more smaller servers). However, the big win that Mongo gives you is data flexibility. You say that you have 50 fields, but are they always present? (probably not) Do they change? (probably with every client) Do you query against all of them all of the time? (probably not) So the benefit Mongo has is that it can accept all of your data, run your map-reduces and the output is actually a database into which you can query some more (holy smokes!). Or you can mongoexport the reduced data to CSV and quickly drop it into MySQL for the finishing touches on your reports. Mongo doesn't need a "data dictionary" or all kinds of Metadata to parse your input files. Oh and it's relatively easy to configure. Picking MongoDB over other similar tools is really about learning curve and ramp-up time. MongoDB is relatively simple compared to many of the other tools out there (like Hadoop). Give me sudo, ssh and 10 boxes and we can have an auto-partitioned cluster available in probably 30 minutes. Mongo can clock in at tens of thousands writes / sec (especially across shards, even on "commodity hardware"). So you're talking about an hour or two to insert 50M records and start running sample roll-ups. So for the cost of about a day, you can actually test if MongoDB is right for you. That means a lot me :)
No comments:
Post a Comment