mongodb - Aggregation with condtion in embedded documents in mongo db -
i'm stuck aggregation in mongodb. premise have data particular ads within time range.
so suppose query ads within range of 22nd april 24th april, here should get, summation of spend source2, , revenue, session, bounces etc source1.
[{ "_id" : objectid("560bbd5dfabc614611000e95"), "spend": 470, "revenue": 440, "sessions": 3 },.... ] here query, attempting gives me correct data takes long - 24seconds 22k entires.
db.getcollection('tests').aggregate([{ $match: { ad_account_id: 40 } }, { "$unwind": "$source1" }, { "$unwind": "$source2" }, { "$group": { "_id": "$internal_id", "transactionrevenue": { "$sum": { "$cond": [{ "$and": [{ "$gte": [ "$source1.created_at", isodate("2015-04-22t00:00:00.000z") ] }, { "$lte": [ "$source1.created_at", isodate("2015-04-25t00:00:00.000z") ] }] }, "$source1.transactionrevenue", 0] } }, "sessions": { "$sum": { "$cond": [{ "$and": [{ "$gte": [ "$source1.created_at", isodate("2015-04-22t00:00:00.000z") ] }, { "$lte": [ "$source1.created_at", isodate("2015-04-25t00:00:00.000z") ] }] }, "$source1.sessions", 0] } }, "spend": { "$sum": { "$cond": [{ "$and": [{ "$gte": [ "$source2.created_at", isodate("2015-04-22t00:00:00.000z") ] }, { "$lte": [ "$source2.created_at", isodate("2015-04-25t00:00:00.000z") ] }] }, "$source2.spend", 0] } } }, }]); problems how unwind multiple times, how summation of multiple things in source1 , not having aggregation again , again? takes 24seconds, 22 entries....please suggest on should index (i have none), , if document size of average 4mb suggests there wrong schema?
would map reduce better though aggregation considered faster in mongodb?
if think document design wrong, i'm ears, we're working on migration. better correct things now, rather later.
here sample document
{ "_id" : objectid("560bbd5dfabc614611000e95"), "internal_id": 1, "created_at" : isodate("2015-04-21t00:50:02.593z"), "updated_at" : isodate("2015-09-15t12:20:39.154z"), "name" : "lookalikeusapr21_06h19m", "ad_account_id" : 40, "targeting" : { "age_max" : 44, "age_min" : 35, "genders" : [ 1 ], "page_types" : [ "desktopfeed" ] }, "auto_optimization" : false, "source1" : [ { "id" : 119560952, "created_at" : isodate("2015-04-23t12:35:09.467z"), "updated_at" : isodate("2015-05-19t05:20:58.374z"), "transactionrevenue" : 320, "sessions" : 1, "bounces" : 1 }, { "id" : 119560955, "created_at" : isodate("2015-05-01t12:35:09.467z"), "updated_at" : isodate("2015-05-19t05:20:58.374z"), "transactionrevenue" : 230, "sessions" : 10, "bounces" : 1 }, { "id" : 119560954, "created_at" : isodate("2015-04-23t10:35:09.467z"), "updated_at" : isodate("2015-05-19t05:20:58.374z"), "transactionrevenue" : 120, "sessions" : 2, "bounces" : 1 }, { "id" : 119560953, "created_at" : isodate("2015-04-25t12:35:09.467z"), "updated_at" : isodate("2015-05-19t05:20:58.374z"), "transactionrevenue" : 100, "sessions" : 3, "bounces" : 2 } ], "source2" : [ { "id" : 219560952, "created_at" : isodate("2015-04-22t12:35:09.467z"), "updated_at" : isodate("2015-05-19t05:20:58.374z"), "spend" : 300 }, { "id" : 219560955, "created_at" : isodate("2015-04-23t12:35:09.467z"), "updated_at" : isodate("2015-05-19t05:20:58.374z"), "spend" : 170 }, { "id" : 219560954, "created_at" : isodate("2015-04-25t10:35:09.467z"), "updated_at" : isodate("2015-05-19t05:20:58.374z"), "spend" : 450 } ] }
the first thing should doing adding index both source1 , source2 arrays "created_at" field. reduce lot of possible results , improve speed querying these possible matches being present in documents select.
the next main improvements combine arrays , filter one, , notably before process $unwind. going save lot of cycles , document expansion in arrays.
moreover, it's going give correct totals. when $unwind 2 arrays, 1 array's details repeated number of items in second array. gives incorrect results array content "unwound" first. can each separately, it's far better merge them one:
db.getcollection('tests').aggregate([ { "$match": { "ad_account_id": 40, "$or": [ { "source1": { "$elemmatch": { "created_at": { "$gte": new date("2015-04-22"), "$lte": new date("2015-04-25") } } } }, { "source2": { "$elemmatch": { "created_at": { "$gte": new date("2015-04-22"), "$lte": new date("2015-04-25") } } } } ] }}, { "$project": { "_id": 0, "internal_id": 1, "source": { "$setdifference": [ { "$map": { "input": { "$setunion": [ "$source1", "$source2" ] }, "as": "source", "in": { "$cond": [ { "$and": [ { "$gte": [ "$$source.created_at", new date("2015-04-22") ] }, { "$lte": [ "$$source.created_at", new date("2015-04-25") ] } ]}, "$$source", false ] } }}, [false] ] } }}, { "$unwind": "$source"}, { "$group": { "_id": "$internal_id", "transactionrevenue": { "$sum": { "$ifnull": [ "$source.transactionrevenue", 0 ] } }, "sessions": { "$sum": { "$ifnull": [ "$source.sessions", 0 ] } }, "spend": { "$sum": { "$ifnull": [ "$source.spend", 0 ] } } }} ]) which going give result on sample:
{ "_id" : 1, "transactionrevenue" : 440, "sessions" : 3, "spend" : 470 } so great big architecture hint in being done here it wise to combine arrays single array in general application usage. can add field "type" if must discern between 2 different types of items, processing should benefit singular array.
the main lesson query aside that, $match first filter out content possible. whilst initial $match stage cannot of course remove items arrays not meet conditions, can importantly "match documents". because not want process documents don't have information @ all. adds time.
the second part other combined array want filter out content before unwinding array possible same reasons, don't want processing items don't need to.
short lesson, filter first reduce processing. conditional sums fine, should used selection of content , not raw filtering. it's getting rid of undesired data first rather ignoring it. process less , faster.
Comments
Post a Comment