SlideShare a Scribd company logo
Social Data and Log Analysis
      Using MongoDB
      2011/03/01(Tue) #mongotokyo
              doryokujin
Self-Introduction

• doryokujin (Takahiro Inoue), Age: 25
• Education: University of Keio
  • Master of Mathematics March 2011 ( Maybe... )
  • Major: Randomized Algorithms and Probabilistic Analysis

• Company: Geisha Tokyo Entertainment (GTE)
  • Data Mining Engineer (only me, part-time)

• Organized Community:
  • MongoDB JP, Tokyo Web Mining
My Job

• I’m a Fledgling Data Scientist
  • Development of analytical systems for social data
  • Development of recommendation systems for social data
• My Interest: Big Data Analysis
  • How to generate logs scattered many servers
  • How to storage and access to data
  • How to analyze and visualization of billions of data
Agenda
• My Company’s Analytic Architecture
• How to Handle Access Logs
• How to Handle User Trace Logs
• How to Collaborate with Front Analytic Tools
• My Future Analytic Architecture
Agenda                   Hadoop,
                                       Mongo Map Reduce

• My Company’s Analytic Architecture      Hadoop,
                                        Schema Free
• How to Handle Access Logs
• How to Handle User Trace Logs         REST Interface,
                                           JSON

• How to Collaborate with Front Analytic Tools
                                       Capped Collection,
• My Future Analytic Architecture      Modifier Operation


Of Course Everything With
My Company’s
Analytic Architecture
Social Game (Mobile): Omiseyasan




• Enjoy arranging their own shop (and avatar)
• Communicate with other users by shopping, part-time, ...
• Buy seeds of items to display their own shop
Data Flow

Access
Back-end Architecture
  Pretreatment: Trimming,      As a Central Data Server
   Validation, Filtering,...




Dumbo (Hadoop Streaming)

                                         PyMongo




    Back
    Up To
    S3
Front-end Architecture

                  sleepy.mongoose
                  (REST Interface)
PyMongo


                                        Web UI




Social Data Analysis                 Data Analysis
Environment
• MongoDB: 1.6.4
  • PyMongo: 1.9
• Hadoop: CDH2 ( soon update to CDH3 )
  • Dumbo: Simple Python Module for Hadoop Streaming
• Cassandra: 0.6.11
   • R, Neo4j, jQuery, Munin, ...
• [Data Size (a rough estimate)]
  • Access Log 15GB / day ( gzip ) - 2,000M PV
  • User Trace Log 5GB / day ( gzip )
How to Handle
 Access Logs
How to Handle Access Logs
Pretreatment: Trimming,       As a Data Server
 Validation, Filtering, ...




   Back
   Up To
   S3
Access Data Flow
                                                            Caution: need
                                                          MongoDB >= 1.7.4
                                  user_pageview




                                 agent_pageview                     daily_pageview
 Pretreatment                                      2nd Map Reduce




user_access                      hourly_pageview
                1st Map Reduce


                                   Group by
Hadoop

• Using Hadoop: Pretreatment Raw Records
• [Map / Reduce]
    • Read all records
    • Split each record by ‘¥s’
    • Filter unnecessary records (such as *.swf)
    • Check records whether correct or not
    • Insert (save) records to MongoDB
    ※ write operations won’t yet fully utilize all cores
Access Logs

110.44.178.25 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/
BattleSelectAssetPage.html;jsessionid=9587B0309581914AB7438A34B1E51125-n15.at3?collec
    tion=12&opensocial_app_id=00000&opensocial_owner_id=00000 HTTP/1.0" 200 6773 "-"
"DoCoMo/2.0 ***"


110.44.178.26 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/shopping/battle/
ShoppingBattleTopPage.html;jsessionid=D901918E3CAE46E6B928A316D1938C3A-n11.a
    p1?opensocial_app_id=00000&opensocial_owner_id=11111 HTTP/1.0" 200 15254 "-"
"DoCoMo/2.0 ***"


110.44.178.27 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/
BattleSelectAssetDetailPage;jsessionid=202571F97B444370ECB495C2BCC6A1D5-n14.at11?asse
    t=53&collection=9&opensocial_app_id=00000&opensocial_owner_id=22222 HTTP/1.0" 200
11616 "-" "SoftBank/***"


...(many records)
Collection: user_trace
> db.user_trace.find({user: "7777", date: "2011-02-12"}).limit(0)
    .forEach(printjson)
{
        "_id" : "2011-02-12+05:39:31+7777+18343+Access",
        "lastUpdate" : "2011-02-19",
        "ipaddr" : "202.32.107.166",
        "requestTimeStr" : "12/Feb/2011:05:39:31 +0900",
        "date" : "2011-02-12",
        "time" : "05:39:31",
        "responseBodySize" : 18343,
        "userAgent" : "DoCoMo/2.0 SH07A3(c500;TB;W24H14)",
        "statusCode" : "200",
        "splittedPath" : "/avatar2-gree/MyPage,
        "userId" : "7777",
        "resource" : "/avatar2-gree/MyPage;jsessionid=...?
battlecardfreegacha=1&feed=...&opensocial_app_id=...&opensocial_viewer_id=...&
opensocial_owner_id=..."
}
1st Map Reduce

• [Aggregation]
   • Group by url, date, userId
   • Group by url, date, userAgent
   • Group by url, date, time
   • Group by url, date, statusCode
• Map Reduce operations runs in parallel on all shards
1st Map Reduce with PyMongo
map = Code("""
   function(){
                                         • this.userId
        emit({
              path:this.splittedPath,
                                         • this.userAgent
              userId:this.userId,
              date:this.date
        },1)}
                                         • this. timeRange
 """)
                                         • this.statusCode
 reduce = Code("""
   function(key, values){
        var count = 0;
        values.forEach(function(v) {
              count += 1;
        });
        return {"count": count, "lastUpdate": today};
   }
 """)
# ( mongodb >= 1.7.4 )
     result = db.user_access.map_reduce(map,
                                reduce,
                                marge_out="user_pageview",
                                full_response=True,
                                query={"date": date})


• About output collection, there are 4 options: (MongoDB >= 1.7.4)
  • out : overwrite collection if already exists
  • marge_output : merge new data into the old output collection
  • reduce_output : reduce operation will be performed on the two values
    (the same key on new result and old collection) and the result will be
    written to the output collection.
  • full_responce (=false) : If True, return on stats on the operation. If False,
    No collection will be created, and the whole map-reduce operation will
    happen in RAM. The Result set fits within the 8MB/doc limit (16MB/doc
    in 1.8?).
Map Reduce (>=1.7.4):
              out option in JavaScript
• "collectionName" : If you pass a string indicating the name of a collection, then
  the output will replace any existing output collection with the same name.
• { merge : "collectionName" } : This option will merge new data into the old
  output collection. In other words, if the same key exists in both the result set and
  the old collection, the new key will overwrite the old one.
• { reduce : "collectionName" } : If documents exists for a given key in the result
  set and in the old collection, then a reduce operation (using the specified reduce
  function) will be performed on the two values and the result will be written to
  the output collection. If a finalize function was provided, this will be run after
  the reduce as well.
• { inline : 1} : With this option, no collection will be created, and the whole map-
  reduce operation will happen in RAM. Also, the results of the map-reduce will
  be returned within the result object. Note that this option is possible only when
  the result set fits within the 8MB limit.
                                                http://www.mongodb.org/display/DOCS/MapReduce
Collection: user_pageview
> db.user_pageview.find({
          "_id.userId": "7777",                  • Regular Expression
          "_id.path": "/.*MyPage$/",
          "_id.date": {$lte: "2011-02-12"}
                                                 • <, >, <=, >=
    ).limit(1).forEach(printjson)
#####
{
          "_id" : {
                  "date" : "2011-02-12",
                  "path" : "/avatar2-gree/MyPage",
                  "userId" : "7777",
          },
          "value" : {
                  "count" : 10,
                  "lastUpdate" : "2011-02-19"
          }
}
2nd Map Reduce with PyMongo
map = Code("""
       function(){
           emit({
                  "path" : this._id.path,
                  "date":   this._id.date,
           },{
                  "pv": this.value.count,
                  "uu": 1
           });
       }
""")
reduce = Code("""
       function(key, values){
           var pv = 0;
           var uu = 0;
           values.forEach(function(v){
                 pv += v.pv;
                 uu += v.uu;
           });
           return {"pv": pv, "uu": uu};
       }
""")
2nd Map Reduce with PyMongo
map = Code("""
       function(){
           emit({
                  "path" : this._id.path,
                  "date":   this._id.date,
           },{
                  "pv": this.value.count,
                  "uu": 1
           });
       }
""")
reduce = Code("""
       function(key, values){
           var pv = 0;                        Must be the same key
           var uu = 0;                       ({“pv”: NaN} if not)
           values.forEach(function(v){
                 pv += v.pv;
                 uu += v.uu;
           });
           return {"pv": pv, "uu": uu};
       }
""")
# ( mongodb >= 1.7.4 )
result = db.user_pageview.map_reduce(map,
                  reduce,
                  marge_out="daily_pageview",
                  full_response=True,
                  query={"date": date})
Collection: daily_pageview

> db.daily_pageview.find({
        "_id.date": "2011-02-12",
        "_id.path": /.*MyPage$/
    }).limit(1).forEach(printjson)
{
        "_id" : {
                "date" : "2011-02-12",
                "path" : "/avatar2-gree/MyPage",
        },
        "value" : {
                "uu" : 53536,
                "pv" : 539467
        }
}
Current Map Reduce is Imperfect
  • [Single Threads per node]
    • Doesn't scale map-reduce across multiple threads

  • [Overwrite the Output Collection]
    • Overwrite the old collection ( no other options like “marge” or
      “reduce” )

# mapreduce code to merge output (MongoDB < 1.7.4)
result = db.user_access.map_reduce(map,
                   reduce,
                   full_response=True,
                   out="temp_collection",
                   query={"date": date})
[db.user_pageview.save(doc) for doc in temp_collection.find()]
Useful Reference: Map Reduce

• http://www.mongodb.org/display/DOCS/MapReduce
• ALookAt MongoDB 1.8's MapReduce Changes
• Map Reduce and Getting Under the Hood with Commands
• Map/reduce runs in parallel/distributed?
• Map/Reduce parallelism with Master/SlaveA
• mapReduce locks the whole server
• mapreduce vs find
How to Handle
User Trace Logs
How to Handle
               User TRACE Logs
Pretreatment: Trimming,       As a Data Server
 Validation, Filtering, ...




   Back
   Up To
   S3
User Trace / Charge Data Flow

                             user_charge




Pretreatment
                             daily_charge




user_trace     daily_trace
User Trace Log
Hadoop
• Using Hadoop: Pretreatment Raw Records
• [Map / Reduce]
    • Split each record by ‘¥s’
    • Filter Unnecessary Records
    • Check records whether user behaves dishonestly
    • Unify format to be able to sum up ( Because raw records are
      written by free format )

    • Sum up records group by “userId” and “actionType”
    • Insert (save) records to MongoDB
    ※ write operations won’t yet fully utilize all cores
An Example of User Trace Log

     UserId   ActionType   ActionDetail
An Example of User Trace Log
-----Change------
ActionLogger    a{ChangeP}          (Point,1371,1383)
ActionLogger    a{ChangeP}          (Point,2373,2423)

------Get------
ActionLogger    a{GetMaterial}   (syouhinnomoto,0,-1)          The value of “actionDerail”
ActionLogger    a{GetMaterial}   usesyouhinnomoto
ActionLogger    a{GetMaterial}   (omotyanomotoPRO,1,6)
                                                                 must be unified format
-----Trade-----
ActionLogger    a{Trade} buy 3 itigoke-kis from gree.jp:00000 #

-----Make-----
ActionLogger     a{Make}            make item kuronekono_n
ActionLogger     a{MakeSelect}      make item syouhinnomoto
ActionLogger     a{MakeSelect}      (syouhinnomoto,0,1)

-----PutOn/Off-----
ActionLogger    a{PutOff}            put off 1 ksuteras
ActionLogger    a{PutOn}             put 1 burokkus @2500

-----Clear/Clean-----
ActionLogger    a{ClearLuckyStar}       Clear LuckyItem_1     4 times

-----Gatcha-----
ActionLogger     a{Gacha} Play gacha with first free play:
ActionLogger     a{Gacha} Play gacha:
Collection: user_trace
> db.user_trace.find({date:"2011-02-12”,
                         actionType:"a{Make}",
                         userId:”7777"}).forEach(printjson)
{
    "_id" : "2011-02-12+7777+a{Make}",
    "date" : "2011-02-12"
    "lastUpdate" : "2011-02-19",
    "userId" : ”7777",
    "actionType" : "a{Make}",               Sum up values group by
    "actionDetail" : {                     “userId” and “actionType”
        "make item ksutera" : 3,
        "make item makaron" : 1,
        "make item huwahuwamimiate" : 1,
        …

    }

}
Collection: daily_trace
> db.daily_trace.find({
                       date:{$gte:"2011-02-12”,$lte:”2011-02-19”},
                       actionType:"a{Make}"}).forEach(printjson)
{
       "_id" : "2011-02-12+group+a{Make}",
       "date" : "2011-02-12",
       "lastUpdate" : "2011-02-19",
       "actionType" : "a{Make}",
       "actionDetail" : {
             "make item kinnokarakuridokei" : 615,
             "make item banjo-" : 377,
             "make item itigoke-ki" : 135904,
             ...
       },
       ...
}...
User Charge Log
Collection: user_charge
// TOP10 Users at 2011-02-12 abount Accounting
> db.user_charge.find({date:"2011-02-12"})
                 .sort({totalCharge:-1}).limit(10).forEach(printjson)
{
     "_id" : "2011-02-12+7777+Charge",
     "date" : "2011-02-12",
     "lastUpdate" : "2011-02-19",
     "totalCharge" : 10000,
     "userId" : ”7777",
     "actionType" : "Charge",
                                               Sum up values group by
     "boughtItem" : {                         “userId” and “actionType”
         "        EX" : 13,

         "    +6000" : 3,

         "        PRO" : 20

     }
}
{…
Collection: daily_charge
> db.daily_charge.find({date:"2011-02-12",T:"all"})
                                  .limit(10).forEach(printjson)
{
    "_id" : "2011-02-12+group+Charge+all+all",
    "date" : "2011-02-12",
    "total" : 100000,
    "UU" : 2000,
    "group" : {
         "              " : 1000000,

         "   " : 1000000, ...

    },
    "boughtItemNum" : {
         "        EX" : 8,

         "         " : 730, ...

    },
    "boughtItem" : {
         "        EX" : 10000,

         "         " : 100000, ...

    }
}
Categorize Users
Categorize Users
 user_trace     Attribution                    • [Categorize Users]

                              user_registrat
                                                  • by play term
                Attribution        ion
 user_charge                                      • by total amount
                                                    of charge

                                                  • by registration
                Attribution
                                                    date
user_savedata
                              user_category

                Attribution
                                               • [ Take an Snapshot
                                                 of Each Category’s
user_pageview
                                                 Stats per Week]
Collection: user_registration
> db.user_registration.find({userId:”7777"}).forEach(printjson)
{
    "_id" : "2010-06-29+7777+Registration",
    "userId" : ”7777"
    "actionType" : "Registration",
                                                  Tagging User
    "category" : {
         “R1” : “True”,              #

         “T” : “ll”                  #

         …

    },

    “firstCharge” : “2010-07-07”,    #

    “lastLogin” : “2010-09-30”,      #

    “playTerm” : 94,

    “totalCumlativeCharge” : 50000, #

    “totalMonthCharge” : 10000,      #

    …

}
Collection: user_category

> var cross = new Cross()    # User Definition Function
> MCResign = cross.calc(“2011-02-12”,“MC”,1)
# each value is the number of the user
# Charge(yen)/Term(day)
                 0(z)     ~¥1k(s)    ~¥10k(m)   ¥100k~(l)    total
~1day(z)        50000          10          5        0        50015
~1week(s)       50000         100         50        3        50153
~1month(m)     100000         200        100        1       100301
~3month(l)     100000         300         50        6       100356
month~(ll)          0           0          0        0            0
How to Collaborate With
 Front Analytic Tools
Front-end Architecture

                  sleepy.mongoose
                  (REST Interface)
PyMongo


                                        Web UI




Social Data Analysis                 Data Analysis
Web UI and Mongo
Data Table: jQuery.DataTables
  [ Data Table ]                •
                                1 Variable length pagination
                                2 On-the-fly filtering
                                3 Multi-column sorting with data
                                    type detection

• Want to Share Daily Summary   4 Smart handling of column widths
                                5 Scrolling options for table
• Want to See Data from Many
  Viewpoint                         viewport
                                6 ...
• Want to Implement Easily
Graph: jQuery.HighCharts
  [ Graph ]                        •
                                   1. Numerous Chart Types

                                   2. Simple Configuration Syntax

                                   3. Multiple Axes

• Want to Visualize Data           4. Tooltip Labels

• Handle Time Series Data Mainly   5. Zooming

• Want to Implement Easily         6. ...
sleepy.mongoose

• [REST Interface + Mongo]
   • Get Data by HTTP GET/POST Request
   • sleepy.mongoose
      ‣ request as “/db_name/collection_name/_command”
      ‣ made by a 10gen engineer: @kchodorow
      ‣ Sleepy.Mongoose: A MongoDB REST Interface
sleepy.mongoose

//start server
> python httpd.py
…listening for connections on http://localhost:27080


//connect to MongoDB
> curl --data server=localhost:27017 'http://localhost:27080/
_connect’


//request example
> http://localhost:27080/playshop/daily_charge/_find?criteria={}
&limit=10&batch_size=10


{"ok": 1, "results": [{“_id": “…”, ”date":… },{“_id”:…}], "id":
0}}
JSON: Mongo <---> Ajax




 sleepy.mongoose
 (REST Interface)
                    Get
                          JSON

• jQuery library and MongoDB are compatible
• It is not necessary to describe HTML tag(such as <table>)
Example: Web UI
Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
R and Mongo
Collection: user_registration
> db.user_registration.find({userId:”7777"}).forEach(printjson)
{
    "_id" : "2010-06-29+7777+Registration",       Want to know the relation
    "userId" : ”7777"
                                                  between user attributions
    "actionType" : "Registration",
    "category" : {
         “R1” : “True”,              #

         “T” : “ll”                  #

         …

    },

    “firstCharge” : “2010-07-07”,    #

    “lastLogin” : “2010-09-30”,      #

    “playTerm” : 94,

    “totalCumlativeCharge” : 50000, #

    “totalMonthCharge” : 10000,      #

    …

}
R Code: Access MongoDB
       Using sleepy.mongoose
##### LOAD LIBRARY #####
library(RCurl)
library(rjson)
##### CONF #####
today.str    <-    format(Sys.time(), "%Y-%m-%d")
url.base     <-    "http://localhost:27080"
mongo.db     <-    "playshop"
mongo.col    <-    "user_registration"
mongo.base   <-    paste(url.base, mongo.db, mongo.col, sep="/")
mongo.sort   <-    ""
mongo.limit <-     "limit=100000"
mongo.batch <-     "batch_size=100000"
R Code: Access MongoDB
             Using sleepy.mongoose
##### FUNCTION #####
find <- function(query){
    mongo <- fromJSON(getURL(url))
    docs <- mongo$result
    makeTable(docs) # My Function
}
# Example
# Using sleepy.mongoose https://github.com/kchodorow/sleepy.mongoose
mongo.criteria <- "_find?criteria={ ¥
     "totalCumlativeCharge":{"$gt":0,"$lte":1000}}"
mongo.query <- paste(mongo.criteria, mongo.sort, ¥
     mongo.limit, mongo.batch, sep="&")
url <- paste(mongo.base, mongo.query, sep="/")
user.charge.low <- find(url)
The Result
# Result: 10th Document

[[10]]
[[10]]$playTerm
[1] 31

[[10]]$lastUpdate
[1] "2011-02-24"

[[10]]$userId
[1] "7777"

[[10]]$totalCumlativeCharge
[1] 10000

[[10]]$lastLogin
[1] "2011-02-21"

[[10]]$date
[1] "2011-01-22"

[[10]]$`_id`
[1] "2011-02-12+18790376+Registration"

...
Make a Data Table from The Result

# Result: Translate Document to Table

        playTerm totalWinRate totalCumlativeCharge totalCommitNum totalWinNum
 [1,]         56           42                 1000            533         224
 [2,]         57           33                 1000            127          42
 [3,]         57           35                 1000            654         229
 [4,]         18           31                 1000             49          15
 [5,]         77           35                 1000            982         345
 [6,]         77           45                 1000            339         153
 [7,]         31           44                 1000             70          31
 [8,]         76           39                 1000            229          89
 [9,]         40           21                 1000            430          92
[10,]         26           40                 1000             25          10
...
Scatter Plot / Matrix

                  Each Category
                  (User Attribution)




# Run as a batch command
$ R --vanilla --quiet < mongo2R.R
Munin and MongoDB
Monitoring DB Stats




Munin configuration examples - MongoDB

https://github.com/erh/mongo-munin

https://github.com/osinka/mongo-rs-munin
My Future
Analytic Architecture
Realtime Analysis
Access Logs           Flume                 with MongoDB
RealTime
(hourly)

                capped
              collection               user_access               daily/hourly
              (per hour)   Trimming                  MapReduce     _access
                           Filtering                  Modifier
                            Sum Up                    Sum Up

                capped
                                                                 daily/hourly
              collection               user_trace
              (per hour)                                            _trace
RealTime
(hourly)

User Trace
  Logs
Flume
Server A
           Hourly /
Server B   Realtime


Server C                    Flume
                            Plugin   Mongo
              Collector
                                      DB
Server D


Server E   Access Log
           User Trace Log

Server F
An Output From
                 Mongo-Flume Plugin
> db.flume_capped_21.find().limit(1).forEach(printjson)
{
        "_id" : ObjectId("4d658187de9bd9f24323e1b6"),
        "timestamp" : "Wed Feb 23 2011 21:52:06 GMT+0000 (UTC)",
        "nanoseconds" : NumberLong("562387389278959"),
        "hostname" : "ip-10-131-27-115.ap-southeast-1.compute.internal",
        "priority" : "INFO",
        "message" : "202.32.107.42 - - [14/Feb/2011:04:30:32 +0900] "GET /
avatar2-gree.4d537100/res/swf/avatar/18051727/5/useravatar1582476746.swf?
opensocial_app_id=472&opensocial_viewer_id=36858644&o
pensocial_owner_id=36858644 HTTP/1.1" 200 33640 "-" "DoCoMo/2.0 SH01C
(c500;TB;W24H16)"",
        "metadata" : {}
}



Mongo Flume Plugin: https://github.com/mongodb/mongo-hadoop/tree/master/flume_plugin
Summary
Summary
• Almighty as a Analytic Data Server
  • schema-free: social game data are changeable
  • rich queries: important for analyze many point of view
  • powerful aggregation: map reduce
  • mongo shell: analyze from mongo shell are speedy and handy

• More...
  • Scalability: using Replication, Sharding are very easy
  • Node.js: It enable us server side scripting with Mongo
My Presentation
MongoDB
                        UI       MongoDB                          :
         http://www.slideshare.net/doryokujin/mongodb-uimongodb

MongoDB Ajax                                              GraphDB
                             :
    http://www.slideshare.net/doryokujin/mongodbajaxgraphdb-5774546

Hadoop     MongoDB
:
           http://www.slideshare.net/doryokujin/hadoopmongodb

GraphDB
                                        GraphDB                       :
           http://www.slideshare.net/doryokujin/graphdbgraphdb
I ♥ MongoDB JP

• continue to be a organizer of MongoDB JP
• continue to propose many use cases of MongoDB
  • ex: Social Data, Log Data, Medical Data, ...

• support MongoDB users
  • by document translation, user-group, IRC, blog, book,
    twitter,...

• boosting services and products using MongoDB
Thank you for coming to
       Mongo Tokyo!!

[Contact me]
twitter: doryokujin
skype: doryokujin
mail: mr.stoicman@gmail.com
blog: http://d.hatena.ne.jp/doryokujin/
MongoDB JP: https://groups.google.com/group/mongodb-jp?hl=ja

More Related Content

PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
PDF
MongoDB for Analytics
PPTX
Python and MongoDB as a Market Data Platform by James Blackburn
PPTX
2014 bigdatacamp asya_kamsky
PPTX
Back to Basics Webinar 1: Introduction to NoSQL
PPTX
Using MongoDB As a Tick Database
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
PDF
Barcelona MUG MongoDB + Hadoop Presentation
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
MongoDB for Analytics
Python and MongoDB as a Market Data Platform by James Blackburn
2014 bigdatacamp asya_kamsky
Back to Basics Webinar 1: Introduction to NoSQL
Using MongoDB As a Tick Database
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Barcelona MUG MongoDB + Hadoop Presentation

What's hot (20)

PDF
Hadoop - MongoDB Webinar June 2014
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
PPTX
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
POTX
Webinar: MongoDB + Hadoop
PPTX
Doing Joins in MongoDB: Best Practices for Using $lookup
PPT
Introduction to MongoDB
PPTX
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
PPTX
MongoDB using Grails plugin by puneet behl
PPT
MongoDB Tick Data Presentation
PPTX
MongoDB + Spring
PPTX
High Performance Applications with MongoDB
PPTX
Introduction to MongoDB and Hadoop
PPTX
Back to Basics Webinar 2: Your First MongoDB Application
PPTX
Dev Jumpstart: Build Your First App with MongoDB
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PPTX
MongoDB and Hadoop: Driving Business Insights
PPTX
MongoDB 101
PDF
Using MongoDB + Hadoop Together
PPTX
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
PPTX
MongoDB and Hadoop: Driving Business Insights
Hadoop - MongoDB Webinar June 2014
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinar: MongoDB + Hadoop
Doing Joins in MongoDB: Best Practices for Using $lookup
Introduction to MongoDB
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB using Grails plugin by puneet behl
MongoDB Tick Data Presentation
MongoDB + Spring
High Performance Applications with MongoDB
Introduction to MongoDB and Hadoop
Back to Basics Webinar 2: Your First MongoDB Application
Dev Jumpstart: Build Your First App with MongoDB
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
MongoDB and Hadoop: Driving Business Insights
MongoDB 101
Using MongoDB + Hadoop Together
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
MongoDB and Hadoop: Driving Business Insights
Ad

Viewers also liked (14)

PPTX
eBay Cloud CMS based on NOSQL
PDF
No sql e as vantagens na utilização do mongodb
PPT
MongoATL: How Sourceforge is Using MongoDB
PPTX
Semantic Wiki: Social Semantic Web In Action:
PPTX
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
PDF
Ebay: DB Capacity planning at eBay
PPTX
NOSQL uma breve introdução
PDF
Artigo Nosql
KEY
Scaling with MongoDB
PDF
An Elastic Metadata Store for eBay’s Media Platform
KEY
NoSQL at Twitter (NoSQL EU 2010)
PDF
Building LinkedIn's Learning Platform with MongoDB
PPTX
MongoDB at eBay
eBay Cloud CMS based on NOSQL
No sql e as vantagens na utilização do mongodb
MongoATL: How Sourceforge is Using MongoDB
Semantic Wiki: Social Semantic Web In Action:
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
Ebay: DB Capacity planning at eBay
NOSQL uma breve introdução
Artigo Nosql
Scaling with MongoDB
An Elastic Metadata Store for eBay’s Media Platform
NoSQL at Twitter (NoSQL EU 2010)
Building LinkedIn's Learning Platform with MongoDB
MongoDB at eBay
Ad

Similar to Social Data and Log Analysis Using MongoDB (20)

DOCX
What are the major components of MongoDB and the major tools used in it.docx
KEY
MongoDB and hadoop
PDF
MongoDB FabLab León
PPTX
introtomongodb
PPTX
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
KEY
Mongodb intro
PPTX
MongoDB Use Cases: Healthcare, CMS, Analytics
KEY
MongoDB - Introduction
PDF
Webinar: Managing Real Time Risk Analytics with MongoDB
PDF
Webinar: Data Processing and Aggregation Options
PPTX
Introduction tomongodb
PDF
MongoDB dla administratora
PDF
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
PDF
Advanced Analytics & Statistics with MongoDB
PPTX
Operational Intelligence with MongoDB Webinar
PDF
MongoDB, Hadoop and humongous data - MongoSV 2012
KEY
PDF
Querying Mongo Without Programming Using Funql
PDF
MongoDB: a gentle, friendly overview
PDF
OSDC 2012 | Building a first application on MongoDB by Ross Lawley
What are the major components of MongoDB and the major tools used in it.docx
MongoDB and hadoop
MongoDB FabLab León
introtomongodb
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Mongodb intro
MongoDB Use Cases: Healthcare, CMS, Analytics
MongoDB - Introduction
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Data Processing and Aggregation Options
Introduction tomongodb
MongoDB dla administratora
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Advanced Analytics & Statistics with MongoDB
Operational Intelligence with MongoDB Webinar
MongoDB, Hadoop and humongous data - MongoSV 2012
Querying Mongo Without Programming Using Funql
MongoDB: a gentle, friendly overview
OSDC 2012 | Building a first application on MongoDB by Ross Lawley

More from Takahiro Inoue (20)

PDF
Treasure Data × Wave Analytics EC Demo
PDF
トレジャーデータとtableau実現する自動レポーティング
PDF
Tableauが魅せる Data Visualization の世界
PDF
トレジャーデータのバッチクエリとアドホッククエリを理解する
PDF
20140708 オンラインゲームソリューション
PDF
トレジャーデータ流,データ分析の始め方
PDF
オンラインゲームソリューション@トレジャーデータ
PDF
事例で学ぶトレジャーデータ 20140612
PDF
トレジャーデータ株式会社について(for all Data_Enthusiast!!)
PDF
この Visualization がすごい2014 〜データ世界を彩るツール6選〜
PDF
Treasure Data Intro for Data Enthusiast!!
PDF
Hadoop and the Data Scientist
PDF
MongoDB: Intro & Application for Big Data
PDF
An Introduction to Fluent & MongoDB Plugins
PDF
An Introduction to Tinkerpop
PDF
An Introduction to Neo4j
PDF
The Definition of GraphDB
PDF
Large-Scale Graph Processing〜Introduction〜(完全版)
PDF
Large-Scale Graph Processing〜Introduction〜(LT版)
PDF
Advanced MongoDB #1
Treasure Data × Wave Analytics EC Demo
トレジャーデータとtableau実現する自動レポーティング
Tableauが魅せる Data Visualization の世界
トレジャーデータのバッチクエリとアドホッククエリを理解する
20140708 オンラインゲームソリューション
トレジャーデータ流,データ分析の始め方
オンラインゲームソリューション@トレジャーデータ
事例で学ぶトレジャーデータ 20140612
トレジャーデータ株式会社について(for all Data_Enthusiast!!)
この Visualization がすごい2014 〜データ世界を彩るツール6選〜
Treasure Data Intro for Data Enthusiast!!
Hadoop and the Data Scientist
MongoDB: Intro & Application for Big Data
An Introduction to Fluent & MongoDB Plugins
An Introduction to Tinkerpop
An Introduction to Neo4j
The Definition of GraphDB
Large-Scale Graph Processing〜Introduction〜(完全版)
Large-Scale Graph Processing〜Introduction〜(LT版)
Advanced MongoDB #1

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Machine Learning_overview_presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Encapsulation_ Review paper, used for researhc scholars

Social Data and Log Analysis Using MongoDB

  • 1. Social Data and Log Analysis Using MongoDB 2011/03/01(Tue) #mongotokyo doryokujin
  • 2. Self-Introduction • doryokujin (Takahiro Inoue), Age: 25 • Education: University of Keio • Master of Mathematics March 2011 ( Maybe... ) • Major: Randomized Algorithms and Probabilistic Analysis • Company: Geisha Tokyo Entertainment (GTE) • Data Mining Engineer (only me, part-time) • Organized Community: • MongoDB JP, Tokyo Web Mining
  • 3. My Job • I’m a Fledgling Data Scientist • Development of analytical systems for social data • Development of recommendation systems for social data • My Interest: Big Data Analysis • How to generate logs scattered many servers • How to storage and access to data • How to analyze and visualization of billions of data
  • 4. Agenda • My Company’s Analytic Architecture • How to Handle Access Logs • How to Handle User Trace Logs • How to Collaborate with Front Analytic Tools • My Future Analytic Architecture
  • 5. Agenda Hadoop, Mongo Map Reduce • My Company’s Analytic Architecture Hadoop, Schema Free • How to Handle Access Logs • How to Handle User Trace Logs REST Interface, JSON • How to Collaborate with Front Analytic Tools Capped Collection, • My Future Analytic Architecture Modifier Operation Of Course Everything With
  • 7. Social Game (Mobile): Omiseyasan • Enjoy arranging their own shop (and avatar) • Communicate with other users by shopping, part-time, ... • Buy seeds of items to display their own shop
  • 9. Back-end Architecture Pretreatment: Trimming, As a Central Data Server Validation, Filtering,... Dumbo (Hadoop Streaming) PyMongo Back Up To S3
  • 10. Front-end Architecture sleepy.mongoose (REST Interface) PyMongo Web UI Social Data Analysis Data Analysis
  • 11. Environment • MongoDB: 1.6.4 • PyMongo: 1.9 • Hadoop: CDH2 ( soon update to CDH3 ) • Dumbo: Simple Python Module for Hadoop Streaming • Cassandra: 0.6.11 • R, Neo4j, jQuery, Munin, ... • [Data Size (a rough estimate)] • Access Log 15GB / day ( gzip ) - 2,000M PV • User Trace Log 5GB / day ( gzip )
  • 12. How to Handle Access Logs
  • 13. How to Handle Access Logs Pretreatment: Trimming, As a Data Server Validation, Filtering, ... Back Up To S3
  • 14. Access Data Flow Caution: need MongoDB >= 1.7.4 user_pageview agent_pageview daily_pageview Pretreatment 2nd Map Reduce user_access hourly_pageview 1st Map Reduce Group by
  • 15. Hadoop • Using Hadoop: Pretreatment Raw Records • [Map / Reduce] • Read all records • Split each record by ‘¥s’ • Filter unnecessary records (such as *.swf) • Check records whether correct or not • Insert (save) records to MongoDB ※ write operations won’t yet fully utilize all cores
  • 16. Access Logs 110.44.178.25 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/ BattleSelectAssetPage.html;jsessionid=9587B0309581914AB7438A34B1E51125-n15.at3?collec tion=12&opensocial_app_id=00000&opensocial_owner_id=00000 HTTP/1.0" 200 6773 "-" "DoCoMo/2.0 ***" 110.44.178.26 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/shopping/battle/ ShoppingBattleTopPage.html;jsessionid=D901918E3CAE46E6B928A316D1938C3A-n11.a p1?opensocial_app_id=00000&opensocial_owner_id=11111 HTTP/1.0" 200 15254 "-" "DoCoMo/2.0 ***" 110.44.178.27 - - [19/Nov/2010:04:40:40 +0900] "GET /playshop.4ce13800/battle/ BattleSelectAssetDetailPage;jsessionid=202571F97B444370ECB495C2BCC6A1D5-n14.at11?asse t=53&collection=9&opensocial_app_id=00000&opensocial_owner_id=22222 HTTP/1.0" 200 11616 "-" "SoftBank/***" ...(many records)
  • 17. Collection: user_trace > db.user_trace.find({user: "7777", date: "2011-02-12"}).limit(0) .forEach(printjson) { "_id" : "2011-02-12+05:39:31+7777+18343+Access", "lastUpdate" : "2011-02-19", "ipaddr" : "202.32.107.166", "requestTimeStr" : "12/Feb/2011:05:39:31 +0900", "date" : "2011-02-12", "time" : "05:39:31", "responseBodySize" : 18343, "userAgent" : "DoCoMo/2.0 SH07A3(c500;TB;W24H14)", "statusCode" : "200", "splittedPath" : "/avatar2-gree/MyPage, "userId" : "7777", "resource" : "/avatar2-gree/MyPage;jsessionid=...? battlecardfreegacha=1&feed=...&opensocial_app_id=...&opensocial_viewer_id=...& opensocial_owner_id=..." }
  • 18. 1st Map Reduce • [Aggregation] • Group by url, date, userId • Group by url, date, userAgent • Group by url, date, time • Group by url, date, statusCode • Map Reduce operations runs in parallel on all shards
  • 19. 1st Map Reduce with PyMongo map = Code(""" function(){ • this.userId emit({ path:this.splittedPath, • this.userAgent userId:this.userId, date:this.date },1)} • this. timeRange """) • this.statusCode reduce = Code(""" function(key, values){ var count = 0; values.forEach(function(v) { count += 1; }); return {"count": count, "lastUpdate": today}; } """)
  • 20. # ( mongodb >= 1.7.4 ) result = db.user_access.map_reduce(map, reduce, marge_out="user_pageview", full_response=True, query={"date": date}) • About output collection, there are 4 options: (MongoDB >= 1.7.4) • out : overwrite collection if already exists • marge_output : merge new data into the old output collection • reduce_output : reduce operation will be performed on the two values (the same key on new result and old collection) and the result will be written to the output collection. • full_responce (=false) : If True, return on stats on the operation. If False, No collection will be created, and the whole map-reduce operation will happen in RAM. The Result set fits within the 8MB/doc limit (16MB/doc in 1.8?).
  • 21. Map Reduce (>=1.7.4): out option in JavaScript • "collectionName" : If you pass a string indicating the name of a collection, then the output will replace any existing output collection with the same name. • { merge : "collectionName" } : This option will merge new data into the old output collection. In other words, if the same key exists in both the result set and the old collection, the new key will overwrite the old one. • { reduce : "collectionName" } : If documents exists for a given key in the result set and in the old collection, then a reduce operation (using the specified reduce function) will be performed on the two values and the result will be written to the output collection. If a finalize function was provided, this will be run after the reduce as well. • { inline : 1} : With this option, no collection will be created, and the whole map- reduce operation will happen in RAM. Also, the results of the map-reduce will be returned within the result object. Note that this option is possible only when the result set fits within the 8MB limit. http://www.mongodb.org/display/DOCS/MapReduce
  • 22. Collection: user_pageview > db.user_pageview.find({ "_id.userId": "7777", • Regular Expression "_id.path": "/.*MyPage$/", "_id.date": {$lte: "2011-02-12"} • <, >, <=, >= ).limit(1).forEach(printjson) ##### { "_id" : { "date" : "2011-02-12", "path" : "/avatar2-gree/MyPage", "userId" : "7777", }, "value" : { "count" : 10, "lastUpdate" : "2011-02-19" } }
  • 23. 2nd Map Reduce with PyMongo map = Code(""" function(){ emit({ "path" : this._id.path, "date": this._id.date, },{ "pv": this.value.count, "uu": 1 }); } """) reduce = Code(""" function(key, values){ var pv = 0; var uu = 0; values.forEach(function(v){ pv += v.pv; uu += v.uu; }); return {"pv": pv, "uu": uu}; } """)
  • 24. 2nd Map Reduce with PyMongo map = Code(""" function(){ emit({ "path" : this._id.path, "date": this._id.date, },{ "pv": this.value.count, "uu": 1 }); } """) reduce = Code(""" function(key, values){ var pv = 0; Must be the same key var uu = 0; ({“pv”: NaN} if not) values.forEach(function(v){ pv += v.pv; uu += v.uu; }); return {"pv": pv, "uu": uu}; } """)
  • 25. # ( mongodb >= 1.7.4 ) result = db.user_pageview.map_reduce(map, reduce, marge_out="daily_pageview", full_response=True, query={"date": date})
  • 26. Collection: daily_pageview > db.daily_pageview.find({ "_id.date": "2011-02-12", "_id.path": /.*MyPage$/ }).limit(1).forEach(printjson) { "_id" : { "date" : "2011-02-12", "path" : "/avatar2-gree/MyPage", }, "value" : { "uu" : 53536, "pv" : 539467 } }
  • 27. Current Map Reduce is Imperfect • [Single Threads per node] • Doesn't scale map-reduce across multiple threads • [Overwrite the Output Collection] • Overwrite the old collection ( no other options like “marge” or “reduce” ) # mapreduce code to merge output (MongoDB < 1.7.4) result = db.user_access.map_reduce(map, reduce, full_response=True, out="temp_collection", query={"date": date}) [db.user_pageview.save(doc) for doc in temp_collection.find()]
  • 28. Useful Reference: Map Reduce • http://www.mongodb.org/display/DOCS/MapReduce • ALookAt MongoDB 1.8's MapReduce Changes • Map Reduce and Getting Under the Hood with Commands • Map/reduce runs in parallel/distributed? • Map/Reduce parallelism with Master/SlaveA • mapReduce locks the whole server • mapreduce vs find
  • 29. How to Handle User Trace Logs
  • 30. How to Handle User TRACE Logs Pretreatment: Trimming, As a Data Server Validation, Filtering, ... Back Up To S3
  • 31. User Trace / Charge Data Flow user_charge Pretreatment daily_charge user_trace daily_trace
  • 33. Hadoop • Using Hadoop: Pretreatment Raw Records • [Map / Reduce] • Split each record by ‘¥s’ • Filter Unnecessary Records • Check records whether user behaves dishonestly • Unify format to be able to sum up ( Because raw records are written by free format ) • Sum up records group by “userId” and “actionType” • Insert (save) records to MongoDB ※ write operations won’t yet fully utilize all cores
  • 34. An Example of User Trace Log UserId ActionType ActionDetail
  • 35. An Example of User Trace Log -----Change------ ActionLogger a{ChangeP} (Point,1371,1383) ActionLogger a{ChangeP} (Point,2373,2423) ------Get------ ActionLogger a{GetMaterial} (syouhinnomoto,0,-1) The value of “actionDerail” ActionLogger a{GetMaterial} usesyouhinnomoto ActionLogger a{GetMaterial} (omotyanomotoPRO,1,6) must be unified format -----Trade----- ActionLogger a{Trade} buy 3 itigoke-kis from gree.jp:00000 # -----Make----- ActionLogger a{Make} make item kuronekono_n ActionLogger a{MakeSelect} make item syouhinnomoto ActionLogger a{MakeSelect} (syouhinnomoto,0,1) -----PutOn/Off----- ActionLogger a{PutOff} put off 1 ksuteras ActionLogger a{PutOn} put 1 burokkus @2500 -----Clear/Clean----- ActionLogger a{ClearLuckyStar} Clear LuckyItem_1 4 times -----Gatcha----- ActionLogger a{Gacha} Play gacha with first free play: ActionLogger a{Gacha} Play gacha:
  • 36. Collection: user_trace > db.user_trace.find({date:"2011-02-12”, actionType:"a{Make}", userId:”7777"}).forEach(printjson) { "_id" : "2011-02-12+7777+a{Make}", "date" : "2011-02-12" "lastUpdate" : "2011-02-19", "userId" : ”7777", "actionType" : "a{Make}", Sum up values group by "actionDetail" : { “userId” and “actionType” "make item ksutera" : 3, "make item makaron" : 1, "make item huwahuwamimiate" : 1, … } }
  • 37. Collection: daily_trace > db.daily_trace.find({ date:{$gte:"2011-02-12”,$lte:”2011-02-19”}, actionType:"a{Make}"}).forEach(printjson) { "_id" : "2011-02-12+group+a{Make}", "date" : "2011-02-12", "lastUpdate" : "2011-02-19", "actionType" : "a{Make}", "actionDetail" : { "make item kinnokarakuridokei" : 615, "make item banjo-" : 377, "make item itigoke-ki" : 135904, ... }, ... }...
  • 39. Collection: user_charge // TOP10 Users at 2011-02-12 abount Accounting > db.user_charge.find({date:"2011-02-12"}) .sort({totalCharge:-1}).limit(10).forEach(printjson) { "_id" : "2011-02-12+7777+Charge", "date" : "2011-02-12", "lastUpdate" : "2011-02-19", "totalCharge" : 10000, "userId" : ”7777", "actionType" : "Charge", Sum up values group by "boughtItem" : { “userId” and “actionType” " EX" : 13, " +6000" : 3, " PRO" : 20 } } {…
  • 40. Collection: daily_charge > db.daily_charge.find({date:"2011-02-12",T:"all"}) .limit(10).forEach(printjson) { "_id" : "2011-02-12+group+Charge+all+all", "date" : "2011-02-12", "total" : 100000, "UU" : 2000, "group" : { " " : 1000000, " " : 1000000, ... }, "boughtItemNum" : { " EX" : 8, " " : 730, ... }, "boughtItem" : { " EX" : 10000, " " : 100000, ... } }
  • 42. Categorize Users user_trace Attribution • [Categorize Users] user_registrat • by play term Attribution ion user_charge • by total amount of charge • by registration Attribution date user_savedata user_category Attribution • [ Take an Snapshot of Each Category’s user_pageview Stats per Week]
  • 43. Collection: user_registration > db.user_registration.find({userId:”7777"}).forEach(printjson) { "_id" : "2010-06-29+7777+Registration", "userId" : ”7777" "actionType" : "Registration", Tagging User "category" : { “R1” : “True”, # “T” : “ll” # … }, “firstCharge” : “2010-07-07”, # “lastLogin” : “2010-09-30”, # “playTerm” : 94, “totalCumlativeCharge” : 50000, # “totalMonthCharge” : 10000, # … }
  • 44. Collection: user_category > var cross = new Cross() # User Definition Function > MCResign = cross.calc(“2011-02-12”,“MC”,1) # each value is the number of the user # Charge(yen)/Term(day) 0(z) ~¥1k(s) ~¥10k(m) ¥100k~(l) total ~1day(z) 50000 10 5 0 50015 ~1week(s) 50000 100 50 3 50153 ~1month(m) 100000 200 100 1 100301 ~3month(l) 100000 300 50 6 100356 month~(ll) 0 0 0 0 0
  • 45. How to Collaborate With Front Analytic Tools
  • 46. Front-end Architecture sleepy.mongoose (REST Interface) PyMongo Web UI Social Data Analysis Data Analysis
  • 47. Web UI and Mongo
  • 48. Data Table: jQuery.DataTables [ Data Table ] • 1 Variable length pagination 2 On-the-fly filtering 3 Multi-column sorting with data type detection • Want to Share Daily Summary 4 Smart handling of column widths 5 Scrolling options for table • Want to See Data from Many Viewpoint viewport 6 ... • Want to Implement Easily
  • 49. Graph: jQuery.HighCharts [ Graph ] • 1. Numerous Chart Types 2. Simple Configuration Syntax 3. Multiple Axes • Want to Visualize Data 4. Tooltip Labels • Handle Time Series Data Mainly 5. Zooming • Want to Implement Easily 6. ...
  • 50. sleepy.mongoose • [REST Interface + Mongo] • Get Data by HTTP GET/POST Request • sleepy.mongoose ‣ request as “/db_name/collection_name/_command” ‣ made by a 10gen engineer: @kchodorow ‣ Sleepy.Mongoose: A MongoDB REST Interface
  • 51. sleepy.mongoose //start server > python httpd.py …listening for connections on http://localhost:27080 //connect to MongoDB > curl --data server=localhost:27017 'http://localhost:27080/ _connect’ //request example > http://localhost:27080/playshop/daily_charge/_find?criteria={} &limit=10&batch_size=10 {"ok": 1, "results": [{“_id": “…”, ”date":… },{“_id”:…}], "id": 0}}
  • 52. JSON: Mongo <---> Ajax sleepy.mongoose (REST Interface) Get JSON • jQuery library and MongoDB are compatible • It is not necessary to describe HTML tag(such as <table>)
  • 57. Collection: user_registration > db.user_registration.find({userId:”7777"}).forEach(printjson) { "_id" : "2010-06-29+7777+Registration", Want to know the relation "userId" : ”7777" between user attributions "actionType" : "Registration", "category" : { “R1” : “True”, # “T” : “ll” # … }, “firstCharge” : “2010-07-07”, # “lastLogin” : “2010-09-30”, # “playTerm” : 94, “totalCumlativeCharge” : 50000, # “totalMonthCharge” : 10000, # … }
  • 58. R Code: Access MongoDB Using sleepy.mongoose ##### LOAD LIBRARY ##### library(RCurl) library(rjson) ##### CONF ##### today.str <- format(Sys.time(), "%Y-%m-%d") url.base <- "http://localhost:27080" mongo.db <- "playshop" mongo.col <- "user_registration" mongo.base <- paste(url.base, mongo.db, mongo.col, sep="/") mongo.sort <- "" mongo.limit <- "limit=100000" mongo.batch <- "batch_size=100000"
  • 59. R Code: Access MongoDB Using sleepy.mongoose ##### FUNCTION ##### find <- function(query){ mongo <- fromJSON(getURL(url)) docs <- mongo$result makeTable(docs) # My Function } # Example # Using sleepy.mongoose https://github.com/kchodorow/sleepy.mongoose mongo.criteria <- "_find?criteria={ ¥ "totalCumlativeCharge":{"$gt":0,"$lte":1000}}" mongo.query <- paste(mongo.criteria, mongo.sort, ¥ mongo.limit, mongo.batch, sep="&") url <- paste(mongo.base, mongo.query, sep="/") user.charge.low <- find(url)
  • 60. The Result # Result: 10th Document [[10]] [[10]]$playTerm [1] 31 [[10]]$lastUpdate [1] "2011-02-24" [[10]]$userId [1] "7777" [[10]]$totalCumlativeCharge [1] 10000 [[10]]$lastLogin [1] "2011-02-21" [[10]]$date [1] "2011-01-22" [[10]]$`_id` [1] "2011-02-12+18790376+Registration" ...
  • 61. Make a Data Table from The Result # Result: Translate Document to Table playTerm totalWinRate totalCumlativeCharge totalCommitNum totalWinNum [1,] 56 42 1000 533 224 [2,] 57 33 1000 127 42 [3,] 57 35 1000 654 229 [4,] 18 31 1000 49 15 [5,] 77 35 1000 982 345 [6,] 77 45 1000 339 153 [7,] 31 44 1000 70 31 [8,] 76 39 1000 229 89 [9,] 40 21 1000 430 92 [10,] 26 40 1000 25 10 ...
  • 62. Scatter Plot / Matrix Each Category (User Attribution) # Run as a batch command $ R --vanilla --quiet < mongo2R.R
  • 64. Monitoring DB Stats Munin configuration examples - MongoDB https://github.com/erh/mongo-munin https://github.com/osinka/mongo-rs-munin
  • 66. Realtime Analysis Access Logs Flume with MongoDB RealTime (hourly) capped collection user_access daily/hourly (per hour) Trimming MapReduce _access Filtering Modifier Sum Up Sum Up capped daily/hourly collection user_trace (per hour) _trace RealTime (hourly) User Trace Logs
  • 67. Flume Server A Hourly / Server B Realtime Server C Flume Plugin Mongo Collector DB Server D Server E Access Log User Trace Log Server F
  • 68. An Output From Mongo-Flume Plugin > db.flume_capped_21.find().limit(1).forEach(printjson) { "_id" : ObjectId("4d658187de9bd9f24323e1b6"), "timestamp" : "Wed Feb 23 2011 21:52:06 GMT+0000 (UTC)", "nanoseconds" : NumberLong("562387389278959"), "hostname" : "ip-10-131-27-115.ap-southeast-1.compute.internal", "priority" : "INFO", "message" : "202.32.107.42 - - [14/Feb/2011:04:30:32 +0900] "GET / avatar2-gree.4d537100/res/swf/avatar/18051727/5/useravatar1582476746.swf? opensocial_app_id=472&opensocial_viewer_id=36858644&o pensocial_owner_id=36858644 HTTP/1.1" 200 33640 "-" "DoCoMo/2.0 SH01C (c500;TB;W24H16)"", "metadata" : {} } Mongo Flume Plugin: https://github.com/mongodb/mongo-hadoop/tree/master/flume_plugin
  • 70. Summary • Almighty as a Analytic Data Server • schema-free: social game data are changeable • rich queries: important for analyze many point of view • powerful aggregation: map reduce • mongo shell: analyze from mongo shell are speedy and handy • More... • Scalability: using Replication, Sharding are very easy • Node.js: It enable us server side scripting with Mongo
  • 71. My Presentation MongoDB UI MongoDB : http://www.slideshare.net/doryokujin/mongodb-uimongodb MongoDB Ajax GraphDB : http://www.slideshare.net/doryokujin/mongodbajaxgraphdb-5774546 Hadoop MongoDB : http://www.slideshare.net/doryokujin/hadoopmongodb GraphDB GraphDB : http://www.slideshare.net/doryokujin/graphdbgraphdb
  • 72. I ♥ MongoDB JP • continue to be a organizer of MongoDB JP • continue to propose many use cases of MongoDB • ex: Social Data, Log Data, Medical Data, ... • support MongoDB users • by document translation, user-group, IRC, blog, book, twitter,... • boosting services and products using MongoDB
  • 73. Thank you for coming to Mongo Tokyo!! [Contact me] twitter: doryokujin skype: doryokujin mail: [email protected] blog: http://d.hatena.ne.jp/doryokujin/ MongoDB JP: https://groups.google.com/group/mongodb-jp?hl=ja