Daniel Sikar: Hadoop MapReduce - 06/09/2010

1. QUICK AND DIRTY PARALLEL PROCESSING ON THE CLOUD Daniel Sikar

2. EC2 S3

4. Tools AWS Command line tools

5. Elastic MapReduce Ruby library

6. Hadoop

7. s3cmd

8. Hadoop MapReduce Job Tracker + Task Tracker + Slaves HDFS – Distributed file system

9. Hadoop MapReduce usage Data crunching in general Clicks Statistics etc

10. Hadoop Project Mgmt Committee

11. MapReduce ?

12. MapReduce Key Pairs <key,value>

13. MapReduce

14. HTTP Logs Log file A: (...) FreeTouchScreenNokia5230 (...) (...) GetRidofAllSpeedCameras(...) (...) USManWinsLottery (...) (...) BNPToLaunchElectionManifesto (...) Log file B: (...) FreeTouchScreenNokia5230 (...) (...) BodyLanguageTellsAll (...)

15. MapReduce <FreeTouchScreenNokia5230, 1> + <FreeTouchScreenNokia5230, 1> = <FreeTouchScreenNokia5230, 2>

16. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer

17. Hadoop Streaming Running MapReduce jobs with .exe fiels and scripts $ <list> | mapper | reducer

18. Real life example of Hadoop Streaming usage

19. Wikipedia Page Access Logs

20. Wine Grape Varieties

21. Wikipedia WGV Page Access Stats

22. Business Decisions

23. Launching a virtual Hadoop Cluster $ elastic-mapreduce --create --name "Wiki log crunch" --alive --num-instances –instance-type c1.medium 20 Created job flow <job flow id> $ ec2din (...)

25. Hadoop Standalone Operation

26. Pseudo-Distributed Operation

27. Fully-Distributed Operation

28. NameNode

29. JobTracker

30. DataNode + TaskTracker

31. Hadoop Standalone Operation

32. Pseudo-Distributed Operation

33. Fully-Distributed Operation

34. NameNode

35. JobTracker

36. DataNode + TaskTracker

37. Add a step $ elastic-mapreduce --jobflow <jfid> --stream \ --step-name "Wiki log crunch" \ --input s3n://dsikar-wikilogs-2009/dec/ \ --output s3n://dsikar-wikilogs-output/21 \ --mapper s3n://dsikar-wiki-scripts/wikidictionarymap.pl \ --reducer s3n://dsikar-wiki-scripts/wikireduce.pl http://<instance public dns>:9100

38. s3cmd # make bucket $ s3cmd mb s3://dsikar-wikilogs # put log files $ s3cmd put pagecounts-200912*.gz s3://dsikar-wikilogs/dec $ s3cmd put pagecounts-201004*.gz s3://dsikar-wikilogs/apr # list log files $ s3cmd ls s3://dsikar-wikilogs/ # put scripts $ s3cmd put *.pl s3://dsikar-wiki-scripts/ # delete log files $ s3cmd del --recursive --force s3://dsikar-wikilogs/ # remove bucket $ s3cmd rb s3://dsikar-wikilogs/

39. Elastic MapReduce --create --list --jobflow --describe --stream --terminate

40. Output files part-00000 part-00001 part-00002 (...)

41. Further aggregation

42. Conclusion Hadoop MapReduce provides out-of-the-box ready-to-go distributed computing.

Daniel Sikar: Hadoop MapReduce - 06/09/2010

More Related Content

What's hot (19)

Viewers also liked (6)

Similar to Daniel Sikar: Hadoop MapReduce - 06/09/2010 (20)

More from Skills Matter (20)

Recently uploaded (20)

Daniel Sikar: Hadoop MapReduce - 06/09/2010

Editor's Notes