Monthly Archives: January 2015

Walk Through of Mahout LDA 0.9 againt Tweets on CDH 5.2.+

For the time being, only key points are listed. Best Tutorial Ever Follow Tom’s book on develop MapReduce Programs in a regression way, unit test -> local file -> hdfs file -> multiple hdfs files Package dependencies with maven shade plugin Flume generates too many small files 1170+ files, on my single CDH machine MR takes longer time 1 hour+ to process  (about 23k tweets) MR produces one 2.2 MB sequence file with key tweetID and value text. Create vectors using mahout seq2sparse (stopwords can be filtered out by the maxDFPercentage and using TF, ?? unknown). Create rowid for mahout LDA to work (convert into matrix ?? unknown).Run mahout cvb. Beware that if you don’t want to hit hdfs, remember to set sth. to MAHOUT_LOCAL so that mahout will run only in local mode. Applying seqdumper to read intermediate sequence file and using vectordump to display LDA output (topic term is the one displaying terms for a topic, whereas topic document is for document and topic matching) Yarn Related:

  • unset HADOOP_CLASSPATH (make sure Yarn will see the exact path as you are using. Your local Hadoop Classpath will not be able to transfer to Yarn)
  • sudo -u hdfs hadoop jar thePackagedJar.jar DriverClass
  • -fs hdfs://cdh-manager.localdomain (file:/// means local file system with respect to the current directory or absolute path)
  • -jt cdh-manager.localdomain:8032 (local means using local runner; this url pointing to Yarn’s resource manager)
  • /user/flume/tweets/2014/12/22/{13,14,15}” (input arg[0] taking a list of directories as set. For full list of file path pattern supported by hadoop, search “hadoop fs path pattern”)
  • /out-seq (output directory with respect to hdfs:/// root)

CDH may not be able to run your application for above mahout, because mahout may contains several application in a pipeline within the mahout scripts. Cases may happen when job is stuck at the point of requiring resources. My intake is that browsing the configuration of Yarn, and set all the memory and vcore related configuration to the default (without worrying about the physical limitations).

  • mahout seqdumper -i <input file name, address depending on MAHOUT_HOME local or hdfs>
  • -o <a local file>
  • sudo -u hdfs mahout vectordump
  • -i /tweet-lda
  • -o /tmp/topic_terms
  • -p true (Print out the key as well, delimited by tab (or the value if useKey is true)
  • -d /seqdir-sparse-lda/dictionary.file-* (the dictionary used for interpreting term)
  • -dt sequencefile (The dictionary file type (text|seqfile) is sequencefile)
  • -sort /tweet-lda (Sort output key/value pairs of the vector entries in abs magnitude of ?? unknown)
  • -vs 10 (each topic to display the top 10 terms associated with this topic)
  • -ni 1000000 (max number of items, in this case the max number of topics)

Mahout driver classes use a lot of shorthand. Check what they mean by going to file driver.classes.default.props located under conf/  

1. sudo -u hdfs mahout seq2sparse -i /demo-out-seq -o /demo-seq2sparse -ow --maxDFPercent 85 --namedVector -wt TF 2. sudo -u hdfs mahout rowid -i /demo-seq2sparse/tf-vectors -o /demo-rowid-matrix 3. sudo -u hdfs mahout cvb -i /demo-rowid-matrix/matrix -o /demo-tweet-cvb -k 10 -ow -x 5 -dict /demo-seq2sparse/dictionary.file-0 -dt /demo-cvb-topics -mt /demo-cvb-model 4. sudo -u hdfs mahout vectordump -i /demo-tweet-cvb -o /tmp/demo-topic_terms -p true -d /demo-seq2sparse/dictionary.file-* -dt sequencefile -sort /demo-tweet-cvb -vs 10 -ni 1000000

When writing a MapReduce Application, following these steps.

  1. Unit test your Mapper and Reducer class. First create unit test on the individual class methods. Second use mrunit to write drive test on mapper and reducer. Driver’s configuration can be set to test various situations.
  2. Write your application driver test. Of course you gonna need your driver class first by extending Configured and implementing Tool. In your test, invoke driver’s run method to actually run your applications. It is somehow like the integration test, since it is going to hit some file system (either local or hdfs). The purpose here is to test out a smaller real data set to see if your application is working.
  3. The last testing is really the cluster deployment testing. You can involve a relatively larger dataset and actually submit your driver application to the cluster.

Following above steps allow you to have confidence when debugging. Messing up the sequence will give you no idea where the possible error is. Passing external files (runtime) to Map Reduce. It is common that map reduce application may depend on some external configurations submitted during job submission time. There are several approaches to achieve it.

  • Map Reduce configuration is a channel for job to communicate with mapper and reducer tasks. You can specify a key in mr tasks, and the task at setup time simply do a checking on the keys. Let’s say if “external.filename” has a filename pointing to it, the task knows it should check out some file. But how to let the file accessible? We can use console’s -files options which would add any local files to the task’s classpath. At runtime the file name can be specified through console option -D property=value.

sudo -u hdfs hadoop jar target/your.jar DriverClass -fs hdfs://cdh-manager.localdomain -jt cdh-manager.localdomain:8032 -files target/classes/ /user/flume/tweets/2015/01/09/11 /out-seq

  • Another method is related to doing some work on the application driver class, by reading the argument list. If a relevant argument is detected, call MapReduce Job API to “addFileToClassPath” or  “addCacheFile”. But it may require a hdfs path visible (i.e. the file has been uploaded to hdfs or a URI is available for hadoop to fetch the file). I am still investigating and the useful link would be latter part of this tutorial.

Quick Run Spark with Docker in Mac OS

For the time being, I am just listing a brief guide.

1. install boot2docker, which is a linux core required to run docker.

2. install spark docker by (just pull docker image). The docker image has a hadoop yarn inside as well.

3. in MacOs, start boot2docker first, note down and export the $DOCKER_HOST environment. In MacOS, any docker client command requires this environment to connect to boot2docker’s host.

4. after export the docker environment in MacOS, investigate docker containers by `docker ps` with -a or -l(astest), or `docker inspect CONTAINER_HASH

5. now in Mac OS, it is safe to run

docker run  -p 4040:4040 -p 8030:8030 -p 49707:49707 -p 50020:50020 -p 8042:8042 -p 50070:50070 -p 8033:8033 -p 8032:8032 -p 50075:50075 -p 22:22 -p 8031:8031 -p 8040:8040 -p 50010:50010 -p 50090:50090 -p 8088:8088 -i -t -h sand    box sequenceiq/spark:1.2.0 /etc/ -bash

I open the ports so that MacOS can access directly.

6. you can exam the ports opened in boot2docker by `sudo iptables -t nat -L -n` from boot2docker. Enter boot2docker by `boot2docker ssh`, and make hostname `sandbox` to be reflected in Mac OS by adding it in /etc/hosts

Despite the learning curve in docker, I have to say so far this way is the most convenient for deploying/testing a spark application and less computational power consuming.

Downside: It is not a cluster, but one node only.