Walk Through of Mahout LDA 0.9 againt Tweets on CDH 5.2.+

For the time being, only key points are listed. Best Tutorial Ever Follow Tom’s book on develop MapReduce Programs in a regression way, unit test -> local file -> hdfs file -> multiple hdfs files Package dependencies with maven shade plugin Flume generates too many small files 1170+ files, on my single CDH machine MR takes longer time 1 hour+ to process  (about 23k tweets) MR produces one 2.2 MB sequence file with key tweetID and value text. Create vectors using mahout seq2sparse (stopwords can be filtered out by the maxDFPercentage and using TF, ?? unknown). Create rowid for mahout LDA to work (convert into matrix ?? unknown).Run mahout cvb. Beware that if you don’t want to hit hdfs, remember to set sth. to MAHOUT_LOCAL so that mahout will run only in local mode. Applying seqdumper to read intermediate sequence file and using vectordump to display LDA output (topic term is the one displaying terms for a topic, whereas topic document is for document and topic matching) Yarn Related:

  • unset HADOOP_CLASSPATH (make sure Yarn will see the exact path as you are using. Your local Hadoop Classpath will not be able to transfer to Yarn)
  • sudo -u hdfs hadoop jar thePackagedJar.jar DriverClass
  • -fs hdfs://cdh-manager.localdomain (file:/// means local file system with respect to the current directory or absolute path)
  • -jt cdh-manager.localdomain:8032 (local means using local runner; this url pointing to Yarn’s resource manager)
  • /user/flume/tweets/2014/12/22/{13,14,15}” (input arg[0] taking a list of directories as set. For full list of file path pattern supported by hadoop, search “hadoop fs path pattern”)
  • /out-seq (output directory with respect to hdfs:/// root)

CDH may not be able to run your application for above mahout, because mahout may contains several application in a pipeline within the mahout scripts. Cases may happen when job is stuck at the point of requiring resources. My intake is that browsing the configuration of Yarn, and set all the memory and vcore related configuration to the default (without worrying about the physical limitations).

  • mahout seqdumper -i <input file name, address depending on MAHOUT_HOME local or hdfs>
  • -o <a local file>
  • sudo -u hdfs mahout vectordump
  • -i /tweet-lda
  • -o /tmp/topic_terms
  • -p true (Print out the key as well, delimited by tab (or the value if useKey is true)
  • -d /seqdir-sparse-lda/dictionary.file-* (the dictionary used for interpreting term)
  • -dt sequencefile (The dictionary file type (text|seqfile) is sequencefile)
  • -sort /tweet-lda (Sort output key/value pairs of the vector entries in abs magnitude of ?? unknown)
  • -vs 10 (each topic to display the top 10 terms associated with this topic)
  • -ni 1000000 (max number of items, in this case the max number of topics)

Mahout driver classes use a lot of shorthand. Check what they mean by going to file driver.classes.default.props located under conf/  

1. sudo -u hdfs mahout seq2sparse -i /demo-out-seq -o /demo-seq2sparse -ow --maxDFPercent 85 --namedVector -wt TF 2. sudo -u hdfs mahout rowid -i /demo-seq2sparse/tf-vectors -o /demo-rowid-matrix 3. sudo -u hdfs mahout cvb -i /demo-rowid-matrix/matrix -o /demo-tweet-cvb -k 10 -ow -x 5 -dict /demo-seq2sparse/dictionary.file-0 -dt /demo-cvb-topics -mt /demo-cvb-model 4. sudo -u hdfs mahout vectordump -i /demo-tweet-cvb -o /tmp/demo-topic_terms -p true -d /demo-seq2sparse/dictionary.file-* -dt sequencefile -sort /demo-tweet-cvb -vs 10 -ni 1000000

When writing a MapReduce Application, following these steps.

  1. Unit test your Mapper and Reducer class. First create unit test on the individual class methods. Second use mrunit to write drive test on mapper and reducer. Driver’s configuration can be set to test various situations.
  2. Write your application driver test. Of course you gonna need your driver class first by extending Configured and implementing Tool. In your test, invoke driver’s run method to actually run your applications. It is somehow like the integration test, since it is going to hit some file system (either local or hdfs). The purpose here is to test out a smaller real data set to see if your application is working.
  3. The last testing is really the cluster deployment testing. You can involve a relatively larger dataset and actually submit your driver application to the cluster.

Following above steps allow you to have confidence when debugging. Messing up the sequence will give you no idea where the possible error is. Passing external files (runtime) to Map Reduce. It is common that map reduce application may depend on some external configurations submitted during job submission time. There are several approaches to achieve it.

  • Map Reduce configuration is a channel for job to communicate with mapper and reducer tasks. You can specify a key in mr tasks, and the task at setup time simply do a checking on the keys. Let’s say if “external.filename” has a filename pointing to it, the task knows it should check out some file. But how to let the file accessible? We can use console’s -files options which would add any local files to the task’s classpath. At runtime the file name can be specified through console option -D property=value.

sudo -u hdfs hadoop jar target/your.jar DriverClass -fs hdfs://cdh-manager.localdomain -jt cdh-manager.localdomain:8032 -files target/classes/stopwords.data -Dcustom.stopwords.filename=stopwords.data /user/flume/tweets/2015/01/09/11 /out-seq

  • Another method is related to doing some work on the application driver class, by reading the argument list. If a relevant argument is detected, call MapReduce Job API to “addFileToClassPath” or  “addCacheFile”. But it may require a hdfs path visible (i.e. the file has been uploaded to hdfs or a URI is available for hadoop to fetch the file). I am still investigating and the useful link would be latter part of this tutorial.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: