Install Apache Hadoop Cluster on one physical machine with Puppet and Vagrant

Speak At First

In this post, I will walk through the installation and configuration processes of my development environment including Apache Hadoop 1.2.1, Hive 0.13 and Presto 0.69 with the support of automate deployment and configuration management. The whole walkthrough was inspired by the post and this post with more slaves.

Short about Vagrant

I choose vagrant over docker, simply because vagrant gives me the feel of working on machines, instead of application environment. Vagrant is heavier than docker, since it takes time to build guest os on top of host os, while docker simulates environments by transmitting dependencies to containers. Vagrant gave me the feeling of working directly on virtual machines as what I do for everyday.

Useful commands for vagrants include

  • vagrant init: which gives me an initial Vagrantfile to work on
  • vagrant up –provision: builds up guest os with provision such as puppet; it took a while for first time and the guest addition version may need to be built as well for sharing folders that is needed for puppet later.
  • vagrant reload –provision: Once I change some configuration and want vagrant to re-apply my settings for guest os, this saves my a lot of time to avoid overhead as in the first time.
  • vagrant halt: for me to gracefully shut down my virtual machines before close my laptop.
  • vagrant destroy: oh, before doing that knowing you are going to take the overhead for the first time. But this cleans up all the messy guest os.

Get to know Puppet Basic

I am new to puppet, so I installed a test script for quick understanding of puppet first. Even before that, for me it is still a must-read beginner tutorial before writing the script.

Puppet Structure for vagrant

At vagrant root folder (the one with Vagrantfile), create folder “modules” and “manifests”. manifests folder contains the entry point for puppet to start. modules contain multiple module, i.e. a set of files for installing a library (like java, mysql for the guest os). A module is included in the overall puppet installation by “include [a module name]” in the entry point file like “test_vm.pp”.

So here it is, 1) a puppet script to create a test file (manifests/test_vm.pp, no modules)


class first {
file {'testfile':
path => '/tmp/testfile',
ensure => present,
mode => 0640,
content => "I'm a test file by guopeng",
}
notify {"so that is it….":
require => File['testfile'],
}
}

view raw

gistfile1.txt

hosted with ❤ by GitHub

2) the updated vagrant file, which defines a network ip and use vm provider to customize memory size.


VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.define :test_vm do |test_vm|
test_vm.vm.box = "ubuntu13.10_64"
test_vm.vm.network :private_network, ip: "192.168.32.2"
config.vm.provider "virtualbox" do |vb|
vb.customize ["modifyvm", :id, "–memory", "1024"]
end
test_vm.vm.provision :puppet do |puppet|
puppet.module_path = "modules"
puppet.manifests_path = "manifests"
puppet.manifest_file = "test_vm.pp"
end
end
end

view raw

gistfile1.txt

hosted with ❤ by GitHub

Note: the syntax for private network and change memory size are specific to the latest vagrant version.

Problems even for this simple test puppet

  • Guest Addition Version mismatch issue: solution
  • Warning: Could not retrieve fact fqdn: This means the hostname is not in full qualified domain name. So either edit “domain my.com” to /etc/resolv.conf or do hostname my.com for current session or specify vm.hostname in vagrant in full
  • Warning: Config file /etc/puppet/hiera.yaml not found, using Hiera defaults: solutions. Note: one should look for the file in VM not the host machine, since this is not the problem for host but for the guest os.
  • Do puppet syntax check with puppet parser validate or do style checkup with puppet-lint; this saves your time by really deploying the script.

By now, the vagrant should be able to provide a nice vm with the specified test file, and we are ready to move on for more meaningful tasks, like install a java.

Install java in puppet

This post is a great reference to me.

  • puppet file with content => causing jitters for export PATH variables: in my case it is due to the whitespaces problem; simply make sure tabs are converted to spaces before debug further.

Install Hadoop with Puppet

test run and issues

  • create and grant access to tmp.dir
  • create masters and slaves.xml, using 0 secondary namenode, 2 additional slave nodes.
  • hadoop-env.sh.erb with $JAVA_HOME
  • Host Key Verification Fail, add to known_host key or specify hadoop ssh option as `UserKnownHostsFile=/dev/null` and `StrictHostKeyChecking=no`
  • org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException: slave1.gp.netThe short answer is that Hadoop performs reverse hostname lookups even if you specify IP addresses in your configuration files. In your environment, in order for you to make Hadoop work, SSP-SANDBOX-1.mysite.com must resolve to the IP address of that machine, and the reverse lookup for that IP address must resolve to SSP-SANDBOX-1.mysite.com.
  • Do dfs format in master for enabling future hive/warehouse directory. Before doing that, I need to use puppet stage; a note between defining a class and use a class for stage to work
  • err: /Stage[final]/Hdfsrun/Exec[format hdfs]/returns: change from notrun to 0 failed: /usr/bin/env: bash: No such file or directory. It caused by 1. no path variable set or not full path (for hadoop script use full path, set path won’t work for me. Why?); 2. the script hadoop namenode -format includes an interactive yes/no portion. use option -force to bypass. However, it should be only for 1st time, otherwise all previous data is going to be lost. 3. specify user => vagrant, otherwise root by default will cause error

Install Hive with Puppet

Install mysql with puppet modules

I struggled to figure out this: VagrantFile has a field to specify module path in my host os. I specified as “modules” which is relative to my vagrant script starting point; In order to use puppetlabs modules which handles dependencies and configurations nicely, I need to 1) install puppet modules in my host os `puppet module install puppetlabs-mysql` that downloads puppet scripts under my home directory (of course –modulepath will overwrite this) 2) point my VagrantFile’s modulepath field to this. 3) Vagrant up each time will help me to copy all the necessary modules from my host to the guest tmp folder, and basically do puppet apply in the guest os. Therefore, no need to do puppet module install in the guest os at all.

  • Note: same user in puppet mysql script can only appear once otherwise duplicated declaration of the same user will occur.
  • Problem: Hive throws above exception when old version of MySQL is used as hive metastore.
    Solution: Set Latin1 as the charset for metastore
    mysql> alter database metastore_db character set latin1;

test run and issues

  • add mysql to rc.d or ensure => running
  • hive: get hdfs running first; type hive in to command to try first. Hive different modes: embeded-derby, local, remote. Presto only for remote. So creating remote hive server. Understand hive: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html, but don’t 100% follow that for hive-site.xml. combined reference: http://rajnishbigdata.wordpress.com/mysql-with-hive/ ; http://www.thecloudavenue.com/2013/11/differentWaysOfConfiguringHiveMetastore.html; NOTE: mysql create user hive for localhost and create user hive for remote, and if needed grant privileges to local or remote. To test hive server, must get dfs running, and start hive –service metastore. then the metastore uri with thrift can be called by others like hiveserver2, impala or presto. must run hive –serivce hiveserver -p 10000 for presto? (presto does not require hiveserver, simply metastore server)
  • download movie lens data, unzip; to start hive database management (not query) 1. start-dfs, 2. mysql 3. metastore 9083, 4.hiveserver2 if from program otherwise only CLI.> hive> create tables.For hive query to work, must start map-r by start-mapred. Then select count(*) now works.
  • in order to configure hdfs for hive (creating hive folder and grant access), I prefer to write a shell script rather than puppet script, since a lot of condition checking with loop such as waiting name node leaves safe mode, checking existence of hive folders, etc. However, since puppet has a limited shell, I use command => bash -c to source .bashrc before running bash scripts since limited shell does not support regex etc.

Install Presto with Puppet

test run: A single presto node

  • start presto-server by launcher and start cli> now show tables not workv0.69 single presto node requires to set scheduler on.
  • presto log: /var/presto/data; set hive to /var/hive/hive.log
  • set discovery-service and worker node

test run: Multiple presto nodes

HOw to check workers have been discovered ?  https://groups.google.com/forum/#!searchin/presto-users/discovery/presto-users/Q6gmcI1Uo1s/EfeILTq1y5YJ

Make your VMs accessible to you

  • apt-get update, install vim, personal .vimrc, system time to SG

Writing a JDBC Client to Hive

  • Connection Refused: if hive.server2 host is set to localhost, then client program in host os cannot access through 10.17.3.10:10000; set hive.server2 host to fqdn, which matches /etc/hosts; then host os client can connect
  • Cannot execute select query: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Go to hive server stderr it reads org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.Ac
    cessControlException: Permission denied: user=anonymous, access=WRITE, … The issue was caused by the jdbc client DriverManager.getConnection(“jdbc:hive2://10.17.3.10:10000/default”, “”, “”); previously provides an anonymous user name, which according to dfs the user has no write access. The solution is to simply put the user name on the hdfs who has the access. But the password is left empty which is very insecure —Unsolved

Deploy to Real world

NOTE!! apt-get install puppet-master only, not puppet passenger, otherwise, conflicts cause cert versification get revoked each time.

 

bamboo-dep:

0. vi /etc/hosts
10.110.254.128  citymodel-001.hpls.local citymodel-001
1. install ruby
sudo apt-get update
curl -L https://get.rvm.io | bash -s stable
source ~/.rvm/scripts/rvm
rvm install ruby
rvm rubygems curret
2. uninstall puppet
sudo apt-get –purge remove puppet
sudo apt-get autoremove
rm -rf /var/lib/puppet
rm -rf /etc/puppet
3. install
sudo apt-get install puppet
sudo puppet resource service puppet ensure=running enable=true
4 vi /etc/puppet/puppet.conf
[main]
server=citymodel-001.hpls.local
environment=try
5. on master puppet.conf
Set environmentpath = $confdir/environments in the puppet master’s puppet.conf
dns_alt_names=citymodel-001,citymodel-001.hpls.local
run:
sudo puppet master –verbose –no-daemonize
—————-
sudo puppet agent -t
6. master sign
sudo puppet cert sign “bambo-dep”
7. on agent
sudo puppet agent -t
Error: Could not request certificate: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed: [certificate revoked for /CN=citymodel-001.hpls.local]
make sure agent is running.
ps aux| grep puppet
8
Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class first for bamboo-dep on node bamboo-dep
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
# puppet agent --test --noop --debug



Play Use Bonecp connected to HiveServer2 failed "cannot connect database [datasource name]"
Debug into Hive.log


2014-06-16 06:20:28,153 INFO [pool-2-thread-1]: thrift.ThriftCLIService (ThriftCLIService.java:OpenSession(188
)) - Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6
2014-06-16 06:20:28,167 INFO [pool-2-thread-1]: hive.metastore (HiveMetaStoreClient.java:open(297)) - Trying t
o connect to metastore with URI thrift://10.17.3.10:9083
2014-06-16 06:20:28,168 INFO [pool-2-thread-1]: hive.metastore (HiveMetaStoreClient.java:open(385)) - Connecte
d to metastore.
2014-06-16 06:20:28,169 INFO [pool-2-thread-1]: session.SessionState (SessionState.java:start(360)) - No Tez s
ession required at this point. hive.execution.engine=mr.
2014-06-16 06:20:28,211 INFO [pool-2-thread-1]: session.SessionState (SessionState.java:start(360)) - No Tez s
ession required at this point. hive.execution.engine=mr.
2014-06-16 06:20:28,276 INFO [pool-2-thread-2]: thrift.ThriftCLIService (ThriftCLIService.java:OpenSession(188
)) - Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6
2014-06-16 06:20:28,279 INFO [pool-2-thread-2]: hive.metastore (HiveMetaStoreClient.java:open(297)) - Trying t
o connect to metastore with URI thrift://10.17.3.10:9083

repeated for three times.


Play's Bonecp cannot establish reliable connections with Hive. After debug, I found the cause is by hive, java.sql.SQLException: enabling autocommit is not supported

2 thoughts on “Install Apache Hadoop Cluster on one physical machine with Puppet and Vagrant

  1. vigneshwaran says:

    i have learning for your revidews.this reviews excellent.Thanks a lot.

    Hadoop Training in Chennai

Leave a comment