Speak At First
In this post, I will walk through the installation and configuration processes of my development environment including Apache Hadoop 1.2.1, Hive 0.13 and Presto 0.69 with the support of automate deployment and configuration management. The whole walkthrough was inspired by the post and this post with more slaves.
Short about Vagrant
I choose vagrant over docker, simply because vagrant gives me the feel of working on machines, instead of application environment. Vagrant is heavier than docker, since it takes time to build guest os on top of host os, while docker simulates environments by transmitting dependencies to containers. Vagrant gave me the feeling of working directly on virtual machines as what I do for everyday.
Useful commands for vagrants include
- vagrant init: which gives me an initial Vagrantfile to work on
- vagrant up –provision: builds up guest os with provision such as puppet; it took a while for first time and the guest addition version may need to be built as well for sharing folders that is needed for puppet later.
- vagrant reload –provision: Once I change some configuration and want vagrant to re-apply my settings for guest os, this saves my a lot of time to avoid overhead as in the first time.
- vagrant halt: for me to gracefully shut down my virtual machines before close my laptop.
- vagrant destroy: oh, before doing that knowing you are going to take the overhead for the first time. But this cleans up all the messy guest os.
Get to know Puppet Basic
I am new to puppet, so I installed a test script for quick understanding of puppet first. Even before that, for me it is still a must-read beginner tutorial before writing the script.
Puppet Structure for vagrant
At vagrant root folder (the one with Vagrantfile), create folder “modules” and “manifests”. manifests folder contains the entry point for puppet to start. modules contain multiple module, i.e. a set of files for installing a library (like java, mysql for the guest os). A module is included in the overall puppet installation by “include [a module name]” in the entry point file like “test_vm.pp”.
So here it is, 1) a puppet script to create a test file (manifests/test_vm.pp, no modules)
2) the updated vagrant file, which defines a network ip and use vm provider to customize memory size.
Note: the syntax for private network and change memory size are specific to the latest vagrant version.
Problems even for this simple test puppet
- Guest Addition Version mismatch issue: solution
- Warning: Could not retrieve fact fqdn: This means the hostname is not in full qualified domain name. So either edit “domain my.com” to /etc/resolv.conf or do hostname my.com for current session or specify vm.hostname in vagrant in full
- Warning: Config file /etc/puppet/hiera.yaml not found, using Hiera defaults: solutions. Note: one should look for the file in VM not the host machine, since this is not the problem for host but for the guest os.
- Do puppet syntax check with puppet parser validate or do style checkup with puppet-lint; this saves your time by really deploying the script.
By now, the vagrant should be able to provide a nice vm with the specified test file, and we are ready to move on for more meaningful tasks, like install a java.
Install java in puppet
This post is a great reference to me.
- puppet file with content => causing jitters for export PATH variables: in my case it is due to the whitespaces problem; simply make sure tabs are converted to spaces before debug further.
Install Hadoop with Puppet
test run and issues
- create and grant access to tmp.dir
- create masters and slaves.xml, using 0 secondary namenode, 2 additional slave nodes.
- hadoop-env.sh.erb with $JAVA_HOME
- Host Key Verification Fail, add to known_host key or specify hadoop ssh option as `UserKnownHostsFile=/dev/null` and `StrictHostKeyChecking=no`
- org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException: slave1.gp.netThe short answer is that Hadoop performs reverse hostname lookups even if you specify IP addresses in your configuration files. In your environment, in order for you to make Hadoop work, SSP-SANDBOX-1.mysite.com must resolve to the IP address of that machine, and the reverse lookup for that IP address must resolve to SSP-SANDBOX-1.mysite.com.
- Do dfs format in master for enabling future hive/warehouse directory. Before doing that, I need to use puppet stage; a note between defining a class and use a class for stage to work
- err: /Stage[final]/Hdfsrun/Exec[format hdfs]/returns: change from notrun to 0 failed: /usr/bin/env: bash: No such file or directory. It caused by 1. no path variable set or not full path (for hadoop script use full path, set path won’t work for me. Why?); 2. the script hadoop namenode -format includes an interactive yes/no portion. use option -force to bypass. However, it should be only for 1st time, otherwise all previous data is going to be lost. 3. specify user => vagrant, otherwise root by default will cause error
Install Hive with Puppet
Install mysql with puppet modules
I struggled to figure out this: VagrantFile has a field to specify module path in my host os. I specified as “modules” which is relative to my vagrant script starting point; In order to use puppetlabs modules which handles dependencies and configurations nicely, I need to 1) install puppet modules in my host os `puppet module install puppetlabs-mysql` that downloads puppet scripts under my home directory (of course –modulepath will overwrite this) 2) point my VagrantFile’s modulepath field to this. 3) Vagrant up each time will help me to copy all the necessary modules from my host to the guest tmp folder, and basically do puppet apply in the guest os. Therefore, no need to do puppet module install in the guest os at all.
- Note: same user in puppet mysql script can only appear once otherwise duplicated declaration of the same user will occur.
- Problem: Hive throws above exception when old version of MySQL is used as hive metastore.
Solution: Set Latin1 as the charset for metastore
mysql> alter database metastore_db character set latin1;
test run and issues
- add mysql to rc.d or ensure => running
- hive: get hdfs running first; type hive in to command to try first. Hive different modes: embeded-derby, local, remote. Presto only for remote. So creating remote hive server. Understand hive: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html, but don’t 100% follow that for hive-site.xml. combined reference: http://rajnishbigdata.wordpress.com/mysql-with-hive/ ; http://www.thecloudavenue.com/2013/11/differentWaysOfConfiguringHiveMetastore.html; NOTE: mysql create user hive for localhost and create user hive for remote, and if needed grant privileges to local or remote. To test hive server, must get dfs running, and start hive –service metastore. then the metastore uri with thrift can be called by others like hiveserver2, impala or presto. must run hive –serivce hiveserver -p 10000 for presto? (presto does not require hiveserver, simply metastore server)
- download movie lens data, unzip; to start hive database management (not query) 1. start-dfs, 2. mysql 3. metastore 9083, 4.hiveserver2 if from program otherwise only CLI.> hive> create tables.For hive query to work, must start map-r by start-mapred. Then select count(*) now works.
- in order to configure hdfs for hive (creating hive folder and grant access), I prefer to write a shell script rather than puppet script, since a lot of condition checking with loop such as waiting name node leaves safe mode, checking existence of hive folders, etc. However, since puppet has a limited shell, I use command => bash -c to source .bashrc before running bash scripts since limited shell does not support regex etc.
Install Presto with Puppet
test run: A single presto node
- start presto-server by launcher and start cli> now show tables not workv0.69 single presto node requires to set scheduler on.
- presto log: /var/presto/data; set hive to /var/hive/hive.log
- set discovery-service and worker node
test run: Multiple presto nodes
HOw to check workers have been discovered ? https://groups.google.com/forum/#!searchin/presto-users/discovery/presto-users/Q6gmcI1Uo1s/EfeILTq1y5YJ
Make your VMs accessible to you
- apt-get update, install vim, personal .vimrc, system time to SG
Writing a JDBC Client to Hive
- Connection Refused: if hive.server2 host is set to localhost, then client program in host os cannot access through 10.17.3.10:10000; set hive.server2 host to fqdn, which matches /etc/hosts; then host os client can connect
- Cannot execute select query: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Go to hive server stderr it reads org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.Ac
cessControlException: Permission denied: user=anonymous, access=WRITE, … The issue was caused by the jdbc client DriverManager.getConnection(“jdbc:hive2://10.17.3.10:10000/default”, “”, “”); previously provides an anonymous user name, which according to dfs the user has no write access. The solution is to simply put the user name on the hdfs who has the access. But the password is left empty which is very insecure —Unsolved
Deploy to Real world
NOTE!! apt-get install puppet-master only, not puppet passenger, otherwise, conflicts cause cert versification get revoked each time.
environmentpath = $confdir/environmentsin the puppet master’s puppet.conf
# puppet agent --test --noop --debug Play Use Bonecp connected to HiveServer2 failed "cannot connect database [datasource name]" Debug into Hive.log 2014-06-16 06:20:28,153 INFO [pool-2-thread-1]: thrift.ThriftCLIService (ThriftCLIService.java:OpenSession(188 )) - Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 2014-06-16 06:20:28,167 INFO [pool-2-thread-1]: hive.metastore (HiveMetaStoreClient.java:open(297)) - Trying t o connect to metastore with URI thrift://10.17.3.10:9083 2014-06-16 06:20:28,168 INFO [pool-2-thread-1]: hive.metastore (HiveMetaStoreClient.java:open(385)) - Connecte d to metastore. 2014-06-16 06:20:28,169 INFO [pool-2-thread-1]: session.SessionState (SessionState.java:start(360)) - No Tez s ession required at this point. hive.execution.engine=mr. 2014-06-16 06:20:28,211 INFO [pool-2-thread-1]: session.SessionState (SessionState.java:start(360)) - No Tez s ession required at this point. hive.execution.engine=mr. 2014-06-16 06:20:28,276 INFO [pool-2-thread-2]: thrift.ThriftCLIService (ThriftCLIService.java:OpenSession(188 )) - Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 2014-06-16 06:20:28,279 INFO [pool-2-thread-2]: hive.metastore (HiveMetaStoreClient.java:open(297)) - Trying t o connect to metastore with URI thrift://10.17.3.10:9083 repeated for three times. Play's Bonecp cannot establish reliable connections with Hive. After debug, I found the cause is by hive, java.sql.SQLException: enabling autocommit is not supported