I am following the official guide of auto installation, while it seems easy at first glance. I had a rough time on it.
Before going on, CDH is considered kind of heavy distribution of hadoop. What I am doing is to use CDH as dev, and so far my experiences are (1) VMWorkstation on a 16G RAM windows host machine, with CDH taking up to 8G single node or (2) Mac 8G RAM host, with vagrant-enabled virtual box taking up to 4G and 2 vcores for a single node (more than one node will cause non-functional CDH)
Make sure if you are behind a proxy by enabling proxy as stated in the guide. In addition, once you are in the phase of downloading parcels, remember to configure parcel downloading proxy setting through web browser. Confirm on the downloading issue of parcel by checking ‘
Don’t need to deal with java and just let it go with default oracle-j2sdk.
Disable IPv6 by following the blog guide.
Find ip address of the machine node and create FQDN as suggested in the comments. And Cloudera requires more and more:
The hosts in a Cloudera Manager deployment must satisfy the following networking and security requirements:
- Cluster hosts must have a working network name resolution system and correctly formatted /etc/hosts file. All cluster hosts must have properly configured forward and reverse host resolution through DNS. The /etc/hosts files must
- Contain consistent information about hostnames and IP addresses across all hosts
- Not contain uppercase hostnames
- Not contain duplicate IP addresses
A properly formatted /etc/hosts file should be similar to the following example:
127.0.0.1 localhost.localdomain localhost
192.168.1.1 cluster-01.example.com cluster-01
192.168.1.2 cluster-02.example.com cluster-02
192.168.1.3 cluster-03.example.com cluster-03
My intake is that 1) avoid using 127.0.0.1 from loopback interface, using IPs assigned through eth0 interface. 2) make hostname is FQDN as well, by using `sudo hostname <FQDN>` and saving the name in `/etc/hostname` for reboot.
Allow host to be resolvable from local /etc/hosts files by this:
Disable firewall and iptable as in the guide.
CDH requires root access using password or private key. My take is using password for root user is easier. Do following,
Make sure openssh-server is installed and started
sudo apt-get install openssh-server
Give root a password, enable ssh root access
sudo passwd root;
service ssh restart
Test ssh access as root
With above configuration, install cloudera manager and installation cluster should be working.
During cluster installation, if you need to retry from web browser, you may need to manually remove the lock by:
sudo rm /tmp/.scm_prepare_node.lock
If encountering any problem, you can always uninstall and get back to a clean state by following the uninstallation guide.
Note: during cluster installation, if the web browser does not show any progress bar, that means something wrong. Check the root access listed above.
======= after a running CDH, configurations to be continue ======
CDH calculates the settings (like memory location) for the host, but sometimes the configuration is not checked against the minimum requirement of installed components.
For example, the test installation with estimating PI does not work, unless increasing following memory settings in Yarn as (wired enough, the log in yarn does not point anything useful),
– Set the Container Memory (yarn.nodemanager.resource.memory-mb) to 4GB
– Set the Java Heap Size of ResourceManager to 1GB
The most useful way is to check the non-default settings by switching to the new view.
======= running mahout example ======
When trying to execute `mahout seq2sparse -i reuters-out-seqdir/ -o reuters-out-seqdir-lda -ow –maxDFPercent 85 –namedVector` with `MAHOUT_LOCAL` set to “something not null”, meaning running on local, guava library version mismatch.
Exception in thread “main” java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
Mahout requires guava 16.0, while hadoop V2 uses guava 11.0.
The solution is quite wired. I was simply going to turn on the log by reading CDH 5’s mahout script, which is pointing mahout conf directory to /etc/mahout/conf.dist. In the conf directory, I put a simple log4j properties under it. Surprisingly, the guava problem is gone.