Pages

Monday, December 23, 2013

My first experience with Ansible


I needed a nice, simple and repeatable way of installing complex systems such as Hadoop and Oozie. I noticed that recently people mention Ansible so I gave it a try.

Ansible looks very attractive at the beginning - it lets you easily run commands on your systems, e.g.:

ansible clusterx -a "ifconfig"

runs ifconfig on each machine of clusterx and lists the results one after another on stdout in a nice green. I was hoping this simplicity would somehow magically remain even with more complex tasks. It is not so easy though. There are many gotchas with Ansible, especially if no module exists for what you need. Unfortunately, the documentation is also lacking in this regard.

For example, you cannot specify multiple actions per task. This may make sense if each task is a complex operation executed using an existing Ansible module but it gets in the way if you just need to run a couple of commands via SSH yourself.

Development of new playbooks is made difficult by the fact that you have no feedback on long running tasks. Again, this probably makes sense, once everything is working and you're running your playbooks on hundreds of machines but it makes development unpleasant. The main reason is that a long running task becomes not easily distinguishable form a task that is stuck for example because it requires user input (which should never happen but sometimes it does when you develop things).

Because for my goal, installing Hadoop and Oozie, there are no satisfactory ready made modules or playbooks, I had to rely a lot on using modules command, shell, and script. Again, they are relatively easy to use for simple things. But you will hit a problem for example as soon as you would like to split your script into modules (for example to source a common part). The script module lets you run a script but will not help you with running a script that sources another script - I guess you will have to scp them to the target machine yourself first before running them. 

Another problem I came across is not directly related to Ansible but I want to mention it in case someone runs into it too. At first, I didn't realize that environment is not set the same way when you run an SSH command and when you start an interactive shell. That caused me some trouble for a minute because for example JAVA_HOME was not set correctly. This is easily corrected but it's good to remind yourself of it once in a while. More people have this problem.

Back to Ansible. You can set environment variables for an Ansible script task by creating a YAML dictionary variable. Unfortunately I found no easy and nice way of combining and augmenting existing environments. For example I would like to have a separate environment directory_layout grouping directory variables and networking_params for variables around networking. Or just using the directory_layout environment and adding one more variable to it. This cannot be done as far as I know. I came up only with workarounds such as either creating a new environment and populating it using the original environment (thus avoiding repeating literal values) or using YAML ids and references. Both ways work but are not very elegant.

One particularly irritating thing is that Ansible mixes its own YAML syntax with a Python-based template language. This is very important but it is not stressed enough in the Ansible documentation. It means that you can, and often unfortunately have to, run pieces of Python code on the template variables. For example:

- name: Fetch SSH key
 fetch: src=/home/admin/.ssh/id_rsa.pub dest=fetched/{{ansible_hostname}}
- name: Admin user has passwordless localhost SSH
 authorized_key: user=admin key="{{ lookup('file', 'fetched/'+ansible_hostname+'/id_rsa.pub') }}"

The quotes around {{ }} are necessary because of the YAML parser. Ansible is quite helpful in this regard suggesting fixes so I can't complain too much but here and there I experienced more friction with YAML than necessary (it's been a while since I used it the last time). I can see why YAML is convenient for the developers but I'm not so sure about its advantages for users as myself (although I found for example the YAML references feature useful).

Another example of using Python is related to the not so well documented registering of variables. You can run a remote script and register its output as a variable. Well, only to later find out that its not the output of the script in the variable but a whole JSON structure describing execution of the script. It has stdout as one of its fields. So you can use it as follows:

- name: Check is Java installed
script: checkJava.sh
register: javaCheck

- name: Install Java
  apt: pkg='$item' state=present
  with_items:
      - oracle-java7-installer
      - oracle-java7-set-default
  when: "'not_installed' in javaCheck.stdout"

Now, the "'not_installed' in javaCheck.stdout" is actually Pyton code and it is something not easy to figure out by reading documentation. Especially, how to use the "when:" construct in general. It is a bit complicated by the fact that earlier versions of Ansible used some, now deprecated, variations of it such as when_string and there are a lot of references to it on the web which makes it only more confusing. 

Even though this post may seem very critical, I am still hopeful about Ansible. First, it is still a young project and second, maybe I need more getting used to it. I definitely need to find out more about writing Ansible modules - I anticipate that they would solve some of my problems.

P.S. I'll probably update this post soon, I just wanted to get it out now, maybe it will save someone else a bit of time.

Wednesday, December 4, 2013

How to install Oozie

The Oozie Quick start guide is hard to follow to the point that I would almost say that it is intentional. The installation itself is not difficult once you figure out what you need to do. The following will hopefully help you if you are stuck.

# download and extract Oozie 3.3.2 (4.0.0 doesn't build for me)
export BUILD_DIR=/home/jakub/oozie
wget http://www.whoishostingthis.com/mirrors/apache/oozie/3.3.2/oozie-3.3.2.tar.gz
tar xzvf oozie-3.3.2.tar.gz
cd oozie-3.3.2
bin/mkdistro.sh -DskipTests
cp ${BUILD_DIR}/oozie-3.3.2/distro/target/oozie-3.3.2-distro.tar.gz /opt/
cd /opt
tar xzvf oozie-3.3.2-distro.tar.gz
cd oozie-3.3.2

mkdir libext
cp ${BUILD_DIR}/oozie-3.3.2/hadooplibs/hadooplib-1.1.1.oozie-3.3.2/* libext/

# if the above doesn't work for you for any reason you can try:
# cp /opt/hadoop/*.jar libext/
# cp /opt/hadoop/lib/*.jar libext/
# Beware of log4j version clashes. 
# Oozie requires log4j >= 1.2.16 (because of a bug in 1.2.15)

cd libext
wget http://extjs.com/deploy/ext-2.2.zip
bin/oozie-setup.sh prepare-war
Edit /opt/hadoop/conf/core-site.xml and modify it e.g. like:
  
     
        hadoop.proxyuser.jakub.hosts
        localhost
     
     
         hadoop.proxyuser.jakub.groups
         jakub
     
At this point you should install and configure MySQL (skipped).
# Initialize a MySQL database
bin/ooziedb.sh create -sqlfile oozie.sql -run

# Put required libraries to HDFS
bin/oozie-setup.sh sharelib create -fs hdfs://localhost:9000 ${BUILD_DIR}/oozie-3.3.2/sharelib/target/oozie-sharelib-3.3.2.tar.gz

# Install client
cp client/target/oozie-client-3.3.2-client.tar.gz /opt
cd  /opt 
tar xzvf oozie-client-3.3.2-client.tar.gz
export PATH=$PATH:/opt/oozie-client-3.3.2/bin
export OOZIE_URL="http://localhost:11000/oozie"

# Run Oozie
bin/oozied.sh run
bin/oozie admin -oozie http://localhost:11000/oozie -status

# Prepare and run examples
cd ${BUILD_DIR}/oozie-3.3.2/examples/target
tar xzvf oozie-examples-3.3.2-examples.tar.gz
hdfs -put examples examples
cd ${BUILD_DIR}/oozie-3.3.2/examples/target

# Edit examples/apps/map-reduce/job.properties to set the correct job tracker and namenode URIs

oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run

Thursday, July 5, 2012

Java virtual method invocation and access modifiers

Java virtual method invocation presents unexpected intricacies as shown by a recent post of Jan Vrany in a Czech Java forum. Consider the following two classes:

package x.a;

public class A {
 String method1() {
  return "A.method1()";
 }

 public static String callMethod1(A obj) {
  return obj.method1();
 }
}

package x.b;

import x.a.A;

public class B extends A {
 public String method1() {
  return "B.method1()";
 }
}
The question then is what does the following code print?
System.out.println(
  A.callMethod1(new B())
);

The answer follows below.








The answer is quite surprisingly "A.method1()".  Notice that A.method1 has no access modifier while B.method1 is public and B is in a different package than A. See also the Visibility table in Controlling Access to Members of a Class (docs.oracle.com).

If A.method1 were public or at least protected then the output would be "B.method1()" as one would expect from a normal virtual method invocation.

This problem seems to be closely related to the following stackoverflow question:
Why does Java's invokevirtual need to resolve the called method's compile-time class?

The answer to the question (that it is just a performance optimization) seems to be wrong and the counterexample is above.

The relevant parts of the JVM spec are 5.4.3.3 Method Resolution and 7.7 Invoking Methods.

The question remains what is the reason for this design or whether it is accidental or for some reason difficult to avoid.

Thursday, June 9, 2011

ThinkPad T420 thinkfan settings

Update Dec 16 2011: After upgrading the BIOS to 1.34 in October, the issue went away - the fan was spinning at about 3500 RPM which produces quite a bearable noise and an OK temperature. Unfortunately, the issue returned yesterday after installing the newest Ubuntu updates. So I'm back to using ThinkFan for now. I'm using new settings that seem to be closer to Lenovo's intention (based on observation during the time I used only the BIOS 1.34 and no ThinkFan):


(0, 0, 35)
(1, 33, 38)
(2, 36, 45)
(3, 39, 49)
(4, 46, 58)
(5, 50, 62)
(7, 56, 32767)

Also note that the sensors in thinkfan.conf now should be (Ubuntu 11.10):

sensor /sys/devices/platform/coretemp.0/temp1_input (0)
sensor /sys/devices/platform/coretemp.0/temp2_input (0)
sensor /sys/devices/platform/coretemp.0/temp3_input (0)
sensor /sys/devices/virtual/hwmon/hwmon0/temp1_input (0)

(thanks for the comments)

Perhaps I should add that I'm using an external LCD which is attached to the laptop most of the time. Someone on Lenovo Forums points out that the expected temperature and thus fan speed is higher in such a case.



The new ThinkPad T420 series unfortunately suffers from loud fan problems. With the default settings it spins almost all the time and produces an uncomfortable high-pitched hair-dryer-like noise.

The Ubuntu forums provide a how-to for setting up the thinkfan utility that helps to manage a laptop's fans.  However, the default thinkfan temperature settings are too low for a T420, and even an alternative adjusted to a T420 that I found is a bit too low because of the loud fan problem. Given that the operating temperature of SandyBridge CPUs is up to 100C (85C is considered high) and given that SSDs work up to ~70C (at least the OCZ Vertex 3 Max IOPS), I decided to use the following settings:


(0, 0, 52)
(1, 46, 59)
(2, 54, 65)
(3, 58, 69)
(4, 62, 72)
(5, 65, 74)
(7, 68, 32767)


The file is /etc/thinkfan.conf. The syntax is (Level, Low, High). The rationale behind it is that the highest temperature will always be at the CPU (I don't have a discrete graphics card) so the SSD (and a HDD in the UltraBay) will be cooler and it is ok if the fan kicks in for a CPU temperature a bit higher than is the highest operating temperature of the SSD. Unfortunately, there is apparently no temperature sensor by the OCZ Vertex (hddtemp /dev/sda doesn't provide a reasonable temperature) and I did not find a way how to tell thinkfan to read the temperature of the HDD from its SMART values.

Another problem is that the thinkfan makes the fan jump on a too high level if the temperature rises too suddenly which seems to be the usual case (and normal) with the T420. Here the thinkfan "bias" value is  to blame. It is a multiplier in a formula that thinkfan uses to correct for the delay with which it reads temperatures and to deal with the fact that the temperature is likely rising after the read. The default bias is set to 5 which causes sudden jumps even to the max RPM level of the fan even though nothing is  really going on in the system. Setting it to 2 works better for me in normal use. You can set the bias in /etc/default/thinkfan:

# Additional startup parameters
DAEMON_ARGS="-q -b 2"
Use the settings on your own risk, I'm no expert in these matters.

(updated with a bit more conservative temperature settings)