Pages

Saturday, March 14, 2015

Can all books be found somewhere within the number Pi?

Probably yes. But you would have to look for a very very long time to find even a short one.



What does the question even mean? Well, the decimal representation of pi is infinite. This infinite sequence of digits can be trivially (and in many different ways) converted to a sequence of characters. For example, the first 100 digits after the decimal point:
1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679
can be converted to the string of 50 characters:
oponjlbgmuarmgbycgttrnpxkgutxshxhdmgckivicdwzivrgb 
The question then is: is the sequence of characters of any given book a subsequence of PI's sequence of characters? A version of this question was asked at Math StackExchange.  The answer is "probably yes" because PI is believed to be a normal number and therefore all possible finite sequences of characters appear equally often in it.

It is however not very likely to find a longer given sequence of characters "soon". The probability of finding a sequence of 11 characters in the first 100 million digits of pi is only 0.09995% and the probability decreases very quickly with the length of the search sequence.

Looking for Poems in PI

A Dream Within a Dream


Let's put it into perspective and let's take a short poem instead of a book. Let's say A Dream Within a Dream by Edgar Allan Poe. Let's assume that our PI character sequence contains only letters A-Z. The length of the poem (including the title and the author's name) with all spaces, new lines, and punctuation characters removed is 519. The probability of finding this sequence within the first 2 billion PI characters is: 1 - 1/(e^(2*10^9 * 0.1^519)) = 1.9 * 10^-510; in other words, it is astronomically small. Why 2 billion characters? Because the longest readily available PI sequence is 4 billion digits, and we use each two consecutive digits to encode one character in the following.

To give ourselves a fighting chance, let's simplify the task by looking only for all the words that appear in the poem individually. Let's also stem the words first (using the standard English analyzer from Lucene) to increase our chances even further:
thu,through,been,torment,allan,none,kiss,upon,hope,while,dai,hold,brow,tighter,me,golden,few,pitiless,from,let,roar,grain,you,all,surf,night,creep,them,less,my,gone,take,vision,sand,dream,therefor,much,who,deep,poe,part,deem,save,we,weep,wave,seem,flown,clasp,how,can,see,grasp,now,have,stand,hand,on,finger,within,edgar,shore,yet,i,amid,wrong,o,avow,awai,ha,god
The probability of finding a sequence of length 1-5 is 100% in a 100m sequence so let's remove at least one and two letter words which are uninteresting anyway. We get the following 64 words (sorted):
all,allan,amid,avow,awai,been,brow,can,clasp,creep,dai,deem,deep,dream,edgar,few,finger,flown,from,god,golden,gone,grain,grasp,hand,have,hold,hope,how,kiss,less,let,much,night,none,now,part,pitiless,poe,roar,sand,save,see,seem,shore,stand,surf,take,them,therefor,through,thu,tighter,torment,upon,vision,wave,weep,while,who,within,wrong,yet,you
Only 59 out of these 64 words are present within the first 4 billion digits (2 billion characters) of PI. It's pretty easy to guess which ones are missing: pitiless,therefor,through,tighter,torment.

Even when looking only for individual stemmed words we are out of luck and have to humbly conclude that we found some evidence for only ~92% of Edgar Allan Poe's poem...

Let's at least check how often the words appear:

            dai 67355 ############################################################ 
            can 67293 ############################################################ 
            poe 67250 ############################################################ 
            let 67169 ############################################################ 
            all 66890 ############################################################ 
            see 66882 ############################################################ 
            thu 66795 ############################################################ 
            god 66777 ############################################################ 
            yet 50571 ############################################## 
            how 50567 ############################################## 
            you 50358 ############################################# 
            few 50317 ############################################# 
            who 50065 ############################################# 
            now 49969 ############################################# 
           upon  2766 ### 
           hope  2751 ### 
           deep  2734 ### 
           deem  2730 ### 
           been  2723 ### 
           take  2718 ### 
           sand  2718 ### 
           much  2709 ### 
           surf  2695 ### 
           have  2689 ### 
           hold  2675 ### 
           gone  2664 ### 
           roar  2659 ### 
           kiss  2657 ### 
           seem  2654 ### 
           save  2653 ### 
           part  2641 ### 
           amid  2640 ### 
           less  2634 ### 
           none  2631 ### 
           hand  2620 ### 
           from  2592 ### 
           them  2579 ### 
           brow  2049 ## 
           wave  2044 ## 
           avow  2028 ## 
           awai  1948 ## 
           weep  1946 ## 
          clasp   139 # 
          night   122 # 
          grasp   114 # 
          edgar   112 # 
          allan   109 # 
          shore   108 # 
          grain   108 # 
          stand   104 # 
          dream   101 # 
          creep   101 # 
          wrong    83 # 
          while    77 # 
          flown    68 # 
         within     5 # 
         vision     4 # 
         golden     3 # 
         finger     1 # 

We can see that words of similar length really tend to appear with similar frequency and words of length already 6 characters (12 digits) are unlikely to appear.

The four longest words appear in the following positions:
finger   306 078 594
golden  601 607 569,   616 761 705,   1 215 536 913
vision   2 104 964 085,   2 606 042 920,   2 706 090 319,   3 107 599 098
within  623 377 779,   3 011 529 776,   3 308 455 606,   3 721 207 493,   4 001 065 796

And The Moon And The Stars And The World


Let's try again this time with an even shorter poem: And The Moon And The Stars And The World by Charles Bukowski. The stemmed sorted words of the poem are as follows:
beer,bukowski,charl,fight,good,housew,husband,long,madden,moon,night,off,peek,soul,star,tire,try,walk,watch,what,window,world
 Unfortunately, even this time we are out of luck. Only 20 out of the 22 words (~91%) appear within the first 2 billion characters. The two missing are the longest ones again: bukowski and husband.

The word frequencies are unsurprising again:

            off 68019 ############################################################ 
            try 50213 ############################################# 
           long  2782 ### 
           tire  2713 ### 
           moon  2711 ### 
           beer  2711 ### 
           star  2706 ### 
           soul  2695 ### 
           peek  2673 ### 
           good  2636 ### 
           walk  2056 ## 
           what  2049 ## 
          night   122 # 
          fight   110 # 
          charl    95 # 
          watch    85 # 
          world    78 # 
         madden     5 # 
         housew     5 # 
         window     1 #

The word "window" appears in position 807 090 731. The first appearance of "madden" is in position 103 680 410, the first appearance of "housew" is in position 1 009 425 988.

Daffodils


When we are at it already, let's try with two longer poems. First, Daffodils by William Wordsworth.

We look for the following 78 stemmed individual words:
all,along,bai,beneath,besid,bliss,breez,brought,cloud,compani,continu,couch,could,crowd,daffodil,danc,did,end,fill,flash,float,flutter,gai,gaze,glanc,glee,golden,had,head,heart,high,hill,host,inward,jocund,lake,leav,lie,line,littl,lone,margin,milki,mood,never,oft,onc,out,pensiv,pleasur,poet,saw,shine,show,solitud,sparkl,sprightli,star,stretch,ten,them,thought,thousand,toss,tree,twinkl,upon,vacant,vale,wai,wander,wave,wealth,what,when,which,william,wordsworth
Only 66 (84%) of the words appear in the first 2 billions of characters. The following 12 are missing:
beneath,brought,compani,continu,daffodil,flutter,pleasur,sprightli,stretch,thought,thousand,wordsworth
The word frequencies follow the same pattern as before.

Auguries of Innocence


Let's try one last time with Auguries of Innocence by William Blake (361 distinct stemmed words at least 3 characters long):
adorn,affright,afric,air,all,appear,armd,armour,art,artist,auguri,babe,bad,bag,band,bar,bark,bat,beam,beat,becom,bee,befor,beggar,believ,bellow,belovd,beneath,blake,bleat,blood,boi,born,bow,bower,brace,brain,breast,breed,bright,bui,build,butcher,butterfli,caesar,cage,call,came,can,care,cat,catterpil,caught,ceas,chafer,cherubim,child,clipd,close,cloth,cock,cricket,crown,cry,curs,dai,danc,dead,death,deer,deform,delight,displai,divin,doe,dog,doth,doubt,dove,draweth,dwell,each,eagl,emmet,endless,england,enmiti,envi,etern,ever,everi,faith,farmer,farth,fat,fate,feed,feel,femal,fibr,fight,filld,fine,fit,flit,flower,flutter,fly,foot,forgiv,form,fright,from,fruit,gambler,game,gate,gem,get,gnat,god,gold,good,grain,grave,grief,grow,gun,hand,hare,harlot,hears,heaven,hell,here,high,hold,honei,hors,hour,hous,how,howl,human,hunt,hurt,immedi,inch,infant,infin,innoc,intent,invent,iron,jealousi,joi,judgment,keep,kill,knife,know,knowledg,known,labrer,lamb,lame,land,last,laurel,leaf,led,left,licencd,lie,light,like,lion,littl,loser,lovd,made,mai,make,man,master,men,mile,miser,miseri,misusd,mite,mock,mockd,moon,more,morn,moth,mother,movd,nation,neer,never,newt,nigh,night,nor,nought,old,out,outcri,over,owl,own,palm,palsi,pass,passion,peac,perish,philosophi,pigeon,pine,pleas,plow,poison,polar,poor,predict,princ,protect,public,put,question,race,rag,rage,rais,realm,reason,red,region,repeat,repli,respect,returnd,reveng,riddl,right,rightli,rise,road,roar,robe,robin,rod,rot,ruin,run,safe,sand,season,see,sell,shall,sheet,shore,should,shout,shudder,silken,sing,sit,skylark,slander,slept,sly,smile,snake,soldier,some,song,soul,speak,spider,sprite,starvd,state,street,strife,strike,strongest,summer,sun,swadl,sweat,sweet,sword,teach,tear,than,that,thee,them,theyd,those,thou,thr,thro,throughout,thy,toadstool,toi,told,tongu,tool,torment,train,triumph,truth,twine,two,unbeliev,under,understand,upon,wandr,wanton,war,wave,weav,weep,were,what,when,which,who,whole,whore,widow,wild,william,wilt,wind,wing,winner,woe,wolf,woman,wont,word,world,worth,wound,woven,wrath,wren,write,wrung,yet,you,your
327 of these words (90%) appear within the first two billion characters. That's still roughly the same proportion as for the other poems because the lengths of words we look for are, not surprisingly, similar in all cases. However, this poem in its original form is much less likely to appear in in pi "soon" simply due to being much longer.

All English Words


What proportion of all the English words can be found? Out of the 348 600 words of the Moby word list which are at least 4 characters long (not stemmed), only 55 390 words (~16%) can be found within the first two billions of characters of PI.

The longest found words seem a bit strange to me:
astonying, curtseyed, disshadow, eluviates, homoeotel, lambiness, lectotype, mendicant, octasemic, postnasal, romeldale, seaplanes, sheetlike, smokejack, suability
but I wouldn't draw any conclusions from it:)

The 20 most frequent words of length at least four are:

           bdft  2880 ############################################################ 
           jura  2871 ############################################################ 
           gats  2864 ############################################################ 
           nook  2863 ############################################################ 
           raob  2856 ############################################################ 
           soum  2852 ############################################################ 
           sill  2851 ############################################################ 
           peri  2851 ############################################################ 
           glar  2850 ############################################################ 
           ooid  2848 ############################################################ 
           cuir  2848 ############################################################ 
           kmet  2846 ############################################################ 
           pate  2844 ############################################################ 
           roop  2840 ############################################################ 
           perv  2835 ############################################################ 
           brev  2834 ############################################################ 
           spat  2833 ############################################################ 
           fend  2833 ############################################################ 
           jove  2830 ########################################################### 
           gite  2830 ########################################################### 

Well, this word list is not exactly perfect.

All Wiktionary Words


What about all the Wiktionary words? The latest enwiktionary dump contains 4 270 557 page titles. That resulted into 3 029 041 distinct words of length at least 4 after I removed titles with non-alphabetic characters, transliterated accented characters to unaccented ones and removed the words for which transliteration failed. Looking for all occurrences of more than 3 millions words in a sequence of 2 billions of characters already sounds like a challenge;)

Out of the 3m words, 304 739 can be found within the first two billions of characters of PI. The top 20 most frequent words are:

           fegt  2894 ############################################################ 
           suif  2879 ############################################################ 
           pepe  2877 ############################################################ 
           jura  2871 ############################################################ 
           dusj  2871 ############################################################ 
           gobe  2870 ############################################################ 
           itsu  2869 ############################################################ 
           quen  2867 ############################################################ 
           duin  2866 ############################################################ 
           smol  2864 ############################################################ 
           gats  2864 ############################################################ 
           avul  2864 ############################################################ 
           nook  2863 ############################################################ 
           vkus  2862 ############################################################ 
           muno  2861 ############################################################ 
           geno  2861 ############################################################ 
           jodu  2860 ############################################################ 
           ruro  2855 ############################################################ 
           hapi  2855 ############################################################ 
           soum  2852 ############################################################

I recognize only one of them: "vkus" which is Czech for "taste" as in having a good taste in something.

The longest found words are:
mendicante, protrahite, surruantur
They first appear in the following positions:
mendicante: 2 500 805 327
protrahite:    3 419 212 585
surruantur:         8 110 868
That makes "surruantur" (third-person plural present passive subjunctive of surruĊ: undermine, overthrow, demolish) the longest word appearing the earliest in PI.

Technical Details


Maybe you are wondering how long it takes to find all occurrences of 3 million words in a sequence of 2 billions characters. It took roughly 17 minutes on my i5 @ 2.5GHz with an SSD using an unoptimized implementation of a search algorithm I wrote myself.

We need to find many words at the same time in a long string of characters. That's a perfect use case for the Aho-Corasick algorithm. Why implement it myself? To play a bit with new Java 8 features and also because it's one of the first algorithms I learned many years ago at Matfyz and I was wondering what is it like to implement it.

The trie-like graph for the 349k words of the Moby word list had 977 889 nodes, 877 178 outputs and it took roughly 8 seconds to build.

The graph for the 3m Wiktionary words had 5 713 390 nodes, 10 909 184 outputs, and it took roughly 70 to 80 seconds to construct.

I processed the sequence in chunks of 100m digits. I didn't try to match words on boundaries so in fact the above numbers are likely slightly incorrect (they are pretty accurate lower bounds).

A Fulltext Search Detour


The search problem can be formulated as looking for a wildcard pattern "*word*" in the sequence for each of the search words. Searching with a prefix "word*" is not a problem. Searching only for the suffix "*word" would also be ok because I could just reverse the word and the sequence and turn it into a prefix search again. That's an easy task for a standard fulltext search engine such as Solr or Elasticsearch.

I knew that double star search "*word*" is a problem for fulltext search but I wanted to get a better idea how bad it is. So I indexed ~1700m characters as ~39m overlapping chunks in Elasticsearch (i.e. only a 850m character sequence is really indexed). The resulting index has 5.4GB (1700m digits takes up ~816MB space) and a single double star search takes cca 9.5s on my computer. Looking up all the 3m Wiktionary words would take 333 days. So yes, it is indeed bad. However, it's good to have an idea about the magnitude of the problem. Of course, this says nothing about the qualities of Elasticsearch, Solr, or other inverted-index-based full-text solutions as they were not designed to solve this particular problem.

Monday, December 23, 2013

My first experience with Ansible


I needed a nice, simple and repeatable way of installing complex systems such as Hadoop and Oozie. I noticed that recently people mention Ansible so I gave it a try.

Ansible looks very attractive at the beginning - it lets you easily run commands on your systems, e.g.:

ansible clusterx -a "ifconfig"

runs ifconfig on each machine of clusterx and lists the results one after another on stdout in a nice green. I was hoping this simplicity would somehow magically remain even with more complex tasks. It is not so easy though. There are many gotchas with Ansible, especially if no module exists for what you need. Unfortunately, the documentation is also lacking in this regard.

For example, you cannot specify multiple actions per task. This may make sense if each task is a complex operation executed using an existing Ansible module but it gets in the way if you just need to run a couple of commands via SSH yourself.

Development of new playbooks is made difficult by the fact that you have no feedback on long running tasks. Again, this probably makes sense, once everything is working and you're running your playbooks on hundreds of machines but it makes development unpleasant. The main reason is that a long running task becomes not easily distinguishable form a task that is stuck for example because it requires user input (which should never happen but sometimes it does when you develop things).

Because for my goal, installing Hadoop and Oozie, there are no satisfactory ready made modules or playbooks, I had to rely a lot on using modules command, shell, and script. Again, they are relatively easy to use for simple things. But you will hit a problem for example as soon as you would like to split your script into modules (for example to source a common part). The script module lets you run a script but will not help you with running a script that sources another script - I guess you will have to scp them to the target machine yourself first before running them. 

Another problem I came across is not directly related to Ansible but I want to mention it in case someone runs into it too. At first, I didn't realize that environment is not set the same way when you run an SSH command and when you start an interactive shell. That caused me some trouble for a minute because for example JAVA_HOME was not set correctly. This is easily corrected but it's good to remind yourself of it once in a while. More people have this problem.

Back to Ansible. You can set environment variables for an Ansible script task by creating a YAML dictionary variable. Unfortunately I found no easy and nice way of combining and augmenting existing environments. For example I would like to have a separate environment directory_layout grouping directory variables and networking_params for variables around networking. Or just using the directory_layout environment and adding one more variable to it. This cannot be done as far as I know. I came up only with workarounds such as either creating a new environment and populating it using the original environment (thus avoiding repeating literal values) or using YAML ids and references. Both ways work but are not very elegant.

One particularly irritating thing is that Ansible mixes its own YAML syntax with a Python-based template language. This is very important but it is not stressed enough in the Ansible documentation. It means that you can, and often unfortunately have to, run pieces of Python code on the template variables. For example:

- name: Fetch SSH key
 fetch: src=/home/admin/.ssh/id_rsa.pub dest=fetched/{{ansible_hostname}}
- name: Admin user has passwordless localhost SSH
 authorized_key: user=admin key="{{ lookup('file', 'fetched/'+ansible_hostname+'/id_rsa.pub') }}"

The quotes around {{ }} are necessary because of the YAML parser. Ansible is quite helpful in this regard suggesting fixes so I can't complain too much but here and there I experienced more friction with YAML than necessary (it's been a while since I used it the last time). I can see why YAML is convenient for the developers but I'm not so sure about its advantages for users as myself (although I found for example the YAML references feature useful).

Another example of using Python is related to the not so well documented registering of variables. You can run a remote script and register its output as a variable. Well, only to later find out that its not the output of the script in the variable but a whole JSON structure describing execution of the script. It has stdout as one of its fields. So you can use it as follows:

- name: Check is Java installed
script: checkJava.sh
register: javaCheck

- name: Install Java
  apt: pkg='$item' state=present
  with_items:
      - oracle-java7-installer
      - oracle-java7-set-default
  when: "'not_installed' in javaCheck.stdout"

Now, the "'not_installed' in javaCheck.stdout" is actually Pyton code and it is something not easy to figure out by reading documentation. Especially, how to use the "when:" construct in general. It is a bit complicated by the fact that earlier versions of Ansible used some, now deprecated, variations of it such as when_string and there are a lot of references to it on the web which makes it only more confusing. 

Even though this post may seem very critical, I am still hopeful about Ansible. First, it is still a young project and second, maybe I need more getting used to it. I definitely need to find out more about writing Ansible modules - I anticipate that they would solve some of my problems.

P.S. I'll probably update this post soon, I just wanted to get it out now, maybe it will save someone else a bit of time.

Wednesday, December 4, 2013

How to install Oozie

The Oozie Quick start guide is hard to follow to the point that I would almost say that it is intentional. The installation itself is not difficult once you figure out what you need to do. The following will hopefully help you if you are stuck.

# download and extract Oozie 3.3.2 (4.0.0 doesn't build for me)
export BUILD_DIR=/home/jakub/oozie
wget http://www.whoishostingthis.com/mirrors/apache/oozie/3.3.2/oozie-3.3.2.tar.gz
tar xzvf oozie-3.3.2.tar.gz
cd oozie-3.3.2
bin/mkdistro.sh -DskipTests
cp ${BUILD_DIR}/oozie-3.3.2/distro/target/oozie-3.3.2-distro.tar.gz /opt/
cd /opt
tar xzvf oozie-3.3.2-distro.tar.gz
cd oozie-3.3.2

mkdir libext
cp ${BUILD_DIR}/oozie-3.3.2/hadooplibs/hadooplib-1.1.1.oozie-3.3.2/* libext/

# if the above doesn't work for you for any reason you can try:
# cp /opt/hadoop/*.jar libext/
# cp /opt/hadoop/lib/*.jar libext/
# Beware of log4j version clashes. 
# Oozie requires log4j >= 1.2.16 (because of a bug in 1.2.15)

cd libext
wget http://extjs.com/deploy/ext-2.2.zip
bin/oozie-setup.sh prepare-war
Edit /opt/hadoop/conf/core-site.xml and modify it e.g. like:
  
     
        hadoop.proxyuser.jakub.hosts
        localhost
     
     
         hadoop.proxyuser.jakub.groups
         jakub
     
At this point you should install and configure MySQL (skipped).
# Initialize a MySQL database
bin/ooziedb.sh create -sqlfile oozie.sql -run

# Put required libraries to HDFS
bin/oozie-setup.sh sharelib create -fs hdfs://localhost:9000 ${BUILD_DIR}/oozie-3.3.2/sharelib/target/oozie-sharelib-3.3.2.tar.gz

# Install client
cp client/target/oozie-client-3.3.2-client.tar.gz /opt
cd  /opt 
tar xzvf oozie-client-3.3.2-client.tar.gz
export PATH=$PATH:/opt/oozie-client-3.3.2/bin
export OOZIE_URL="http://localhost:11000/oozie"

# Run Oozie
bin/oozied.sh run
bin/oozie admin -oozie http://localhost:11000/oozie -status

# Prepare and run examples
cd ${BUILD_DIR}/oozie-3.3.2/examples/target
tar xzvf oozie-examples-3.3.2-examples.tar.gz
hdfs -put examples examples
cd ${BUILD_DIR}/oozie-3.3.2/examples/target

# Edit examples/apps/map-reduce/job.properties to set the correct job tracker and namenode URIs

oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run

Thursday, July 5, 2012

Java virtual method invocation and access modifiers

Java virtual method invocation presents unexpected intricacies as shown by a recent post of Jan Vrany in a Czech Java forum. Consider the following two classes:

package x.a;

public class A {
 String method1() {
  return "A.method1()";
 }

 public static String callMethod1(A obj) {
  return obj.method1();
 }
}

package x.b;

import x.a.A;

public class B extends A {
 public String method1() {
  return "B.method1()";
 }
}
The question then is what does the following code print?
System.out.println(
  A.callMethod1(new B())
);

The answer follows below.








The answer is quite surprisingly "A.method1()".  Notice that A.method1 has no access modifier while B.method1 is public and B is in a different package than A. See also the Visibility table in Controlling Access to Members of a Class (docs.oracle.com).

If A.method1 were public or at least protected then the output would be "B.method1()" as one would expect from a normal virtual method invocation.

This problem seems to be closely related to the following stackoverflow question:
Why does Java's invokevirtual need to resolve the called method's compile-time class?

The answer to the question (that it is just a performance optimization) seems to be wrong and the counterexample is above.

The relevant parts of the JVM spec are 5.4.3.3 Method Resolution and 7.7 Invoking Methods.

The question remains what is the reason for this design or whether it is accidental or for some reason difficult to avoid.