2

Essential Information Theory

John Goldsmith

Fall 2000

1 A motivating example

Suppose we have given a large set of data from a previously unanalyzed language, and four different analyses of the verbal system are being offered by four different linguists. Each has an account of the verbal morphology using rules that are (individually) of equal complexity. There are 100 verb stems. Verbs in each group use the same rules; verbs in different groups use entirely different rules.

Linguist 1 found that he had to divide the verbs into 10 groups with 10 verbs in each group.

Linguist 2 found that she had to divide the verbs into 10 groups, with 50 in the first group, 30 in the second group, 6 in the third group, and 2 in each of 7 small groups.

Linguist 3 found that he had just one group of verbs, with a set of rules that worked for all of them.

Linguist 4 found that she had to divide the verbs into 50 groups, each with 2 stems in it.

Rank these four analyses according how good you think they are -- sight unseen.

Hopefully you ranked them this way:

Best: Linguist 3

Linguist 2

Linguist 1

Worst: Linguist 4

And why? Because the entropy of the sets that they created goes in that order. That's not a coincidence -- entropy measures our intuition of the degree of organization of information.

The entropy of a set is , where we sum over the probability of each subset making up the whole -- and where the log is the base-2 log.

The entropy of Linguist 1's set of verbs is -1 * 10 * 1/10 * log (1/10) = log (10) = 3.32.

The entropy of Linguist 2's set of verbs is -1 * (1/2 log (1/2) + 0.3 * log (0.3) + 0.06 * log (0.06) + 0.14 * log (0.02) ) = 0.346 + 0.361 + 0.169 + 0.548 = 1.42.

The entropy of Linguist 3's set of verbs is -1 * 1 * log (1) = 0.

The entropy of Linguist 4's set of verbs is -1 * 50 * 1/50 * log (0.02) = 3.91.

2 Computing some entropies

We can write a little Perl program to do this (or upgrade WordFreq9.prl to use the lower case l option). It's actually easier to do it more accurately and allowing any character, not just the 26 letters of the alphabet. If we do so, we find the following results from the Brown Corpus (but I've collapsed the upper case letters with the lower case letters): these are raw counts followed by (proportional) frequency:

$ 581 0.000118208215735487

% 147 2.99081027764485e-005

& 165 3.35703194429524e-005

' 10717 0.00218044311194013

( 2411 0.00049053357683005

) 2437 0.000495823445348334

* 4828 0.000982287892548935

+ 1 2.03456481472439e-007

, 58587 0.0119199048800258

- 11611 0.00236233320637649

. 55478 0.011287358679128

/ 205 4.170857870185e-005

0 4334 0.000881780390701551

1 5129 0.00104352829347214

2 2611 0.000531224873124538

3 1725 0.000350962430539957

4 1457 0.000296436093505344

5 2112 0.000429700088869791

6 1443 0.000293587702764729

7 1058 0.00021525695739784

8 1254 0.000255134427766439

9 2120 0.000431327740721571

: 1912 0.000389008792575303

; 2769 0.000563370997197184

? 2342 0.000476495079608452

] 798 0.000162358272215006

a 380786 0.0774733797539642

b 72527 0.0147560882317516

c 146693 0.0298456416366365

d 187861 0.0382215380658939

e 592228 0.12049262510946

f 110407 0.0224630197499276

g 92357 0.01879063025935

h 256924 0.0522728530458249

i 345213 0.0702358223385451

j 9377 0.00190781142676706

k 31133 0.00633421063768144

l 195802 0.0398371859852665

m 120281 0.0244719490479864

n 336134 0.0683886409432568

o 359641 0.0731712924532294

p 95570 0.019444335934321

q 5048 0.00102704831847287

r 290398 0.0590833553066333

s 310114 0.0630947032953439

t 438188 0.0891521887034451

u 128563 0.0261569756275412

v 47092 0.0095811726255001

w 89001 0.0181078303075285

x 9381 0.00190862525269295

y 81607 0.0166034730835213

z 4498 0.000915147253663031

Entropy: 4.35499988129783

Now, what's the entropy of all the letters that follow a? This means counting over the set of 380786 letters that follow those a's. Let's see. Here are the counts:

$ 36 9.45412909088044e-005

' 198 0.000519977099998424

( 53 0.000139185789393518

) 82 0.000215344051514499

* 112 0.000294128460605169

, 834 0.00219020657272064

- 172 0.00045169727878651

. 895 0.00235040153787167

1 68 0.000178577993938853

2 50 0.000131307348484451

3 15 3.93922045453352e-005

4 17 4.46444984847132e-005

5 19 4.98967924240912e-005

6 21 5.51490863634692e-005

7 10 2.62614696968901e-005

9 4 1.0504587878756e-005

: 18 4.72706454544022e-005

; 38 9.97935848481824e-005

? 24 6.30275272725363e-005

] 21 5.51490863634692e-005

a 583 0.00153104368332869

b 9530 0.0250271806211363

c 19085 0.0501200149165148

d 17599 0.0462175605195569

e 373 0.000979552819694001

f 4410 0.0115813081363285

g 8759 0.0230024213075061

h 1451 0.00381053925301876

i 13498 0.0354477317968623

j 528 0.0013866055999958

k 4537 0.011914828801479

l 38993 0.102401348789084

m 12624 0.0331524793453541

n 74028 0.194408407872138

o 531 0.00139448404090487

p 8799 0.0231074671862936

q 221 0.000580378480301272

r 41158 0.10808695697846

s 38367 0.100757380786058

t 54907 0.144193851664715

u 4353 0.0114316177590563

v 8446 0.0221804373059934

w 3893 0.0102235901529993

x 805 0.00211404831059965

y 9979 0.0262063206105266

z 642 0.00168598635454035

Entropy: 3.76230403829144 of letters following 'a'.

What do we find? The entropy has gone down! The entropy for single letters is 4.355. And why is that? because what follows an a is not just any old letter -- the distribution is different from the distribution of all letters in all contexts.

So if we wish to ask the question more generally, is there a connection between a letter and the next letter, on the whole? we just compute the conditional entropy: the entropy of letters, given the previous letter; we do this for the 26 letters (etc.), and weight all these results by the frequency of the preceding letter.

The result we get (try WordFreq9.prl with the l option) is: 3.755. That's way down from 4.355.

Click here for results.

Now think about this: the difference between the entropy of letters here (which we calculated to be about 4.355) and the conditional entropy is the mutual information between successive letters in English. And that equals 4.355 - 3.755 = 0.6 bits.

Would mutual information of this sort be larger or smaller in a Polynesian language, where we found only V's after C and only C's after V? Explain.