Secteur TAL Informatique

ILPGA Université Paris 3

Parcours TAL : step 7

Statistiques textuelles

Traitements de Prem96 avec Lexico


Traitements de Prem96-Bébé avec Lexico

  1. Constituer un corpus adapté à Lexico et partitionné par prématuré.
  2. Appliquer les traitements disponibles avec Lexico sur ce corpus : segmentation et spécificités.

Traitements de Prem96-Infirmière avec Lexico

  1. Constituer un corpus adapté à Lexico et partitionné par infirmière.
  2. Appliquer les traitements disponibles avec Lexico sur ce corpus : segmentation et spécificités.

Programmes Perl à tester

  • WordFreq1.pl
  •  
    • Read a file, and print each line to the screen. ** Don't forget when you run a Perl script -- as the lines roll past blindingly fast on your screen -- that you can pause (and un-pause) by hitting "Ctrl-S" on your keyboard.
  • WordFreq2.pl
  •  
    • Same as WordFreq1, but prints to a file also.
  • WordFreq3.pl
  •  
    • Same as WordFreq, but takes the input and output file names from the prompt.
  • WordFreq4.pl
  •  
    • Tests whether there are enough command line arguments.
  • WordFreq5.pl
  •  
    • Breaks up lines into words, and displays the first word.
  • WordFreq6.pl
  •  
    • Puts all the words into a hash, and counts their frequencies.
  • WordFreq7.pl
  •  
    • Adds command line options (flags), permitting sort by frequencies.
  • WordFreq8.pl
  •  
    • More command line options, such as ignore punctuation; threshold; stop words;
  • WordFreq9.pl
  •  
    • Lots more options --look at the code.

Exercices et Projets (lecture du livre de C. Manning indispensable)

  1. Pick a corpus (in English or any language using Latin characters). Find the frequencies of the words, and produce a frequency-sorted list and an alphabetically sorted list with frequencies.
  2. Pick a corpus, and produce a new file which contains one sentence per line. (Here is a Perl script to do this. Here is a Perl script to come up with a tentative list of abbreviations.) Discuss what conceptual and practical problems you encounter. Discuss how well your attempt worked. Suggest improvements, and implement them if you can.
  3. Determine the Zipf constant for at least three corpora -- in one language, or in different languages. Do the values vary? What accounts for the differences? Can you tell whether variations across language are more or less than the variation across corpora in one language (e.g., in English)?
  4. Take a corpus (preferably in a language other than English -- it's more interesting) and run Zellig Harris' algorithm for finding morphemes (Zellig.pl). Describe what it does well and what it does poorly. How could you improve it? How sensitive is the quality of the result to the size of the corpus on which the algorithm is run?
  5. Do the same thing for Automorphology (running in Linguistica)
  6. Take the corpus from question 1, and run Ngrams.pl to find the ngrams of the language, sorted by various orders. Compare ranking by (1) frequency divided by expected frequency (where "expected" means on a unigram model) and (2) frequency times log of (frequency/expected frequency). Explain why the second is interesting, and the first isn't. Explain both in terms of theory and in terms of what you find empirically in your results.
  7. Choose two 19th century authors (whose works are no longer copyrighted), compute the frequencies of words in some of their works, and then choose one additional work from each, and see whether Maximum Likelihood computation will "predict" the authorship of the two additional works. In effect, use these counts to determine authorship. You can make the word-frequency lists with the WordFreq.pl programs. You can use a perl script for this: maxlikelihoodwords.pl. Get a list of proper names (say, in the U.S. The Census Bureau has posted a list of all American names at http://www.census.gov/genealogy/names/.) Look at them and based on your intuitions, select 100 names from 4 or 5 different linguistic origins, and put them in separate groups. Count bigram frequencies for each "corpus," and then using maximum likelihood statistics, test another 100 names: determine which set of bigram frequencies best matches each of them.
  8. Take a large corpus and find the set of abbreviations contained in it. Find the set of proper names in it. Estimate how well you have done on this (evaluate quantitatively).

Rechercher une entrée du TLFi :

 

Rechercher une entrée du XMLittré :