MKCORPUS PROJECT MKCORPUS PROJECT MKCORPUS PROJECT
MKC Home
 

 

MkCorpus/CorpusPlusBuilder...

Module : Webxref

S. Fleury

01/07/2001

_______________________________________________

1. Lancement du programme webxref (036 et 038):

Ce programme prend en argument plusieurs paramètres (cf infra pour la description des options)

a. Lancement sur un répertoire contenant un site (webxref-Read One Local Web Dir)

Dans la fenêtre de lancement du programme il faut donner :

Dir Input : le chemin du répertoire contenant le répertoire/site web localement aspiré (terminé par /)

Site Name : le nom du répertoire contenant le site à traiter

Dir Output : le chemin du répertoire de sortie (terminé par /)

b. Lancement sur un répertoire contenant des répertoires de site (webxref-Read One Dir of Local Web Dir)

Dans la fenêtre de lancement du programme il faut donner :

Dir Input : le chemin du répertoire contenant les répertoires de sites web localement aspirés (terminé par /)

Dir Output : le chemin du répertoire de sortie (terminé par /)

_______________________________________________

2. Lancement des programmes complémentaires (036):

 

 

 

 

 

 

________________________________________________

 

3. Les options disponibles dans le menu Typweb :

WEBXREF 036

1. Read One Local Web Dir (036) : cette option permet d'exécuter webxref sur un répertoire contenant un site web localement aspiré

2. Read Index File (036) : cette option permet d'exécuter webxref sur un répertoire contenant un site web localement aspiré, le site visé est celui qui est associé au fichier préalablement chargé dans la fenêtre d'édition de mkCorpus

WEBXREF 038

1. Read One Local Web Dir (038) : cette option permet d'exécuter webxref sur un répertoire contenant un site web localement aspiré

2. Read One Dir of Local Web Dir (038) : cette option permet d'exécuter webxref sur un répertoire contenant plusieurs sites web localement aspirés

3. Read Index File (038) : cette option permet d'exécuter webxref sur un répertoire contenant un site web localement aspiré, le site visé est celui qui est associé au fichier préalablement chargé dans la fenêtre d'édition de mkCorpus

4. Read One Local Web Dir (038 Unix Lynx) : idem (1) avec sorties Lynx

5. Read One Dir of Local Web Dir (038 Unix Lynx) : idem (2) avec sortie Lynx

6. Read Index File (038 Unix Lynx) : idem (3) avec sortie Lynx

7. Read Local Web Dir (038-homolWin32) : idem (1)

8. Read Index File (038-homolWin32) : idem (2)

9. Read Local Web Dir (038-homolUnix Lynx) : idem (1) avec sortie Lynx

10. Read Index File (038-homolUnix Lynx) : idem (2) avec sortie Lynx

 

 

WEBXREF DOC 038 :

-----------------

Usage: webxref -options file.html || webxref -options site/

then navigate to "working_directory/res[](site)/"

webxref -options site1 site2

then see "res[](site1)" and "res[](site2)"

webxref -at "path" file.html

then navigate to "path/res[](site)/"

 

Options: -help/-h -noxref -xref/-x -onexref -fluff -htmlonly

-rep -norep -at -http -delay seconds

-silent/-s -verbose/-v -errors/-e -noint

-spell -html -del -lynx -brief -fullpath

-islocal <address> -avoid/-a <regexp>

-one/-1 -depth <depth> -rappspec <number>

-date <yymmdd> -time <hhmmss> -before -after

-find <string> -findexpr <regexp>>

-replace <string> -replaceexpr <regexp> -by <string/expr>

=========================================

Which parameters to use for what purpose:

=========================================

Webxref checks the given file and follows the links in that file. While

working it lets you know it's alive by printing to STDOUT verbose messages.

It also prints in the report file a '+' for each file checked ok, and a '-'

for each file with a problem.

Default webxref gives for each file found on your local disks a report

on its headers,tag elements(with attributes-values) and links. After parsing

it as a string,the routine DissectFile will display data from the HTML syntax

tree. The routine was inspired by the htmlscript "dissectsite.hts" found at

"http://worldwidemart.com/scripts/htmlscript/dissect/". Specify -norep to

discharge it or see the main section for configuring.

A webxref run can take some time. You can, however, interrupt webxref with

ctrl-c (Unix). Webxref will report only the files it has inspected up to that

moment and exit. (*New!*)(Note: this is not reliable! webxref is not interruptable

at any time, due to the C-libraries not being re-entrant. (This probably does not

interest you at all, but it's not the author's fault.)) Specify -noint if you don't

want webxref to try and generate output after an interrupt.

When the whole site has been searched, all links have been inspected and

all its .html, .htm files found have been dissected, webxref prints a report.

Actual default is a long report in .txt form. The option "-html", lets you

change the form default. The option "-at", allows you to choose a directory

on your disk to put the results in. See also the examples.

If you want more information while webxref is working specify -verbose to get

messages on every file or -errors to see only files with problems. With -silent

webxref prints few messages while working.

Webxref keeps track of which html-documents are being linked to from other

documents. This is called cross-referencing, hence webxref's name. If you are

not interested in this, specify -noxref, so you won't be told where things have

failed and probably have to run webxref again. If you're just interested in one

location where a file is referenced specify -onexref. This saves memory too.

If you need to know if there are files and/or directories in your site that

are not referenced at all by any pages in your site specify -fluff.

If you want to only inspect files that really have the .html or .htm extension

specify -htmlonly

If you specify -fullpath, you'll get the full paths for files. Default, the file

names will be abbreviated: /u/people/rick/www/a.html is printed as "a.html"

(webxref is called from ~/rick/www).

If you use full URLs in your site referring to your own site, say "www.sara.nl" is

your www-address and you use links like <a href=www.sara.nl/rick/index.html> then

tell webxref that "www.sara.nl" actually can be found on the local machine with:

-islocal 'www.sara.nl'

If you want to avoid certain files use the -avoid parameter to specify which

files to avoid.

If you want to limit the number of files webxref inspects you may want to limit

the scan to 1 or 2 directories deep in the file system. If you specify -depth 0

only files in the current directory are inspected.

If you just want to check if links in a file are valid specify -one (or -1). Only

the links present in the file are tested, but no more. Use this with -files

to specify a collection of files to just check those files.

Specify -http if you want webxref to check if the http:// links work. After all

local files are inspected. This may be time-consuming. To avoid overloading

a webserver there is a delay of 1 second between checks. If you want longer

or shorter delays specify the number of seconds with -delay. (Longer delays may

be necessary if a lot of links refer to the same webserver.)

To see if you have files or directories that were modified last before or after

a certain date/time use: -before/-after -date yymmdd -time hhmmss. If -before

is given files are reported that were modified before the date given, with -after

files last modified after the date given are reported.

Default, simply list the fileor directories at the end of the command. To tell

webxref which files to inspect use -files or -f. Webxref generates different

results directories only if the files given as arguments are from different sites.

Webxref can search and even search-replace text, see later.

=======================

What the parameters do:

=======================

While checking webxref prints messages to STDOUT according to:

-silent/-s Few messages, list problems at the end of the run.

-verbose/-v Print information while checking files.

-errors/-e Print errors when they occur, even when -silent.

Webxref generates a report according to:

-noint Do not generate output on interrupt

-norep The routine elements is discharged

-spell Checks html files for syntax errors

-brief List just problems.

-xref/-x List which files reference files (cross-references).

-noxref Do not list which files reference files (default).

-html Print report in .html form.

-del Delete Html Report Files

-lynx Generates Lynx Dump on each html file (for the XML corpus)

-rappspec nb The nb is used in the name of the report directory

-at Lets you choose a directory to put the results in

Webxref inspects files/directories according to:

-fluff List which files/directories are never used.

-htmlonly Only inspect files with the .html/.htm extension.

-fullpath Print full-length filenames.

-islocal url 'www.mymachine.nl' is actually a local file reference.

-avoid regexp Avoid files with names matching regexp for inspection.

-depth number The maximum directory nesting level.

0 means: current directory only,

1 means: directories from the current directory.

100 probably means there is no restriction in

how deep webxref is allowed to find files.

-one/-1 Specify -one if you just want to check the links

from the given file(s) and no further link following.

-http Check external URLs via the network.

-delay seconds Wait the specified number of seconds between HTTP checks

-date -time Date [yymm<dd>], time [hhmm<ss>].

-before -after List files that are modified before or after

the date/time given with -date and -time.

=================

Find/replacement: ** EXPERT ONLY **

=================

Webxref can scan your site for files containing certain text. To find fixed

text use -find. To find text using e.g. wildcards use -findexpr. The Perl

expression is matched with the text of the file under test. Take care to not

have the shell interpret '*' and '/' by using appropriate quoting. Search is

always case-insensitive. Webxref does search/replace beyond end-of-line. I.e.

newlines are matched, and can even be inserted (use \n).

To replace text with something else use -replace and -replaceexpr and -by. The

string or expression you specify with -replace or -replaceexpr is replaced by

the string you specify with -by. In case of editing, a backup file with a random

numeric extension is placed next to the resulting file. E.g. when index.html is

edited there'll be a file "index.html.1234" or something similar.(DISCLAIMER:

the author cannot be held responsible for any damage resulting from using the

edit- or any other functions of webxref or indeed any software, hardware, chemical

substance, imagined or real (or imagined to be real) effects or by-effects of

anything, at all, whatsoever.)

-find string report files containing the given string

-findexpr regexp report files containing the given expression

-replace string *REPLACE* string by the string given with -by

-replaceexpr regexp *REPLACE* regexpr by the string given with -by

-by string replacement string (or regexp)

-nobackup Not implemented on purpose.

========

Examples

========

webxref file.htm(l) or webxref site/

Lists every file encountered in directories, reports problems,

dissects .html, .html and writes the list of the reports in

"/res(site)[]/analysis_results.html[txt]".

webxref site1/ site/2

Analyse site(directory)1 then site(directory)2

webxref -at path file.htm(l)

Lets you choose a directory on you disk where to put the results

webxref -norep file.html

lists files encountered in directories and reports problems

webxref -html index.html

lets you get the reports in .html form

webxref -one index.html

just check the links in index.html, don't follow the links

webxref -one *.html

Check only the links in the html-files in the current dir.

webxref -depth 0 index.html

Check index.html, but don't check files in directories

that are deeper in the file system.

webxref -http file.html

Checks file.html and external URLs

webxref -htmlonly file.html

Checks file.html, but only files with the .html/htm extension

webxref -avoid '.*Archive.*' file.html

Checks file.html but avoids files with names containing

'Archive'

webxref -avoid '.*Archive.*|.*Distribution.*' file.html

Same as above, but also files with names containing

webxref -islocal www.sara.nl

Treat things like '<a href=http://www.sara.nl/rick' as a

local reference, as if it would have been '<a href=/rick'

webxref -fluff index.html

Checks index.html and reports files in the directories

encountered that were not referenced by index.html or any

file linked to from there.

webxref -silent index.html

Just report problems at the end of the run. This may take

a while with a big website.

webxref -silent -errors index.html

Prints only problems while scanning, and the final report.

webxref -verbose index.html

Prints a message for every file under test.

webxref -brief -silent index.html

Does not print messages while scanning, and generates a

short report, i.e. lists just problems.

webxref -before -date 970823 -time 1200 index.html

Reports files last modified before August 23rd 1997

webxref -find 'me.gif' index.html

Reports a list of pages containing the text 'me.gif'

webxref -findexpr '<img .*\.gif' index.html

Reports files containing links to gif files.

webxref -replace 'me' -by 'you' -one index.html

Replace 'me' by 'you' in index.html one-ly.