Spelling checker and corrector Allan Hetzel University of Kentucky University of Kentucky Spelling Checker & Corrector December 1995 Distribution The UK Spelling Checker & Corrector is a set of Execs and Modules that can be used to check and correct spelling in files containing English text. The easiest use is with the Spellfix command under Xedit. Best performance is obtained by placing the Spelling Checker (with its rather large internal lexicon) in a DCSS. This an attempt to put together a more-or-less complete package of our CMS Spelling Checker. It contains the actual Exec and Module files we are using in production. The only changes since 1991 have been to allow the DCSS loader to run in an ESA mode machine and some improvements in error handling. We are currently running this code under CMS 8 and VM/ESA 2.2 on a 3090. These are the files need to run the Spelling Checker (we keep them on our Y-disk): SPELL EXEC SPELLDEF EXEC SPELLFIX EXEC SPELLLEX EXEC SPELLCHK MODULE SPELLMOD MODULE SPELL XEDIT SPELLFIX XEDIT The usual lack of warranty applies. Dave Elbon University of Kentucky sysdave@ukcc.uky.edu 606 257-2230 Here are the instructions from the 1990 VM Workshop Tools Tape: ------------- M E M O R A N D U M June 11, 1990 To: Those Interested From: Allan Hetzel (sysal@ukcc.uky.edu) Subject: University of Kentucky Spelling checker New For 1990: * The Spelling Checker has been tested under CMS Rel 5.5 on VM XA/SP R2.0. The module which is distributed on the tape works on an XA system. Users of VM/SP or HPO systems will need to reassemble SPELLMOD and SCANLEX with the non-XA version of the UKSLIB library. The GENLMOD or GENSEG exec should be run (with the non-XA version of UKSLIB) to create a module or segment. (UKSLIB libraries are found elsewhere on the tape.) * The lexicon files currently contain over 97,000 unexpurgated words. Thanks to the efforts of Kent Fiala, the spelling of over 300 words has been corrected and more than 1,000 new words have been added. * A change to SPELLMOD allows for input records longer than 255 characters * Our discontiguous shared segment (DCSS) size is 1,024K. About 510K of that is being used by the present checker. A 512K segment should still work but its getting real tight. Summary: Approximately 5000 1K blocks are needed to restore all the files. Some of the EXECs and XEDIT macros may need modification to work properly at other installations. All EXECs are currently written in EXEC2 or REXX. Hopefully every- thing needed is on the tape. A brief summary of the files follows: LEXICON files: These 27 LEXICONs are built into the MODULE: x LEXICON contains words starting with the letter "x". COMMON LEXICON The file which was used to create the common (lex-c) LEXICON used by SPELLCHK. File is in frequency of use order as determined by the Brown Corpus. The following files are loaded at execution time if called in with the USER option (see the HELP files). Depending on the filetype of the file being SPELLFIXed, some of these may be included by the EXEC without the user having to specify them. For instance, if the filetype is SCRIPT then the SCRIPT LEXICON would be loaded. JARGON LEXICON SCRIPT LEXICON ABBR LEXICON $EXEC LEXICON $XEDIT LEXICON ASSEMBLE LEXICON EXEC LEXICON HELPCMS LEXICON ASSEMBLE files: The source files are liberally sprinkled with useful comments (I hope). By applying the update files a complete source file of the current version of the Spelling Checker can be generated. LEXGEN reads LEXICON files and produces TEXT files which are to become part of SPELLMOD MODULE. A large number of unsupported flag messages are generated when running this program. This is normal. Called by the GENLEX EXEC. SPELLMOD main part of the code for the spelling checker. SCANLEX routines which look up words in the two internal lexicons, common (lex-c) and other (lex-o), and the optional user lexicon (lex-u). DCSSLOAD a general purpose transient command which will load a shared segment or a load module and branch to it. This is used to produce the SPELLCHK (production) and XPELLCHK (testing) modules. TABLEXO1 tables of addresses for the other lexicon (lex-o). There are three TABLEXO2 files because of the number of external references involved. TABLEXO3 TRYTABLE Spelling guessing substitution table. Update These are files for updating the various source files. They need to AUX be applied to make the source current with the TEXT decks. CNTRL MODULEs: By loading all the EXEC, XEDIT, and MODULE files along with the user lexicons and HELP files a working copy of the Spelling Checker should be generated. SPELLMOD XPELLMOD SPELLCHK XPELLCHK LEXGEN Generation EXECs: GENALL calls GENLEX EXEC once for each letter. GENLEX calls LEXGEN and generates TEXT files from LEXICON files. Then updates the SPELLCHK TXTLIB with the new files. GENLMOD generate the SPELLMOD and XPELLMOD modules. GENXMOD GENSEG generate the production and test segments. GENXSEG GENSPE generate the SPELLCHK and XPELLCHK modules. GENSPEX SPELLCHK TXTLIB: Contains the TEXT files produced by LEXGEN. xxxxx TEXT: Assorted text files: SPELLMOD, SCANLEX, TABLEXO1, TABLEXO2, TABLEXO3, TRYTABLE, DCSSLOAD, and LEXGEN. HELP files: Most of the HELP files have been preformatted for the IBM Release 4 HELP system. Their filetypes start with HELP... Support files: UKSLIB MACLIB needed for assembling. This library is found in another file of the University of Kentucky contributions. Use correct version, either XA or non-XA. UKSLIB TXTLIB needed for loading. This library is found in another file of the University of Kentucky contributions. Use correct version, either XA or non-XA. EXECs and XEDIT macros: SPELL These first two are most useful to users. SPELLFIX See the HELP files for function. SPELLDEF support EXEC. How to implement: Restore the files to disk with at least 5000 1K blocks. At this point, you have all the files (and more) that you need for testing the program. For production, the Spelling Checker should really be kept in a DCSS since performance suffers greatly otherwise. If you choose to modify the LEXICONs you're pretty much on your own, although just about everything you need should be on this tape. Both the source code for the checker and the EXECs and XEDIT macros are sprinkled with helpful comments.