Metadata-Version: 2.4
Name: readability
Version: 0.3.1
Summary: Measure the readability of a given text using surface characteristics
Home-page: https://github.com/andreasvc/readability/
Author: Andreas van Cranenburgh
Author-email: A.W.vanCranenburgh@uva.nl
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Cython
Classifier: Topic :: Text Processing :: Linguistic
License-File: LICENSE.txt
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: license-file
Dynamic: summary

Readability
===========

An implementation of traditional readability measures based on simple surface
characteristics. These measures are basically linear regressions based on the
number of words, syllables, and sentences.

The functionality is modeled after the UNIX ``style(1)`` command. Compared to the
implementation as part of `GNU diction <http://www.moria.de/~michael/diction/>`_,
this version supports UTF-8 encoded text, but expects sentence-segmented and
tokenized text. The syllabification and word type recognition is based on
simple heuristics and only provides a rough measure. The supported languages
are English, German, and Dutch. Adding support for a new language involves the
addition of heuristics for the aforementioned syllabification and word type
recognition; see ``langdata.py``.

NB: all readability formulas were developed for English, so the scales of the
outcomes are only meaningful for English texts. The Dale-Chall measure uses the
original word list for English, but for Dutch and German lists of frequent
words are used that were not specifically selected for recognizability by
school children.

Installation
------------
::

    $ pip install https://github.com/andreasvc/readability/tarball/master

Usage
-----
From Python::

    >>> import readability
    >>> text = ('This is an example sentence .\n'
            'Note that tokens are separated by spaces and sentences by newlines .\n')
    >>> results = readability.getmeasures(text, lang='en')
    >>> print(results['readability grades']['FleschReadingEase'])
    55.95250000000002

Command line usage::

    $ readability --help
    Simple readability measures.

    Usage: readability [--lang=<x>] [FILE]
    or: readability [--lang=<x>] --csv FILES...

    By default, input is read from standard input.
    Text should be encoded with UTF-8,
    one sentence per line, tokens space-separated.

    Options:
      -L, --lang=<x>   Set language (available: de, nl, en).
      --csv            Produce a table in comma separated value format on
                       standard output given one or more filenames.
      --tokenizer=<x>  Specify a tokenizer including options that will be given
                       each text on stdin and should return tokenized output on
                       stdout. Not applicable when reading from stdin.

For proper results, the text should be tokenized.

- For English, I recommend "tokenizer",
  cf. http://moin.delph-in.net/WeSearch/DocumentParsing
- For Dutch, I recommend the tokenizer that is part of the Alpino parser:
  http://www.let.rug.nl/vannoord/alp/Alpino/.
- ``ucto`` is a  general multilingual tokenizer: http://ilk.uvt.nl/ucto

Example using ``ucto``::

    $ ucto -L en -n -s '' "CONRAD, Joseph - Lord Jim.txt" | readability
    [...]
    readability grades:
        Kincaid:                          5.44
        ARI:                              6.39
        Coleman-Liau:                     6.91
        FleschReadingEase:               85.17
        GunningFogIndex:                  9.86
        LIX:                             31.98
        SMOGIndex:                        9.39
        RIX:                              2.56
        DaleChallIndex:                   8.02
    sentence info:
        characters_per_word:              4.17
        syll_per_word:                    1.24
        words_per_sentence:              16.35
        sentences_per_paragraph:         11.5
        type_token_ratio:                 0.09
        characters:                  551335
        syllables:                   164205
        words:                       132211
        wordtypes:                    12071
        sentences:                     8087
        paragraphs:                     703
        long_words:                   20670
        complex_words:                10990
        complex_words_dc:             29908
    word usage:
        tobeverb:                      3907
        auxverb:                       1630
        conjunction:                   4398
        pronoun:                      18092
        preposition:                  19290
        nominalization:                1167
    sentence beginnings:
        pronoun:                       2578
        interrogative:                  217
        article:                        629
        subordination:                  120
        conjunction:                    236
        preposition:                    397

The option ``--csv`` collects readability measures for a number of texts in
a table. To tokenize documents on-the-fly when using this option, use
the ``--tokenizer`` option. Example with the "tokenize" tool::

    $ readability --csv --tokenizer='tokenizer -L en-u8 -P -S -E "" -N' */*.txt >readabilitymeasures.csv

References
----------
The following readability metrics are included:

1. http://en.wikipedia.org/wiki/Automated_Readability_Index
2. http://en.wikipedia.org/wiki/SMOG
3. http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_Grade_Level#Flesch.E2.80.93Kincaid_Grade_Level
4. http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_test#Flesch_Reading_Ease
5. http://en.wikipedia.org/wiki/Coleman-Liau_Index
6. http://en.wikipedia.org/wiki/Gunning-Fog_Index
7. https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula

For better readability measures, consider the following:

- Collins-Thompson & Callan (2004). A language modeling approach to predicting reading difficulty.
  In Proc. of HLT/NAACL, pp. 193-200. http://aclweb.org/anthology/N04-1025.pdf
- Schwarm & Ostendorf (2005). Reading level assessment using SVM and statistical language models.
  Proc. of ACL, pp. 523-530. http://www.aclweb.org/anthology/P05-1065.pdf
- The Lexile framework for reading. http://www.lexile.com
- Coh-Metrix. http://cohmetrix.memphis.edu/
- Stylene: http://www.clips.ua.ac.be/category/projects/stylene
- T-Scan: http://languagelink.let.uu.nl/tscan

Acknowledgments
---------------
The code is based on: https://github.com/mmautner/readability

Which in turn was based on: https://github.com/nltk/nltk_contrib/tree/master/nltk_contrib/readability
