Napkin is a simple tool to produce statistical analysis of a text
Find a file
2020-10-11 10:45:36 +02:00
bin new: [option] --analysis to limit the output to a specific analysis 2020-10-09 23:23:36 +02:00
logo chg: [logo] added 2020-10-09 21:53:08 +02:00
samples add: [sample] french text - Alice in Wonderland 2020-10-11 10:45:36 +02:00
.gitignore Initial commit 2020-08-18 16:49:24 +02:00
LICENSE Initial commit 2020-08-18 16:49:24 +02:00
README.md new: [option] --analysis to limit the output to a specific analysis 2020-10-09 23:23:36 +02:00
requirements.txt new: [requirement] requirement file added 2020-10-09 20:50:40 +02:00

napkin-text-analysis

napkin text analysis - logo

Napkin is a Python tool to produce statistical analysis of a text.

Analysis features are :

  • Verbs frequency
  • Nouns frequency
  • Digit frequency
  • Labels frequency such as (Person, organisation, product, location) as defined in spacy.io named entities
  • URL frequency
  • Email frequency
  • Mention frequency (everything prefixed with an @ symbol)
  • Out-Of-Vocabulary (OOV) word frequency meaning any words outside English dictionary

Verbs and nouns are in their lemmatized form by default but the option --verbatim allows to keep the original inflection.

Intermediate results are stored in a Redis database to allow the analysis of multiple text files.

requirements

  • Python >= 3.6
  • spacy.io
  • redis (a redis server running on port 6380 is required)
  • pycld3
  • tabulate

how to use napkin

usage: napkin.py [-h] [-v V] [-f F] [-t T] [-s] [-o O] [-l L] [--verbatim]
                 [--no-flushdb] [--binary] [--analysis ANALYSIS]

Extract statistical analysis of text

optional arguments:
  -h, --help           show this help message and exit
  -v V                 verbose output
  -f F                 file to analyse
  -t T                 maximum value for the top list (default is 100) -1 is
                       no limit
  -s                   display the overall statistics (default is False)
  -o O                 output format (default is csv), json, readable
  -l L                 language used for the analysis (default is en)
  --verbatim           Don't use the lemmatized form, use verbatim. (default
                       is the lematized form)
  --no-flushdb         Don't flush the redisdb, useful when you want to
                       process multiple files and aggregate the results. (by
                       default the redis database is flushed at each run)
  --binary             set output in binary instead of UTF-8 (default)
  --analysis ANALYSIS  Limit output to a specific analysis (verb, noun,
                       hashtag, mention, digit, url, oov, labels, punct).
                       (Default is all analysis are displayed)

example usage of napkin

A sample file "The Prince, by Nicoló Machiavelli" is included to test napkin.

python3 ./bin/napkin.py -o readable -f samples/the-prince.txt -t 4

Example output:

╒═════════════════╕
│ Top 4 of verb   │
╞═════════════════╡
│ 116 occurences  │
├─────────────────┤
│ make            │
├─────────────────┤
│ 106 occurences  │
├─────────────────┤
│ may             │
├─────────────────┤
│ 102 occurences  │
├─────────────────┤
│ would           │
╘═════════════════╛
╒═════════════════╕
│ Top 4 of noun   │
╞═════════════════╡
│ 108 occurences  │
├─────────────────┤
│ state           │
├─────────────────┤
│ 90 occurences   │
├─────────────────┤
│ people          │
├─────────────────┤
│ one             │
╘═════════════════╛
╒════════════════════╕
│ Top 4 of hashtag   │
╞════════════════════╡
╘════════════════════╛
╒════════════════════╕
│ Top 4 of mention   │
╞════════════════════╡
╘════════════════════╛
╒══════════════════╕
│   Top 4 of digit │
╞══════════════════╡
│           750175 │
├──────────────────┤
│          6221541 │
├──────────────────┤
│            57037 │
╘══════════════════╛
╒═════════════════════════════════════════╕
│ Top 4 of url                            │
╞═════════════════════════════════════════╡
│ 1 occurences                            │
├─────────────────────────────────────────┤
│ www.gutenberg.org/license               │
├─────────────────────────────────────────┤
│ www.gutenberg.org/contact               │
├─────────────────────────────────────────┤
│ http://www.gutenberg.org/5/7/0/3/57037/ │
╘═════════════════════════════════════════╛
╒════════════════╕
│ Top 4 of oov   │
╞════════════════╡
│ 6 occurences   │
├────────────────┤
│ Vitelli        │
├────────────────┤
│ Pertinax       │
├────────────────┤
│ Orsinis        │
╘════════════════╛
╒═══════════════════╕
│ Top 4 of labels   │
╞═══════════════════╡
│ 197 occurences    │
├───────────────────┤
│ CARDINAL          │
├───────────────────┤
│ 189 occurences    │
├───────────────────┤
│ ORG               │
├───────────────────┤
│ 131 occurences    │
├───────────────────┤
│ NORP              │
╘═══════════════════╛

what about the name?

The name 'napkin' came after a first sketch of the idea on a napkin. The goal was also to provide a simple text analysis tool which can be run on the corner of table in a kitchen.

LICENSE

napkin is free software under the AGPLv3 license.

Copyright (C) 2020 Alexandre Dulaunoy
Copyright (C) 2020 Pauline Bourmeau