-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathREADME
39 lines (29 loc) · 1.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
This script allows to extract proper nouns from an English text with NTLK.
Install dependencies
--------------------
* Install NTLK according your OS (pkg install ntlk on FreeBSD for example)
* Install numpy (pkg install py27-numpy)
* Download the needed NLTK resources with ntlk.download():
** averaged_perceptron_tagger
** maxent_treebank_pos_tagger
** punkt
** treebank
Source text
-----------
You need a copy of the text you want to extract from as plain text.
Source English word list
------------------------
The expected format is a list in lowercase, each line a substantive word.
Filename should be wordsEn.txt or modified in eliminate-common-nouns script.
Such file was available at [SIL](http://web.archive.org/web/20141122213941/http://www-01.sil.org/linguistics/wordlists/english/).
Usage
-----
./extract-proper-nouns source.txt > nouns.txt
To sort them and eliminate duplicates:
./extract-proper-nouns source.txt | sort | uniq > nouns.txt
To discard known English words:
./eliminate-common-nouns nouns.txt
Acknowledgment
--------------
Thank you to Rama for NLTK suggestion and some brief guidance.
The original code idea is from Alvations, and could be seen at http://stackoverflow.com/a/17672491/1930997.