Custom Named Entity Recognition (NER) model with spaCy.
Python program to train an English NER model for detecting people's names using spaCy.
conll.txt
is the raw dataset used to train the model (CoNLL-2013)preprocess.py
contains a helper function to load, parse, and preprocess the raw dataset.ner.py
is the main program and does the following:- Create the training data from the raw dataset.
- Train the model according to the specified hyperparameters (
EPOCHS
,BATCH_SIZE
,DROPOUT
). - Evaluate the model on a subset of the training data whose size is given by
TEST_SIZE
. - Save the model to disk (optional).
- Use the new model or a pre-trained one for inference (optional).
model
contains a model trained with the following hyperparameters:- Epochs: 100
- Drop-out rate: 0.1
- Mini-batch size: 32
harry.txt
contains a sample of text with a few named entities to test the model.
The following libraries are required to run ner.py
:
--model
: directory containing a pre-trained model (if using a pre-trained model for inference).--data
: path to the raw dataset (if training a fresh model).--save
: directory to save the new model to.--file
: path to a text file containing named entities (if running inference)--output
: output path to save the model predictions (if running inference)
-
Train a new model and save it:
python ner.py --data conll.txt --save .
-
Run inference using a pre-trained model:
python ner.py --model ./model --file harry.txt --output harry_output.txt
In total, 3 hours.
- 1h reviewing spaCy documentation;
- 1.5h coding solution;
- 0.5h debugging.