Skip to content

parse-english is a minimum viable English parser implemented in LexYacc

Notifications You must be signed in to change notification settings

onlyuser/parse-english

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

parse-english

Copyright (C) 2011-2017 mailto:onlyuser@gmail.com

About

parse-english is a minimum viable English parser implemented in LexYacc. It parses in parallel all possible interpretations of an English sentence accepted by a grammar and generates abstract syntax trees for successful parses. The algorithm is completely deterministic. No training data is required.

See old version here: NatLang

A Motivating Example

input:

the quick brown fox jumps over the lazy dog.

output:

picture alt

Usage

cd ./demo/0_parse-english_full_nlp
./demo.sh "the quick brown fox jumps over the lazy dog"
Switch Description
-e SENTENCE input sentence
-l Lisp mode
-g graph mode (slow for deep trees)
-d dot mode
-x extract ontology mode
-q quiet mode
-m memory debug
-n indent lisp

Requirements

Unix tools and 3rd party components (accessible from $PATH):

gcc flex bison

Supported Features

  • Parallel reentrant parsing
  • Lisp / graph / dot output (multiple trees)

Supported Grammar Syntaxes

  • Present tense
  • Progressive tense
  • Future tense
  • Past tense
  • Past perfect tense
  • Passive voice
  • Questions
  • Conditionals
  • Imperitive mood
  • Comparisons

Limitations

  • Hard coded grammar & vocabulary.
  • A brute force algorithm tries all supported interpretations of a sentence. This is slow for long sentences.
  • BNF rules are suitable for specifying constituent-based phrase structure grammars, but are a poor fit for expressing non-local dependencies.

Make Targets

target action
all make binaries
test all + run tests
pure test + use valgrind to check for memory leaks
dot test + generate .png graph for tests
lint use cppcheck to perform static analysis on .cpp files
doc use doxygen to generate documentation
xml test + generate .xml for tests
import test + use ticpp to serialize-to/deserialize-from xml
clean remove all intermediate files

References

"Part-of-speech tagging"
http://en.wikipedia.org/wiki/Part-of-speech_tagging
"Princeton WordNet"
http://wordnet.princeton.edu/
"Syntactic Theory: A Unified Approach"
ISBN: 0340706104
"Enju - A fast, accurate, and deep parser for English"
http://www.nactem.ac.uk/enju/

Keywords

Natural Language Processing, English parser, Yacc, BNF