Skip to content
jiri_vinarek edited this page Mar 17, 2015 · 1 revision

Linguistics explanation, tools and algorithms.

Table of contents: <wiki:toc max_depth="2" />

General resources

Linguistics tools

Suggested and used tools for linguistics analysis. We need to obtain a sentence structure and base form of all words. This process is composed of four parts, each is described below.

Tokenizer

A basic tool for string splitting. Tokenizer separates numbers from letters, separates words and deals with a punctuation.

Input Natural language sentence; negation detection
Output Array of tokens (eg. words, numbers)

Tools

  • The EGYPT toolkit - perl tokenizer

Part of a statistical machine translation toolkit. http://www.clsp.jhu.edu/ws99/projects/mt/

  • LingPipe - tokenizer (added in version 4.0.1)

Good looking Java toolkit for processing text using computational linguistics. [- Simple Java tokenizer for regular expressions

http://introcs.cs.princeton.edu/72regular/Tokenizer.java.html

Tagger

Usually tool just for part-of-speech tagging. Identifies basic linguistic category for each word.

Input Array of tokens
Output Array of POS tagged tokens (eg. adjectives, verbs)

Tools

  • A Maximum Entropy Model for Part-Of-Speech Tagging

Java implementation of this tagger - [Another wrapper - http://godel.stanford.edu/public/doc-versions/util/doc/api/csli/util/nlp/postag/MXPOST.html as a part of basic tools (/util) at the Center for the Study of Language and Information at Stanford University [Penn Treebank Tags - explanation of all tags http://bulba.sdsu.edu/jeanette/thesis/PennTags.html#RB

  • Stanford Log-linear Part-Of-Speech Tagger

Java implementation of the log-linear part-of-speech taggers [## Parser

Mainly a statistical parser. This tool is used to discover sentence structure, usually written as a syntactic tree. Part of a syntactic analysis.

Input Array of POS tagged tokens
Output Parse trees of each sentence

Tools

Analysis and description of the structure of morphemes. For our purpose is sufficient obtaining lemmas (base forms). Eg. have for had. Lemmatizers are finding only lemmas, stemmers are for finding stems (eg. bug for debugging).

Input Parse trees of each sentence
Output Lemma for each word

Tools

General tools

  • Core NLP Stanford suite of Core NLP Tools (allmost all operations) [- XTAG Tool for operations with Tree Adjoining Grammars http://www.cis.upenn.edu/~xtag/

Sentence negation

Keywords: negative-positive conversion
  1. Identification of a negation in sentence
  2. Dealing with negation

Reference

Tools

Sentence negation

  • Sanda Harabagiu, Andrew Hickl and Finley Lacatusu: Negation, Contrast and Contradiction in Text Processing [- SSergey Goryachev, Margarita Sordo, Qing T. Zeng, Long Ngo: Implementation and Evaluation of Four Different Methods of Negation Detection https://www.i2b2.org/software/projects/hitex/negation.pdf

  • Katsura, Y., Matsumoto, K., Ren, F.:Flexible English writing support based on negative-positive conversion method [- Pradeep G. Mutalik, MD, Aniruddha Deshpande, MD, and Prakash M. Nadkarni, MD: Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents http://www.ncbi.nlm.nih.gov/pmc/articles/PMC130070/

Clone this wiki locally