relatiionship names usually correspond in natural language to what type of grammatical form?

J Am Med Inform Assoc. 2011 Sep-Oct; 18(five): 544–551.

Natural linguistic communication processing: an introduction

Prakash M Nadkarni

^oneYale Academy School of Medicine, New Haven, Connecticut, U.s.a.

Lucila Ohno-Machado

²University of California, San Diego School of Medicine, Sectionalization of Biomedical Informatics, La Jolla, California, USA

Wendy W Chapman

²University of California, San Diego School of Medicine, Partition of Biomedical Information science, La Jolla, California, United states

Received 2011 Jul 4; Accustomed 2011 Jul vi.

Abstract

Objectives

To provide an overview and tutorial of natural language processing (NLP) and modern NLP-organisation blueprint.

Target audience

This tutorial targets the medical information science generalist who has express associate with the principles backside NLP and/or limited noesis of the current state of the art.

Scope

We depict the historical evolution of NLP, and summarize common NLP sub-issues in this extensive field. We and so provide a synopsis of selected highlights of medical NLP efforts. After providing a brief description of common motorcar-learning approaches that are being used for diverse NLP sub-problems, we discuss how modern NLP architectures are designed, with a summary of the Apache Foundation'due south Unstructured Information Management Architecture. We finally consider possible future directions for NLP, and reflect on the possible impact of IBM Watson on the medical field.

Keywords: Tongue processing, Introduction, clinical NLP, knowledge bases, motorcar learning, predictive modeling, statistical learning, privacy technology

Introduction

This tutorial provides an overview of natural language processing (NLP) and lays a foundation for the JAMIA reader to better appreciate the articles in this issue.

NLP began in the 1950s every bit the intersection of bogus intelligence and linguistics. NLP was originally distinct from text information retrieval (IR), which employs highly scalable statistics-based techniques to index and search big volumes of text efficiently: Manning et al ¹ provide an excellent introduction to IR. With time, notwithstanding, NLP and IR have converged somewhat. Currently, NLP borrows from several, very diverse fields, requiring today'due south NLP researchers and developers to broaden their mental knowledge-base significantly.

Early simplistic approaches, for example, give-and-take-for-word Russian-to-English machine translation,² were defeated by homographs—identically spelled words with multiple meanings—and metaphor, leading to the apocryphal story of the Biblical, 'the spirit is willing, simply the flesh is weak' existence translated to 'the vodka is agreeable, but the meat is spoiled.'

Chomsky's 1956 theoretical assay of linguistic communication grammars³ provided an estimate of the problem'due south difficulty, influencing the creation (1963) of Backus-Naur Form (BNF) annotation.⁴ BNF is used to specify a 'context-complimentary grammer'⁵ (CFG), and is usually used to stand for programming-language syntax. A language's BNF specification is a set of derivation rules that collectively validate programme code syntactically. ('Rules' hither are absolute constraints, not expert systems' heuristics.) Chomsky too identified even so more than restrictive 'regular' grammars, the basis of the regular expressions ⁶ used to specify text-search patterns. Regular expression syntax, defined by Kleene^vii (1956), was kickoff supported by Ken Thompson'south grep utility⁸ on UNIX.

Subsequently (1970s), lexical-analyzer (lexer) generators and parser generators such as the lex/yacc combination^ix utilized grammars. A lexer transforms text into tokens; a parser validates a token sequence. Lexer/parser generators simplify programming-linguistic communication implementation greatly by taking regular-expression and BNF specifications, respectively, as input, and generating code and lookup tables that make up one's mind lexing/parsing decisions.

While CFGs are theoretically inadequate for natural language,¹⁰ they are often employed for NLP in practice. Programming languages are typically designed deliberately with a restrictive CFG variant, an LALR(i) grammar (LALR, Wait-Ahead parser with Left-to-correct processing and Rightmost (bottom-up) derivation),⁴ to simplify implementation. An LALR(1) parser scans text left-to-right, operates bottom-up (ie, information technology builds compound constructs from simpler ones), and uses a expect-ahead of a single token to make parsing decisions.

The Prolog language¹¹ was originally invented (1970) for NLP applications. Its syntax is especially suited for writing grammars, although, in the easiest implementation mode (top-down parsing), rules must be phrased differently (ie, right-recursively¹²) from those intended for a yacc-mode parser. Meridian-down parsers are easier to implement than bottom-up parsers (they don't need generators), just are much slower.

The limitations of hand-written rules: the ascension of statistical NLP

Tongue's vastly large size, unrestrictive nature, and ambiguity led to ii problems when using standard parsing approaches that relied purely on symbolic, mitt-crafted rules:

NLP must ultimately extract meaning ('semantics') from text: formal grammars that specify human relationship betwixt text units—parts of speech such as nouns, verbs, and adjectives—address syntax primarily. One tin can extend grammars to accost natural-language semantics past greatly expanding sub-categorization, with additional rules/constraints (eg, 'eat' applies just to ingestible-detail nouns). Unfortunately, the rules may now become unmanageably numerous, often interacting unpredictably, with more than frequent cryptic parses (multiple interpretations of a discussion sequence are possible). (Puns—ambiguous parses used for humorous event—metachronism NLP.)
Handwritten rules handle 'ungrammatical' spoken prose and (in medical contexts) the highly telegraphic prose of in-hospital progress notes very poorly, although such prose is human-comprehensible.

The 1980s resulted in a fundamental reorientation, summarized by Klein^xiii:

Simple, robust approximations replaced deep analysis.
Evaluation became more rigorous.
Auto-learning methods that used probabilities became prominent. (Chomsky'due south volume, Syntactic Structures ^fourteen (1959), had been skeptical well-nigh the usefulness of probabilistic language models).
Large, annotated bodies of text (corpora) were employed to train machine-learning algorithms—the annotation contains the correct answers—and provided aureate standards for evaluation.

This reorientation resulted in the birth of statistical NLP. For example, statistical parsing addresses parsing-rule proliferation through probabilistic CFGs¹⁵: individual rules have associated probabilities, adamant through machine-learning on annotated corpora. Thus, fewer, broader rules supersede numerous detailed rules, with statistical-frequency information looked up to disambiguate. Other approaches build probabilistic 'rules' from annotated data like to automobile-learning algorithms like C4.5,^xvi which build conclusion trees from feature-vector data. In any example, a statistical parser determines the most likely parse of a judgement/phrase. 'About likely' is context-dependent: for example, the Stanford Statistical Parser,¹⁷ trained with the Penn TreeBank^xviii—annotated Wall Street Journal manufactures, plus telephone-operator conversations—may exist unreliable for clinical text. Manning and Scheutze'due south text provides an fantabulous introduction to statistical NLP.^nineteen

Statistical approaches give skillful results in practice simply because, by learning with copious real data, they utilize the most common cases: the more than abundant and representative the information, the amend they get. They also dethrone more gracefully with unfamiliar/erroneous input. This issue'southward articles make clear, nevertheless, that handwritten-rule-based and statistical approaches are complementary.

NLP sub-problems: awarding to clinical text

We enumerate common sub-problems in NLP: Jurafksy and Martin'south text²⁰ provides additional details. The solutions to some sub-problems have become workable and affordable, if imperfect—for example, speech synthesis (desktop operating systems' accessibility features) and connected-speech recognition (several commercial systems). Others, such equally question answering, remain hard.

In the account below, we mention clinical-context issues that complicate certain sub-problems, citing recent biomedical NLP work confronting each where appropriate. (We do not comprehend the history of medical NLP, which has been applied rather than basic/theoretical; Spyns²¹ reviews pre-1996 medical NLP efforts.)

Low-level NLP tasks include:

Sentence boundary detection: abbreviations and titles ('m.chiliad.,' 'Dr.') complicate this task, as do items in a list or templated utterances (eg, 'MI [10], SOB[]').
Tokenization: identifying private tokens (discussion, punctuation) within a sentence. A lexer plays a core role for this task and the previous i. In biomedical text, tokens often comprise characters typically used equally token boundaries, for instance, hyphens, frontward slashes ('10 mg/24-hour interval,' 'N-acetylcysteine').
Function-of-speech communication assignment to individual words ('POS tagging'): in English, homographs ('set') and gerunds (verbs ending in 'ing' that are used equally nouns) complicate this job.
Morphological decomposition of compound words: many medical terms, for example, 'nasogastric,' need decomposition to comprehend them. A useful sub-task is lemmatization—conversion of a word to a root by removing suffixes. Non-English clinical NLP emphasizes decomposition; in highly synthetic languages (eg, German language, Hungarian), newly coined compound words may replace entire phrases.²² Spell-checking applications and preparation of text for indexing/searching (in IR) besides utilize morphological assay.
Shallow parsing (chunking): identifying phrases from constituent part-of-oral communication tagged tokens. For example, a noun phrase may comprise an describing word sequence followed past a noun.
Problem-specific sectionalisation: segmenting text into meaningful groups, such as sections, including Chief Complaint, Past Medical History, HEENT, etc.²³

Haas²⁴ lists publicly bachelor NLP modules for such tasks: nearly modules, with the exception of cTAKES (clinical Text Analysis and Cognition Extraction Organization),²⁵ have been developed for not-clinical text and ofttimes work less well for clinical narrative.

College-level tasks build on low-level tasks and are unremarkably problem-specific. They include:

1. Spelling/grammatical mistake identification and recovery: this task is generally interactive because, as discussion-processing users know, it is far from perfect. Highly synthetic phrases predispose to fake positives (correct words flagged as errors), and incorrectly used homophones (identically sounding, differently spelled words, eg, sole/soul, their/there) to fake negatives.
2. Named entity recognition (NER) ²⁶ ²⁷: identifying specific words or phrases ('entities') and categorizing them—for example, equally persons, locations, diseases, genes, or medication. An common NER task is mapping named entities to concepts in a vocabulary. This task often leverages shallow parsing for candidate entities (eg, the noun phrase 'breast tenderness'); notwithstanding, sometimes the concept is divided across multiple phrases (eg, 'chest wall shows slight tenderness on pressure …').

The following issues make NER challenging:

Word/phrase social club variation: for example, perforated duodenal ulcer versus duodenal ulcer, perforated.
Derivation: for example, suffixes transform ane part of speech to another (eg, 'mediastinum' (noun) → 'mediastinal' (adjective)).
Inflection: for case, changes in number (eg, 'opacity/opacities)', tense (eg, 'cough(ed)'), comparative/superlative forms (eg, 'bigger/biggest)').
Synonymy is arable in biomedicine, for instance, liver/hepatic, Addison's disease/adrenocortical insufficiency.
Homographs: polysemy refers to homographs with related meanings, for example, 'straight bilirubin' can refer to a substance, laboratory procedure, or event. Homographic abbreviations are increasingly numerous²⁸: 'APC' has 12 expansions, including 'activated protein C' and 'adenomatous polyposis coli.'

3. Give-and-take sense disambiguation (WSD) ^29–31: determining a homograph's correct meaning.
4. Negation and doubtfulness identification ^32–34: inferring whether a named entity is nowadays or absent, and quantifying that inference's uncertainty. Effectually half of all symptoms, diagnoses, and findings in clinical reports are estimated to be negated.³⁵ Negation can exist explicit, for example, 'Patient denies breast pain' or implied—for instance, 'Lungs are clear upon auscultation' implies absence of aberrant lung sounds. Negated/affirmed concepts tin be expressed with dubiety ('hedging'), as in 'the ill-defined density suggests pneumonia.' Dubiousness that represents reasoning processes is difficult to capture: 'The patient probably has a left-sided cerebrovascular accident; mail service-convulsive country is less probable.' Negation, uncertainty, and affirmation class a continuum. Uncertainty detection was the focus of a recent NLP contest.³⁶
5. Relationship extraction: determining relationships betwixt entities or events, such equally 'treats,' 'causes,' and 'occurs with.' Lookup of problem-specific information—for example, thesauri, databases—facilitates human relationship extraction.

Anaphora reference resolution ³⁷ is a sub-task that determines relationships between 'hierarchically related' entities: such relationships include:

Identity: ane entity—for case, a pronoun like 'south/he,' 'hers/his,' or an abbreviation—refers to a previously mentioned named entity;
Office/whole: for example, city within land;
Superset/subset: for instance, antibiotic/penicillin.

vi. Temporal inferences/relationship extraction ³⁸ ³⁹: making inferences from temporal expressions and temporal relations—for example, inferring that something has occurred in the past or may occur in the time to come, and ordering events within a narrative (eg, medication X was prescribed afterwards symptoms began).
7. Data extraction (IE): the identification of trouble-specific information and its transformation into (problem-specific) structured form. Tasks 1–vi are often office of the larger IE chore. For case, extracting a patient'due south current diagnoses involves NER, WSD, negation detection, temporal inference, and anaphoric resolution. Numerous modern clinical IE systems exist,^40–44 with some available every bit open up-source.²⁵ ⁴⁴ ⁴⁵ IE and human relationship extraction have been themes of several i2b2/VA NLP challenges.^46–49 Other trouble areas include phenotype label,^l–52 biosurveillance,⁵³ ⁵⁴ and agin-drug reaction recognition.⁵⁵

The National Library of Medicine (NLM) provides several well-known 'knowledge infrastructure' resource that use to multiple NLP and IR tasks. The UMLS Metathesaurus,⁵⁶ which records synonyms and categories of biomedical concepts from numerous biomedical terminologies, is useful in clinical NER. The NLM's Specialist Lexicon⁵⁷ is a database of common English and medical terms that includes function-of-voice communication and inflection data; information technology is accompanied by a gear up of NLP tools.⁵⁸ The NLM besides provides a test drove for discussion disambiguation.⁵⁹

Some information driven approaches: an overview

Statistical and automobile learning involve development (or use) of algorithms that allow a program to infer patterns near case ('training') data, that in turn allows it to 'generalize'—make predictions about new information. During the learning phase, numerical parameters that characterize a given algorithm's underlying model are computed by optimizing a numerical measure out, typically through an iterative process.

In general, learning can be supervised—each particular in the preparation data is labeled with the correct answer—or unsupervised, where it is not, and the learning process tries to recognize patterns automatically (as in cluster and factor analysis). One pitfall in any learning approach is the potential for over-fitting: the model may fit the case information virtually perfectly, just makes poor predictions for new, previously unseen cases. This is because it may acquire the random noise in the training information rather than merely its essential, desired features. Over-fitting risk is minimized by techniques such as cantankerous-validation, which partition the case data randomly into training and test sets to internally validate the model's predictions. This process of data partitioning, grooming, and validation is repeated over several rounds, and the validation results are then averaged across rounds.

Machine-learning models tin exist broadly classified as either generative or discriminative. Generative methods seek to create rich models of probability distributions, and are so chosen because, with such models, one can 'generate' synthetic data. Discriminative methods are more utilitarian, directly estimating posterior probabilities based on observations. Srihari⁶⁰ explains the difference with an analogy: to identify an unknown speaker's language, generative approaches would apply deep noesis of numerous languages to perform the friction match; discriminative methods would rely on a less knowledge-intensive arroyo of using differences between languages to discover the closest friction match. Compared to generative models, which tin can become intractable when many features are used, discriminative models typically allow apply of more features.⁶¹ Logistic regression and conditional random fields (CRFs) are examples of discriminative methods, while Naive Bayes classifiers and hidden Markov models (HMMs) are examples of generative methods.

Some mutual motorcar-learning methods used in NLP tasks, and utilized by several articles in this issue, are summarized below.

Support vector machines (SVMs)

SVMs, a discriminative learning arroyo, allocate inputs (eg, words) into categories (eg, parts of voice communication) based on a feature ready. The input may exist transformed mathematically using a 'kernel function' to permit linear separation of the data points from different categories. That is, in the simplest two-characteristic case, a direct line would dissever them in an 10–Y plot: in the general N-feature case, the separator will be an (N−1) hyperplane. The commonest kernel function used is a Gaussian (the basis of the 'normal distribution' in statistics). The separation procedure selects a subset of the preparation data (the 'support vectors'—data points closest to the hyperplane) that best differentiates the categories. The separating hyperplane maximizes the distance to support vectors from each category (run across figure i).

An external file that holds a picture, illustration, etc. Object name is amiajnl-2011-000464fig1.jpg

Back up vector machines: a simple 2-D case is illustrated. The data points, shown as categories A (circles) and B (diamonds), can be separated by a direct line X–Y. The algorithm that determines X–Y identifies the data points ('support vectors') from each category that are closest to the other category (a1, a2, a3 and b1, b2, b3) and computes X–Y such that the margin that separates the categories on either side is maximized. In the general N-dimensional case, the separator volition exist an (North−ane) hyperplane, and the raw data volition sometimes need to be mathematically transformed then that linear separation is achievable.

A tutorial by Hearst et al ⁶² and the DTREG online documentation⁶³ provide outgoing introductions to SVMs. Fradkin and Muchnik⁶⁴ provide a more technical overview.

Hidden Markov models (HMMs)

An HMM is a system where a variable can switch (with varying probabilities) between several states, generating one of several possible output symbols with each switch (also with varying probabilities). The sets of possible states and unique symbols may be big, merely finite and known (see figure 2). We can observe the outputs, but the arrangement'due south internals (ie, state-switch probabilities and output probabilities) are 'hidden.' The problems to exist solved are:

An external file that holds a picture, illustration, etc. Object name is amiajnl-2011-000464fig2.jpg

Subconscious Markov models. The small circles S1, S2 and S3 represent states. Boxes O1 and O2 stand for output values. (In applied cases, hundreds of states/output values may occur.) The solid lines/arcs connecting states represent country switches; the arrow represents the switch's direction. (A state may switch back to itself.) Each line/arc characterization (non shown) is the switch probability, a decimal number. A dashed line/arc connecting a land to an output value indicates 'output probability': the probability of that output value being generated from the particular state. If a particular switch/output probability is zero, the line/arc is non fatigued. The sum of the switch probabilities leaving a given state (and the similar sum of output probabilities) is equal to 1. The sequential or temporal attribute of an HMM is shown in figure 3.

Inference: given a detail sequence of output symbols, compute the probabilities of one or more candidate state-switch sequences.
Pattern matching: find the land-switch sequence most likely to accept generated a particular output-symbol sequence.
Grooming: given examples of output-symbol sequence (training) data, compute the land-switch/output probabilities (ie, system internals) that fit this information best.

B and C are actually Naive Bayesian reasoning extended to sequences; therefore, HMMs utilize a generative model. To solve these problems, an HMM uses two simplifying assumptions (which are true of numerous real-life phenomena):

The probability of switching to a new state (or back to the same state) depends on the previous N states. In the simplest 'outset-order' instance (N=ane), this probability is determined by the current state alone. (First-order HMMs are thus useful to model events whose likelihood depends on what happened last.)
The probability of generating a particular output in a detail state depends only on that state.

These assumptions let the probability of a given land-switch sequence (and a corresponding observed-output sequence) to be computed by elementary multiplication of the private probabilities. Several algorithms be to solve these problems.⁶⁵ ⁶⁶ The highly efficient Viterbi algorithm, which addresses trouble B, finds applications in signal processing, for example, cell-phone technology.

Theoretically, HMMs could be extended to a multivariate scenario,⁶⁷ but the training problem can now go intractable. In do, multiple-variable applications of HMMs (eg, NER⁶⁸) use unmarried, bogus variables that are uniquely adamant composites of existing categorical variables: such approaches crave much more preparation data.

HMMs are widely used for voice communication recognition, where a spoken word's waveform (the output sequence) is matched to the sequence of individual phonemes (the 'states') that nigh likely produced information technology. (Frederick Jelinek, a statistical-NLP advocate who pioneered HMMs at IBM's Speech Recognition Group, reportedly joked, 'every time a linguist leaves my grouping, the speech recognizer's operation improves.'²⁰) HMMs as well accost several bioinformatics problems, for example, multiple sequence alignment⁶⁹ and gene prediction.⁷⁰ Eddy⁷¹ provides a lucid bioinformatics-oriented introduction to HMMs, while Rabiner⁷² (speech recognition) provides a more detailed introduction.

Commercial HMM-based speech-to-text is at present robust enough to have essentially killed off academic research efforts, with dictation systems for specialized areas—eg, radiology and pathology—providing structured data entry. Phrase recognition is paradoxically more than reliable for polysyllabic medical terms than for ordinary English language: few give-and-take sequences audio like 'angina pectoris,' while common English language has numerous homophones (eg, ii/too/to).

Provisional random fields (CRFs)

CRFs are a family unit of discriminative models first proposed by Lafferty et al.⁷³ An attainable reference is Culotta et al ⁷⁴; Sutton and McCallum⁷⁵ is more mathematical. The commonest (linear-chain) CRFs resemble HMMs in that the adjacent state depends on the current land (hence the 'linear concatenation' of dependency).

CRFs generalize logistic regression to sequential information in the same way that HMMs generalize Naive Bayes (see effigy 3). CRFs are used to predict the state variables ('Ys') based on the observed variables ('Xs'). For example, when applied to NER, the state variables are the categories of the named entities: we want to predict a sequence of named-entity categories within a passage. The observed variables might be the word itself, prefixes/suffixes, capitalization, embedded numbers, hyphenation, and so on. The linear-chain prototype fits NER well: for instance, if the previous entity is 'Salutation' (eg, 'Mr/Ms'), the succeeding entity must be a person.

An external file that holds a picture, illustration, etc. Object name is amiajnl-2011-000464fig3.jpg

The relationship between Naive Bayes, logistic regression, hidden Markov models (HMMs) and provisional random fields (CRFs). Logistic regression is the discriminative-model counterpart of Naive Bayes, which is a generative model. HMMs and CRFs extend Naive Bayes and logistic regression, respectively, to sequential data (adapted from Sutton and McCallum⁷³). In the generative models, the arrows indicate the direction of dependency. Thus, for the HMM, the land Y2 depends on the previous land Y1, while the output X1 depends on Y1.

CRFs are improve suited to sequential multivariate information than HMMs: the training problem, while requiring more example data than a univariate HMM, is still tractable.

Due north-grams

An 'N-gram'^xix is a sequence of N items—letters, words, or phonemes. Nosotros know that certain item pairs (or triplets, quadruplets, etc) are likely to occur much more than frequently than others. For example, in English words, U always follows Q, and an initial T is never followed by Grand (though it may be in Ukrainian). In Portuguese, a Ç is always followed by a vowel (except E and I). Given sufficient information, we can compute frequency-distribution data for all Northward-grams occurring in that data. Because the permutations increase dramatically with North—for example, English has 26^2 possible letter pairs, 26^3 triplets, and so on—North is restricted to a modest number. Google has computed word N-gram information (Due north≤v) from its spider web information and from the Google Books project, and made it available freely.⁷⁶

N-grams are a kind of multi-order Markov model: the probability of a particular item at the Nth position depends on the previous N−1 items, and can be computed from information. Once computed, N-gram information can be used for several purposes:

Suggested car-completion of words and phrases to the user during search, equally seen in Google'south own interface.
Spelling correction: a misspelled word in a phrase may be flagged and a correct spelling suggested based on the correctly spelled neighboring words, as Google does.
Speech recognition: homophones ('ii' vs 'too') can be disambiguated probabilistically based on correctly recognized neighboring words.
Give-and-take disambiguation: if we build 'word-pregnant' N-grams from an annotated corpus where homographs are tagged with their correct meanings, nosotros can use the non-ambiguous neighboring words to judge the right meaning of a homograph in a test document.

Northward-gram data are voluminous—Google'south Due north-gram database requires 28 GB—just this has become less of an issue equally storage becomes cheap. Special data structures, chosen North-gram indexes, speed upwards search of such information. Due north-gram-based classifiers leverage raw training text without explicit linguistic/domain knowledge; while yielding good performance, they go out room for comeback, and are therefore complemented with other approaches.

Chaining NLP analytical tasks: pipelines

Any applied NLP task must perform several sub-tasks. For example, all of NLP sub-problems section′south low-level tasks must execute sequentially, before college-level tasks tin commence. Since dissimilar algorithms may be used for a given task, a modular, pipelined system design—the output of one analytical module becomes the input to the side by side—allows 'mixing-and-matching.' Thus, a CRF-based POS tagger could be combined with dominion-based medical named-entity recognition. This pattern improves organisation robustness: one could replace one module with another (maybe superior) module, with minimal changes to the residuum of the organisation.

This is the intention behind pipelined NLP frameworks, such as GATE⁷⁷ and IBM (now Apache) Unstructured Information Direction Architecture (UIMA).⁷⁸ UIMA's scope goes beyond NLP: one could integrate structured-format databases, images, and multi-media, and any arbitrary technology. In UIMA, each analytical task transforms (a copy of) its input by adding XML-based markup and/or reading/writing external data. A chore operates on Common Analysis Construction (CAS), which contains the data (possibly in multiple formats, eg, sound, HTML), a schema describing the analysis structure (ie, the details of the markup/external formats), the analysis results, and links (indexes) to the portions of the source data that they refer to. UIMA does not dictate the design of the analytical tasks themselves: they collaborate with the UIMA pipeline only through the CAS, and can be treated every bit black boxes: thus, unlike tasks could be written in different programming languages.

The schema for a particular CAS is developer-defined because information technology is normally problem-specific. (Currently, no standard schemas exist for tasks such as POS tagging, although this may alter.) Definition is performed using XMI (XML Metadata Interchange), the XML-interchange equivalent of the Unified Modeling Language (UML). XMI, however, is 'programmer-hostile': information technology is easier to use a commercial UML tool to design a UML model visually and then generate XMI from information technology.⁷⁹

In do, a pure pipeline pattern may non exist optimal for all solutions. In many cases, a higher-level process needs to provide feedback to a lower-level procedure to improve the latter'due south accuracy. (All supervised auto- learning algorithms, for example, ultimately rely on feedback.) Implementing feedback across analytical tasks is complicated: it involves modifying the code of communicating tasks—one outputting data that constitutes the feedback, the other checking for the existence of such information, and accepting them if bachelor (see figure 4). New approaches based on active learning may help select cases for manual labeling for construction of training sets.^fourscore ⁸¹

An external file that holds a picture, illustration, etc. Object name is amiajnl-2011-000464fig4.jpg

A UIMA pipeline. An input task is sequentially put through a series of tasks, with intermediate results at each step and final output at the stop. Mostly, the output of a job is the input of its successor, but exceptionally, a particular task may provide feedback to a previous one (as in task 4 providing input to task 1). Intermediate results (eg, successive transformations of the original bus) are read from/written to the CAS, which contains metadata defining the formats of the data required at every footstep, the intermediate results, and annotations that link to these results.

Also, given that no NLP chore achieves perfect accurateness, errors in whatsoever one process in a pipeline will propagate to the next, and so on, with accuracy degrading at each step. This trouble, still, applies to NLP in general: it would occur fifty-fifty if the private tasks were all combined into a single trunk of code. 1 way to address it (adopted in some commercial systems) is to apply alternative algorithms (in multiple or branching pipelines) and contrast the final results obtained. This allows tuning the output to trade-offs (high precision versus high remember, etc).

A look into the hereafter

Contempo advances in artificial intelligence (eg, computer chess) take shown that effective approaches utilize the strengths of electronic circuitry—high speed and big retentivity/disk capacity, problem-specific data-pinch techniques and evaluation functions, highly efficient search—rather than trying to mimic homo neural role. Similarly, statistical-NLP methods correspond minimally to man thought processes.

By comparing with IR, we now consider what information technology may take for multi-purpose NLP applied science to become mainstream. While always important to library science, IR achieved major prominence with the web, notably afterwards Google'southward scientific and financial success: the limelight also caused a corresponding IR research and toolset nail. The question is whether NLP has a similar quantum application in the wings. Ane candidate is IBM Watson, which attracted much attention within the biomedical computer science community (eg, the ACMI Discussion newsgroup and the AMIA NLP working group give-and-take list) after its 'Jeopardy' performance. Watson appears to accost the admittedly difficult problem of question-answering successfully. Although the Watson try is impressive in many ways, its discernible limitations highlight ongoing NLP challenges.

IBM Watson: a wait-and-run into viewpoint

Watson, which employs UIMA,⁸² is a system-engineering triumph, using highly parallel hardware with 2880 CPUs+16 TB RAM. All its lookup of reference content (encyclopedias, dictionaries, etc) and analytical operations use structures optimized for in-memory manipulation. (By contrast, most pipelined NLP architectures on ordinary hardware are disk-I/O-bound.) It integrates several software technologies: IR, NLP, parallel database search, ontologies, and knowledge representation.

A Prolog parser extracts key elements such every bit the relationships between entities and task-specific answers. In a recent public brandish, the task was to compete for the fastest correct answer in a series of questions against 2 human contestants in the pop United states-based boob tube show, 'Jeopardy.' During training with a Jeopardy question-databank, NLP is also used to pre-process online reference text (eg, encyclopedia, dictionaries) into a structure that provides evidence for candidate answers, including whether the relationships betwixt entities in the question match those in the show.⁸³ The search, and ranking of candidate answers, utilize IR approaches.

A claiming in porting Watson's technology to other domains, such as medical question answering, will be the degree to which Watson's design is generalizable.

Watson built its lead in the contest with straightforward direct questions whose answers many of the audience (and the skilled homo contestants) clearly knew—and which a amateurish human armed with Google may accept been able to retrieve using keywords lonely (albeit slower). As pointed out past Libresco⁸⁴ and Jennings,⁸⁵ Watson was but faster with the cablegram—electronics beats human reaction time. For non-game-playing, real-world question answering scenarios, however, split-second reaction fourth dimension may not constitute a competitive advantage.
For harder questions, Watson'south limitations became clearer. Computing the correct response to the question about which United states city (Chicago) has two airports, one named after a World War 2 battle (Midway), the other after a World War II hero (O'Hare), involves 3 set intersections (eg, the commencement operation would cross names of airports in Usa cities confronting a listing of World War 2 battles). Watson lacked a higher-level strategy to reply such circuitous questions.
Watson's Prolog parser and search, and especially the entire reference content, were tuned/structured for playing Jeopardy, in which the questions and answers are one judgement long (and the answer is of the course 'what/who is/are X?'). Such an approach runs the risk of 'over-fitting' the system to a particular problem, so that it may crave meaning effort to alter it for even a slightly dissimilar trouble.

IBM recently conducted a medical diagnosis sit-in of Watson, which is reported in an Associated Press commodity.⁸⁶ Demonstrations eventually need to be followed by evaluations. Earlier medical diagnosis advice software underwent evaluations that were rigorous for their time, for example, Berner et al ⁸⁷ and Friedman et al,⁸⁸ and today's evaluations would need to be even more stringent. The manufactures from Miller and Masarie⁸⁹ and Miller⁹⁰ are first-class starting points for learning about the numerous pitfalls in the automated medical diagnosis domain, and IBM may rediscover these:

Physician-legal liability: ultimately the provider, not software, is responsible for the patient.
Reference-content reliability: determining the reliability of a given unit of show is challenging. Even some recent recommendations past 'authorities' have become tainted (eg, in psychiatry) with subsequent revelations of undisclosed conflict of interest.
The limited office of NLP and unstructured text in medical diagnosis: information technology is unclear that accurate medical diagnosis/advice mandates forepart-end NLP technology: structured data entry with thesaurus/Due north-gram assisted pick-lists or word/phrase completion might suffice. Similarly, diagnostic systems have used structured, curated information rather than unstructured text for prioritizing diagnoses. Even this information requires tailoring for local prevalence rates, and continual maintenance. Unstructured text, in the grade of citations, is used mainly to support the structured data.

To be off-white to IBM, NLP engineering science may conceivably augment web crawler technologies that search for specific information and alarm curators about new data that may require them to update their database. Electronic IE technologies might save curation fourth dimension, but given the medico-legal consequences, and the lack of 100% accuracy, such data would need to be verified past humans.

From an optimistic perspective, the Watson phenomenon may accept the beneficial side effect of focusing attention not simply on NLP, but also on the need to integrate it effectively with other technologies.

Volition NLP software become a commodity?

The post-Google involvement in IR has led to IR commoditization: a proliferation of IR tools and incorporation of IR engineering into relational database engines. Earlier, statistical packages and, subsequently, information mining tools also became commoditized. Commodity analytical software is characterized past:

Availability of several tools inside a bundle: the user can oftentimes set up up a pipeline without programming using a graphical metaphor.
High user friendliness and ease of learning: online documentation/tutorials are highly outgoing for the not-specialist, focusing on when and how to use a item tool rather than its underlying mathematical principles.
High value in relation to price: some offerings may even be freeware.

By contrast, NLP toolkits and UIMA are withal oriented toward the advanced developer, and commercial offerings are expensive. General purpose NLP is perchance overdue for commoditization: if this happens, best-of-breed solutions are more likely to rise to the top. Over again, analytics vendors are probable to lead the way, post-obit the steps of biomedical informatics researchers to devise innovative solutions to the challenge of processing circuitous biomedical language in the diverse settings where information technology is employed.

Footnotes

Funding: This work is funded in part by grants from the National Institutes of Health (R01LM009520;, U54HL108460;, and UL1RR031980).

Competing interests: None.

Provenance and peer review: Deputed; internally peer reviewed.

References

1. Manning C, Raghavan P, Schuetze H. Introduction to Information Retrieval. Cambridge, UK: Cambridge University Printing, 2008 [Google Scholar]

3. Chomsky Due north. Iii models for the description of language. IRE Trans Inf Theory 1956;2:113–24 [Google Scholar]

4. Aho AV, Sethi R, Ullman JD. Compilers: Principles, Techniques, Tools. Reading, MA: Addison-Wesley, 1988 [Google Scholar]

5. Chomsky N. On certain formal properties of grammars. Inform Contr 1959;2:137–67 [Google Scholar]

six. Friedl JEF. Mastering Regular Expressions. Sebastopol, CA: O'Reilly & Associates, Inc., 1997 [Google Scholar]

7. Kleene SC. Representation of events in nerve nets and finite automata. In: Shannon C, McCarthy J, eds. Automata Studies. Princeton, NJ: Princeton University Printing, 1956 [Google Scholar]

8. Kernighan B, State highway R. The UNIX Programming Environment. Englewood Cliffs, NJ: Prentice-Hall, 1989 [Google Scholar]

9. Levine JR, Stonemason T, Brown D. Lex & Yacc. Sebastopol, CA: O'Reilly & Associates, Inc., 1992 [Google Scholar]

10. Joshi A, Vijay-Shanker K, Weir D. The convergence of mildly context-sensitive grammar formalisms. In: Sells P, Shieber Southward, Wasow T, eds. Foundational Issues in Natural Language Processing. Cambridge, MA: MIT Printing, 1991:31–81 [Google Scholar]

xi. Clocksin WF, Mellish CS. Programming in Prolog: Using the ISO Standard. 5th edn New York: Springer, 2003 [Google Scholar]

14. Chomsky N. Syntactic Structures. The Hague, Netherlands: Mouton and Co, 1957 [Google Scholar]

16. Quinlan JR. C4.5: Programs for Automobile Learning. San Mateo, CA: Morgan Kaufmann Publishers, 1993 [Google Scholar]

19. Manning C, Schuetze H. Foundations of Statistical Tongue Processing. Cambridge, MA: MIT Press, 1999 [Google Scholar]

20. Jurafsky D, Martin JH. Speech and Language Processing. second edn Englewood Cliffs, NJ: Prentice-Hall, 2008 [Google Scholar]

21. Spyns P. Natural language processing in medicine: an overview. Methods Inf Med 1996;five:285–301 [PubMed] [Google Scholar]

22. Deleger 50, Namer F, Zweigenbaum P. Morphosemantic parsing of medical chemical compound words: transferring a French analyzer to English. Int J Med Inform 2009;78(Suppl 1):S48–55 [PubMed] [Google Scholar]

23. Denny JC, Spickard A, tertiary, Johnson KB, et al. Evaluation of a method to identify and categorize section headers in clinical documents. J Am Med Inform Assoc 2009;16:806–15 [PMC complimentary commodity] [PubMed] [Google Scholar]

25. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Assay and Knowledge Extraction Arrangement (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010;17:507–thirteen [PMC complimentary article] [PubMed] [Google Scholar]

26. Aronson A. Effective mapping of biomedical text to the UMLS metathesaurus: The MetaMap Program. Proc AMIA Symp 2001;2001:17–21 [PMC gratuitous article] [PubMed] [Google Scholar]

27. Zou Q, Chu WW, Morioka C, et al. IndexFinder: a method of extracting central concepts from clinical texts for indexing. AMIA Annu Symp Proc 2003:763–seven [PMC gratis commodity] [PubMed] [Google Scholar]

28. Liu H, Aronson A, Friedman C. A study of abbreviations in MEDLINE abstracts. Proc AMIA Symp 2002:464–8 [PMC gratis article] [PubMed] [Google Scholar]

29. Rindflesch TC, Aronson AR. Ambiguity resolution while mapping free text to the UMLS Metathesaurus. Proc Annu Symp Comput Appl Med Care 1994:240–4 [PMC free article] [PubMed] [Google Scholar]

31. Weeber 1000, Mork JG, Aronson AR. Developing a Test Collection for Biomedical Word Sense Disambiguation. Proc AMIA Symp 2001:746–fifty [PMC free commodity] [PubMed] [Google Scholar]

32. Chapman West, Bridewell Due west, Hanbury P, et al. A simple algorithm for identifying negated findings and diseases in belch summaries. J Biomed Inform 2001;34:301–x [PubMed] [Google Scholar]

33. Mutalik P, Deshpande A, Nadkarni P. Employ of general-purpose negation detection to broaden concept indexing of medical documents: a quantitative written report using the UMLS. J Am Med Inform Assoc 2001;8:598–609 [PMC free article] [PubMed] [Google Scholar]

34. Huang Y, Lowe HJ. A Novel Hybrid Approach to Automated Negation Detection in Clinical Radiology Reports. J Am Med Inform Assoc 2007;14:304–xi [PMC free article] [PubMed] [Google Scholar]

35. Chapman W, Bridewell W, Hanbury P, et al. Evaluation of negation phrases in narrative clinical reports. Proceedings of the AMIA Fall Symposium; 2001. Washington DC: Hanley & Belfus, Philadelphia, 2001:105–9 [PMC free commodity] [PubMed] [Google Scholar]

37. Savova GK, Chapman WW, Zheng J, et al. Anaphoric relations in the clinical narrative: corpus cosmos. J Am Med Inform Assoc 2011;18:459–65 [PMC free article] [PubMed] [Google Scholar]

38. Tao C, Solbrig H, Deepak S, et al. Fourth dimension-Oriented Question Answering from Clinical Narratives Using Semantic-Web Techniques. Springer, Berlin: Lecture Note on Computer Scientific discipline, 2011:6496 http://www.springerlink.com/content/67623p256743wv4u/ (accessed 20 Jul 2011). [Google Scholar]

39. Hripcsak Yard, Elhadad North, Chen YH, et al. Using empiric semantic correlation to interpret temporal assertions in clinical texts. J Am Med Inform Assoc 2009;sixteen:220–7 [PMC free commodity] [PubMed] [Google Scholar]

40. Taira RK, Johnson DB, Bhushan 5, et al. A concept-based retrieval system for thoracic radiology. J Digit Imaging 1996;9:25–36 [PubMed] [Google Scholar]

41. Sager North, Lyman Yard, Nhan Due north, et al. Medical language processing: applications to patient data representation and automated encoding. Meth Inform Med 1995;34:140–6 [PubMed] [Google Scholar]

42. Haug PJ, Ranum DL, Frederick PR. Computerized extraction of coded findings from free-text radiologic reports. Radiology 1990;174:543–8 [PubMed] [Google Scholar]

43. Christensen Fifty, Haug PJ, Fiszman M. MPLUS: a probabilistic medical language agreement arrangement. Philadelphia, PA: IEEE. Proceedings Workshop on Tongue Processing in the Biomedical Domain 2002:29–36 http://acl.ldc.upenn.edu/Due west/W02/W02-0305.pdf (accessed xx Jul 2011). [Google Scholar]

44. Xu H, Friedman C, Stetson PD. Methods for building sense inventories of abbreviations in clinical notes. AMIA Annu Symp Proc 2008:819. [PMC free commodity] [PubMed] [Google Scholar]

45. Christensen L, Harkema H, Irwin J, et al. ONYX: A Organisation for the Semantic Analysis of Clinical Text. Philadelphia, PA: IEEE. Proceedings of the BioNLP2009 Workshop of the ACL Conference 2009. http://www.aclweb.org/album/Westward/W09/W09-1303.pdf (accessed xx Jul 2011). [Google Scholar]

46. Uzuner O, South B, Shen S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011;18:552–6 [PMC free commodity] [PubMed] [Google Scholar]

47. Uzuner O, Goldstein I, Luo Y, et al. Identifying patient smoking condition from medical discharge records. J Am Med Inform Assoc 2008;15:14–24 [PMC complimentary article] [PubMed] [Google Scholar]

48. Uzuner O. Recognizing obesity and comorbidities in thin data. J Am Med Inform Assoc 2009;sixteen:561–70 [PMC complimentary article] [PubMed] [Google Scholar]

49. Uzuner O, Solti I, Cadag Due east. Extracting medication information from clinical text. J Am Med Inform Assoc 2010;17:514–eighteen [PMC free article] [PubMed] [Google Scholar]

50. Chute C. The horizontal and vertical nature of patient phenotype retrieval: new directions for clinical text processing. Proc AMIA Symposium; 2002. Washington DC: American Medical Informatics Association, 2002:165–9 [PMC costless article] [PubMed] [Google Scholar]

51. Chen L, Friedman C. Extracting phenotypic information from the literature via tongue processing. Stud Wellness Technol Inform 2004;107:758–62 [PubMed] [Google Scholar]

52. Wang Ten, Hripcsak G, Friedman C. Characterizing environmental and phenotypic associations using information theory and electronic health records. BMC Bioinformatics 2009;ten:S13. [PMC free commodity] [PubMed] [Google Scholar]

53. Chapman WW, Fiszman M, Dowling JN, et al. Identifying respiratory findings in emergency department reports for biosurveillance using MetaMap. Stud Health Technol Inform 2004;107:487–91 [PubMed] [Google Scholar]

54. Chapman WW, Dowling JN, Wagner MM. Fever detection from free-text clinical records for biosurveillance. J Biomed Inform 2004;37:120–7 [PMC gratis article] [PubMed] [Google Scholar]

55. Wang 10, Hripcsak G, Markatou M, et al. Active computerized pharmacovigilance using natural linguistic communication processing, statistics, and electronic health records: a feasibility study. J Am Med Inform Assoc 2009;xvi:328–37 [PMC gratuitous article] [PubMed] [Google Scholar]

56. Lindberg DAB, Humphreys BL, McCray AT. The Unified Medical Language Organisation. Meth Inform Med 1993;32:281–91 [PMC free article] [PubMed] [Google Scholar]

58. Divita G, Browne AC, Rindflesch TC. Evaluating lexical variant generation to improve information retrieval. Proc AMIA Symp 1998:775–9 [PMC complimentary article] [PubMed] [Google Scholar]

62. Hearst MA, Dumais ST, Osman E, et al. Support vector machines. IEEE Intel Sys Appl 1998;thirteen:18–28 [Google Scholar]

64. Fradkin D, Muchnik I. Back up vector machines for classification. In: Abello J, Carmode One thousand, eds. Discrete Methods in Epidemiology. Piscataway, NJ: Rutgers State University of New Jersey, 2006:13–20 [Google Scholar]

65. Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inform Theor 1967;thirteen:260–9 [Google Scholar]

66. Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Information via the EM Algorithm. J Roy Stat Soc 1977;39:1–38 [Google Scholar]

67. Hasegawa-Johnson Grand. Multivariate-country hidden Markov models for simultaneous transcription of phones and formants. IEEE International Briefing on Acoustics, Speech communication, and Betoken Processing (ICASSP); 2000. Istanbul, Turkey: 2000:1323–26 http://www.isle.illinois.edu/pubs/2000/hasegawa-johnson00icassp.pdf (accessed 20 Jul 2011). [Google Scholar]

68. Zhang J, Shen D, Zhou 1000, et al. Tan C-l. Exploring deep knowledge resources in biomedical name recognition. J Biomed Inform 2004;37:411–22 [PubMed] [Google Scholar]

69. Sonnhammer ELL, Eddy SR, Birney E, et al. Pfam: Multiple sequence alignments and HMM-profiles of poly peptide domains. Nucleic Acids Res 1998;26:320–2 [PMC gratis article] [PubMed] [Google Scholar]

lxx. Lukashin A, Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 1998;26:1107–fifteen [PMC costless article] [PubMed] [Google Scholar]

71. Eddy SR. What is a hidden Markov model? Nat Biotechnol 2004;22:1315–16 [PubMed] [Google Scholar]

72. Rabiner LR. A Tutorial on Subconscious Markov Models and Selected Applications in Spoken language Recognition. Proc IEEE 1989;77:257–86 [Google Scholar]

73. Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence information. Proc 18th International Conf on Automobile Learning; 2001. 2001:282–9 http://www.cis.upenn.edu/∼pereira/papers/crf.pdf (accessed 20 Jul 2011). [Google Scholar]

75. Sutton C, McCallum A. An Introduction to Conditional Random Fields for Relational Learning. Amherst: University of Massachusetts, 2004 [Google Scholar]

77. University of Sheffield Tongue Group Information Extraction: the GATE pipeline. 2011. http://www.gate.air conditioning.uk/ie (accessed 1 Jun 2011). [Google Scholar]

87. Berner ES, Webster GD, Shugerman AA, et al. Performance of four figurer-based diagnostic systems. N Engl J Med. 1994;330:1792–6 [PubMed] [Google Scholar]

88. Friedman CP, Elstein Every bit, Wolf FM, et al. Enhancement of clinicians' diagnostic reasoning past computer-based consultation: a multisite study of ii systems. JAMA 1999;282:1851–6 [PubMed] [Google Scholar]

89. Miller R, Masarie F. The demise of the Greek oracle model of diagnostic decision back up systems. Meth Inform Med 1990;29:1–viii [PubMed] [Google Scholar]

90. Miller R. Medical diagnostic decision support systems–past, present, and futurity: a threaded bibliography and brief commentary. J Am Med Inform Assoc 1994;ane:8–27 [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

petchyhationly.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3168328/