Named Entity Recognition (NER) Annotation for Clinical NLP

Well-Annotated and Gold Standard clinical text data to train/develop clinical NLP to build next version of Healthcare API

The importance of clinical Natural Language Processing (NLP) has been increasingly recognized over the past years and has led to transformative advances. Clinical NLP allows computers to understand the rich meaning that lies behind a doctor’s written analysis of a patient. Clinical NLP can have multiple use cases ranging from population health analytics to improvement in clinical documentation to speech recognition to clinical trial matching etc.

To develop and train any clinical NLP models, you require accurate, unbiased, and well-annotated datasets in enormous volumes. Gold Standard and diverse data help in enhancing precision and recall of NLP engines.

Volume

No. of Documents Annotated

No. of Pages Annotated

0 +

Project Duration

< 0 months

Challenges

The client was looking forward to train and develop their Natural Language Processing (NLP) Platform with new entity types and also identify the relationship among various types. Moreover, they were evaluating vendors who oﬀered high accuracy, complied with local laws and had the required medical knowledge to annotate a large set of data.

The task was to label and annotate up to 20,000 Labeled Records including up to 15,000 Labeled Records from inpatient and outpatient electronic health record (EHR) data and up to 5,000 Labeled Records from transcribed medical dictations, equally distributed across (1) geographical provenances and (2) available medical specialties.

So, to summarize the challenges:

Organize heterogeneous clinical data to train NLP Platform

Identify the relationship between diﬀerent entities to derive critical information
Ability and expertise to label / annotate a broad set of complex clinical documents
Keeping cost in control to label / annotate a large volume of data to train clinical NLP within the stipulated time frame
Annotate entities in the clinical dataset that consists of 75% EHR and 25% Dictation records.
Data De-identification at the time of delivery

Other Challenges in Natural Language Understanding

Ambiguity

Words are unique but can have diﬀerent meanings depending on the context resulting in ambiguity on the lexical, syntactic, and semantic levels.

Synonymy

We can express the same idea with diﬀerent terms which are also synonyms: big and large mean the same when describing an object.

Coreference

The process of finding all expressions that refer to the same entity in a text is called coreference resolution.

Intention, Emotions

Depending on the personality of the speaker, their intention and emotions, might be expressed diﬀerently for the same idea.

Solution

A large volume of medical data and knowledge is available, in the form of medical documents, but it is mainly in an unstructured format. With Medical entity Annotation / Named Entity Recognition (NER) Annotation, Insights AI was able to convert unstructured data into a structured format by annotating useful information from diverse types of clinical records. Once the entities were identified, the relationship among them was also mapped to identify critical information.

Scope of Work: Healthcare Entity Mention Annotation

9 Entity Types

Medical Condition
Medical Procedure
Anatomical Structure
Medicine
Medical Device
Body Measurement
Substance Abuse
Laboratory data
Body function

17 Modifiers

Medication Modifiers: Strength, Unit, Dose, From, Frequency, Route, Duration, Status
Body Measurement Modifiers: Value, Unit, Result
Procedure Modifiers: Method
Laboratory data Modifier: Lab Value, Lab Unit, Lab Result
Severity
Procedure result

27 Relationships & Patient Status

Outcome

The annotated data would be used to develop and train Client’s clinical NLP Platform, which would be incorporated in the next version of their Healthcare API. The benefits that the client derived were:

The data labeled/annotated met Client’s standard data annotation guidelines.
Heterogeneous datasets were used to train the NLP Platform for greater accuracy.
Relationship between diﬀerent entities, i.e. Anatomical body structure <> Medical Device, Medical Condition <> Medical Device, Medical Condition <> Medication, Medical Condition <> Procedure were identified to derive critical medical information.
The broad set of data that were labeled/annotated were also de-identified at the time of delivery.

Our collaboration with Insights AI significantly advanced our project in Ambient Technology and Conversational AI within healthcare. Their expertise in creating and transcribing synthetic healthcare dialogues provided a solid foundation, showcasing the potential of synthetic data in overcoming regulatory challenges. With Insights AI, we navigated these hurdles and are now a step closer to realizing our vision of intuitive healthcare solutions.