Decision Support Systems [DSS] in Allied Healthcare

De ontembare groei van elektronische gezondheidsdossiers (EPD) in het laatste decennium heeft een overvloed aan klinische tekst opgeleverd die veelal ongestructureerd is en onbenut blijft. Een complicerende factor is dat EPDs in Nederlandse ziekenhuizen door slechts een drietal softwareleveranciers wordt beheerd . Deze monopolie positie heeft er toe geleid dat de interoperabiliteit van het EPD ---koppelen van meerdere informatiesystemen--- nogal te wensen over laat volgens de Nederlandse academische ziekenhuizen.

Desalniettemin, deze enorme hoeveelheid klinische tekstgegevens ---Big data--- leent zich voor informatie-extractie en text mining technieken gebaseerd op Kunstmatige Intelligentie (AI) modellen binnen het Natural Language Processing (NLP) toepassingsdomein.

Speech-to-Text (STT), NGRAM analysis, Named Entity Recognition (NER) en Relationship Extraction (RE) zijn sleutelcomponenten van NLP informatie-extractie taken met betrekking tot het benutten van terminologieënstelsels ---ontologieën--- voor de gezondheidszorg zoals SNOMED.

Voordat deze data-gedreven innovatie mogelijk wordt moet je kunnen beschikken over verzamelingen aan tekst of gesproken taal CORPORA die woorden bevatten met betrekking tot het gebruik van taal binnen een specifiek toepassingsdomein (vakgebied) zoals de geassocieerde gezondheidszorg in Nederland ---Klinisch Psychologen, Ergotherapeuten en Fysiotherapeuten---.

Domain-specific (clinical) Language models in Dutch

A major difficulty to allow for NLP of dutch clinical narratives ---free-texts--- is the lack of AI-models. A wide range of AI-models are solely available in English. The Huggingface Transformer framework offers a multitude of English Transformer models and variations of Bidirectional Encoder Representations which includes the Transformer BERT. Notably, the Huggingface-Hub comprises the Flair framework which offers Dutch biomedical support by means of the "BERTje transformer model".

In conclusion, automated encoding of free-text clinical narratives using concepts from NLP is widely performed. However, the majority of open-source NLP tools --- e.g. SpaCY--- and terminological systems --- e.g. SNOMED--- involved are written in the English language Cornet et al. (2012).

Research Aim: Building a Custom Corpus

This project aims [1] to share practical knowledge about how to apply NLP techniques and [2] to create a custom, domain specific ---medical--- corpus derived from clinical narratives ---allied-healthcare medical case-studies--- through the use of data engineering DE + data science DS techniques and standards such as The CRoss Industry Standard Process for Data Mining CRISP-DM.

A corpus can be large or small, though generally they consist of dozens or even hundreds of gigabytes of data inside of thousands of documents. Corpora are collections of related documents that contain natural language. Corpora can be annotated, meaning that the text or documents are labeled with the correct responses for supervised learning algorithms (e.g., to build a filter to detect spam email), or unannotated, making them candidates for topic modeling and document clustering (e.g., to explore shifts in latent themes within messages over time).

The endproduct should be in the form of a wel documented digital-protocol that can be readily employed by allied healthcare processionals to perform semantic and/or pragmatic NLP-techniques such as Named Entity Recognition (NER) and Relationship Extraction (RE) on dutch clinical narratives.

That is, ultimately making clinical data freely exchangeable between the various professionals within the bachelor IVG and other educational or research institutes of Rotterdam University of Applied Sciences (RUAS).

Collaboration & Data Management

This practice-based research project focuses on improving allied healthcare by applying state-of-the-art AI Technologies. It is a highly transdisciplinary collaboration between IGV, the CMI Minor Data Engineering and the Prometheus Data-Lab of the Rotterdam University of applied Sciences ---RUAS---. Supported is geven by the RUAS Program for AI & Ethics, the Digital Competence Centre (DCC) for Practice-based Research ---DCC SURF-pilot project--- and the RUAS Data Supported Healthcare team ---Zorgtech010 data-science unit---.

The raw data wil be stored on Research-Drive which is a EU GDPR complient service provided by SURF A data steward is responsible for managing and creating folder structures, user access, and determining quotas. Research Drive enables the use of Jupyter Notebooks.

Natural Lanuage Processing [NLP] Defined

Natural Language Processing (NLP) is a hybrid AI-discipline that is developed from linguistics and computer science to make human language intelligible to machines. The computers’ availability in the 1960s gave rise to NLP applications on computers known as computational linguistics. The structure of language is hierarchical comprising of seven levels each that constrain the use of computational linguistics.

level top-to-bottom	Structure	refers to
[1]	Phonology	Elementary sounds
[2]	Morphology	Elementary combinations of letters and sounds, called Morphemes
[3]	Lexical	Individual words formed of Morphemes, called Lexemes
[4]	Syntax	Combination of words, grammatical structure of a sentence
[5]	Semantics	Rules used to convey meaning using the lower levels
[6]	Pragmatics	Behavioral constraints on the use of a specific language
[7]	Discourse	Multiple sentences together, rules about how they should relate to each other

Syntactic ---parsing--- and semantic ---semiotics--- analysis of text and speech to determine the meaning of a sentence. Syntax refers to the grammatical structure of a sentence, while semantics alludes to its intended meaning. By allowing computers to automatically analyze massive sets of data, NLP can find meaningful information in just milliseconds.

NLP covers two application-domains NLU + NLG

Natural Language Understanding (NLU): It is considered a "Hard AI-problem". The ambiguity and creativity of human language are just two of the characteristics that make NLP a demanding area to work in. The goal is to resolve ambiguities, obtain context and understand the meaning of what's being said. In particular, it tackles the complexities of language beyond the basic sentence structure. NLU is commonly used in text mining to understand consumer attitudes. In particular, sentiment analysis enables brands to monitor their customer feedback more closely, allowing them to cluster positive and negative social media comments and track net promoter scores. NLU can also establish a relevant ontology: a data structure which specifies the relationships between words and phrases. While humans naturally do this in conversation, the combination of these analyses is required for a machine to understand the intended meaning of different texts.

Natural Language Generation (NLG): While NLU focuses on computers to comprehend human language, NLG enables computers to write. Initially, NLG systems used templates to generate text. Based on some data or query, an NLG system would fill in the blank, like a game of Mad Libs. But over time, natural language generation systems have evolved with the application of hidden Markov chains, recurrent neural networks, and transformers, enabling more dynamic text generation in real time. Given an internal representation, this involves selecting the right words, forming phrases and sentences. Sentences need to ordered so that information is conveyed correctly. It produces a human language text response based on some data input. This text can also be converted into a speech format through text-to-speech services.

NLU is about both analysis and synthesis ---understanding---. Sentiment analysis and semantic search are examples of NLU. Captioning an image or video is mainly an NLG ---generating--- task since this type of input is not "textual". Text summarization and chatbot are applications that involve both NLU + NLG. NLG also encompasses text summarization capabilities that generate summaries from input documents while maintaining the integrity of the information.

Pre-Processing of free-texts & the NLP-data Pipeline

As mentioned earlier, NLP software typically analyzes text by breaking it up into words (tokens) and sentences. Hence, any NLP pipeline has to start with a reliable system to split the text into sentences (sentence segmentation) and further split a sentence into words (word tokenization). On the surface, these seem like simple tasks, and you may wonder why they need special treatment.

NLP software typically works at the sentence level and expects a separation of words at the minimum. So, we need some way to split a text into words and sentences before proceeding further in a processing pipeline. Sometimes, we need to remove special characters and digits, and sometimes, we don’t care whether a word is in upper or lowercase and want everything in lowercase. Many more decisions like this are made while processing text. Such decisions are addressed during the pre-processing step of the NLP pipeline.

NLP OPEN-SOURCE Python Tools

To harnass NLP capabilities, there are high quality open-source NLP tools available allowing developers to discover valuable insights from unstructured texts. That is, dealing with text analysis problems like classification, word ambiguity, sentiment analysis etc.

The here shown inventory is given on state-of-the-art ---Python programming language based--- open-source natural-language processing (NLP) tools & software. These are suites of libraries, frameworks, and applications for symbolic, statistical natural-language and speech processing.

Tool	NLP tasks	Distinctive features	Neural networks	Best for	Not suitable for
NLTK	Classification, tokenization, stemming. tagging. parsing. semantic reasoning	Over 50 corpora Package for chatbots Multilingual support	No	Training, Education, Research	Complex projects with large datasets
Gensim	Text similarity. text summarization, SOTA topic modeling	Scalability and high performance Unsupervised training	No	Converting words and documents into vectors	Supervised text modeling Full NLP pipeline
SpaCy	Tokenization, CNN tagging, parsing, named entity recognition. classification, sentiment analysis	50+ languages (Dutch) available for tokenization Easy to learn and use	Yes	Teaching and research	Business production
Textacy	Tokenization, Part-of-Speech Tagging, Dependency Parsing	High-performance SpaCy library	No	Access and extend spaCy’s core functionality	Beginners
Stanford CoreNLP Python Interface	Tokenization, multi- word-token expansion. lemmatization, POS tagging, dependency parsing	Different usage models Multilingual	Yes	Fully functional NLP systems, Co-reference resolution	Beginners
Text Blob	POS tagging.noun phrase extraction sentiment analysis, classification, translation, spelling correction, etc.	Translation and spelling correction	No	NLP prototyping	Large scale productions § Altexsoft
PyTorch-NLP	Word2Vector Encoding, Dataset Sampling	Neural Network pre-trained Embeddings	Yes	Rapid Prototyping, Research	Beginners
AllenNLP	high-level configuration language to implement many common approaches in NLP, such as transformer experiments, multi-task training, vision+language tasks, fairness, and interpretability	Solving natural language processing tasks in PyTorch	Yes	Experimentation	Development has stopped
FlairNLP	Get insight from text extraction, word embedding, named entity recognition, parts of speech tagging, and text classification	Sense Disambiguation + Classification, Sentiment Analysis	No	Supports Biomedical Datasets	Business production
Spark-NLP	NLP-library for use with Apache Spark	Easy to scale by extending Apache Spark natively	Yes	Use of SOTA transformers such as BERT & ELMO at scale by extending Apache Spark natively	Beginners

NGRAM Code example

from nltk import ngrams
sentence = input("Enter the sentence: ")
n = int(input("Enter the value of n: "))
n_grams = ngrams(sentence.split(), n)
for grams in n_grams:
    print(grams)

Name		Name	Last commit message	Last commit date
Latest commit History 250 Commits
BACKUPs		BACKUPs
FIGs		FIGs
MIRO_BOARDS		MIRO_BOARDS
Notebooks		Notebooks
Project_Principles		Project_Principles
RECOURCES		RECOURCES
References		References
TEST		TEST
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decision Support Systems [DSS] in Allied Healthcare

Domain-specific (clinical) Language models in Dutch

Research Aim: Building a Custom Corpus

Collaboration & Data Management

Natural Lanuage Processing [NLP] Defined

NLP covers two application-domains NLU + NLG

Pre-Processing of free-texts & the NLP-data Pipeline

NLP OPEN-SOURCE Python Tools

NGRAM Code example

References

About

Releases

Packages

Languages

License

robvdw/Decision-Support-Systems-In-Allied-Healthcare

Folders and files

Latest commit

History

Repository files navigation

Decision Support Systems [DSS] in Allied Healthcare

Domain-specific (clinical) Language models in Dutch

Research Aim: Building a Custom Corpus

Collaboration & Data Management

Natural Lanuage Processing [NLP] Defined

NLP covers two application-domains NLU + NLG

Pre-Processing of free-texts & the NLP-data Pipeline

NLP OPEN-SOURCE Python Tools

NGRAM Code example

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages