De ontembare groei van elektronische gezondheidsdossiers (EPD) in het laatste decennium heeft een overvloed aan klinische tekst opgeleverd die veelal ongestructureerd is en onbenut blijft. Een complicerende factor is dat EPDs in Nederlandse ziekenhuizen door slechts een drietal softwareleveranciers wordt beheerd . Deze monopolie positie heeft er toe geleid dat de interoperabiliteit van het EPD ---koppelen van meerdere informatiesystemen--- nogal te wensen over laat volgens de Nederlandse academische ziekenhuizen.
Desalniettemin, deze enorme hoeveelheid klinische tekstgegevens ---Big data--- leent zich voor informatie-extractie en text mining technieken gebaseerd op Kunstmatige Intelligentie (AI) modellen binnen het Natural Language Processing (NLP) toepassingsdomein.
Speech-to-Text (STT), NGRAM analysis, Named Entity Recognition (NER) en Relationship Extraction (RE) zijn sleutelcomponenten van NLP informatie-extractie taken met betrekking tot het benutten van terminologieënstelsels ---ontologieën--- voor de gezondheidszorg zoals SNOMED.
Voordat deze data-gedreven innovatie mogelijk wordt moet je kunnen beschikken over verzamelingen aan tekst of gesproken taal CORPORA die woorden bevatten met betrekking tot het gebruik van taal binnen een specifiek toepassingsdomein (vakgebied) zoals de geassocieerde gezondheidszorg in Nederland ---Klinisch Psychologen, Ergotherapeuten en Fysiotherapeuten---.
A major difficulty to allow for NLP of dutch clinical narratives ---free-texts--- is the lack of AI-models. A wide range of AI-models are solely available in English. The Huggingface Transformer framework offers a multitude of English Transformer models and variations of Bidirectional Encoder Representations which includes the Transformer BERT. Notably, the Huggingface-Hub comprises the Flair framework which offers Dutch biomedical support by means of the "BERTje transformer model".
In conclusion, automated encoding of free-text clinical narratives using concepts from NLP is widely performed. However, the majority of open-source NLP tools --- e.g. SpaCY--- and terminological systems --- e.g. SNOMED--- involved are written in the English language Cornet et al. (2012).
This project aims [1] to share practical knowledge about how to apply NLP techniques and [2] to create a custom, domain specific ---medical--- corpus derived from clinical narratives ---allied-healthcare medical case-studies--- through the use of data engineering DE + data science DS techniques and standards such as The CRoss Industry Standard Process for Data Mining CRISP-DM.
A corpus can be large or small, though generally they consist of dozens or even hundreds of gigabytes of data inside of thousands of documents. Corpora are collections of related documents that contain natural language. Corpora can be annotated, meaning that the text or documents are labeled with the correct responses for supervised learning algorithms (e.g., to build a filter to detect spam email), or unannotated, making them candidates for topic modeling and document clustering (e.g., to explore shifts in latent themes within messages over time).
The endproduct should be in the form of a wel documented digital-protocol that can be readily employed by allied healthcare processionals to perform semantic and/or pragmatic NLP-techniques such as Named Entity Recognition (NER) and Relationship Extraction (RE) on dutch clinical narratives.
That is, ultimately making clinical data freely exchangeable between the various professionals within the bachelor IVG and other educational or research institutes of Rotterdam University of Applied Sciences (RUAS).
This practice-based research project focuses on improving allied healthcare by applying state-of-the-art AI Technologies. It is a highly transdisciplinary collaboration between IGV, the CMI Minor Data Engineering and the Prometheus Data-Lab of the Rotterdam University of applied Sciences ---RUAS---. Supported is geven by the RUAS Program for AI & Ethics, the Digital Competence Centre (DCC) for Practice-based Research ---DCC SURF-pilot project--- and the RUAS Data Supported Healthcare team ---Zorgtech010 data-science unit---.
The raw data wil be stored on Research-Drive which is a EU GDPR complient service provided by SURF A data steward is responsible for managing and creating folder structures, user access, and determining quotas. Research Drive enables the use of Jupyter Notebooks.
Natural Language Processing (NLP) is a hybrid AI-discipline that is developed from linguistics and computer science to make human language intelligible to machines. The computers’ availability in the 1960s gave rise to NLP applications on computers known as computational linguistics. The structure of language is hierarchical comprising of seven levels each that constrain the use of computational linguistics.
level top-to-bottom | Structure | refers to |
---|---|---|
[1] | Phonology | Elementary sounds |
[2] | Morphology | Elementary combinations of letters and sounds, called Morphemes |
[3] | Lexical | Individual words formed of Morphemes, called Lexemes |
[4] | Syntax | Combination of words, grammatical structure of a sentence |
[5] | Semantics | Rules used to convey meaning using the lower levels |
[6] | Pragmatics | Behavioral constraints on the use of a specific language |
[7] | Discourse | Multiple sentences together, rules about how they should relate to each other |
Syntactic ---parsing--- and semantic ---semiotics--- analysis of text and speech to determine the meaning of a sentence. Syntax refers to the grammatical structure of a sentence, while semantics alludes to its intended meaning. By allowing computers to automatically analyze massive sets of data, NLP can find meaningful information in just milliseconds.
Natural Language Understanding (NLU): It is considered a "Hard AI-problem". The ambiguity and creativity of human language are just two of the characteristics that make NLP a demanding area to work in. The goal is to resolve ambiguities, obtain context and understand the meaning of what's being said. In particular, it tackles the complexities of language beyond the basic sentence structure. NLU is commonly used in text mining to understand consumer attitudes. In particular, sentiment analysis enables brands to monitor their customer feedback more closely, allowing them to cluster positive and negative social media comments and track net promoter scores. NLU can also establish a relevant ontology: a data structure which specifies the relationships between words and phrases. While humans naturally do this in conversation, the combination of these analyses is required for a machine to understand the intended meaning of different texts.
Natural Language Generation (NLG): While NLU focuses on computers to comprehend human language, NLG enables computers to write. Initially, NLG systems used templates to generate text. Based on some data or query, an NLG system would fill in the blank, like a game of Mad Libs. But over time, natural language generation systems have evolved with the application of hidden Markov chains, recurrent neural networks, and transformers, enabling more dynamic text generation in real time. Given an internal representation, this involves selecting the right words, forming phrases and sentences. Sentences need to ordered so that information is conveyed correctly. It produces a human language text response based on some data input. This text can also be converted into a speech format through text-to-speech services.
NLU is about both analysis and synthesis ---understanding---. Sentiment analysis and semantic search are examples of NLU. Captioning an image or video is mainly an NLG ---generating--- task since this type of input is not "textual". Text summarization and chatbot are applications that involve both NLU + NLG. NLG also encompasses text summarization capabilities that generate summaries from input documents while maintaining the integrity of the information.
As mentioned earlier, NLP software typically analyzes text by breaking it up into words (tokens) and sentences. Hence, any NLP pipeline has to start with a reliable system to split the text into sentences (sentence segmentation) and further split a sentence into words (word tokenization). On the surface, these seem like simple tasks, and you may wonder why they need special treatment.
NLP software typically works at the sentence level and expects a separation of words at the minimum. So, we need some way to split a text into words and sentences before proceeding further in a processing pipeline. Sometimes, we need to remove special characters and digits, and sometimes, we don’t care whether a word is in upper or lowercase and want everything in lowercase. Many more decisions like this are made while processing text. Such decisions are addressed during the pre-processing step of the NLP pipeline.
To harnass NLP capabilities, there are high quality open-source NLP tools available allowing developers to discover valuable insights from unstructured texts. That is, dealing with text analysis problems like classification, word ambiguity, sentiment analysis etc.
The here shown inventory is given on state-of-the-art ---Python programming language based--- open-source natural-language processing (NLP) tools & software. These are suites of libraries, frameworks, and applications for symbolic, statistical natural-language and speech processing.
Tool | NLP tasks | Distinctive features | Neural networks | Best for | Not suitable for |
---|---|---|---|---|---|
NLTK | Classification, tokenization, stemming. tagging. parsing. semantic reasoning | Over 50 corpora Package for chatbots Multilingual support | No | Training, Education, Research | Complex projects with large datasets |
Gensim | Text similarity. text summarization, SOTA topic modeling | Scalability and high performance Unsupervised training | No | Converting words and documents into vectors | Supervised text modeling Full NLP pipeline |
SpaCy | Tokenization, CNN tagging, parsing, named entity recognition. classification, sentiment analysis | 50+ languages (Dutch) available for tokenization Easy to learn and use | Yes | Teaching and research | Business production |
Textacy | Tokenization, Part-of-Speech Tagging, Dependency Parsing | High-performance SpaCy library | No | Access and extend spaCy’s core functionality | Beginners |
Stanford CoreNLP Python Interface | Tokenization, multi- word-token expansion. lemmatization, POS tagging, dependency parsing | Different usage models Multilingual | Yes | Fully functional NLP systems, Co-reference resolution | Beginners |
Text Blob | POS tagging.noun phrase extraction sentiment analysis, classification, translation, spelling correction, etc. | Translation and spelling correction | No | NLP prototyping | Large scale productions § Altexsoft |
PyTorch-NLP | Word2Vector Encoding, Dataset Sampling | Neural Network pre-trained Embeddings | Yes | Rapid Prototyping, Research | Beginners |
AllenNLP | high-level configuration language to implement many common approaches in NLP, such as transformer experiments, multi-task training, vision+language tasks, fairness, and interpretability | Solving natural language processing tasks in PyTorch | Yes | Experimentation | Development has stopped |
FlairNLP | Get insight from text extraction, word embedding, named entity recognition, parts of speech tagging, and text classification | Sense Disambiguation + Classification, Sentiment Analysis | No | Supports Biomedical Datasets | Business production |
Spark-NLP | NLP-library for use with Apache Spark | Easy to scale by extending Apache Spark natively | Yes | Use of SOTA transformers such as BERT & ELMO at scale by extending Apache Spark natively | Beginners |
from nltk import ngrams
sentence = input("Enter the sentence: ")
n = int(input("Enter the value of n: "))
n_grams = ngrams(sentence.split(), n)
for grams in n_grams:
print(grams)
-
NLP reference documentation: https://miro.com/app/board/uXjVOa_6fiQ=/?share_link_id=647822840290
-
https://robfvdw.medium.com/a-generic-approach-to-data-driven-activities-d85ad558b5fa
-
https://nictiz.nl/publicaties/snomed-ct-meer-dan-een-terminologiestelsel/
-
https://www.zorgvisie.nl/content/uploads/sites/2/2018/04/Epd-overzicht2018.pdf
-
Inventory of Tools for Dutch Clinical Language Processing, 2012