A sentence is defined as the longest whitespace trimmed character sequence between two punctuation marks. A Sentence Detector utilizes different methods to detect a sentence. As shown in picture below, CLAMP-Cancer provides three different models to detect a sentence:
Each model is described in details in the following sections.
DF_CLAMP_Sentence_Detector is the default sentence detector in CLAMP-Cancer. It is designed specifically for clinical notes and takes into account the distinctive characteristics observed in sentences found in clinical texts. To configure the DF_CLAMP_Sentence_Detector, please click on the config file. A pop-up window opens where you can modify two parameters: Medical Abbreviation, and Max Sentence Length.
There are some medical abbreviations that have punctuation marks at their beginning (".NO2) while some of them have it at the end (spec.). Providing a list of such abbreviations would help the detector to identify sentences more accurately. By default, CLAMP-Cancer has provided a comprehensive list of medical abbreviation which can be found in this file: defaultAbbrs.txt
Checking the checkbox for "Break long sentences or not?" allows users to break long sentences into the number of words that they have specified in the input textbox. Please refer to the following picture for more information.
This detector will identify new sentences using the line breaks in the file, i.e., each line in the file is treated as a single sentence.
This is an OpenNLP sentence detector which advanced users can use its config.conf file to change its default model.