DESCRIPTION (provided by applicant):
As electronic health records (EHRs) continue their expansion into clinical settings, there has been a corresponding increase in interest in mining the data they contain, both for research as well as for clinical decision support. Informaticists are increasingly studying ways to mine EHR textual content. This is an important trend, because there is a wealth of information contained in clinical text not represented anywhere else in the EHR. There is a low level text-as-data issue which presents a significant obstacle to the widespread use of available medical NLP systems: hand-typed clinical narratives in EHRs are usually ungrammatical; short or telegraphic in style; full of abbreviations, acronyms, and misspellings; formatted in a templated or pseudo-tabular form; and contain embedded non-text such as a list of laboratory values cut-and-pasted from elsewhere in the EHR. As we show in the Preliminary Studies Section, this makes high-level processing by popular tools like MedLEE and MetaMap effectively useless for all but a few "clean" document types like discharge summaries or consult reports (e.g., pathology or radiology reports). This in turn explains why there is so little published about what is certainly the preponderance of clinical texts, those that are not as well-behaved lexically and syntactically as a discharge summary.
In this application we distinguish clinical narratives (e.g., a progress note) from biomedical narratives (e.g., a PubMed abstract). We are interested in texts that arise in the clinical or research setting; texts that are composed by clinicians and researchers directly into a computer system. We propose to build and publish a tool called POET (Parsable Output Extracted from Text). POET will be designed to accept unstructured textual documents and return structured, linguistic equivalents that are, to the extent possible, parsable by higher-level NLP engines. POET will have an architecture is modular, extensible, and based on open-source platforms and sources (e.g., Java, Perl, UMLS, NegEx, the Stanford Parser, HL7 Clinical Document Architecture, caGRID, etc.). To implement POET, we will collect, program, and evaluate published as well as novel algorithms for: acronym/abbreviation resolution; spelling correction; template and pseudo-table re-writing; and removal of embedded non-text. To test POET we will use a large corpus of cross-discipline (e.g., medical, nursing, pharmacy, etc.) clinical note types, as well as the clinical research texts MedWatch reports and IRB adverse event reports. The development of POET will combine the best practices found in the literature and new research efforts as part of the project. To validate the fidelity of POET processing we plan a formal analysis of information loss and information gain pre- and post-process. To ensure broad access to the tools, POET will be released under an open-source license. Finally, we plan to assess the feasibility of offering POET as a Web service for remote processing.