DESCRIPTION (provided by applicant):
Biomedical language processing, the application of computational techniques to human-generated texts in biomedicine, is an increasingly important enabling technology for basic and applied biomedical research. The exponential growth of the peer-reviewed literature and the breakdown of disciplinary boundaries associated with high-throughput techniques have increased the importance of automated tools for keeping scientists abreast of all of the published material relevant to their work. However, despite decades of research, the performance of state-of-the-art tools for basic language processing tasks like information extraction and document retrieval remain below the level necessary for adequate utility and widespread adoption of this technology. The development, performance and evaluation of text mining systems depend crucially on the availability of appropriate corpora: collections of representative documents that have been annotated with human judgments relevant to a language-processing task. Corpora play two roles in the development of this technology: first, they act as "gold standards" by which alternative automated methods can be fairly compared, and second, they provide data for the training of statistical and machine learning systems that create empirical models of patterns in language use. The conventional view is that corpora are neutral, random samples of the domain of interest. Our preliminary work suggests that the restrictions in size, quality, genre, and representational schema of the small number of existing corpora are themselves a critical limiting factor for near-term breakthroughs in biomedical text processing technology. Therefore, we propose to test the following hypothesis: Creation of large, high-quality, biomedical corpora from multiple genres will lead to significant improvements in the performance of biomedical text mining systems and the creation of new approaches to text mining tasks. Specific aims include constructing several large corpora covering a range of genres and incorporating a rich knowledge representation; identifying factors that affect differential performance on full text versus abstracts; and developing new methods for language processing, especially of full text. Because improvements in the ability to automatically extract information from many textual genres will assist scientists and clinicians in the crucial task of keeping up with the burgeoning biomedical literature, the potential public health impact is quite large.