Abstract:

Behavior of Various Machine Learning Models in the Face of Noisy Data

Michael D. Blechner, M.D.

MIT HST.951 Final project

Fall 2005

Abstract

Although a great deal of attention has been focused on the future potential for molecular-based cancer diagnosis, histologic examination of tissue specimens remains the mainstay of diagnosis. The process of histologic diagnosis entails the identification of visual features from a slide, followed by the recognition of a feature pattern to which the case belongs. The combination of image analysis and machine learning imitates this process and in certain circumstances may be able to aid the pathologist. However, there is a great deal of variability and noise inherent in such an approach. Therefore, a classification model developed from data at one institution is likely to perform acceptably at other institutions, only if the model can handle such variability. This paper compares the performance of machine learning models based on fuzzy rules (FR), fuzzy decision trees (FDT), artificial neural networks (aNN) and logistic regression (LR) and examines how these models behave in the face of noisy and variant data. Results suggest that FDT models may be more resistant to data noise.

Click here for full text report (html).

Click here to download full text report (MS word).

Data & Software

Wisconsin Diagnostic Breast Cancer (WDBC) dataset housed in the UCI Machine learning repository

1. Data - http://www.ics.uci.edu/~mlearn/databases/breast-cancer-wisconsin/wdbc.data

2. Documentation - http://www.ics.uci.edu/~mlearn/databases/breast-cancer-wisconsin/wdbc.names

http://www.r-project.org/

Ripley’s VR bundle for R, version 7.2-23. Original S development by Venables & Ripley. R port by Brian Ripley. http://cran.r-project.org/src/contrib/Archive/VR/VR_7.2-23.tar.gz