br The rest of the paper is
The rest of the paper RGX-104 organized as follows. Section 2 specifies the challenges in the studied data set. In Section 3, the proposed BPCM clustering algorithm is elaborated in detail, including its optimization procedure and a number of favorable properties in missing attribute imputation. Section 4 presents the bagging and fuzzy rule ensemble modules for classification with completed data. The experiment results on both the missing attribute estimation and the final classification using the proposed approach and the existing methods are presented in Section 5. Section 6 draws the conclusion.
2. Challenges and attempts
2.1. The missing attribute problem
The major problem in cervical cancer screening lies in the risk factor collection process. Since questionnaires concerning cervical cancer factors often involve queries on some private information such as “number of sexual partners”, “pregnancy
An example of missing attributes in the risk factor data set, where N/Aǥ denotes the missing attributes.
Patient Age Number of Smoking status Hormonal Number of STDs Time since first STD Time since last STD Number
status” and other gynecological diseases, very few participants are willing to provide all the related information. Visual in-formation has been found useful in cervical cancer screening ; however, the informative imaging methods might not always be available in developing countries and thus a screening framework without the image data have a practical ad-vantage. For example, in the dataset [17,18], only 6% of participants provided complete data and most of the data lack at least two components. In the worst case, some participants provide almost no informative component. Hence, missing data overwhelm the data set of cervical-cancer-related risk factor provided by Fernandes et al. . Among all the thirty-two features, the most frequent missing attributes are the time since the first/last sexually transmitted disease (STD) diagnosisǥ, which were ignored by over 90% of surveyed subjects. Around 12% of the subjects refused to provide any information about their STD situation, which caused ten to thirteen missing attributes in their corresponding feature vectors, leaving little information to be exploited. Table 1 lists some risk factors for several patients.
In order to deal with the missing attributes, three single-value imputation methods can be used :
1. Mean imputation/substitution: This kind of approaches use the average value of all the valid data of a specific attribute to fill the missing entries;
2. Regression imputation: These approaches assume that data are subject to a linear/polynomial pattern. However, in cervical cancer screening, many attributes are of Boolean values, which hinders this kind of solutions;
3. Hot-deck imputation: By assuming a distance metric or a generative distribution over the data set, this family of ap-proaches estimates the missing attributes by assigning a most probable value based on the inherent data distribution obtained from the complete data. Various kinds of clustering approaches have been widely used in hot-deck imputa-tion, including hard/crisp C-means (HCM) clustering , fuzzy C-means (FCM) clustering , etc. A brief review of the clustering approaches is provided in Appendix A.
In the cervical cancer screening task, the hot deck imputation approaches are more popular because Polytene chromosomes can both esti-mate the missing value and provide informative knowledge about the inherent data distribution.
2.2. Missing attribute estimation based on data clustering
Fixing missing attributes by data clustering usually consists of the following two steps: First, perform clustering on the complete data set and adopt the converged cluster centroids as the representative patterns that depict the inherent structure of the entire data set. Second, for any data with missing attributes, the closest centroid is found based on the known attributes, then the missing values are filled with the corresponding components from that centroid. Generally speaking, the missing value estimation accuracy depends on the performance of the clustering approach.