Data Mining Report

At first, we make preprocessing about Iris Data Set and repair lost data. We also use Java to implement K-Means algorithm and we get high accuracy rate and high efficiency. Then we use Weak Toolkit to make comparison between K-Means and Inevitable and Decision Tree. Finally, we make comparison with other results in our report and It Is proved that our algorithm has very high accuracy and effectiveness. 1.

0 Introduction through K-Means algorithm, and then we make training confirmation by using online data sets Iris.When we analysis the final result of cluster, we find the accuracy of result is very high. This analysis result can prove that K-Means algorithm has high accuracy and efficiency on certain data set. For showing the performance of clustering better, we find a report is about using same data set makes study about algorithm performance. After contracting with each other, we find K-Means algorithm model is 3 not lower than other Optimization classification algorithm in accuracy. Also K-Means algorithm can run very quickly.It is also the most quickly one in these Optimization cluster algorithms. Therefore, for real system, there is no doubt that K-Means algorithm is a valuable consideration first classification algorithm in weighting accuracy and efficiency comparison.

We Will Write a Custom Essay Specifically
For You For Only \$13.90/page!

order now

2. 0 Related Works 2. 1 Clustering analysis Clustering is according to certain standard (often the cosine similarity) to make a specific data set sample is divided into different categories, makes the class attribute or individual within the maximum possible similar, the degree of similarity between classes or low degree of correlation.Ultimately to see the various clustering cluster through the calculation, is a relatively sample and denser area, and a relatively sparse between clusters.

In the category of clustering algorithms are based on the classification method, based n the density, and Cohen method and hierarchical clustering method. Based on the different levels of data sample, make data samples has obvious stratification feature, tree structure can better reflect the running state of the algorithm.Plane clustering is through constantly optimize the loss function values to make the data set is divided into many substructures. Clustering considers each instance based on density field, Judging the data’s density change by the instance field, To get the division of density.

Chosen is a clustering method based on the neural network. And the K – Means clustering is easy to understand and accuracy. 2. 2 K-Means algorithm K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Version cells. The problem is computationally difficult; however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-minimization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both 4 algorithms.Additionally, they both use cluster centers to model the data; however, k- means clustering tends to find clusters of comparable spatial extent, while the term “k-means” was first used by James Macaques in 1967, though the idea goes back to Hugo Steiner’s in 1957.

The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, though it wasn’t published outside of Bell Labs until 1982. In 1965, E. W.

Forgo published essentially the same method, which is why it is sometimes referred to as Lloyd-Forgo.A more efficient version was proposed and published in Fortran by Harridan and Wong in 1975/1979. The process of K-Means algorithm: Choose k objects from n data as the initial center point of clustering. 2.

Operating (3) (4), if every center point of clustering no longer change, the algorithm stops. 3. Cording to the center point of clustering, calculate the distance of each element to the center point, select the minimum distance, this element will be classified to the listening. 4. Recalculate every center point of clustering (the means of elements in clustering).K-Means clustering method input will get k clusters, and then m data objects will be divided into k clusters to satisfy every class. The similarity of objects in a same cluster is higher and the similarity of objects in a different cluster is lower. Here the similarity clustering algorithm is through the element means in cluster, namely by calculate the clustering center point.

K-Means algorithm description: Firstly, randomly choose k objects as the clustering center point, other elements in he data collection select the minimum distance of element to the clustering center point, these elements as the corresponding clusters.Then, according to the means of clustering elements to calculate the center point (clustering center), and repeat this process constantly, know the center point of cluster is not change or change is less than a predefined number. 5 The corresponding formula as follows: p chi E is the sum of square error of all objects in data sets P is the point of data sets mi is the mean of cluster C I For each object in cluster, firstly, to calculate the squared distance of objects to the listening center point, then to get the sum.By this formula, to make the generated k clusters greatest possible opposition and compact. Take serial(k) data as initial cluster center from D data, at the same time, K is the number of the final class. Compute the distance been surplus element and the heart of cluster separately, based on minimum distance method choose attribution. To recalculate the cluster center, that is, the mean value in the element of cluster.

To determine whether a cluster center change or variation less than a threshold value Finish Fig. -1 K-Means algorithm pick K-Means Algorithm has the following characteristics: 1. Only find ball-shaped cluster.

2. Only can deal with numeric data types. 3. Can effectively handle large data sets K-Means algorithm formal can be described as K-Means algorithm implementation process: Input: Matrix U: the dataset of the police.

M: the record number of the dataset. N: the number of Attribution in each record. K: the number of clusters (the number of cluster-center). Output: Matrix P: the dataset with cluster number. 1.Choose “C.V.

by save type, and click save, then, we can get “total_data. C.v.’ file;; Next step, open the Exporter of Weak, click Open file button, open the “total_data” file we Just got, click “save” button, input “Iris” by file name on the following dialogue box, choose “Raff data files (*. Oaf” by file type, the data file got in this way is “total_data. Raff’. 3. 2.

2 Data convert (z-score) This method based on the mean value and standard deviation of original data to standardize data, the original x of A use z-score normalized to x’.Z-score standardized method is suitable for the maximum and minimum values of attribute A are unknown, or circumstances beyond the range of outlier data. New data = (original data – mean value) / standard deviation.

The standardization method is same for sips default standardization and z-score. As follows: 1 . Calculated for each variable (index) arithmetic mean (mathematical expectation) xi ND standard deviation s I ; 2. Standardized treatment: Where: is is the normalized value of the variable; is is the actual value of the variable. 3.