Mini Project Report
Design and implementation of a classification system based on soft computing and statistical approaches
Ashish Kumar Agrawal(2001114)
This project, being developed as a part of MHRD research project (Designing an Intelligent Robot for Explosive Detection and Decontamination funded by MHRD, Govt. of India), explores the design and development of classifier based on statistical methods and soft computing based approaches which is capable of identifying the mines and non mines using various clustering, classification and rules establishment algorithms as to compare the algorithm on the basis of complexity and accuracy. Designing such a classifier is a big challenge because data is not linearly separable and since it has overlapping features, it is not possible to design a classifier with 100% accuracy .This project deals with PVC tubes, wood piece and copper cylinders as non mine data in addition to data of various mines. The basic idea of the classification is based on a fact that it is safe if the non-mines data is predicted as mine but it is not the case when we predict mines data as non-mines. So the unsupervised learning based ART algorithm divides the data into several clusters which are merged on the basis of above fact. Genetic algorithm is enhancing the results to establish the results having negation in the antecedent part. In addition to these approaches, fuzzy approaches also give the membership values corresponding to each class to visual the class of data in better way.
I hereby declare that the work presented in this project titled "Design and implementation of a classification system based on soft computing and statistical approach" submitted towards completion of mini-project in sixth Semester of B.Tech (IT) at the Indian Institute of Information Technology (IIIT), Allahabad. It is an authentic record of my original work pursued under the guidance of Dr. G. C. Nandi, Associate Professor, IIIT, Allahabad.
I have not submitted the matter embodied in this project for the award of any other degree.
(Ashish Kumar Agrawal)
Place: Allahabad Date: 17-5-2004
This is to certify that the above declaration made by the candidate is correct to the best of my knowledge and belief.
(Dr G.C. Nandi)
Place: Allahabad IIIT (Deemed University)
Date: May, 2004 Deoghat, Jhalwa, Allahabad.
First and foremost, I would like to express my sincere gratitude to my project guide, Dr G.C. Nandi. I was privileged to experience a sustained enthusiastic and involved interest from his side. This fueled my enthusiasm even further and encouraged us to boldly step into what was a totally dark and unexplored expanse before us. I would also like to thank my seniors who were ready with a positive comment all the time, whether it was an off-hand comment to encourage us or a constructive piece of criticism and a special thank to JRC database provider who arranged a good database for mines. Last but not least, I would like to thank the IIIT-A staff members and the institute, in general, for extending a helping hand at every juncture of need.
Ashish Kumar Agrawal (2001114)
Table Of Contents
List Of Figures VII
Introduction and Statement Of Problem 1
1.1 Introduction 1
1.2 Problem Statement 1
Challenges In This Field 3
2.1 Features extraction 3
2.2 Selection of an algorithm 3
Approaches in This Direction 4
3.1 statistical Approaches 4
3.1.1 Clustering algorithm Kmean 4
3.1.2 k-nearest neighbour 4
3.2 Softcomputing 5
3.2.1 Genetic algorithm 5
3.2.2 Adaptive resonance theory (ART) 6
3.2.3 Fuzzy C-mean 7
3.2.4. Gustavson kessel algorithm 8
3.2.5 Gathâ€geva algorithm 8
3.2.6 Kohonen SOM 9
System Architecture 10
4.1 Data Source Name and login 10
4.2 Algorithm and table selection 10
Results And Conclusions 12
5.1 Results 12
5.2 conclusion 15
5.3 Future Extensions 15
5.3.1 Improvement in the genetic algorithm 16
5.3.2.Distributed computing environment 16
5.3.3.Dealing with various platform and format 16
- Research Papers 17
List of Figures
Fig 1: Flow of information 11
Fig 2: Main frame of algorithm 11
Fig 3: Result of genetic algorithm 12
Fig 4: Result of ART 12
Fig 5: Result of Fuzzy c-mean and Gustavson kessel algorithm
Fig 6: Result of k-mean algorithm 14
Fig 7: Result of k-nearest neighbour algorithm 14
Fig 8: Result of kohonen SOM 15
Introduction & Statement of Problem
"If we already know about the upcoming hazards; it is very easy to find the way to abolish it."
Here, this sentence is being described in the context of Landmine Detection and Decontamination. My objective is to predict whether at a particular point of working area is occupied by mines or not, with some confidence parameter. Robot is designed to move toward these predicted areas to decontaminate the mines. These mines occupied area can be known before initiation of robot movements or can be predicted dynamically, so to design an obstacles free path for robot is another aspect beyond the domain of this module.
To tackle this problem a classification toolkit has been designed using some statistical and soft computing based approaches to cluster the data, to predict the possible class of incoming data, to generate some rules in the term of confidence parameter. The data may be given in image form or some tabular form having all numeric or categorized attributes.
It is impossible to design a classifier having 100% right classification
because it is not easy to differentiate between the data of metallic debris, PVC tubes and actual mine data.
On the basis of this prediction path designers develop the obstacle free path to decontaminate these mines.
1.2 Statement OF Problem
Anti-Personal landmines are a significant barrier to economic and social development in a number of countries, so we need a classification system that can differentiate a mine from metallic debris on the basis of given data. This data is generated by some highly accurate sensors.
Challenges in This Field
In the field of classification and rules establishment, the basic problems are the features extraction (building blocks of algorithms) and selection of good algorithms those can generate results with high certainty value.
2.1 Features Extraction
The initial problem is the problem of features extraction. Generally the image data is given having a blurred image of an object, so it is very difficult to extract the exact boundary of object. There may be various features those can be used as the raw material of system. Here blobsize, blobaspectratio and blobintensity have been chosen. The given data may contain the images of PVC tube, metallic debris and Mines.
The data in some tabular format having numerical or categorized values of attributes can also be given, which is more suitable for the algorithms
2.2 selection of an Algorithm
The second problem is to choose an algorithm that can interpret the problem in best way. The algorithms can be categorized in two parts:
(1)Statistical approaches (2)Softcomputing based approaches
The three types of algorithms can be applied here: Classification,
Clustering and Rules establishment with some certainty factor, so the best way is to design various algorithms and then check their efficiency and accuracy.
APPROACHES IN THIS DIRECTION
In this section the various algorithms will be discussed being used to achieve the objective.
3.1 Statistical Approaches
Two algorithms have been used one for clustering (K-mean algorithm) and another for prediction of class of incoming data (K-nearest neighbour).
3.1.1 Clustering Algorithm: Kmean
Clustering is a nonlinear activity that generates ideas, images and feelings around a stimulus word. Clustering may be a class or an individual activity.
If the number of data is less than the number of cluster then we assign each data as the centroid of the cluster. Each centroid will have a cluster number. If the number of data is bigger than the number of cluster, for each data, we calculate the distance to all centroid and get the minimum distance. This data is said belong to the cluster that has minimum distance from this data. Since we are not sure about the location of the centroid, we need to adjust the centroid location based on the current updated data. Then we assign all the data to this new centroid. This process is repeated until no data is moving to another cluster anymore. Mathematically this loop can be proved to be convergent Since there are only two classes mine and non-mine so number of classes is given 2 as input with the dataset [B1].
3.1.2 K-nearest Neighbour
K-nearest neighbour technique is used to predict the class of incoming data on the basis of given training data and density estimator (k-nn) to estimate the confidence of the incoming sample for a particular class. Finally the class is predicted having the highest estimator.
Density estimator: qc(x) = (number of neighbors of class c)/K
The neighbors are the k closest point to the given sample .Their mutual distances are calculated by city block distance. [B2]
The problem of choosing k still remains, but a general rule of thumb is to use
Where N is the number of learning samples.
A disadvantage of this method is that it is computationally intensive for large data sets.
3.2 Softcomputing approaches
Softcomputing approaches can be classified into several categories
1.) Neural approaches
2.) Fuzzy clustering
3.) Adaptive resonance theory
4.) Kohonen SOM
5.) Genetic algorithm
3.2.1 Genetic algorithm to establish rules
To establish the rules between the attributes of data association rule but association Rule mining cannot predict the complete set of rules, i.e. the rules which have negation in the attributes cannot be discovered. To overcome that disadvantage, Genetic Algorithms (GAs) has been used. First of all association rule is applied with some support and
confidence values entered by user to generate some base rules
and these rules are sent to genetic algorithm as input which helps to evolve some new rule having negation in attributes. The three basic part of genetic algorithm are as follow: (a)Selection: Roulette wheel technique is used to select the two parents [R1].
(b)Crossover: A random point (crossover point) is generated and
the segment to the left of this point of first parent and that of second parent are interchanged.
©Mutation: mutation point is generated randomly and the bit value at this point is toggled.
After some iteration we find some rules following the above
properties and having high fitness value that can be calculated either using the confidence value or by confusion matrix.
3.2.2 Adaptive resonance theory (ART)
As we know backpropagation network is very powerful in the sense that it can simulate any continuous function given a certain number of hidden neurons and a certain forms of activation functions. But once a back propagation is trained, the number of hidden neurons and the weights are fixed. The network cannot learn from new patterns unless the network is re-trained from scratch, so there is no plasticity. [R2]
So ART is a new neural network technique to solve this problem.
Our ultimate objective is to cluster the data in several chunks.
Each time one by one samples from the data as input neurons is sent as input and the activation value is calculated corresponding to each of the existing output neurons, and the highest value is chosen ,if this value is higher than threshold values then the weight of this connection is updated otherwise a new output neuron is added. After certain iteration it's found that the proper clusters of the data in our application don't have classes more than two (mine and non-mine). The another fact is that if a non-mine data is predicted as mine it is acceptable but vice-versa is not true because it may be
dangerous, so among all the clusters, the cluster having the cluster-center farthest from the mine data center is classified as non-mine, rest of the clusters are classified as mine. Here activation function is calculated as the city block distance of the
incoming normalized data and weights of connection. 3.2.3 Fuzzy c-mean:
In the classical clustering algorithm we have the crisp membership of a class (either one or zero).but while classifying the mine data it is not very easy to differentiate between mine and non-mine. So we need a method that can tell the membership of the data in each class. If this membership is average then we deal this data as special data and classify this in the class of mine (as mine are dangerous!!).[R3] where |X| is the feature vector
and p is the number of classes (p=2 in our case)
Mean center prototype:
If the difference of the membership value with previous
membership value is less than threshold than algorithm terminate with having the membership value for each class.
3.2.4 Gustavson-Kessel Algorithm
It is an improvement of fuzzy c-mean clustering algorithm .the
correlation between the data is not considered in c mean. In this algorithm we redefine our distance formula as: [R3]
Mahalobis distance :
where Ai is the mean center
prototype and xj and cj are
the sample attribute and
And covariance matrix is |X|
calculated as : Fuzzy covariance matrix
Mean center prototype - DlMJfe^S-
3.2.5 Gath-Geva Algorithm :
This algorithm assumes that data is normally distributed. [R3]
where 3 lf 1 is the a-priori probability of data belonging to cluster i,
and ^ !' Mean center prototype
The symbols have same explanation as above.
Before applying this algorithm it is suggested to analyze data whether it
is normally distributed or not. 3.2.6 Kohonen SOM:
A competitive network learns to categorize the input vectors presented to it. If a neural network just needs to learn to categorize its input vectors, then a competitive network will do. Competitive networks also learn the distribution of inputs by dedicating more neurons to classifying parts of the input space with higher densities of input.[B3]
A self-organizing map learns to categorize input vectors. It also learns the distribution of input vectors. Feature maps allocate more neurons to recognize parts of the input space where many input vectors occur and allocate fewer neurons to parts of the input space where few input vectors occur.
Self-organizing maps also learn the topology of their input vectors. Neurons next to each other in the network learn to respond to similar vectors. The layer of neurons can be imagined to be a rubber net that is stretched over the regions in the input space where input vectors occur.
Self-organizing maps allow neurons that are neighbors to the winning neuron to output values. Thus the transition of output vectors is much smoother than that obtained with competitive layers, where only one neuron has an output at a time. Now we have some brief knowledge of algorithms those have been implemented .Now I will discuss the architectural design of classification system followed by the results .
As I have already discussed that input can have image form or tabular
form .Matlab has been used to extract the features from the input images. We have the numerical attributes based table with the entry whether the data belongs to mine or non-mine, but for the genetic algorithm categorized table is required so data is categorized in three categories :Low, Medium and High with class value simply mine or non-mine.
4.1 Data source name and login:
Data is being maintained in MS Access. User is free to enter any
data but he needs to configure the database first using (control panel ->administrative tool->data sources(odbc)->system DSN -> configure ). After the configuration he will be assigned a DSN name .this DSN name is asked when u initialize the application with the user name and password that can be obtained from help(Because this is designed for demonstration so username and password have been given in help). When connect button is pressed if the username and password are correct and the entered DSN exists, a new page opens having all the algorithms and table selection facility.
4.2 Algorithm and table selection
Any table existing in the input database can be selected with the algorithm(Select categorized table if u apply genetic
algorithm).Now the algorithm specific results will be displaced.
Different algorithm can ask for some input parameter like clustering
algorithm can ask for number of cluster etc. The interface is self explanatory with proper help. Java language has
been used at front hand and Microsoft Access XP for Database in
back hand and JDBC Bridge to communicate between algorithms
Fig 1: Flow of information
Chapter 5 Result And Conclusion
This module has successfully been implemented. The ultimate objective of this module is to compare between various algorithms and differentiate between them on the basis of their accuracy and results.
Fig 3 : Result of genetic algorithm This snapshot is displaying the result of both association rule and genetic rules. It is very much clear that genetic algorithm has generated
the rule having negative attribute value in antecedent part so this
algorithm is very useful to establish rules.
the ART algorithm is also giving good rules .the ART gives the multiple class distribution of given data. Because to predict a non-mine as a mine is not as much dangerous as to predict a mine as non-mine, so the all the cluster having more distance from the non-mine center has been assigned mine class.
The fuzzy c-mean algorithm is
giving rules with membership value in each class. So it is very easy to check some data that can not be classified as mine and non-mine, so this type of data can be put into mine class to avoid danger.
This algorithm is abolishing the drawback of fuzzy c-mean algorithm because it considers the correlation between data.
Fig 5 : Result of Fuzzy c-mean and Gustavson kessel algorithm
Kmean algorithm is non-adaptive and time consuming and giving the accuracy of 65%.
Fig 6 : Result of Kmean algorithm
This algorithm is useful if
someone wants to know the class of a given data. First of all, the training data must be given with the inputs and the number of nearest neighbours. On the basis of class of nearest neighbours, this algorithm predicts the possible class of the input data.
The algorithm gives good result when number of K is more which also makes
this algorithm very time
Fig 7 : Result of Knearest neighbour algorithm
0.66407O6184246546 0.08824966552732103 0.30755254982520713 0.5976744672397034 "0.11748799163424184 "0.18036558011000203 0.23419145835776467 0.28278240477066396 " 0.2776843589793486 " 0.2720619007819088 0.3459957812633768 0.4298373759312147 "0.31259920217720577 " 0.17733050072680295 0.200437600533966 0.1401819570761578 0.18744214293235423 0.05557881933644341
0.03514173031 251478 0.1 74977041 14538537 0.0234211 0598870864 0.0274968528481 16083 0.0022985795956857393 |o.2068665439776422 0.005245476513231592 0.1 64700921 33771044 0.008074497554075561 0.201 59253670126964 0.03329993516826782 0.45831 48606298189 0.03164967289444218 0.54721 29742978656
0.025932692874403265 0.D27759768963281667 0.019803147285907947 0.06436022867920095
0.14671930746064074 0.21566047946563296 0.25061310355413907 0.29400776222251623 0.39275151372328065
0.12353391878352092 0.281 58836013178096 0.11180526905168886 0.31962984423056756 0.07125596746625797 0.35620151201 91 9265 0.01992102316260979 0.2830524344391 637 0.0014734484587729116 0.1 6075699040552574 2.9468969175458113E-4 0.07525584686786475 0.046560971297224064 p'4689528669590482 0.059645193611127766 0.3998027490846378 0.041728060352448915 0.5875352843755652 0.03995992220192143 0.6326748600705648 0.03123710732598578 E8081696040896625
0.02180703718983912 0.255133D581-3247816 0.5218197596817039
This algorithm is also used for clustering and it's quite a fast algorithm based on 'winner take all' strategy. It differentiates the mine and non-mine up to 80% accuracy
Fig 8 : Result of Kohonen SOM algorithm
All the eight different algorithms have been implemented to compare the results . This classifier is giving result with 80% accuracy .The best result is being given by ART and Genetic algorithm. Fuzzy C-mean and Gustavson kessel is also good because of membership values for each class. This module can differentiate between the PVC tube , wood piece ,brass tube ,copper cylinder(Non mine data)and the mine data obtained from jrc Israel(http://apl-database.jrc.it).
5.3 Future Extension
We contemplate following future features which can be incorporated into this project:-
5.3.1 Improvement in the genetic algorithm :the implemented genetic algorithm in this module incorporates only point mutation, so the other type of mutation can also be practiced like deletion ,insertion and segment mutation etc. and the crossover and mutation probabilities can be modified to get better results.
5.3.2 Distributed computing environment: Generally we have to deal with large databases because on the basis of 100 tuples databases it is very hard to predict the exact class of data .In practical and real life application we have several GB of data . To operate this much of data we need the distributed databases and computing.
5.3.3 Dealing with various platforms and formats: The data may be various format and databases system so system should be flexible enough to handle the various formats and DBMSs like (Oracle ,MySql etc).
B.1 Earl Gose Steve Jost Richard Johnsonbaugh Pattern Recognition and Image Analysis June, 1996 0132364158 Prentice Hall.
B.2 Richard O. Duda, Peter E. Hart, David G. Stork (2001) Pattern classification (2nd edition), Wiley, New York, ISBN 0471056693
B.3 Valluru B. Rao C++ Neural Networks and Fuzzy Logic second edition.
R.1 Improvements in Genetic AlgorithmsJ. A. Vasconcelos, J. A. Ramirez, R. H. C. Taka hashi, and R. R. Saldanha . IEEE TRANSACTIONS ON MAGNETICS, VOL. 37, NO. 5, SEPTEMBER 2001.
R.2 ART Neural Networks for Remote Sensing: Vegetation Classification from Landsat TM and Terrain Data Gail A. Carpenter, Marin N. Gjaja, Sucharita Gopal, and Curtis E. Woodcock.
R.3. Bezdek, J.C., Pal, S.K., 1992: Fuzzy Models for Pattern Recognition. IEEE Press, New York.