Computational Technology Resources - CCP

Keywords: data mining, missing data, neural network, decision trees, nearest neighbour, Bayes naïve, algorithm analysis, geotechnical data.

Summary

As geotechnical materials are naturally occurring, a database of parameters derived from geotechnical test results will be both noisy and contain missing parameter values. Relationships exist between some of the parameters in the data which can be generated from empirical mathematical correlations [1] or recognised by statistical methods. These relationships can then be exploited to make limited predictions of other parameters [2] from a few known values. However, as the databases would be multidimensional and at least one of the parameters would be categorical rather than numerical (i.e. class), some hidden relationships within the data can only be retrieved using techniques like data mining.

Data mining employs algorithms that are a mixture of statistics, logic, mathematics and artificial intelligence. There are a large number of algorithms (described in [3]) that seek relationships within datasets from which rules of some kind can be derived and subsequently used for prediction, classification or other functions, but selecting the most effective algorithm is not an intuitive process. The algorithms fall into of a number of groups of methods where four of the most widely used are neural networks, decision trees, nearest neighbour and Baysian logic. Many of the algorithms have been refined and augmented to show improvements over the original algorithms e.g. [4] and [5] but the improvements are often marginal. This work had centred on experiments with algorithms from these four types of methods and in particular algorithms were chosen that were amongst the simplest examples of these groups of methods, namely multilayer perceptron, J48, Ibk and Bayes naïve respectively. This work concentrates on analysing algorithm performance with respect to varying degrees of missing data and consequently derives methodologies for selecting the most appropriate data mining technique for data set with large amounts of missing data.

The particular nature of geotechnical data dictates that there are numerous gaps in any set of test data for a variety of reasons. To simulate real data performance a series of synthetic geotechnical data sets was created (geoSynth1 to 4). These were created initially from complete data set, i.e. with zero percent missing. Experiments were carried out on the data with increasing percentages of data missing up to 50%. The distribution of missing records can be totally random throughout the dataset or can display a bias showing more missing for particular attributes or for particular records. This variability was mirrored in the geoSynth sets where some attributes could be weighted to show more missing data than others. Within the original data set a test set is put aside and the rules derived from the remaining training set are used to predict values in the test set. The effectiveness of an algorithm is measured by its ability to establish rules on the training data set and apply these to correctly predict values on the test set. Experiments with each data mining algorithm for many combinations of missing records produced results that demonstrated the competence of the algorithms when the percentage of missing data was increased from zero to 50%. The results indicated that the algorithms were differently effective at various degrees of missing data. For example the neural network algorithm generally showed the best result at zero percent missing but also showed the highest rate of decline in performance as the percentage increased. Conversely, the Bayes naïve algorithm was only averagely effective at zero percent missing but showed very little decline in effectiveness with increasing percentage missing and was generally the best performer at 50% missing. A subsidiary investigation found that missing data could be categorised in relation to proportions of attribute and record weighting and that the nature of the missing data has some effect on the performance of the algorithm.

References

1: Carter, M., & Bentley, S. P., (1991) Correlations of Soil Properties, Pentech Press, London
2: Davey-Wilson, I.E.G., (2003) Analysis of Missing Data in a Geotechnical Database, Proceedings of the Seventh International Conference on The Application of Artificial Intelligence to Civil and Structural Engineering, B.H.V. Topping, (Editor), Civil-Comp Press, Stirling, United Kingdom. doi:10.4203/ccp.78.15
3: Witten, I.H., & Frank, E., (2000) Data Mining, Morgan Kaufman
4: Quinlan, J.R., (1996) Improved use of Continuous Attributes in C4.5 Journal of Artificial Intelligence Research, 4, pp 77-90.
5: Hunt, E.B., Martin, J., & Stone, P.J., (1966) Experiments in Induction Academic Press, New York

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £80 +P&P)

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Proceedings ISSN 1759-3433 CCP: 82 PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON THE APPLICATION OF ARTIFICIAL INTELLIGENCE TO CIVIL, STRUCTURAL AND ENVIRONMENTAL ENGINEERING Edited by: B.H.V. Topping Paper 17 Data Mining Techniques for Analysing Geotechnical Data I.E.G. Davey-Wilson Department of Computing, School of Technology, Oxford Brookes University, Oxford, United Kingdom doi:10.4203/ccp.82.17 purchase the full-text of this paper Full Bibliographic Reference for this paper I.E.G. Davey-Wilson, "Data Mining Techniques for Analysing Geotechnical Data", in B.H.V. Topping, (Editor), "Proceedings of the Eighth International Conference on the Application of Artificial Intelligence to Civil, Structural and Environmental Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 17, 2005. doi:10.4203/ccp.82.17 Keywords: data mining, missing data, neural network, decision trees, nearest neighbour, Bayes naïve, algorithm analysis, geotechnical data. Summary As geotechnical materials are naturally occurring, a database of parameters derived from geotechnical test results will be both noisy and contain missing parameter values. Relationships exist between some of the parameters in the data which can be generated from empirical mathematical correlations [1] or recognised by statistical methods. These relationships can then be exploited to make limited predictions of other parameters [2] from a few known values. However, as the databases would be multidimensional and at least one of the parameters would be categorical rather than numerical (i.e. class), some hidden relationships within the data can only be retrieved using techniques like data mining. Data mining employs algorithms that are a mixture of statistics, logic, mathematics and artificial intelligence. There are a large number of algorithms (described in [3]) that seek relationships within datasets from which rules of some kind can be derived and subsequently used for prediction, classification or other functions, but selecting the most effective algorithm is not an intuitive process. The algorithms fall into of a number of groups of methods where four of the most widely used are neural networks, decision trees, nearest neighbour and Baysian logic. Many of the algorithms have been refined and augmented to show improvements over the original algorithms e.g. [4] and [5] but the improvements are often marginal. This work had centred on experiments with algorithms from these four types of methods and in particular algorithms were chosen that were amongst the simplest examples of these groups of methods, namely multilayer perceptron, J48, Ibk and Bayes naïve respectively. This work concentrates on analysing algorithm performance with respect to varying degrees of missing data and consequently derives methodologies for selecting the most appropriate data mining technique for data set with large amounts of missing data. The particular nature of geotechnical data dictates that there are numerous gaps in any set of test data for a variety of reasons. To simulate real data performance a series of synthetic geotechnical data sets was created (geoSynth1 to 4). These were created initially from complete data set, i.e. with zero percent missing. Experiments were carried out on the data with increasing percentages of data missing up to 50%. The distribution of missing records can be totally random throughout the dataset or can display a bias showing more missing for particular attributes or for particular records. This variability was mirrored in the geoSynth sets where some attributes could be weighted to show more missing data than others. Within the original data set a test set is put aside and the rules derived from the remaining training set are used to predict values in the test set. The effectiveness of an algorithm is measured by its ability to establish rules on the training data set and apply these to correctly predict values on the test set. Experiments with each data mining algorithm for many combinations of missing records produced results that demonstrated the competence of the algorithms when the percentage of missing data was increased from zero to 50%. The results indicated that the algorithms were differently effective at various degrees of missing data. For example the neural network algorithm generally showed the best result at zero percent missing but also showed the highest rate of decline in performance as the percentage increased. Conversely, the Bayes naïve algorithm was only averagely effective at zero percent missing but showed very little decline in effectiveness with increasing percentage missing and was generally the best performer at 50% missing. A subsidiary investigation found that missing data could be categorised in relation to proportions of attribute and record weighting and that the nature of the missing data has some effect on the performance of the algorithm. References 1 Carter, M., & Bentley, S. P., (1991) Correlations of Soil Properties, Pentech Press, London 2 Davey-Wilson, I.E.G., (2003) Analysis of Missing Data in a Geotechnical Database, Proceedings of the Seventh International Conference on The Application of Artificial Intelligence to Civil and Structural Engineering, B.H.V. Topping, (Editor), Civil-Comp Press, Stirling, United Kingdom. doi:10.4203/ccp.78.15 3 Witten, I.H., & Frank, E., (2000) Data Mining, Morgan Kaufman 4 Quinlan, J.R., (1996) Improved use of Continuous Attributes in C4.5 Journal of Artificial Intelligence Research, 4, pp 77-90. 5 Hunt, E.B., Martin, J., & Stone, P.J., (1966) Experiments in Induction Academic Press, New York purchase the full-text of this paper (price £20) go to the previous paper go to the next paper return to the table of contents return to the book description purchase this book (price £80 +P&P)
Back to top	©Civil-Comp Limited 2023 - terms & conditions