Computational Technology Resources - CCP

Keywords: missing data, nearest neighbour, geotechnical parameters, database,.

Summary

Data sets derived from parameter testing of engineering materials can have gaps in the data. Data can be missing for various reasons including specimen limitations, testing inadequacy or project scope. Using a database containing missing data for further analysis involving predictive estimation can be problematic.

Standard methods of missing data analysis usually assume that the distribution of data in a data set is a Gaussian distribution in some form and that the missing data can be estimated using a likelihood-based method such as EM (expectation maximisation) or MI (multiple imputation) [1]. In multi-dimensional parameter space, clusters within data sets may well correspond to a Gaussian form. However, naturally occurring materials, like engineering soils, have disparate parameters that do not necessarily conform to a predictable distribution. The parameters measured in soil testing belong to groups depending on their geotechnical class, but within each class, any one soil may not necessarily conform to the class distribution - there are an almost infinite number of irregular sub-groups to which the soil properties belong [2]. The highly irregular nature of soils and other naturally occurring materials renders a traditional approach to parameter analysis and estimation somewhat problematic. A similarity analysis approach has been adopted here for which a database of over 800 rows of data has been compiled using information from geotechnical laboratory analyses from various sources [3].

The focus of any analysis would be to estimate missing data in a new input data set. This paper describes a method of estimating missing parameters, Filtered Similarity Estimation (FSE), using the similarity of known parameters in a comparison data set to their corresponding nearest neighbours in a large database. A system (cBase) has been built to undertake the analysis.

Data is held in a database in the form of rows of data pertaining to each soil sample. 14 parameters are listed but some of the data is missing. Soil class is also contained in the database for each soil specimen but not in the user data set. The user data set (1 row of 14 parameters relating to 1 soil sample but with some missing data) is matched to each row in the database using a similarity function. This contains a user-set parameter weighting value to derive individual parameter similarity numbers for each parameter which are then totalled to produce a similarity value for each row in the database. Missing data is ignored and the similarity value normalised to account for this. Thus the best matching rows to the user data set can be found from the highest similarity value derived from a nearest neighbour algorithm.

Estimates of the missing data are derived from the 10 most similar rows in the database by taking the means of the best matching parameters. The soil class is predicted in a similar way by analysing the classes of the most similar data rows. Database class is split into 6 elements: primary and descriptor, secondary and descriptor, tertiary and descriptor. The similarity values for each row in the top 10 for each class element are summed where the class elements are similar, thus the class element with the highest score is taken as the class estimate for each element.

A second mechanism is used to forecast missing data parameters in a user dataset. The forecast mechanism works on the whole database by finding the covariance and correlation between every database parameter and every other parameter. A linear forecast function is derived from the correlation and regression between any two parameters in the database. A known user data set parameter value is used in the corresponding forecast function to estimate an unknown parameter. A filtering mechanism applies a restriction to the correlations so only correlations above a certain threshold are used. The same technique is applied to derive estimates solely from the best 10 similar data rows.

Results from analyses of this large geotechnical data set indicate that estimations based on means of the most similar rows or of the whole database proved to be the most successful. The results also indicate that class can be estimated by a similarity-based statistical approach where data is limited.

References

1: Schafer, J.L, (1997) Analysis of incomplete multivariate data, Chapman & Hall, Boca Raton, USA
2: Terzaghi, K. and Peck R.B. (1967) Soil Mechanics in Engineering Practice, John Wiley & Sons, New York, pp. 72-73.
3: Davey-Wilson I.E.G. (2001) Geotechnical Parameter Prediction from Large Data Sets, in Proceedings of the Eighth International Conference on Civil and Structural Engineering Computing, B.H.V. Topping, (Editor), Civil-Comp Press, Stirling, United Kingdom, paper 104. doi:10.4203/ccp.73.104

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £82 +P&P)

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Proceedings ISSN 1759-3433 CCP: 78 PROCEEDINGS OF THE SEVENTH INTERNATIONAL CONFERENCE ON THE APPLICATION OF ARTIFICIAL INTELLIGENCE TO CIVIL AND STRUCTURAL ENGINEERING Edited by: B.H.V. Topping Paper 15 Analysis of Missing Data in a Geotechnical Database I.E.G. Davey-Wilson Department of Computing, School of Technology, Oxford Brookes University, Oxford, United Kingdom doi:10.4203/ccp.78.15 purchase the full-text of this paper Full Bibliographic Reference for this paper I.E.G. Davey-Wilson, "Analysis of Missing Data in a Geotechnical Database", in B.H.V. Topping, (Editor), "Proceedings of the Seventh International Conference on the Application of Artificial Intelligence to Civil and Structural Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 15, 2003. doi:10.4203/ccp.78.15 Keywords: missing data, nearest neighbour, geotechnical parameters, database,. Summary Data sets derived from parameter testing of engineering materials can have gaps in the data. Data can be missing for various reasons including specimen limitations, testing inadequacy or project scope. Using a database containing missing data for further analysis involving predictive estimation can be problematic. Standard methods of missing data analysis usually assume that the distribution of data in a data set is a Gaussian distribution in some form and that the missing data can be estimated using a likelihood-based method such as EM (expectation maximisation) or MI (multiple imputation) [1]. In multi-dimensional parameter space, clusters within data sets may well correspond to a Gaussian form. However, naturally occurring materials, like engineering soils, have disparate parameters that do not necessarily conform to a predictable distribution. The parameters measured in soil testing belong to groups depending on their geotechnical class, but within each class, any one soil may not necessarily conform to the class distribution - there are an almost infinite number of irregular sub-groups to which the soil properties belong [2]. The highly irregular nature of soils and other naturally occurring materials renders a traditional approach to parameter analysis and estimation somewhat problematic. A similarity analysis approach has been adopted here for which a database of over 800 rows of data has been compiled using information from geotechnical laboratory analyses from various sources [3]. The focus of any analysis would be to estimate missing data in a new input data set. This paper describes a method of estimating missing parameters, Filtered Similarity Estimation (FSE), using the similarity of known parameters in a comparison data set to their corresponding nearest neighbours in a large database. A system (cBase) has been built to undertake the analysis. Data is held in a database in the form of rows of data pertaining to each soil sample. 14 parameters are listed but some of the data is missing. Soil class is also contained in the database for each soil specimen but not in the user data set. The user data set (1 row of 14 parameters relating to 1 soil sample but with some missing data) is matched to each row in the database using a similarity function. This contains a user-set parameter weighting value to derive individual parameter similarity numbers for each parameter which are then totalled to produce a similarity value for each row in the database. Missing data is ignored and the similarity value normalised to account for this. Thus the best matching rows to the user data set can be found from the highest similarity value derived from a nearest neighbour algorithm. Estimates of the missing data are derived from the 10 most similar rows in the database by taking the means of the best matching parameters. The soil class is predicted in a similar way by analysing the classes of the most similar data rows. Database class is split into 6 elements: primary and descriptor, secondary and descriptor, tertiary and descriptor. The similarity values for each row in the top 10 for each class element are summed where the class elements are similar, thus the class element with the highest score is taken as the class estimate for each element. A second mechanism is used to forecast missing data parameters in a user dataset. The forecast mechanism works on the whole database by finding the covariance and correlation between every database parameter and every other parameter. A linear forecast function is derived from the correlation and regression between any two parameters in the database. A known user data set parameter value is used in the corresponding forecast function to estimate an unknown parameter. A filtering mechanism applies a restriction to the correlations so only correlations above a certain threshold are used. The same technique is applied to derive estimates solely from the best 10 similar data rows. Results from analyses of this large geotechnical data set indicate that estimations based on means of the most similar rows or of the whole database proved to be the most successful. The results also indicate that class can be estimated by a similarity-based statistical approach where data is limited. References 1 Schafer, J.L, (1997) Analysis of incomplete multivariate data, Chapman & Hall, Boca Raton, USA 2 Terzaghi, K. and Peck R.B. (1967) Soil Mechanics in Engineering Practice, John Wiley & Sons, New York, pp. 72-73. 3 Davey-Wilson I.E.G. (2001) Geotechnical Parameter Prediction from Large Data Sets, in Proceedings of the Eighth International Conference on Civil and Structural Engineering Computing, B.H.V. Topping, (Editor), Civil-Comp Press, Stirling, United Kingdom, paper 104. doi:10.4203/ccp.73.104 purchase the full-text of this paper (price £20) go to the previous paper go to the next paper return to the table of contents return to the book description purchase this book (price £82 +P&P)
Back to top	©Civil-Comp Limited 2023 - terms & conditions