Comparative Analysis of Missing Values Imputation Methods: A Case Study in Financial Series (S&P500 and Bitcoin Value Data Sets)

Goldani, Mahdi

doi:10.61186/ijf.2024.414027.1427

Comparative Analysis of Missing Values Imputation Methods: A Case Study in Financial Series (S&P500 and Bitcoin Value Data Sets)

Document Type : Original Article

Author

Mahdi Goldani

Assistant Prof., Department of Finance, Hakim Sabzevari University, Sabzevar, Iran.

10.61186/ijf.2024.414027.1427

Abstract

The accurate imputation of missing values in time series data is paramount for maintaining the integrity and reliability of analyses and predictions. This article investigates the effica-cy of various missing values imputation methods, encom-passing well-known machine learning and statistical tech-niques. Moreover, for a better understanding, they imple-mented two financial data time series: S&P 500 and Bitcoin markets spanning from 2016 to 2023 on a daily frequency. Initially utilizing complete datasets, controlled missingness was introduced by randomly removing 45 data points. Then, these methods applied multiple imputation strategies for estimating and substituting these missing values. Experi-mental evaluation yielded insightful findings regarding the performance of the different methods. The examined ma-chine learning methods, including k-Nearest Neighbors (k-NN), Random Forest, Deep Learning, and Decision Trees, consistently outperformed their statistical counterparts, such as Mean Imputation, Regression Imputation, Hot-Deck Im-putation, and Expectation-Maximization Imputation. Nota-bly, Random Forest emerged as the most effective method, showcasing superior performance in terms of accuracy and robustness. Conversely, the Mean Imputation method exhibited com-paratively inferior outcomes, suggesting its limited suitabil-ity for financial time series data. This research contributes to the ongoing discourse on data integrity within finance ana-lytics and serves as a comprehensive guide for practitioners seeking optimal missing values imputation methods. The empirical evidence provided herein advances the under-standing of imputation techniques' relative performance and their application in financial data, facilitating enhanced de-cision-making processes and yielding more reliable predic-tions.

Keywords

Missing Values Imputation

Machine Learning

Statistical Methods

Finance Data

S&P 500

Bitcoin

Time Series Analysis

Abidin, N. Z., Ismail, A. R., & Emran, N. A. (2018). Performance analysis of machine learning algorithms for missing value imputation. International Journal of Advanced Computer Science and Applications, 9(6).

Acuna E, Rodriguez C (2004). The treatment of missing values and its effect on the classifier accuracy. In: Banks D et al. (eds) Classification, clustering, and data mining applications. Springer, Berlin, pp 639–648

Cham, H., & West, S. G. (2016). Propensity score analysis with missing data. Psychological methods, 21(3), 427.

Chen, Y. C. (2022). Pattern graphs: a graphical approach to nonmonotone missing data. The Annals of Statistics, 50(1), 129–146.

Dantan, E., Proust-Lima, C., Letenneur, L., & Jacqmin-Gadda, H. (2008). Pattern mixture models and latent class models for the analysis of multivariate longitudinal data with informative dropouts. The International Journal of Biostatistics, 4(1)

Dany'el Irawan, N., Wijono, W., & Setyawati, O. (2017). Perbaikan missing value menggunakan pendekatan korelasi pada metode k-nearest neighbor. Jurnal Infotel, 9(3), 305-311.

Demirtas, H. (2018). Flexible imputation of missing data. Journal of Statistical Software, pp. 85, 1–5.

Dong, Y., & Peng, C. Y. J. (2013). Principled missing data methods for researchers. SpringerPlus, 2, 1-17.

Husson, F., Josse, J., Narasimhan, B., & Robin, G. (2019). Imputation of mixed data with multilevel singular value decomposition. Journal of Computational and Graphical Statistics, 28(3), 552-566.

Ismail, A.R., Abidin, N.Z., & Maen, M.K. (2022). Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare. Journal of Robotics and Control (JRC).

Jadhav, A., Pramod, D., & Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913–933.

Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial intelligence in medicine, 50(2), 105-115.

Kang, H. (2013). The prevention and handling of the missing data. Korean journal of anesthesiology, 64(5), 402–406.

Lakshminarayan, K., Harp, S. A., & Samad, T. (1999). Imputation of missing data in industrial databases. Applied Intelligence, 11(3), 259-275.

Lan, Q., Xu, X., Ma, H., & Li, G. (2020). Multivariable data imputation for the analysis of incomplete credit data. Expert Systems with Applications, 141, 112926.

Lin, T. H. (2010). A comparison of multiple imputation with E.M. algorithm and MCMC method for quality of life missing data. Quality & Quantity, pp. 44, 277–287.

Lin, W. C., & Tsai, C. F. (2020). Missing value imputation: A review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53, 1487-1509.

Little, R. J., & Rubin, D. B. (1989). The analysis of social science data with missing values. Sociological methods & research, 18(2-3), 292-326.

Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). John Wiley & Sons.

Marsh, H. W. (1998). Pairwise deletion for missing data in structural equation models: Nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample sizes. Structural Equation Modeling: A Multidisciplinary Journal, 5(1), 22–36.

Moeinol, H. H., et al. (2022). An approach for handling missing values in classification tasks. International Journal of Engineering and Technology, 14(2), 133-142.

Moinul, M., Amin, S. A., Kumar, P., Patil, U. K., Gajbhiye, A., Jha, T., & Gayen, S. (2022). Exploring sodium glucose cotransporter (SGLT2) inhibitors with machine learning approach: A novel hope in anti-diabetes drug discovery. Journal of Molecular Graphics and Modelling, 111, 108106

Park, J., Müller, J., Arora, B., Faybishenko, B., Pastorello, G., Varadharajan, C. & Agarwal, D. (2023). Long-term missing value imputation for time series data using deep neural networks. Neural Computing and Applications, 35(12), 9071-9091.

Rahman, M.G., & Islam, M.Z. (2011). A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing. Australasian Data Mining Conference.

Rahman, M.G., & Islam, M.Z. (2014). iDMI: A novel technique for missing value imputation using a decision tree and expectation-maximization algorithm. 16th Int'l Conf. Computer and Information Technology, 496-501.

Ratolojanahary, R., Ngouna, R. H., Medjaher, K., Junca-Bourié, J., Dauriac, F., & Sebilo, M. (2019). Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. Expert Systems with Applications, 131, 299-307.

Raymond M, Roberts D (1987). A comparison of methods for treating incomplete data in selection research. Educ Psychol Meas 47:13–26

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.

Saha, S., Ghosh, A., Bandopadhyay, S., & Dey, K.N. (2019). Missing value imputation in DNA microarray gene expression data: a comparative study of an improved collaborative filtering method with decision tree based approach. Int. J. Comput. Sci. Eng., pp. 18, 130–139.

Schouten, R. M., Lugtig, P., & Vink, G. (2018). Generating missing values for simulation purposes: a multivariate amputation procedure. Journal of Statistical Computation and Simulation, 88(15), 2909-2930.

Sidek, R. M., et al. (2016). A review of missing value imputation methods for time series. Annual Research & Review in Biology, 10(6), 1-9.

Stavseth, M. R., Clausen, T., & Røislien, J. (2019). How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data. SAGE open medicine, 7, 2050312118822912.

Strike K, Emam KE, Madhavji N (2001). Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908

Tang, F., & Ishwaran, H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6), 363–377.

Tutz, G., & Ramzan, S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics & Data Analysis, 90, 84-99.

Wang, Z., Zhao, B., Guo, H., Tang, L., & Peng, Y. (2019). Deep Ensemble Learning Model for Short-Term Load Forecasting within Active Learning Framework. Energies, 12(20), n/a.

Zhang, Y., Li, M., Wang, S., Dai, S., Luo, L., Zhu, E. & Zhou, H. (2021). Gaussian mixture model clustering with incomplete data. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(1s), 1–14.

Zhang, Y., Liu, J., Liu, H., Lu, Y., Wang, S., & Zhai, Y. (2022). High Dimensional Missing Data Imputation for Classification Problems: A Hybrid Model based on K-Nearest Neighbor and Genetic Algorithm. 2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE), pp. 572–578.

Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of translational medicine, 4(1).