Document Type : Original Article

Author

Assistant Prof., Department of Finance, Hakim Sabzevari University, Sabzevar, Iran.

10.30699/ijf.2024.414027.1427

Abstract

The accurate imputation of missing values in time series data is paramount for maintaining the integrity and reliability of analyses and predictions. This article investigates the effica-cy of various missing values imputation methods, encom-passing well-known machine learning and statistical tech-niques. Moreover, for a better understanding, they imple-mented two financial data time series: S&P 500 and Bitcoin markets spanning from 2016 to 2023 on a daily frequency. Initially utilizing complete datasets, controlled missingness was introduced by randomly removing 45 data points. Then, these methods applied multiple imputation strategies for estimating and substituting these missing values. Experi-mental evaluation yielded insightful findings regarding the performance of the different methods. The examined ma-chine learning methods, including k-Nearest Neighbors (k-NN), Random Forest, Deep Learning, and Decision Trees, consistently outperformed their statistical counterparts, such as Mean Imputation, Regression Imputation, Hot-Deck Im-putation, and Expectation-Maximization Imputation. Nota-bly, Random Forest emerged as the most effective method, showcasing superior performance in terms of accuracy and robustness. Conversely, the Mean Imputation method exhibited com-paratively inferior outcomes, suggesting its limited suitabil-ity for financial time series data. This research contributes to the ongoing discourse on data integrity within finance ana-lytics and serves as a comprehensive guide for practitioners seeking optimal missing values imputation methods. The empirical evidence provided herein advances the under-standing of imputation techniques' relative performance and their application in financial data, facilitating enhanced de-cision-making processes and yielding more reliable predic-tions.

Keywords

Abidin, N. Z., Ismail, A. R., & Emran, N. A. (2018). Performance analysis of machine learning algorithms for missing value imputation. International Journal of Advanced Computer Science and Applications, 9(6).
Acuna E, Rodriguez C (2004). The treatment of missing values and its effect on the classifier accuracy. In: Banks D et al. (eds) Classification, clustering, and data mining applications. Springer, Berlin, pp 639–648
Cham, H., & West, S. G. (2016). Propensity score analysis with missing data. Psychological methods21(3), 427.
Chen, Y. C. (2022). Pattern graphs: a graphical approach to nonmonotone missing data. The Annals of Statistics50(1), 129–146.
Dantan, E., Proust-Lima, C., Letenneur, L., & Jacqmin-Gadda, H. (2008). Pattern mixture models and latent class models for the analysis of multivariate longitudinal data with informative dropouts. The International Journal of Biostatistics4(1)
Dany'el Irawan, N., Wijono, W., & Setyawati, O. (2017). Perbaikan missing value menggunakan pendekatan korelasi pada metode k-nearest neighbor. Jurnal Infotel9(3), 305-311.
Demirtas, H. (2018). Flexible imputation of missing data. Journal of Statistical Softwarepp. 85, 1–5.
Dong, Y., & Peng, C. Y. J. (2013). Principled missing data methods for researchers. SpringerPlus2, 1-17.
Husson, F., Josse, J., Narasimhan, B., & Robin, G. (2019). Imputation of mixed data with multilevel singular value decomposition. Journal of Computational and Graphical Statistics28(3), 552-566.
Ismail, A.R., Abidin, N.Z., & Maen, M.K. (2022). Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare. Journal of Robotics and Control (JRC).
Jadhav, A., Pramod, D., & Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence33(10), 913–933.
Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial intelligence in medicine50(2), 105-115.
Kang, H. (2013). The prevention and handling of the missing data. Korean journal of anesthesiology64(5), 402–406.
Lakshminarayan, K., Harp, S. A., & Samad, T. (1999). Imputation of missing data in industrial databases. Applied Intelligence11(3), 259-275.
Lan, Q., Xu, X., Ma, H., & Li, G. (2020). Multivariable data imputation for the analysis of incomplete credit data. Expert Systems with Applications141, 112926.
Lin, T. H. (2010). A comparison of multiple imputation with E.M. algorithm and MCMC method for quality of life missing data. Quality & Quantitypp. 44, 277–287.
Lin, W. C., & Tsai, C. F. (2020). Missing value imputation: A review and analysis of the literature (2006–2017). Artificial Intelligence Review53, 1487-1509.
Little, R. J., & Rubin, D. B. (1989). The analysis of social science data with missing values. Sociological methods & research18(2-3), 292-326.
Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). John Wiley & Sons.
Marsh, H. W. (1998). Pairwise deletion for missing data in structural equation models: Nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample sizes. Structural Equation Modeling: A Multidisciplinary Journal5(1), 22–36.
Moeinol, H. H., et al. (2022). An approach for handling missing values in classification tasks. International Journal of Engineering and Technology, 14(2), 133-142.
Moinul, M., Amin, S. A., Kumar, P., Patil, U. K., Gajbhiye, A., Jha, T., & Gayen, S. (2022). Exploring sodium glucose cotransporter (SGLT2) inhibitors with machine learning approach: A novel hope in anti-diabetes drug discovery. Journal of Molecular Graphics and Modelling111, 108106
Park, J., Müller, J., Arora, B., Faybishenko, B., Pastorello, G., Varadharajan, C. & Agarwal, D. (2023). Long-term missing value imputation for time series data using deep neural networks. Neural Computing and Applications35(12), 9071-9091.
Rahman, M.G., & Islam, M.Z. (2011). A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing. Australasian Data Mining Conference.
Rahman, M.G., & Islam, M.Z. (2014). iDMI: A novel technique for missing value imputation using a decision tree and expectation-maximization algorithm. 16th Int'l Conf. Computer and Information Technology, 496-501.
Ratolojanahary, R., Ngouna, R. H., Medjaher, K., Junca-Bourié, J., Dauriac, F., & Sebilo, M. (2019). Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. Expert Systems with Applications131, 299-307.
Raymond M, Roberts D (1987). A comparison of methods for treating incomplete data in selection research. Educ Psychol Meas 47:13–26
Rubin, D. B. (1976). Inference and missing data. Biometrika63(3), 581-592.
Saha, S., Ghosh, A., Bandopadhyay, S., & Dey, K.N. (2019). Missing value imputation in DNA microarray gene expression data: a comparative study of an improved collaborative filtering method with decision tree based approach. Int. J. Comput. Sci. Eng., pp. 18, 130–139.
Schouten, R. M., Lugtig, P., & Vink, G. (2018). Generating missing values for simulation purposes: a multivariate amputation procedure. Journal of Statistical Computation and Simulation88(15), 2909-2930.
Sidek, R. M., et al. (2016). A review of missing value imputation methods for time series. Annual Research & Review in Biology, 10(6), 1-9.
Stavseth, M. R., Clausen, T., & Røislien, J. (2019). How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data. SAGE open medicine7, 2050312118822912.
Strike K, Emam KE, Madhavji N (2001). Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Tang, F., & Ishwaran, H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal10(6), 363–377.
Tang, F., & Ishwaran, H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal10(6), 363–377.
Tutz, G., & Ramzan, S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics & Data Analysis90, 84-99.
Wang, Z., Zhao, B., Guo, H., Tang, L., & Peng, Y. (2019). Deep Ensemble Learning Model for Short-Term Load Forecasting within Active Learning Framework. Energies, 12(20), n/a.
Zhang, Y., Li, M., Wang, S., Dai, S., Luo, L., Zhu, E. & Zhou, H. (2021). Gaussian mixture model clustering with incomplete data. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)17(1s), 1–14.
Zhang, Y., Liu, J., Liu, H., Lu, Y., Wang, S., & Zhai, Y. (2022). High Dimensional Missing Data Imputation for Classification Problems: A Hybrid Model based on K-Nearest Neighbor and Genetic Algorithm. 2022 International Symposium on Advances in Informatics, Electronics and Education (ISAIEE), pp. 572–578.
Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of translational medicine4(1).