The Comparison of Distributive Semantics Models Applied to the Task of Short Job Requirements Clustering for the Russian Labor Market

Ivan Nikolaev; Ivan Ryazanov; Dmitry Botov

Информационные технологии интеллектуальной поддержки принятия решений, Информационные технологии интеллектуальной поддержки принятия решений 2020

Ivan Nikolaev, Ivan Ryazanov, Dmitry Botov

Изменена: 2025-02-20

Аннотация

In this article we compare different vector models (tf-idf, word2vec, fasttext, lda, lsi, artm) in the short text clustering task, using a dataset of job vacancy descriptions in Russian. A two-step experiment is proposed to determine the best model and its hyperparameters based on the quality of the resulting short text clusters. In the first stage, we investigate how various hyperparameters of each model can affect the clusters, produced by training a K-means model on each of the vector representations. In particular, we consider in detail, how the size of the output vector representation in each of our models can influence the quality of the final clusters. We also provide an extensive analysis of the effects of various regularization options for clusters, learned using the vectors produced by the ARTM algorithm. During the second stage, the models showing the best results in the previous step (word2vec, fasttext) are analyzed in greater detail. We compare the effectiveness of these models against datasets of different sizes, as well as using different structures of the source fragments (partial elements or full texts of vacancy descriptions). In our experiments, the highest quality of clusters (evaluated using the ARI metric) was achieved by word2vec, closely followed by the fasttext model. Finally, we perform a topic analysis for each of the resulting clusters and evaluate their homogeneity.

Ключевые слова

сlustering; vector models; short texts; job vacancies; labour market

Литература

[1] Vinel, Mikhail, et al. "Experimental Comparison of Unsupervised Approaches in the Task of Separating Specializations Within Professions in Job Vacancies." Conference on Artificial Intelligence and Natural Language. Springer, Cham, 2019.

[2] Colace, F., De Santo, M., Lombardi, M., Mercorio, F., Mezzanzanica, M., & Pascale, F. (2019, January). Towards labour market intelligence through topic modelling. In Proceedings of the 52nd Hawaii International Conference on System Sciences.

[3] Botov, D., Klenin, J., Melnikov, A., Dmitrin, Y., Nikolaev, I., & Vinel, M. (2019, June). Mining Labor Market Requirements Using Distributional Semantic Models and Deep Learning. In International Conference on Business Information Systems (pp. 177-190). Springer, Cham.

[4] Chaturvedi, V., Pramanik, A., Ghosh, S., Bhadury, P., & Mondal, A. (2020). A Supervised Approach to Analyse and Simplify Micro-texts. In Emerging Technology in Modelling and Graphics (pp. 61-67). Springer, Singapore.

[5] Hadifar, Amir, et al. "A self-training approach for short text clustering." Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). 2019.

[6] Banerjee, Somnath, Krishnan Ramanathan, and Ajay Gupta. "Clustering short texts using wikipedia." Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007.

[7] Hu, Xia, et al. "Exploiting internal and external semantics for the clustering of short texts using world knowledge." Proceedings of the 18th ACM conference on Information and knowledge management. 2009.

[8] Sriram, Bharath, et al. "Short text classification in twitter to improve information filtering." Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 2010.

[9] Boselli, Roberto, et al. ”Using machine learning for labour market intelligence.” Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2017.

[10] Colombo, Emilio, Fabio Mercorio, and Mario Mezzanzanica. ”Applying machine learning tools on web vacancies for labour market and skill analysis.” (2018).

[11] Wowczko, Izabela. ”Skills and vacancy analysis with data mining techniques.” In-formatics. Vol. 2. No. 4. Multidisciplinary Digital Publishing Institute, 2015.

[12] Spirin, Nikita, and Karrie Karahalios. ”Unsupervised approach to generate informative structured snippets for job search engines.” Proceedings of the 22nd International Conference on World Wide Web. ACM, 2013.

[13] Muthyala, Rohit, et al. ”Data-driven Job Search Engine Using Skills and Company Attribute Filters.” 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2017.

[14] Ramos, Juan. "Using tf-idf to determine word relevance in document queries." Proceedings of the first instructional conference on machine learning. Vol. 242. 2003.

[15] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

[16] Joulin, Armand, et al. "Fasttext. zip: Compressing text classification models." arXiv preprint arXiv:1612.03651 (2016).

[17] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003.

[18] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50-57, New York, NY, USA, 1999. ACM.

[19] Vorontsov K. V. Additive regularization of topic models of text document corpora [Additivnaya regulyarizatsiya tematicheskikh modeley kollektsiy tekstovykh dokumentov] // RAN Reports [Doklady RAN]. - 2014. - T. 456, № 3. - S. 268-271.

[20] Vorontsov, Konstantin, et al. "Bigartm: Open source library for regularized multimodal topic modeling of large collections." International Conference on Analysis of Images, Social Networks and Texts. Springer, Cham, 2015.

[21] Vorontsov, Konstantin, and Anna Potapenko. "Additive regularization of topic models." Machine Learning 101.1-3 (2015): 303-323.

[22] Vorontsov, Konstantin, Anna Potapenko, and Alexander Plavin. "Additive regularization of topic models for topic selection and sparse factorization." International Symposium on Statistical Learning and Data Sciences. Springer, Cham, 2015.

[23] Deokar, Sanjivani Tushar. ”Text documents clustering using k means algorithm.” International Journal of Technology and Engineering Science [IJTES] 1.4 (2013): 282-286.

[24] Zhu, Yan, Jian Yu, and Caiyan Jia. ”Initializing k-means clustering using affinity propagation.” 2009 Ninth International Conference on Hybrid Intelligent Systems. Vol. 1. IEEE, 2009.

[25] Guan, Renchu, et al. ”Text clustering with seeds affinity propagation.” IEEE Trans-actions on Knowledge and Data Engineering 23.4 (2011): 627-637.

[26] Steinley, Douglas. "Properties of the Hubert-Arable Adjusted Rand Index." Psychological methods 9.3 (2004): 386