Name: Augmentasi Data Berbasis IndoBERT untuk Pengklasifikasian Teks Bahasa Indonesia
Author: FUAD MUFTIE

Augmentasi Data Berbasis IndoBERT untuk Pengklasifikasian Teks Bahasa Indonesia

FUAD MUFTIE
14210197

ABSTRAK

ABSTRAK

Nama : Fuad Muftie

NIM : 14210197

Program Studi : Ilmu Komputer

Fakultas : Teknologi Informasi

Jenjang : Strata Dua (S2)

Konsentrasi : Data Mining

Judul : Augmentasi Data Berbasis IndoBERT untuk Pengklasifikasian Teks Bahasa Indonesia

Augmentasi data teks telah mampu meningkatkan performa model atau algoritma untuk klasifikasi teks dan analisis sentimen. Teknik augmentasi berbasis aturan dapat dengan mudah diterapkan untuk seluruh jenis bahasa, sementara teknik augmentasi berbasis language model masih memiliki peluang untuk dikembangkan pada data bahasa Indonesia. Dalam paper ini dilakukan model pengolahan data teks bahasa Indonesia yang memiliki keterbatasan jumlah data, untuk dilakukan preprocessing text dan teknik augmentasi data dengan menyisipkan kata secara selektif berbasiskan IndoBERT. Augmentasi berbasis IndoBERT ini mampu menghasilkan data yang tetap memiliki makna dan tetap memiliki sentimen sesuai data aslinya. Dalam pengujian dataset teks twitter ini diperoleh hasil bahwa teknik augmentais yang diusulkan mampu menaikkan akurasi dan mengungguli teknik augmentasi Random Insert.

KATA KUNCI

Augmentasi,Bi-LSTM,CNN,INDOBERT,Language Model

DAFTAR PUSTAKA

DAFTAR REFERENSI

[1] S. Yu, J. Yang, D. Liu, R. Li, Y. Zhang, and S. Zhao, “Hierarchical Data Augmentation and the Application in Text Classification,” IEEE Access, vol. 7, pp. 185476–185485, 2019, doi: 10.1109/ACCESS.2019.2960263.

[2] M. Bayer, M.-A. Kaufhold, B. Buchhold, M. Keller, J. Dallmeyer, and C. Reuter, “Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers,” International Journal of Machine Learning and Cybernetics, 2022, doi: 10.1007/s13042- 022-01553-3.

[3] S. Y. Feng et al., “A Survey of Data Augmentation Approaches for NLP,” arXiv e-prints, p. arXiv:2105.03075, May 2021.

[4] B. Li, Y. Hou, and W. Che, “Data augmentation approaches in natural language processing: A survey,” AI Open, vol. 3, pp. 71–90, 2022, doi: https://doi.org/10.1016/j.aiopen.2022.03.001.

[5] R. Raileanu, M. Goldstein, D. Yarats, I. Kostrikov, and R. Fergus, “Automatic Data Augmentation for Generalization in Reinforcement Learning,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. W. Vaughan, Eds., Curran Associates, Inc., 2021, pp. 5402–5415. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/2b38c2df6a49b9 7f706ec9148ce48d86-Paper.pdf

[6] C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” J Big Data, vol. 6, no. 1, p. 60, 2019, doi: 10.1186/s40537-019-0197-0.

[7] M. Regina, M. Meyer, and S. Goutal, “Text Data Augmentation: Towards better detection of spear-phishing emails,” ArXiv, vol. abs/2007.02033, 2020.

[8] H.-T. Duong and T.-A. Nguyen-Thi, “A review: preprocessing techniques and data augmentation for sentiment analysis,” Comput Soc Netw, vol. 8, no. 1, p. 1, 2021, doi: 10.1186/s40649-020-00080-x.

[9] X. Zhang and Y. LeCun, “Text Understanding from Scratch,” -, 2015.

[10] C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” J Big Data, vol. 6, no. 1, p. 60, 2019, doi: 10.1186/s40537-019-0197-0.

[11] A. Hernández-García and P. König, “Data augmentation instead of explicit regularization,” -, 2020.

[12] R. Liu, G. Xu, C. Jia, W. Ma, L. Wang, and S. Vosoughi, “Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online: Association for Computational Linguistics, Nov. 2020, pp. 9031–9041. doi: 10.18653/v1/2020.emnlp-main.726.

[13] Y. Li, T. Cohn, and T. Baldwin, “Robust Training under Linguistic Adversity,” in Proceedings of the 15th Conference of the European Chapter 56 Program Studi Ilmu Komputer (S2) Universitas Nusa Mandiri of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 21–27.

[14] J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, vol. abs/1901.1, pp. 6382–6388, 2019, doi: 10.18653/v1/d19-1670.

[15] J. Wei, C. Huang, S. Xu, and S. Vosoughi, “Text Augmentation in a MultiTask View,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online: Association for Computational Linguistics, Apr. 2021, pp. 2888–2894. doi: 10.18653/v1/2021.eacl-main.252.

[16] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” in Proceedings of the 28th International Conference on Computational Linguistics (COLING), Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 757– 770. doi: 10.18653/v1/2020.coling-main.66.

[17] A. Gupta, D. Chugh, Anjum, and R. Katarya, “Automated News Summarization Using Transformers,” in Sustainable Advanced Computing, S. Aurelia, S. S. Hiremath, K. Subramanian, and S. Kr. Biswas, Eds., Singapore: Springer Singapore, 2022, pp. 249–259.

[18] A. Vaswani et al., “Attention is All You Need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, in NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, pp. 6000– 6010.

[19] A. Safaya, M. Abdullatif, and D. Yuret, “KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media,” Jun. 2020, pp. 2054–2059. doi: 10.18653/v1/2020.semeval-1.271.

[20] D. Ravì et al., “Deep Learning for Health Informatics,” IEEE J Biomed Health Inform, vol. PP, 2016, doi: 10.1109/JBHI.2016.2636665.

[21] Y. Bengio, “Learning Deep Architectures for AI,” Foundations and Trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009, doi: 10.1561/2200000006.

[22] M. Z. Alom et al., “A state-of-the-art survey on deep learning theory and architectures,” Electronics (Basel), vol. 8, no. 3, p. 292, 2019.

[23] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015, doi: https://doi.org/10.1016/j.neunet.2014.09.003.

[24] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning.,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.

[25] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998, doi: 10.1109/5.726791.

[26] B. S. Kim and T. Kim, “Cooperation of Simulation and Data Model for Performance Analysis of Complex Systems,” International Journal of 57 Program Studi Ilmu Komputer (S2) Universitas Nusa Mandiri Simulation Modelling, vol. 18, pp. 608–619, 2019, doi: 10.2507/IJSIMM18(4)491.

[27] W. Lu, J. Li, J. Wang, and L. Qin, “A CNN-BiLSTM-AM method for stock price prediction,” Neural Comput Appl, vol. 33, no. 10, pp. 4741–4753, 2021, doi: 10.1007/s00521-020-05532-z.

[28] T. Iqbal and S. Qureshi, “The Survey: Text Generation Models in Deep Learning.,” Journal of King Saud University - Computer and Information Sciences, vol. 34, Jun. 2020, doi: 10.1016/j.jksuci.2020.04.001.

[29] A. Yadav, C. K. Jha, and A. Sharan, “Optimizing LSTM for time series prediction in Indian stock market,” Procedia Comput Sci, vol. 167, pp. 2091– 2100, 2020, doi: https://doi.org/10.1016/j.procs.2020.03.257.

[30] R. S. Pontoh et al., “Jakarta Pandemic to Endemic Transition: Forecasting COVID-19 Using NNAR and LSTM,” Applied Sciences, vol. 12, no. 12, 2022, doi: 10.3390/app12125771.

[31] F. Masri, D. Saepudin, and D. Adytia, “Forecasting of Sea Level Time Series using Deep Learning RNN, LSTM, and BiLSTM, Case Study in Jakarta Bay, Indonesia,” e-Proceeding Eng, vol. 7, no. 2, pp. 8544–8551, 2020.

[32] H. Elfaik and E. H. Nfaoui, “Deep Bidirectional LSTM Network LearningBased Sentiment Analysis for Arabic Text,” Journal of Intelligent Systems, vol. 30, no. 1, pp. 395–412, 2021, doi: doi:10.1515/jisys-2020-0021.

[33] S. Ahmed et al., “Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation,” Applied Sciences, vol. 12, no. 1, 2022, doi: 10.3390/app12010317.

[34] Y. K. J. H. K. Min Hyung Park Dongyan Nan, “CBOE Volatility Index Forecasting under COVID-19: An Integrated BiLSTM-ARIMA-GARCH Model,” Computer Systems Science and Engineering, vol. 47, no. 1, pp. 121– 134, 2023, doi: 10.32604/csse.2023.033247.

[35] A. F. Abka, “Evaluating the use of word embeddings for part-of-speech tagging in Bahasa Indonesia,” in 2016 International Conference on Computer, Control, Informatics and its Applications (IC3INA), 2016, pp. 209–214. doi: 10.1109/IC3INA.2016.7863051.

[36] A. Chiche and B. Yitagesu, “Part of speech tagging: a systematic review of deep learning and machine learning approaches,” J Big Data, vol. 9, no. 1, p. 10, 2022, doi: 10.1186/s40537-022-00561-y.

[37] R. Ratino, N. Hafidz, S. Anggraeni, and W. Gata, “Sentimen Analisis Informasi Covid-19 menggunakan Support Vector Machine dan Naïve Bayes,” JUPITER (Jurnal Penelitian Ilmu dan Teknik Komputer), vol. 12, no. 2, pp. 1–11, Oct. 2020, [Online]. Available: https://jurnal.polsri.ac.id/index.php/jupiter/article/view/2388

[38] M. A. Nq, L. P. Manik, and D. Widiyatmoko, “Stemming Javanese: Another Adaptation of the Nazief-Adriani Algorithm,” in 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 2020, pp. 627–631. doi: 10.1109/ISRITI51436.2020.9315420.

[39] R. Sumendap and I. B. Mahendra, “ Membandingkan Analisis Sentimen Review Pelanggan Shopee Dan Tokopedia Menggunakan Google’s NLP API,” JELIKU (Jurnal Elektronik Ilmu Komputer Udayana), vol. 11, no. 4, pp. 655–662, 2023, doi: 10.24843/JLK.2023.v11.i04.p02.

[40] A. K. Singh and M. Shashi, “Vectorization of Text Documents for 58 Program Studi Ilmu Komputer (S2) Universitas Nusa Mandiri Identifying Unifiable News Articles,” International Journal of Advanced Computer Science and Applications, vol. 10, no. 7, 2019, doi: 10.14569/IJACSA.2019.0100742.

[41] A. Kurniasih and L. P. Manik, “On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal Texts,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 6, 2022, doi: 10.14569/IJACSA.2022.01306109.

[42] N. Tran, H. Tran, S. Nguyen, H. Nguyen, and T. Nguyen, “Does BLEU score work for code migration?,” in 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), 2019, pp. 165–176.

[43] A. Sugiyama and N. Yoshinaga, “Data augmentation using back-translation for context-aware neural machine translation,” in Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 35– 44. doi: 10.18653/v1/D19-6504.

[44] S. Shleifer, “Low Resource Text Classification with ULMFit and Backtranslation,” arXiv preprint arXiv:1903.09244, 2019.

[45] S. Kobayashi, “Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 452– 457. doi: 10.18653/v1/N18-2072.

[46] V. Kovatchev, P. Smith, M. Lee, and R. Devine, “Can vectors read minds better than experts? Comparing data augmentation strategies for the automated scoring of children’s mindreading ability,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online: Association for Computational Linguistics, Aug. 2021, pp. 1196–1206. doi: 10.18653/v1/2021.acl-long.96.

[47] N. Khasanah, “Implementasi Arsitektur MobileNetV2 Untuk Klasifikasi Citra Beras Impor,” Program Studi Ilmu Komputer (S2) Universitas Nusa Mandiri, 2021.

[48] I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, “Hate speech detection in the Indonesian language: A dataset and preliminary study,” in 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2017, pp. 233–238. doi: 10.1109/ICACSIS.2017.8355039.

[49] R. C. Rajagukguk, “Analisis Sentimen Twitter dengan TFIDF-ANN.” 2019. [Online]. Available: https://github.com/riochr17/Analisis-Sentimen-ID

[50] R. Riyaddulloh and A. Romadhony, “Normalisasi Teks Bahasa Indonesia Berbasis Kamus Slang Studi Kasus: Tweet Produk Gadget Pada Twitter,” eProceedings of Engineering, vol. 8, no. 4, 2021.

[51] N. Aliyah Salsabila, Y. Ardhito Winatmoko, A. Akbar Septiandri, and A. Jamal, “Colloquial Indonesian Lexicon,” in 2018 International Conference on Asian Language Processing (IALP), 2018, pp. 226–229. doi: 10.1109/IALP.2018.8629151.

[52] E. Ma, “NLP Augmentation.” 2019. [Online]. Available: 59 Program Studi Ilmu Komputer (S2) Universitas Nusa Mandiri https://github.com/makcedward/nlpaug

[53] F. Koto, J. H. Lau, and T. Baldwin, “IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), 2021, pp. 10660–10668.

Detail Informasi

Tesis ini ditulis oleh :

Nama : FUAD MUFTIE
NIM : 14210197
Prodi : Ilmu Komputer
Kampus : Margonda
Tahun : 2023
Periode : I
Pembimbing : Dr. Muhammad Haris, M.Eng
Asisten :
Kode : 0002.S2.IK.TESIS.I.2023
Diinput oleh : NZH
Terakhir update : 10 Juni 2024
Dilihat : 334 kali

TENTANG PERPUSTAKAAN

E-Library Perpustakaan Universitas Nusa Mandiri merupakan platform digital yang menyedikan akses informasi di lingkungan kampus Universitas Nusa Mandiri seperti akses koleksi buku, jurnal, e-book dan sebagainya.