Pencarian Hibrida Dengan BM25 Dan SBERT Yang Disempurnakan Untuk Meningkatkan Relevansi Pencarian Pada Undang-Undang Ketentuan Umum Dan Tata Cara Perpajakan

Wan Ahmad Gazali Kodri
14220002

ABSTRAK

Penelitian ini mengembangkan sistem pencarian hibrida untuk meningkatkan relevansi pencarian pada dataset Ketentuan Umum dan Tata Cara Perpajakan (KUP). Sistem ini mengintegrasikan metode pencarian berbasis leksikal (BM25) dengan pencarian semantik menggunakan Sentence-BERT (SBERT) yang telah di-fine-tune. Pendekatan hibrida ini bertujuan mengatasi keterbatasan masing-masing metode dan meningkatkan akurasi pencarian dalam konteks dokumen hukum perpajakan yang kompleks. Metodologi penelitian meliputi beberapa tahap kunci yaitu pengembangan data sintetik menggunakan Large Language Models untuk fine-tuning SBERT, implementasi normalisasi kueri dan preprocessing data, pengembangan sistem pencarian hibrida dengan teknik Reciprocal Rank Fusion (RRF), dan evaluasi komprehensif kinerja sistem. Hasil penelitian menunjukkan bahwa model hibrida secara konsisten mengungguli metode pencarian individual. Normalisasi kueri dan preprocessing optimal (konversi ke huruf kecil) meningkatkan kinerja secara signifikan. Analisis pengaruh jumlah dokumen yang di-retrieve (k) mengungkapkan trade-off antara Precision dan Recall. Model Fine-hybrid dengan normalisasi kueri dan preprocessing huruf kecil menunjukkan kinerja terbaik, mencapai Precision@N 66.021%. Penelitian ini memberikan kontribusi teoretis dalam pengembangan metodologi pencarian hibrida untuk dokumen hukum, serta kontribusi praktis berupa sistem pencarian yang lebih efektif untuk dataset KUP. Temuan ini berpotensi meningkatkan aksesibilitas informasi perpajakan, efisiensi administrasi, dan kepatuhan wajib pajak..

Leksikal : Pencarian Hibrida, BM25, SBERTI, KUP, RRF

KATA KUNCI

Pencarian Hibrida

DAFTAR PUSTAKA

[1] H. K. A ad and A. Deepak, “Query expansion techniques for information retrieval: A survey,” Inf Process Manag, vol. 56, no. 5, pp. 1698–1735, Sep. 2019, doi: 10.1016/j.ipm.2019.05.009.

[2] S. Sharma and S. P. Panda, “Efficient information retrieval model: overcoming challenges in search engines-an overview,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 32, no. 2, pp. 925–932, Nov. 2023, doi: 10.11591/ijeecs.v32.i2.pp925-932.

[3] P. Bhattacharya, S. Paul, K. Ghosh, S. Ghosh, and A. Wyner, “Identification of Rhetorical Roles of Sentences in Indian Legal Judgments,” Nov. 2019, [Online]. Available: http://arxiv.org/abs/1911.05405

[4] S. Robertson and H. Zarago a, “The probabilistic relevance framework: BM25 and beyond,” Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009, doi: 10.1561/1500000019.

[5] J. P. Salgado-Guerrero, D. F. Quisi-Peralta, M. Lope -Nores, L. D. Paguay- Palaguachi, J. F. Murillo-Valare o, and G. Cajamarca-Morquecho, “A New Hybrid Search Approach to Optimi e the Retrieval of Information from the Website at the Universidad Politécnica Salesiana,” in Lecture Notes in Networks and Systems, Springer Science and Business Media Deutschland GmbH, 2024, pp. 247–257. doi: 10.1007/978-3-031-54235-0_23.

[6] A. Esteva et al., “COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summari ation,” NPJ Digit Med, vol. 4, no. 1, Dec. 2021, doi: 10.1038/s41746-021-00437-0.

[7] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Aug. 2019, [Online]. Available: http://arxiv.org/abs/1908.10084

[8] H. S. Walsh, “Semantic Search With Sentence-BERT For Design Information Retrieval,” 2022.

[9] Miguel A. Silva-Fuentes, Hugo D. Calderon-Vilca, Edwin F. Calderon-Vilca, and Flor C. Cardenas-Marino, “Semantic Search System using Word Embeddings for query expansion,” 2019 IEEE PES Innovative Smart Grid Technologies Conference - Latin America (ISGT Latin America), 2019, doi: 10.1109/ISGT-LA.2019.8894992.

[10] N. Yusuf et al., “Query Expansion Method For Quran Search Using Semantic Search And Lucene Ranking,” 2020.

[11] J. Coelho, A. Neto, M. Tavares, C. Coutinho, R. Ribeiro, and F. Batista, “Semantic search of mobile applications using word embeddings,” in OpenAccess Series in Informatics, Schloss Dagstuhl- Leibni -Zentrum fur 78 Program Studi Ilmu Komputer (S2) FTI Universitas Nusa Mandiri Informatik GmbH, Dagstuhl Publishing, Aug. 2021. doi: 10.4230/OASIcs.SLATE.2021.12.

[12] A. Pandey, A. Gupta, and V. Pudi, “CitRet: A Hybrid Model for Cited Text Span Retrieval,” Proceedings of the 29th International Conference on Computational Linguistics, pp. 4528–4536, 2022, [Online]. Available: http://www.nist.gov/tac/2014/BiomedSumm/

[13] Y. Yuan, Y. Liu, and L. Cheng, “A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models,” Apr. 2024, doi: 10.48550/arXiv.2404.14445.

[14] F. Sufi, “Addressing Data Scarcity in the Medical Domain: A GPT-Based Approach for Synthetic Data Generation and Feature Extraction,” Information (Switzerland), vol. 15, no. 5, May 2024, doi: 10.3390/info15050264.

[15] N. Fujishiro, Y. Otaki, and S. Kawachi, “Accuracy of the Sentence-BERT Semantic Search System for a Japanese Database of Closed Medical Malpractice Claims,” Applied Sciences (Switzerland), vol. 13, no. 6, Mar. 2023, doi: 10.3390/app13064051.

[16] C. D. Manning, Prabhakar. Raghavan, and Hinrich. Schu?t e, Introduction to information retrieval. Cambridge University Press, 2008.

[17] G. G. Chowdhury, Natural language Processing. 2005.

[18] Andrew Wen, Yanshan Wang, Vinod C. Kaggal, Sijia Liu, Hongfang Liu, and Jungwei Fan, “Enhancing Clinical Information Retrieval through Context- Aware Queries and Indices,” IEEE International Conference on Big Data (Big Data), pp. 2800–2807, 2019, doi: 10.1109/BigData47090.2019.9006241.

[19] M.-Y. Kim, J. Rabelo, K. Okeke, and R. Goebel, “Legal Information Retrieval and Entailment Based on BM25, Transformer and Semantic Thesaurus Methods,” The Review of Socionetwork Strategies, vol. 16, no. 1, pp. 157–174, Apr. 2022, doi: 10.1007/s12626-022-00103-1.

[20] E. H. Mohamed and E. M. Shokry, “QSST: A Quranic Semantic Search Tool based on word embedding,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 3, pp. 934–945, Mar. 2022, doi: 10.1016/j.jksuci.2020.01.004.

[21] C. M. Garcia, A. L. Koerich, A. de S. Britto, and J. P. Barddal, “Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams,” Mar. 2024, [Online]. Available: http://arxiv.org/abs/2403.15455

[22] R. Zhu, X. Tu, and J. Xiangji Huang, “Deep learning on information retrieval and its applications,” in Deep Learning for Data Analytics: Foundations, Biomedical Applications, and Challenges, Elsevier, 2020, pp. 125–153. doi: 10.1016/B978-0-12-819764-6.00008-9. 79 Program Studi Ilmu Komputer (S2) FTI Universitas Nusa Mandiri

[23] B. Mitra and N. Craswell, “An introduction to neural information retrieval,” Dec. 23, 2018, Now Publishers Inc. doi: 10.1561/1500000061.

[24] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Jan. 2013, [Online]. Available: http://arxiv.org/abs/1301.3781

[25] S. E. Robertson and S. Walker, “Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval,” 1994.

[26] K. D. Onal et al., “Neural information retrieval: at the end of the early years,” Information Retrieval Journal, vol. 21, no. 2–3, pp. 111–182, Jun. 2018, doi: 10.1007/s10791-017-9321-y.

[27] J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding,” 2019. [Online]. Available: https://github.com/tensorflow/tensor2tensor

[28] R. Nogueira and K. Cho, “Passage Re-ranking with BERT,” arXiv.org, Jan. 2019, doi: 10.48550/arXiv.1901.04085.

[29] A. Sihombing, A. Indrawati, A. Yaman, C. Trianggoro, L. P. Manik, and Z. Akbar, “A scientific expertise classification model based on experts’ self-claims using the semantic and the TF-IDF approach,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Nov. 2022, pp. 301–305. doi: 10.1145/3575882.3575940.

[30] S. Gururangan et al., “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks,” Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.10964

[31] L. Gao, Z. Dai, and J. Callan, “COIL: Revisit Exact Lexical Match in Information Retrieval with Contextuali ed Inverted List,” Apr. 2021, [Online]. Available: http://arxiv.org/abs/2104.07186

[32] V. Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.04906

[33] G. V. Cormack, C. L. A. Clarke, and Stefan Buttcher, “Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods,” SIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 758–759, 2019, doi: 10.1145/1571941.1572114.

[34] F. Said and L. Parningotan Manik, “Aspect-Based Sentiment Analysis on Indonesian Presidential Election Using Deep Learning,” Paradigma, vol. 24, no. 2, pp. 160–167, 2022, doi: 10.31294/p.v24i2.1415.

[35] Y. Bengio et al., “A Neural Probabilistic Language Model,” 2003.

[36] D. Krisnandi, R. N. Ambarwati, A. Y. Asih, A. Ardiansyah, and H. Ferdinandus Pardede, “Analisis Komentar Cyberbullying Terhadap Kata yang Mengandung 80 Program Studi Ilmu Komputer (S2) FTI Universitas Nusa Mandiri Toksisitas dan Agresi Menggunakan Bag of Words dan TF-IDF dengan Klasifikasi SVM,” 2023. [Online]. Available: https://data.mendeley.com/datasets/jf4p yvnpj/1.

[37] A. Vaswani et al., “Attention Is All You Need,” Jun. 2017, [Online]. Available: http://arxiv.org/abs/1706.03762

[38] T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” 2020. [Online]. Available: https://github.com/huggingface/

[39] C. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text- to-Text Transformer,” Oct. 2019, [Online]. Available: http://arxiv.org/abs/1910.10683

[40] A. A. Supianto et al., “Cluster-based text mining for extracting drug candidates for the prevention of COVID-19 from the biomedical literature,” J Taibah Univ Med Sci, vol. 18, no. 4, pp. 787–801, Aug. 2023, doi: 10.1016/j.jtumed.2022.12.015.

[41] L. P. Manik, H. Susianto, A. Dinakaramani, N. Pramanik, and T. Suhardijanto, “Can Lexicon-Based Sentiment Analysis Boost Performances of Transformer- Based Models?,” in Proceedings of the 7th 2023 International Conference on New Media Studies, CONMEDIA 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 314–319. doi: 10.1109/CONMEDIA60526.2023.10428401.

[42] I. Kurniawati and H. F. Pardede, “Hybrid Method of Information Gain and Particle Swarm Optimi ation for Selection of Features of SVM-Based Sentiment Analysis,” in 2018 International Conference on Information Technology Systems and Innovation, ICITSI 2018 - Proceedings, Institute of Electrical and Electronics Engineers Inc., Jul. 2018, pp. 1–5. doi: 10.1109/ICITSI.2018.8695953.

[43] A. Pardamean and H. F. Pardede, “Tuned bidirectional encoder representations from transformers for fake news detection,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 22, no. 3, pp. 1667–1671, Jun. 2021, doi: 10.11591/ijeecs.v22.i3.pp1667-1671.

[44] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to Fine-Tune BERT for Text Classification?,” May 2019, [Online]. Available: http://arxiv.org/abs/1905.05583

[45] I. S. Gabashvili, “Systematic Review The impact and applications of ChatGPT: a Systematic Review of Literature Reviews,” 2023, doi: 10.17605/OSF.IO/87U6Q.

[46] L. P. Manik et al., “L31 - Unraveling Knowledge-Based Chatbot Adoption Intention In Enhancing Species Literacy,” Interdisciplinary Journal of Information, Knowledge, and Management, vol. 19, 2024, doi: 10.28945/5280. 81 Program Studi Ilmu Komputer (S2) FTI Universitas Nusa Mandiri

[47] S. Ubani, S. O. Polat, and R. Nielsen, “ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT,” Apr. 2023, [Online]. Available: http://arxiv.org/abs/2304.14334

[48] F. Muftie and M. Haris, “IndoBERT Based Data Augmentation for Indonesian Text Classification,” in 2023 International Conference on Information Technology Research and Innovation, ICITRI 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 128–132. doi: 10.1109/ICITRI59340.2023.10250061.

[49] T. B. Brown et al., “Language Models are Few-Shot Learners,” May 2020, [Online]. Available: http://arxiv.org/abs/2005.14165

[50] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing,” Jul. 2021, [Online]. Available: http://arxiv.org/abs/2107.13586

[51] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, and N. Aletras, “Findings of the Association for Computational Linguistics LEGAL-BERT: The Muppets straight out of Law School,” 2020. [Online]. Available: http://www.legislation.gov.uk

[52] H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and M. Sun, “How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence,” 2020. [Online]. Available: https://github.com/thunlp/LegalPapers

[53] A. Kurniasih and L. P. Manik, “On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal Texts,” International Journal of Advanced Computer Science and Applications, vol. 13, no. 6, pp. 927–934, 2022, doi: 10.14569/IJACSA.2022.01306109

Detail Informasi

Tesis ini ditulis oleh :

Nama : Wan Ahmad Gazali Kodri
NIM : 14220002
Prodi : Ilmu Komputer
Kampus : Margonda
Tahun : 2024
Periode : I
Pembimbing : Dr. Muhammad Haris, S.Kom, M.Eng
Asisten :
Kode : 0009.S2.IK.TESIS.I.2024
Diinput oleh : SGM
Terakhir update : 16 Februari 2025
Dilihat : 243 kali

TENTANG PERPUSTAKAAN

E-Library Perpustakaan Universitas Nusa Mandiri merupakan platform digital yang menyedikan akses informasi di lingkungan kampus Universitas Nusa Mandiri seperti akses koleksi buku, jurnal, e-book dan sebagainya.