Penyaringan Informasi Sensitif pada Sistem Chatbot Berbasis Retrieval Augmented Generation dengan Metode Named Entity Recognition
Kata Kunci:
Retrieval Augmented Generation, Chatbot, Large Language Model, Named Entity Recognition, informasi sensitifAbstrak
Retrieval Augmented Generation (RAG) telah berkembang sebagai pendekatan inovatif dalam chatbot dengan menggabungkan Large Language Models (LLMs) dan sumber pengetahuan eksternal. Namun, tantangan besar muncul terkait kebocoran informasi sensitif, khususnya dalam aplikasi yang membutuhkan perlindungan privasi. Penelitian ini mengembangkan chatbot berbasis RAG dengan penyaringan informasi sensitif menggunakan Named Entity Recognition (NER). Model DistilBERT yang telah di-fine-tune untuk tugas NER pada dataset sintetis, diimplementasikan untuk mengenali entitas sensitif seperti nama, alamat, dan nomor identitas. Proses penelitian mencakup pengembangan pipeline RAG, integrasi model NER, serta evaluasi kinerja dengan metrik precision, recall, dan f-measure. Hasil menunjukkan performa tinggi dari model fine-tuned DistilBERT, dengan precision 0,965, recall 0,965, dan f-measure 0,965 pada evaluasi weighted average. Meskipun pipeline RAG memiliki performa lebih rendah, dengan precision 0,71, recall 0,92, dan f-measure 0,79, hasilnya tetap menunjukkan kemampuan memadai dalam menyaring informasi sensitif. Evaluasi ini mencerminkan potensi implementasi sistem chatbot berbasis RAG yang lebih aman dan efisien dalam menjaga privasi data pengguna.
Referensi
Abousaber, I. and Abdalla, H.F., 2023. International Journal of Communication Networks and Information Security Review of Using Technologies of Artificial Intelligence in Companies. International Journal of Communication Networks and Information Security (IJCNIS), 15(1), pp.233–244.
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D., 2020. Language Models are Few-Shot Learners. [online] Available at: <http://arxiv.org/abs/2005.14165>.
Denny, P., Prather, J., Becker, B.A., Finnie-Ansley, J., Hellas, A., Leinonen, J., Luxton-Reilly, A., Reeves, B.N., Santos, E.A. and Sarsa, S., 2023. Computing Education in the Era of Generative AI. [online] Available at: <http://arxiv.org/abs/2306.02608>.
Devlin, J., Chang, M.-W., Lee, K., Google, K.T. and Language, A.I., 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [online] Available at: <https://github.com/tensorflow/tensor2tensor>.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M. and Wang, H., 2023. Retrieval-Augmented Generation for Large Language Models: A Survey. [online] Available at: <http://arxiv.org/abs/2312.10997>.
Hebbar, S., Ashwath Rao, B., Muralikrishna, S.N., Supriya, M., Narendra, V.G. and Sobha, L., 2023. Named Entity Recognition Using BERT Model for Kannada Language. In: 2023 International Conference on Recent Advances in Information Technology for Sustainable Development, ICRAIS 2023 - Proceedings. Institute of Electrical and Electronics Engineers Inc. pp.212–216. https://doi.org/10.1109/ICRAIS59684.2023.10367119.
Jang, S., Cho, Y., Seong, H., Kim, T. and Woo, H., 2024. The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model. Applied Sciences (Switzerland), 14(13). https://doi.org/10.3390/app14135682.
Keraghel, I., Morbieu, S. and Nadif, M., 2024. A survey on recent advances in named entity recognition. [online] Available at: <http://arxiv.org/abs/2401.10825>.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S. and Kiela, D., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. [online] Available at: <http://arxiv.org/abs/2005.11401>.
Nie, Y., Tian, Y., Song, Y., Ao, X. and Wan, X., 2020. Findings of the Association for Computational Linguistics Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information. [online] Association for Computational Linguistics. p.4231. https://doi.org/https://doi.org/10.18653/v1/2020.findings-emnlp.378.
Nishanth S and Swetha S, 2024. Enhancing RAG Systems: A Survey of Optimization Strategies for Performance and Scalability. INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, [online] 08(06), pp.1–5. https://doi.org/10.55041/IJSREM35402.
Qiu, L., 2014. Finding and typing new named entities in Tibetan from Chinese-Tibetan parallel corpora. International Journal of Multimedia and Ubiquitous Engineering, 9(9), pp.143–150. https://doi.org/10.14257/ijmue.2014.9.9.16.
Routray, S.K., Javali, A., Sharmila, K.P., Jha, M.K., Pappa, M. and Singh, M., 2023. Large Language Models (LLMs): Hypes and Realities. In: 2023 International Conference on Computer Science and Emerging Technologies, CSET 2023. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/CSET58993.2023.10346621.
Shuster, K., Poff, S., Chen, M., Kiela, D. and Weston, J., 2021. Retrieval Augmentation Reduces Hallucination in Conversation.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. Attention Is All You Need. [online] Available at: <http://arxiv.org/abs/1706.03762>.
Yu, W., 2021. Retrieval-augmented Generation across Heterogeneous Knowledge.
Yulisuyanti, E. and Soewito, B., 2023. Model Manajemen Risiko Sistem Informasi Untuk Sistem Informasi Manajemen Kepegawaian Information System of Risk Management Model for Personnel Management Information System. Jurnal Teknologi Informasi, 22(3), pp.783–795.
Zeng, S., Zhang, J., He, P., Ren, J., Zheng, T., Lu, H., Xu, H., Liu, H., Xing, Y. and Tang, J., 2024a. Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data. [online] Available at: <http://arxiv.org/abs/2406.14773>.
Zeng, S., Zhang, J., He, P., Xing, Y., Liu, Y., Xu, H., Ren, J., Wang, S., Yin, D., Chang, Y. and Tang, J., 2024b. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). [online] Available at: <http://arxiv.org/abs/2402.16893>.
Abousaber, I. and Abdalla, H.F., 2023. International Journal of Communication Networks and Information Security Review of Using Technologies of Artificial Intelligence in Companies. International Journal of Communication Networks and Information Security (IJCNIS), 15(1), pp.233–244.
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D., 2020. Language Models are Few-Shot Learners. [online] Available at: <http://arxiv.org/abs/2005.14165>.
Denny, P., Prather, J., Becker, B.A., Finnie-Ansley, J., Hellas, A., Leinonen, J., Luxton-Reilly, A., Reeves, B.N., Santos, E.A. and Sarsa, S., 2023. Computing Education in the Era of Generative AI. [online] Available at: <http://arxiv.org/abs/2306.02608>.
Devlin, J., Chang, M.-W., Lee, K., Google, K.T. and Language, A.I., 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [online] Available at: <https://github.com/tensorflow/tensor2tensor>.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M. and Wang, H., 2023. Retrieval-Augmented Generation for Large Language Models: A Survey. [online] Available at: <http://arxiv.org/abs/2312.10997>.
Hebbar, S., Ashwath Rao, B., Muralikrishna, S.N., Supriya, M., Narendra, V.G. and Sobha, L., 2023. Named Entity Recognition Using BERT Model for Kannada Language. In: 2023 International Conference on Recent Advances in Information Technology for Sustainable Development, ICRAIS 2023 - Proceedings. Institute of Electrical and Electronics Engineers Inc. pp.212–216. https://doi.org/10.1109/ICRAIS59684.2023.10367119.
Jang, S., Cho, Y., Seong, H., Kim, T. and Woo, H., 2024. The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model. Applied Sciences (Switzerland), 14(13). https://doi.org/10.3390/app14135682.
Keraghel, I., Morbieu, S. and Nadif, M., 2024. A survey on recent advances in named entity recognition. [online] Available at: <http://arxiv.org/abs/2401.10825>.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S. and Kiela, D., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. [online] Available at: <http://arxiv.org/abs/2005.11401>.
Nie, Y., Tian, Y., Song, Y., Ao, X. and Wan, X., 2020. Findings of the Association for Computational Linguistics Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information. [online] Association for Computational Linguistics. p.4231. https://doi.org/https://doi.org/10.18653/v1/2020.findings-emnlp.378.
Nishanth S and Swetha S, 2024. Enhancing RAG Systems: A Survey of Optimization Strategies for Performance and Scalability. INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, [online] 08(06), pp.1–5. https://doi.org/10.55041/IJSREM35402.
Qiu, L., 2014. Finding and typing new named entities in Tibetan from Chinese-Tibetan parallel corpora. International Journal of Multimedia and Ubiquitous Engineering, 9(9), pp.143–150. https://doi.org/10.14257/ijmue.2014.9.9.16.
Routray, S.K., Javali, A., Sharmila, K.P., Jha, M.K., Pappa, M. and Singh, M., 2023. Large Language Models (LLMs): Hypes and Realities. In: 2023 International Conference on Computer Science and Emerging Technologies, CSET 2023. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/CSET58993.2023.10346621.
Shuster, K., Poff, S., Chen, M., Kiela, D. and Weston, J., 2021. Retrieval Augmentation Reduces Hallucination in Conversation.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. Attention Is All You Need. [online] Available at: <http://arxiv.org/abs/1706.03762>.
Yu, W., 2021. Retrieval-augmented Generation across Heterogeneous Knowledge.
Yulisuyanti, E. and Soewito, B., 2023. Model Manajemen Risiko Sistem Informasi Untuk Sistem Informasi Manajemen Kepegawaian Information System of Risk Management Model for Personnel Management Information System. Jurnal Teknologi Informasi, 22(3), pp.783–795.
Zeng, S., Zhang, J., He, P., Ren, J., Zheng, T., Lu, H., Xu, H., Liu, H., Xing, Y. and Tang, J., 2024a. Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data. [online] Available at: <http://arxiv.org/abs/2406.14773>.
Zeng, S., Zhang, J., He, P., Xing, Y., Liu, Y., Xu, H., Ren, J., Wang, S., Yin, D., Chang, Y. and Tang, J., 2024b. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). [online] Available at: <http://arxiv.org/abs/2402.16893>.
Diterbitkan
Cara Mengutip
Terbitan
Bagian
Lisensi
Hak Cipta (c) 2025 Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer

Artikel ini berlisensiCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.