Text Similarity Detection Between Documents Using Case Based Reasoning Method with Cosine Similarity Measure (Case Study SIMNG LPPM Universitas Sriwijaya)

Nabila Febriyanti, Dian Palupi Rini, Osvari Arsalan

Abstract

LPPM Universitas Sriwijaya is an institution that coordinates academic research and community service inside Universitas Sriwijaya. In carrying out the duty, LPPM assesses every proposal’s originality which would be impossible to do manually in the future due to massive data growth. Thus, automatization for the proposal's originality check is needed. The Case Based Reasoning method is used in this research because it allows the system to reuse the information that has been obtained to find documents that are similar to the test document. In this study, the data is represented in the form of the Vector Space Model and uses Cosine Similarity to measure document to document similarity. The data is represented by giving weight for each part of the tested documents. In this study, four formulas from previous research will be used for term weighting then the final result will be compared. The process begins by extracting data, separating parts of the document, figuring the similarity value of the test document to the case base utilizing Cosine Similarity Measure, results filtering with a certain threshold, summarizing the calculation results, and finally preserving the results obtained to be reused in the next calculation. The results of this study indicate that the text-similarity detection between documents has been successfully carried out using the proposed method with the best sensitivity level and the fastest computation time achieved in configuration II.

Full Text:

PDF

References

I. M. I. Subroto and A. Selamat, “Plagiarism detection through internet using hybrid artificial neural network and support vectors machine,” Telkomnika (Telecommunication Comput. Electron. Control., vol. 12, no. 1, pp. 209–218, 2014, doi: 10.12928/TELKOMNIKA.v12i1.648.

P. Clough, “Plagiarism in natural and programming languages: an overview of current tools and technologies,” Finance, no. July, pp. 1–31, 2000, [Online]. Available: http://www.dcs.shef.ac.uk/nlp/meter/Documents/reports/plagiarism/Plagiarism.pdf.

S. M. Alzahrani, N. Salim, and A. Abraham, “Understanding plagiarism linguistic patterns, textual features, and detection methods,” IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, vol. 42, no. 2. pp. 133–149, 2012, doi: 10.1109/TSMCC.2011.2134847.

Y. Yuliati, “Perlindungan Hukum Bagi Pencipta Berkaitan Dengan Plagiarisme Karya Ilmiah Di Indonesia,” Arena Huk., vol. 5, no. 1, pp. 54–64, 2012, doi: 10.21776/ub.arenahukum.2012.00501.7.

W. G. S. Parwita, I. G. A. A. D. Indradewi, and I. N. S. W. Wijaya, “String Matching based Plagiarism Detection for,” 2019 5th Int. Conf. New Media Stud., 2019.

D. Leman, M. Rahman, F. Ikorasaki, B. S. Riza, and M. B. Akbbar, “Rabin Karp and Winnowing Algorithm for Statistics of Text Document Plagiarism Detection,” 2019, doi: 10.1109/CITSM47753.2019.8965422.

M. A. C. Jiffriya, M. A. C. A. Jahan, and R. G. Ragel, “Plagiarism detection on electronic text based assignments using vector space model,” 2014 7th Int. Conf. Inf. Autom. Sustain. "Sharpening Futur. with Sustain. Technol. ICIAfS 2014, 2014, doi: 10.1109/ICIAFS.2014.7069593.

R. Saptono, H. Prasetyo, and A. Irawan, “Combination of cosine similarity method and conditional probability for plagiarism detection in the thesis documents vector space model,” J. Telecommun. Electron. Comput. Eng., vol. 10, no. 2–4, pp. 139–143, 2018.

J. Priambodo, “Pendeteksian Plagiarisme Menggunakan Algoritma Rabin-Karp dengan Metode Rolling Hash,” J. Inform. Univ. Pamulang, vol. 3, no. 1, p. 39, 2018, doi: 10.32493/informatika.v3i1.1518.

A. H. Purba and Z. Situmorang, “Analisis Perbandingan Algoritma Rabin-Karp Dan Levenshtein Distance Dalam Menghitung Kemiripan Teks,” J. Tek. Inform. Unika St. Thomas, vol. 02, pp. 24–32, 2017.

M. Mihajlovic and N. Xiong, “Finding the most similar textual documents using Case-Based Reasoning,” arXiv, 2019.

M. Potthast, B. Stein, A. Eiselt, A. Barrón-Cedeno, and P. Rosso, “Overview of the 1st international competition on plagiarism detection,” CEUR Workshop Proc., vol. 502, pp. 1–9, 2009.

Z. F. Alfikri and A. Purwarianti, “Detailed Analysis of Extrinsic Plagiarism Detection System Using Machine Learning Approach (Naive Bayes and SVM),” TELKOMNIKA Indones. J. Electr. Eng., vol. 12, no. 11, pp. 7884–7894, 2014, doi: 10.11591/telkomnika.v12i11.6652.

M. M. Richter and R. O. Weber, Case-Based Reasoning. Springer International Publishing, 2013.

A. Agnar and E. Plaza, “Case-Based reasoning: Foundational issues, methodological variations, and system approaches,” AI Commun., vol. 7, no. 1, pp. 39–59, 1994, doi: 10.3233/AIC-1994-7104.

A. Mubarak et al., “Case-Based Reasoning Untuk Aplikasi Pemilihan Pestisida Hama Case-Based Reasoning for Web Based Selection of Rice Pesticides,” vol. 3, no. 2, pp. 119–124, 2020, doi: 10.33387/jiko.

A. K. Uysal and S. Gunal, “The impact of preprocessing on text classification,” Inf. Process. Manag., vol. 50, no. 1, pp. 104–112, 2014, doi: 10.1016/j.ipm.2013.08.006.

B. Furlan and V. Batanovi, “Semantic similarity of short texts in languages with a de fi cient natural language processing support,” vol. 55, pp. 710–719, 2013, doi: 10.1016/j.dss.2013.02.002.

R. Goyena and A. . Fallis, “Pengembangan Aplikasi Pendeteksi Plagiarisme Pada Dokumen Teks Menggunakan Algoritma Rabin-Karp,” J. Chem. Inf. Model., vol. 53, no. 9, pp. 1689–1699, 2019.

A. E. Budiman, “Analisis Pengaruh Teks Preprocessing Terhadap Deteksi Plagiarisme Pada Dokumen Tugas Akhir,” J. Tek. Inform. dan Sist. Inf., vol. 6, pp. 475–488, 2020, doi: http://dx.doi.org/10.28932/jutisi.v6i3.2892 Ariel.

F. Z. Tala, “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia,” M.Sc. Thesis, Append. D, vol. pp, pp. 39–46, 2003.

M. Adriani, J. Asian, B. Nazief, S. M. M. Tahaghoghi, and H. Williams, “Stemming Indonesian: A confix-stripping approach.,” ACM Trans. Asian Lang. Inf. Process., vol. 6, 2007.

M. Zechner, M. Muhr, R. Kern, M. Granitzer, and K.-C. Graz, “External and Intrinsic Plagiarism Detection Using Vector Space Models,” 2009.

A. A. P. Ratna et al., “Cross-language plagiarism detection system using latent semantic analysis and learning vector quantization,” Algorithms, vol. 10, no. 2, 2017, doi: 10.3390/a10020069.

A. Mishra and S. Vishwakarma, “Analysis of TF-IDF Model and its Variant for Document Retrieval,” Proc. - 2015 Int. Conf. Comput. Intell. Commun. Networks, CICN 2015, pp. 772–776, 2016, doi: 10.1109/CICN.2015.157.

C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008.

L. Xu, S. Sun, and Q. Wang, “Text similarity algorithm based on semantic vector space model,” 2016, pp. 1–4, doi: 10.1109/ICIS.2016.7550928.

S. Reddy, D. Chen, and C. D. Manning, “CoQA: A conversational question answering challenge,” arXiv, vol. 7, no. March, pp. 249–266, 2018, doi: 10.1162/tacl_a_00266.

Refbacks

  • There are currently no refbacks.