Mining Common Syntactic Patterns used by Java Programmers

Alvaro Losada; Guillermo Facundo; Miguel Garcia; Francisco Ortin

Authors

Alvaro Losada University of Oviedo https://orcid.org/0000-0002-1757-6415
Guillermo Facundo University of Oviedo https://orcid.org/0000-0003-1283-2763
Miguel Garcia University of Oviedo https://orcid.org/0000-0002-3150-2826
Francisco Ortin University of Oviedo https://orcid.org/0000-0003-1199-8649

Keywords:

Syntactic patterns, rule mining, Abstract Syntax Trees, association rules, Java

Abstract

Open source code repositories provide massive data as programs that have been used to develop different tools. These kinds of works have been included in the active Big Code and Mining Software Repositories research fields. Although different machine learning works already classify the syntactic constructs used by programmers, there are no reports about the most common syntactic patterns used by Java programmers. In this article, we describe a system we build to provide such a report. Our system retrieves the syntactic patterns used by Java programmers, distinguishing those utilized by experts and beginners. We also present the anomalies found in the usage of different syntactic constructs. We modify the OpenJDK compiler to double the syntactic information included in its Abstract Syntax Tree (AST), define a mechanism to translate ASTs into n-dimensional vectors, combine the information of different syntax constructs to build heterogeneous patterns, and apply the Frequent Pattern Growth algorithm to mine the syntactic patterns as association rules. The mined patterns allow expressing hierarchical subpatterns connected to one another, providing a high level of expressiveness.

Downloads

Download data is not yet available.

Author Biographies

Alvaro Losada, University of Oviedo

Alvaro Losada es Graduado en Ingeniería Informática por la Universidad de León desde 2018. Actualmente es estudiante del Máster de Investigación en Ingeniería Web en la Universidad de Oviedo. Sus intereses de investigación incluyen machine learning, big data y razonamiento probabilístico.

Guillermo Facundo, University of Oviedo

Guillermo Facundo se graduó en Ingeniería Informática del Software por la Universidad de Oviedo en 2020. Actualmente es estudiante del Máster de Investigación en Ingeniería Web en la Universidad de Oviedo. Sus intereses de investigación incluyen web semántica, desarrollo de software y lenguajes de programación.

Miguel Garcia, University of Oviedo

Miguel Garcia es Profesor Contratado Doctor de la Universidad de Oviedo. Ingeniero Técnico en Informática (2005), Máster en Ingeniería Web (2008) y Máster en Investigación en Ciencias de la Computación (2010). Es doctor en Informática por la Universidad de Oviedo desde 2013. Sus intereses de investigación incluyen big data, desarrollo de software, machine learning y lenguajes de programación.

Francisco Ortin, University of Oviedo

Francisco Ortin Catedrático de Universidad del Departamento de Informática de la Universidad de Oviedo y Profesor Adjunto del Munster Technological University (Irlanda). Director del grupo de investigación oficial Computational Reflection centrado en lenguajes de programación y desarrollo de software.

References

F. Ortin, J. Escalada, and O. Rodriguez-Prieto, “Big Code: New Opportunities for Improving Software Construction,” J. Softw., vol. 11, pp. 1008–1083, Nov. 2016, DOI: 10.17706/JSW.11.11.1083-1088.

K. Aggarwal, M. Salameh, and A. Hindle. “Using machine translation for converting Python 2 to Python 3 code”. PeerJ PrePrints, Oct. 29, 2015. [Online] DOI: 10.7287/PEERJ.PREPRINTS.1459V1, Accessed on: Jun. 10, 2021.

A. V. M. Barone, and R. Sennrich. “A parallel corpus of Python functions and documentation strings for automated code documentation and code generation,” 2017. [Online]. Available: arXiv:1707.02275v1 [cs.CL].

A. Bhoopchand, T. Rocktäschel, E. Barr, and S. Riedel. “Learning Python Code Suggestion with a Sparse Pointer Network,” 2016. [Online]. Available: arXiv:1611.08307v1 [cs.NE].

S. Bhatia and R. Singh, “Automated correction for syntax errors in programming assignments using Recurrent neural networks,” 2016. [Online]. Available: arXiv:1603.06129v1 [cs.PL].

A. W. Appel, and J. Palsberg, Modern Compiler Implementation in Java, 2nd ed., Cambridge, UK: Cambridge University Press, 2002.

F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized Vulnerability Extrapolation using Abstract Syntax Trees,” in ACM Int. Conf. Proc. Series, 2012, pp. 359–368, DOI: 10.1145/2420950.2421003.

J. Escalada and F. Ortin, “An Adaptable Infrastructure to Generate Training Datasets for Decompilation Issues,” in Advances in Intelligent Systems and Computing, Cham: Springer International Publishing, 2014, pp. 85–94. DOI: 10.1007/978-3-319-05948-8_9.

F. Ortin, O. Rodriguez-Prieto, N. Pascual, and M. Garcia, “Heterogeneous tree structure classification to label Java programmers according to their expertise level,” Futur. Gener. Comput. Syst., vol. 105, pp. 380–394, 2020, DOI: 10.1016/J.FUTURE.2019.12.016.

V. Iyer, and C. Zilles. “Pattern Census: A Characterization of Pattern Usage in Early Programming Courses,” in Proc. of the 52nd ACM Technical Symp. on Comp. Science Education, 2021, pp. 45–51.

T. Crow, A. Luxton-Reilly, and B. Wuensche. “Intelligent tutoring systems for programming education: a systematic review,” in Proc. of the 20th Australasian Comp. Education Conf., 2018, pp. 53–62.

Oracle, Oracle JDK 9 Documentation - Java Platform, Standard Edition Tools Reference. Accessed on: Jun. 10, 2021. [Online]. Available: https://docs.oracle.com/javase/9/tools/javac.htm#JSWOR627

M. Allamanis, and C. Sutton, “Mining Idioms from Source Code,” in Proc. of the 22nd ACM SIGSOFT Int. Symp. on Foundations of Software Engineering, 2014, pp. 472–483, DOI: 10.1145/2635868.2635901.

I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone detection using abstract syntax trees,” in Proc. of the Int. Conf. on Software Maintenance, 1998, pp. 368–377, DOI: 10.1109/ICSM.1998.738528.

D. Perez, and S. Chiba, “Cross-language clone detection by learning over abstract syntax trees,” in Int. Conf. on Mining Software Repositories, 2019, pp. 518–528, DOI: 10.1109/MSR.2019.00078.

J. Zeng, K. Ben, X. Li, and X. Zhang, “Fast Code Clone Detection Based on Weighted Recursive Autoencoders,” IEEE Access, vol. 7, pp. 125062–125078, 2019.

L. Büch, and A. Andrzejak, “Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection,” in Proc. of the IEEE Int. Conf. on Software Analysis, Evolution and Reengineering, pp. 95–104, DOI: 10.1109/SANER.2019.8668039.

Z. Lubsen, A. Zaidman, and M. Pinzger, “Using association rules to study the co-evolution of production test code,” in 6th IEEE Int. Working Conf. on Mining Software Repositories, 2009, pp. 151–154, DOI: 10.1109/MSR.2009.5069493.

P. Bian, B. Liang, W. Shi, J. Huang, and Y. Cai, “NAR-Miner: Discovering Negative Association Rules from Code for Bug Detection,” in Proc. of the 2018 26th ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering, 2018, pp. 411–422, DOI: 10.1145/3236024.3236032.

A. Losada, G. Facundo, M. Garcia, and f. Ortin, “Mining Common Syntactic Patterns used by Java Programmers (support material website),” 2021. [Online]. Available: http://www.reflection.uniovi.es/bigcode/download/2021/ieee-lat.

M. Joglekar, H. Garcia-Molina, and A. Parameswaran, “Interactive Data Exploration with Smart Drill-Down,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 1, pp. 46–60, 2019.

J. W. Tukey, Exploratory Data Analysis. Upper Saddle River, NJ: Pearson, 1977.

R. Agrawal, and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” in Proc. of the 20th Int. Conf. on Very Large Data Bases, 1994, pp. 487–499.

R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules between sets of items in large databases,” in Proc. of the 1993 ACM SIGMOD Int. Conf. on Manage. of Data - SIGMOD ’93, 1993.

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Oxford, England: Morgan Kaufmann, 2012.

O. Rodriguez-Prieto, A. Mycroft, and F. Ortin, “An efficient and scalable platform for Java source code analysis using overlaid graph representations,” IEEE Access, vol. 8, pp. 72239–72260, 2020.

Mining Common Syntactic Patterns used by Java Programmers

Authors

Keywords:

Abstract

Downloads

Author Biographies

Alvaro Losada, University of Oviedo

Guillermo Facundo, University of Oviedo

Miguel Garcia, University of Oviedo

Francisco Ortin, University of Oviedo

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Information