SoCeRC++: Source Code Recommendation For C++

Preface

This project encapsulates my senior year capstone project, completed by myself as well as five other team members. This article will include some documentation we worked on; feel free to download it to see our design and requirement specifications, as well as our conclusion (all of these items will be briefly discussed throughout this blog). Below are the documents:

All final code can be viewed here.

Introduction

Motivated by the idea of reusing existing source code from completed projects within a software company, a source code recommendation technique (“SoCeR”) was developed to help programmers find relevant sample code based on software requirement specifications using natural language query.

Utilizing functions already within a company’s database allows for a faster development cycle, requiring less resources, ultimately resulting in lower cost:

Our final deliverable is a web application that can use a trusted code base for reliable reuse of code, as well as an initial code base.

Below is an overview of our project inception:

An overview of subsystems implemented in our app

Functional Requirements

SoCeRC++ shall be able to extract valid functions from C++ files.
SoCeRC++ shall generate descriptors for valid C++ functions.
SoCeRC++ shall maintain a database of C++ functions and shall include its function descriptor.
SoCeRC++ shall return stored C++ functions in response to user queries.
SoCeRC++ shall notify the user if a file was successfully uploaded.
SoCeRC++ shall notify users if a file failed to upload.
SoCeRC++ shall allow users to download a function’s original file.

Nonfunctional Requirements

SoCeRC++ shall have a pre-populated database of at least 100 functions.
SoCeRC++ shall return functions in response to a search query within 30 seconds.
Functions uploaded to the SoCeRC++ application shall be available to be searched for within 24 hours of being uploaded.

Project Organization

Our project followed agile scrum for project management; our sprints were broken down into six sprints, two weeks each, with a new scrum master for each sprint.

We used a Trello board to break down our user stories, ensuring that each functional and nonfunctional requirement was met. Trello allowed us to view where we stood on each task, as well as view the overview for each sprint (things such as scrum master and estimated/actual velocity). You can see a breakdown of our requirements and board breakdown here:

Trello board overview (note: all user stories are completed here)

An example of a user story; we only allowed two team members to a story, and now more than one person to a task.

Project Architecture

Our team settled on a layered structure architecture for this project; this allows us to break down subsystems into components that can independently be worked on. Due to time constraints, we decided that this architecture best fits our needs.

We used several state diagrams to see the flow on different components. An example is shown below:

Simple state diagrams for three components

We also used sequence diagrams to see interactions and flow amongst components:

Project Design

As mentioned earlier, the final deliverable will be a web application. Below is a breakdown of the website architecture:

The website is comprised of two primary pages: a search and upload page:

Project Results

Final project features are seen below:

SoCeRC++ summarizes C++ code into sentences or phrases to match them against user queries.

SoCeRC++ extracts and analyzes the content of the code (such as variables, functions, docstrings, and comments) to generate a code summary for each function which is then mapped to the respective functions.
It also allows users to upload new code to enrich the code base with tested code.
All functional/nonfunctional requirements met
Additional features: pre-populated codebase

The final web application is a Spring Boot app; to run a version of the app on a local machine, the .war file needs to be run; you can then run the application on localhost:8080:

A breakdown of the project directory (left). Running the server from a terminal (middle). A live endpoint running on local machine (right).

Throughout the development process, we conducted many tests for and against our software product. In order to ensure our software product meets its requirements and does not have any unidentified/overlooked bugs and errors, our team went through a third party to receive additional test cases and parameters. An example test is seen below:

Conclusion

What we have achieved in SoCeRC++ is an elementary search engine and a successful implementation of natural language quantification that is optimized for use in conjunction with MySQL. There are, as there ever will be, more optimizations that can be made in the future.

The stemming algorithms we looked into had great potential. With enough time, a robust stemmer could do much to improve the relevance of results returned from search queries. Identifying the right one for SoCeRC++ and implementing it would multiply together with the work done by our term frequency - inverse document frequency algorithm to provide a very effective search engine. In terms of priority, it remains to be determined whether or not stemming would be the single largest improvement to be made to our search engine, but a good stemmer algorithm implementation should be able to compete for that title from what our research suggests.

As demonstrated in our performance evaluation, the core requirement of parsing natural language has been met and the only avenue left for us, other than the addition of a stemmer as discussed previously, would be to add in these additional optimizations over time. Freed from our constraint of time, we would continue to work towards increasing the accuracy of results by our search engine through the refinement of our relevance engine. Other than that, there are runtime improvements that can be made such as reducing database calls by only requesting information from the database on a successful insertion or deletion of data in the database. Aside from relevance and runtime, we could also standardize the display for the search result display on the website.

By the subjective nature of natural language, there will always be improvements that can be made to the parsing of it. That is exactly why it remains a major area of study for computer scientists to this day. Therefore, this section of future improvements could be a paper all its own. The runtime optimizations to be made however are finite and will eventually no longer be an issue.