Multi-Level AI-Driven Analysis of Software Repository Similarities

Honglin Zhang, Leyu Zhang, Lei Fang, Rosa Filgueira*

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

Abstract / Description of output

his paper introduces significant enhancements to RepoSim4Py and RepoSnipy, advanced semantic tools for deep analysis of software repositories. RepoSim4Py command- line toolbox now supports multi-level embedding, encompass- ing code, documentation, requirements, README, and com- prehensive repository analysis, which enable the understand- ing of repository dynamics. Concurrently, RepoSnipy web- based search engine facilitates sophisticated repository similarity searches and introduces clustering based on both repository tags (topic cluster) and code embeddings (code cluster). We also introduce SimilarityCal, a novel binary classification model trained on these clusters, to predict and quantify repository similarities with high accuracy. These developments provide researchers and developers with powerful tools to navigate the complex landscape of software repositories, improving efficiency in software development and fostering innovation through better reuse of existing resources.
Original languageEnglish
Number of pages10
DOIs
Publication statusPublished - 23 Sept 2024
EventIEEE eScience 2024 - Senri Life Science Center , Osaka, Japan
Duration: 16 Sept 202420 Sept 2024
https://www.escience-conference.org/2024/

Conference

ConferenceIEEE eScience 2024
Abbreviated titleeScience 2024
Country/TerritoryJapan
CityOsaka
Period16/09/2420/09/24
Internet address

Keywords / Materials (for Non-textual outputs)

  • repository similarity
  • semantic analysis
  • reposi tory clustering
  • code understanding
  • multi-level embeddings
  • pretrained language models
  • GitHub
  • mining software repositories.

Fingerprint

Dive into the research topics of 'Multi-Level AI-Driven Analysis of Software Repository Similarities'. Together they form a unique fingerprint.

Cite this