Mining business topics in source code using Latent Dirichlet Allocation

Girish Maskeri*, Santonu Sarkar, Kenneth Heafield

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

One of the difficulties in maintaining a large software system is the absence of documented business domain topics and correlation between these domain topics and source code. Without such a correlation, people without any prior application knowledge would find it hard to comprehend the functionality of the system. Latent Dirichlet Allocation (LDA), a statistical model, has emerged as a popular technique for discovering topics in large text document corpus. But its applicability in extracting business domain topics from source code has not been explored so far. This paper investigates LDA in the context of comprehending large software systems and proposes a human assisted approach based on LDA for extracting domain topics from source code. This method has been applied on a number of open source and proprietary systems. Preliminary results indicate that LDA is able to identify some of the domain topics and is a satisfactory starting point for further manual refinement of topics.

Original languageEnglish
Title of host publicationProceedings of the 2008 1st India Software Engineering Conference, ISEC'08
Number of pages8
ISBN (Electronic)9781595939173
Publication statusPublished - 19 Feb 2008
Event2008 1st India Software Engineering Conference, ISEC'08 - Hyderabad, India
Duration: 19 Feb 200822 Feb 2008


Conference2008 1st India Software Engineering Conference, ISEC'08

Keywords / Materials (for Non-textual outputs)

  • LDA
  • Maintenance
  • Program comprehension


Dive into the research topics of 'Mining business topics in source code using Latent Dirichlet Allocation'. Together they form a unique fingerprint.

Cite this