Git-Pandas is a powerful Python library that transforms Git repository data into pandas DataFrames, making it easy to analyze and visualize your codebase's history, contributors, and development patterns. Built on top of GitPython, it provides a simple yet powerful interface for extracting meaningful insights from your Git repositories.
- Easy to Use: Simple API that converts Git data into familiar pandas DataFrames
- Comprehensive Analysis: From basic commit history to complex metrics like bus factor
- Flexible: Works with single repositories or entire project directories
- Visualization Ready: Built-in plotting utilities for common Git analytics
- Performance Optimized: Optional caching support for memory-intensive operations
The Repository
class provides a wrapper around a single Git repository, offering methods to:
- Extract commit history with filtering by extension and directory
- Analyze file changes and blame information
- Track branch and tag information
- Generate cumulative blame statistics
- Calculate file ownership and contribution patterns
The ProjectDirectory
class enables analysis across multiple repositories:
- Automatically discovers and analyzes nested Git repositories
- Aggregates metrics across multiple repositories
- Provides project-level insights and statistics
- Calculates cross-repository metrics like total development time
- Commit History: Track changes with extension and directory filtering
- File Analysis: Monitor edited files and blame information
- Branch & Tag Management: Access repository structure information
- Cumulative Blame: Generate time-series data of code ownership
- File Ownership: Approximate file ownership and contribution patterns
- Bus Factor: Calculate project sustainability metrics
- Development Time: Estimate hours spent per project or author
- Contributor Analysis: Track individual and team contributions
- Project Health: Generate comprehensive project information tables
- Profile Analysis: Analyze GitHub.com profiles via
GitHubProfile
object - Repository Metrics: Extract repository-specific insights
- Contributor Insights: Track external contributions and collaborations
- Plotting Helpers: Built-in utilities for common Git analytics
- Punchcard Analysis: Generate and visualize commit patterns
- Blame Visualization: Create cumulative blame charts
- Time Series Analysis: Track changes and patterns over time
Git-Pandas supports Python 2.7+ and 3.3+. Install using pip:
pip install git-pandas
from gitpandas import Repository
# Analyze a single repository
repo = Repository('/path/to/repo')
commits_df = repo.commit_history()
blame_df = repo.blame()
# Analyze multiple repositories
from gitpandas import ProjectDirectory
project = ProjectDirectory('/path/to/project')
project_info = project.general_information()
Comprehensive documentation is available at http://wdm0006.github.io/git-pandas/
For memory-intensive operations, Git-Pandas supports:
- Memory-based caching
- Redis-based caching
- Configurable cache durations
- GitNOC: Network of Code analysis tool
- Commit Opener: Commit analysis and visualization tool
We welcome contributions! Please review our Contributing Guidelines for details on:
- Code of Conduct
- Development Setup
- Pull Request Process
- Starter Issues
This project is BSD licensed (see LICENSE.md)