The EthOSS Analysis Tool is an end-to-end framework designed to assess and report on ethical behavior within open-source software (OSS) communities, leveraging data extracted from GitHub repositories. By analyzing interactions in GitHub issues and comments, EthOSS classifies behaviors according to ethical standards defined by the Contributor Covenant. The resulting insights help maintain healthy, inclusive, and productive open-source environments.
EthOSS includes the following main stages:
- Data Extraction: Collects and cleans issue and comment data from GitHub repositories.
- Metadata Extraction: Retrieves detailed repository information and contributor statistics.
- Comments Classification: Utilizes advanced NLP (GPT-4 Mini) to categorize comments according to ethical flags.
- Report Generation: Creates interactive HTML reports that summarize ethical analysis, repository metadata, and community engagement patterns.
EthOSS uses specific ethical categories, known as "flags," derived from the Contributor Covenant, to evaluate interactions within OSS projects. These flags represent clearly defined positive and negative behaviors, facilitating structured and objective analysis.
Flag | Description | Type |
---|---|---|
F1 | Empathy and kindness toward other community members. | Positive |
F2 | Respect for differing opinions, viewpoints, and experiences. | Positive |
F3 | Constructive feedback, substantial contributions, and helpful recommendations. | Positive |
F4 | Accepting responsibility, apologizing, and learning from mistakes. | Positive |
F5 | Prioritizing actions beneficial for the entire community. | Positive |
F6 | Sexualized language or unwanted sexual attention. | Negative |
F7 | Insulting, derogatory, or trolling comments. | Negative |
F8 | Public harassment, intimidation, or threats. | Negative |
F9 | Publishing private information without consent. | Negative |
F11 | Comments that do not exhibit any specific ethical behaviors outlined above. | Neutral |
These categories provide comprehensive visibility into community dynamics and ethical behaviors within GitHub projects, enabling actionable insights to foster healthy and inclusive environments.
This stage compiles all previous analyses into comprehensive, interactive HTML reports, providing a complete ethical and statistical evaluation of repository interactions:
Main Operations:
- Data Integration: Combines processed comments, issue data, repository metadata, and classification results into structured datasets.
- Interactive Visualizations: Generates interactive charts and tables using Plotly, illustrating ethical behaviors, comment activity, issue trends, and contributor engagement.
- Repository Metadata: Includes detailed metadata about the repository, such as contributors, bots, license details, popularity metrics (stars, forks, watchers), and the presence of a code of conduct.
- HTML Reports: Produces interactive HTML reports with multiple sections:
- Ethical Analysis: Distribution and frequency of ethical flags over time.
- Data Analysis: Statistical summaries of issues and comments.
- Repository Overview: Comprehensive repository metadata and contributor statistics.
Output:
- HTML reports are generated and stored in an organized directory structure (
data/reports/{language}/{repository}.html
).
Configuration:
- Dependencies: Python libraries including
pandas
,plotly
,json
,base64
,yaml
,logging
, andos
.
- Introduction: Overview of dataset activity, summarizing total issues, comments, authors, and engagement trends.
- Issue Analysis: Visualizations of issue creation patterns, label usage, and contributor activity.
- Comment Analysis: Monthly distribution of comments and identification of top commenters.
- Ethical Analysis: Detailed breakdown of ethical flags and their frequency, highlighting community interactions and behavioral trends.
Ethical Analysis Overview:
Monthly Flag Distribution:
This structured approach ensures comprehensive visibility into the ethical and collaborative health of open-source repositories.
git clone https://github.com/your-username/EthOSS.git
cd EthOSS
pip install -r requirements.txt
Set up the required environment variables for GitHub and OpenAI:
Linux / macOS:
export CRAWLER_ETHOSS="your_github_token"
export OPENAI_API_KEY="your_openai_api_key"
Windows:
set CRAWLER_ETHOSS=your_github_token
set OPENAI_API_KEY=your_openai_api_key
Edit config/config.json
to specify repositories and extraction date ranges:
{
"repos": [
{
"owner": "owner_name",
"name": "repo_name",
"language": "language"
}
],
"start_date": "YYYY-MM-DD",
"end_date": "YYYY-MM-DD"
}
Execute the complete analysis pipeline from the main directory:
python src/main.py
After running the pipeline, you'll find the following files and folders generated under the data
directory:
data
βββ comments
β βββ raw_comments.csv # All extracted comments from processed issues
β βββ cleaned_raw_comments.csv # Comments cleaned (duplicates and bots removed)
β βββ classified_comments.csv # Comments classified with ethical flags (raw format)
β βββ processed_classified_comments.csv # Classified comments structured for analysis
βββ issues
β βββ extracted_issues
β β βββ {language}
β β βββ owner_repo_issues.json # Raw issue data from GitHub extraction
β βββ processed_issues
β βββ {language}
β βββ owner_repo_issues.json # Issues with IDs and filtered comments
βββ metadata
β βββ {language}
β βββ owner_repo_metadata.json # Repository metadata from GitHub
β # Log file for extraction and processing
βββ reports
βββ {language}
βββ owner_repo.html # Interactive HTML analysis report
- CSV Files: Data structured for easy analysis.
- JSON Files: Raw and processed issue data and repository metadata.
- HTML Reports: Interactive, visual summaries of repository analysis.
EthOSS performs a structured extraction and processing workflow to prepare GitHub repository data for ethical analysis. This workflow includes clear stages:
This stage involves extracting detailed issue data along with associated comments from specified GitHub repositories, followed by a processing phase to clean and structure the data:
Main Operations:
- API Interaction: Utilizes the GitHub GraphQL API to efficiently fetch issues and comments.
- Configuration Driven: Reads repository details (owner, repository name, language) and date range from a configuration file (
config/config.json
). - Rate Limit Management: Automatically handles GitHub API rate limits to ensure continuous data collection.
- Data Filtering: Retrieves issues created within a specified date range.
- Structured Storage: Saves extracted data into organized JSON files (
data/issues/extracted_issues/{language}/{owner}_{repo}_issues.json
).
Main Operations:
- Data Cleaning: Filters comments according to a defined cutoff date (specified in
config/extraction.yaml
). - Unique ID Generation: Adds universally unique identifiers (UUIDs) to each issue and comment to facilitate traceability and further analysis.
- Output Storage: Saves cleaned and structured data into a dedicated directory (
data/issues/processed_issues/{language}/{owner}_{repo}_issues.json
).
- Files:
config/extraction.yaml
,config/config.json
- Dependencies: Python libraries including
requests
,yaml
,json
,uuid
,datetime
, andlogging
.
This complementary process enhances the extracted data by providing comprehensive repository-level metadata:
Main Operations:
- API Interaction: Leverages the GitHub REST API to obtain detailed metadata.
- Metadata Collection: Retrieves repository information (stars, forks, license, description, topics, and code-of-conduct presence).
- Contributor Analysis: Determines total contributors and contribution diversity, identifying the top five contributors and their percentage of contributions.
- Bot Detection: Identifies commonly used moderation bots (e.g.,
dependabot[bot]
). - Output Storage: Stores enriched metadata into structured JSON files (
data/metadata/{language}/{owner}_{repo_name}_metadata.json
).
- Files:
config/extraction.yaml
,config/config.json
- Dependencies: Python libraries including
requests
,yaml
,json
, andlogging
.
This structure supports flexible addition of multiple repositories for broader analyses.
This stage processes extracted issues and comments, transforming them into a structured and cleaned dataset suitable for ethical analysis:
Main Operations:
- JSON Data Extraction: Extracts detailed comments from previously processed JSON files.
- Metadata Enrichment: Associates each comment with relevant issue and repository metadata (e.g., repository name, language, issue details, author).
- CSV Export: Compiles the enriched comments into a unified CSV file for further analysis (
data/comments/raw_comments.csv
).
Configuration:
- Files:
config/data_pipeline.yaml
- Dependencies: Python libraries including
json
,yaml
,csv
,os
,logging
, andpandas
.
Main Operations:
- Removal of Incomplete Entries: Eliminates rows with missing or empty
comment_author
orcomment_body
fields. - Bot Comment Filtering: Removes comments authored by bots identified through standard naming conventions (e.g., names containing '-bot') and a predefined list of known bots.
- Duplicate Removal: Eliminates duplicated comments based on identical comment content (
comment_body
). - Cleaned CSV Output: Saves the refined data into a cleaned CSV file (
data/comments/cleaned_raw_comments.csv
).
Configuration:
- Files:
config/data_pipeline.yaml
- Dependencies: Python libraries including
pandas
,yaml
,os
, andlogging
.
directories:
issues_cleaned_dir: "data/issues/processed_issues"
files:
raw_issues_file: "data/comments/raw_comments.csv"
cleaned_raw_issues_file: "data/comments/cleaned_raw_comments.csv"
bots:
classic_bots:
- "probot"
- "stale"
- "dependabot"
- "github-actions"
- "mergify"
- "cla-assistant"
- "danger"
This setup enables flexible and efficient data processing for comprehensive ethical analyses.
This stage involves classifying comments from GitHub issues to detect ethical behaviors, utilizing prompt engineering with OpenAI's GPT-4 Mini:
Main Operations:
- Prompt Engineering: Uses GPT-4 Mini to analyze each comment and classify it according to predefined ethical flags, considering context from the issue title.
- Classification Categories: Identifies behaviors such as empathy, constructive feedback, respect for differing opinions, harassment, insults, and other defined positive and negative behaviors.
- JSON Output: Each comment receives a structured JSON output specifying the identified flags and reasons for classification.
- CSV Output: Saves the classification results to a CSV file (
data/comments/classified_comments.csv
).
Configuration:
- Files:
config/classifier.yaml
- Dependencies: Python libraries including
openai
,pandas
,yaml
,json
,logging
, andos
.
Main Operations:
- JSON Parsing and Validation: Extracts ethical flags and corresponding reasons from the JSON outputs generated during classification.
- Data Enhancement: Integrates parsed flags and reasons into the original dataset, creating structured columns for easy analysis and reporting.
- Final CSV Generation: Outputs an enhanced and easily analyzable CSV file (
data/comments/processed_classified_comments.csv
).
Configuration:
- Files:
config/classifier.yaml
- Dependencies: Python libraries including
pandas
,yaml
,json
,re
, andlogging
.
files:
raw_comments_file: "data/comments/cleaned_raw_comments.csv"
classifier_output_csv: "data/comments/classified_comments.csv"
processed_classification_output: "data/comments/processed_classified_comments.csv"
openai:
api_key_env: "OPENAI_API_KEY"
This configuration ensures streamlined integration and flexible execution of the classification process.