EthOSS Analysis Tool

📌 Overview

The EthOSS Analysis Tool is an end-to-end framework designed to assess and report on ethical behavior within open-source software (OSS) communities, leveraging data extracted from GitHub repositories. By analyzing interactions in GitHub issues and comments, EthOSS classifies behaviors according to ethical standards defined by the Contributor Covenant. The resulting insights help maintain healthy, inclusive, and productive open-source environments.

EthOSS includes the following main stages:

Data Extraction: Collects and cleans issue and comment data from GitHub repositories.
Metadata Extraction: Retrieves detailed repository information and contributor statistics.
Comments Classification: Utilizes advanced NLP (GPT-4 Mini) to categorize comments according to ethical flags.
Report Generation: Creates interactive HTML reports that summarize ethical analysis, repository metadata, and community engagement patterns.

📋 Ethical Categories (Flags)

EthOSS uses specific ethical categories, known as "flags," derived from the Contributor Covenant, to evaluate interactions within OSS projects. These flags represent clearly defined positive and negative behaviors, facilitating structured and objective analysis.

Flag	Description	Type
F1	Empathy and kindness toward other community members.	Positive
F2	Respect for differing opinions, viewpoints, and experiences.	Positive
F3	Constructive feedback, substantial contributions, and helpful recommendations.	Positive
F4	Accepting responsibility, apologizing, and learning from mistakes.	Positive
F5	Prioritizing actions beneficial for the entire community.	Positive
F6	Sexualized language or unwanted sexual attention.	Negative
F7	Insulting, derogatory, or trolling comments.	Negative
F8	Public harassment, intimidation, or threats.	Negative
F9	Publishing private information without consent.	Negative
F11	Comments that do not exhibit any specific ethical behaviors outlined above.	Neutral

These categories provide comprehensive visibility into community dynamics and ethical behaviors within GitHub projects, enabling actionable insights to foster healthy and inclusive environments.

📊 Report Generation (`report_generation.py`)

This stage compiles all previous analyses into comprehensive, interactive HTML reports, providing a complete ethical and statistical evaluation of repository interactions:

Main Operations:

Data Integration: Combines processed comments, issue data, repository metadata, and classification results into structured datasets.
Interactive Visualizations: Generates interactive charts and tables using Plotly, illustrating ethical behaviors, comment activity, issue trends, and contributor engagement.
Repository Metadata: Includes detailed metadata about the repository, such as contributors, bots, license details, popularity metrics (stars, forks, watchers), and the presence of a code of conduct.
HTML Reports: Produces interactive HTML reports with multiple sections:
- Ethical Analysis: Distribution and frequency of ethical flags over time.
- Data Analysis: Statistical summaries of issues and comments.
- Repository Overview: Comprehensive repository metadata and contributor statistics.

Output:

HTML reports are generated and stored in an organized directory structure (data/reports/{language}/{repository}.html).

Configuration:

Dependencies: Python libraries including pandas, plotly, json, base64, yaml, logging, and os.

📑 HTML Report Sections:

Introduction: Overview of dataset activity, summarizing total issues, comments, authors, and engagement trends.
Issue Analysis: Visualizations of issue creation patterns, label usage, and contributor activity.
Comment Analysis: Monthly distribution of comments and identification of top commenters.
Ethical Analysis: Detailed breakdown of ethical flags and their frequency, highlighting community interactions and behavioral trends.

📈 Sample Visualizations:

Ethical Analysis Overview:

Monthly Flag Distribution:

This structured approach ensures comprehensive visibility into the ethical and collaborative health of open-source repositories.

🔧 Setup

1️⃣ Clone the Repository:

git clone https://github.com/your-username/EthOSS.git
cd EthOSS

2️⃣ Install Requirements:

pip install -r requirements.txt

🔑 Environment Variables

Set up the required environment variables for GitHub and OpenAI:

Linux / macOS:

export CRAWLER_ETHOSS="your_github_token"
export OPENAI_API_KEY="your_openai_api_key"

Windows:

set CRAWLER_ETHOSS=your_github_token
set OPENAI_API_KEY=your_openai_api_key

💻 Usage

⚙️ Configure Repositories:

Edit config/config.json to specify repositories and extraction date ranges:

{
  "repos": [
    {
      "owner": "owner_name",
      "name": "repo_name",
      "language": "language"
    }
  ],
  "start_date": "YYYY-MM-DD",
  "end_date": "YYYY-MM-DD"
}

▶️ Run the Pipeline:

Execute the complete analysis pipeline from the main directory:

python src/main.py

📁 Output Structure

After running the pipeline, you'll find the following files and folders generated under the data directory:

data
├── comments
│   ├── raw_comments.csv                  # All extracted comments from processed issues
│   ├── cleaned_raw_comments.csv          # Comments cleaned (duplicates and bots removed)
│   ├── classified_comments.csv           # Comments classified with ethical flags (raw format)
│   └── processed_classified_comments.csv # Classified comments structured for analysis
├── issues
│   ├── extracted_issues
│   │   └── {language}
│   │       └── owner_repo_issues.json    # Raw issue data from GitHub extraction
│   └── processed_issues
│       └── {language}
│           └── owner_repo_issues.json    # Issues with IDs and filtered comments
├── metadata
│   ├── {language}
│       └── owner_repo_metadata.json      # Repository metadata from GitHub
│                                         # Log file for extraction and processing
└── reports
    └── {language}
        └── owner_repo.html               # Interactive HTML analysis report

CSV Files: Data structured for easy analysis.
JSON Files: Raw and processed issue data and repository metadata.
HTML Reports: Interactive, visual summaries of repository analysis.

🗂️ Data Extraction

EthOSS performs a structured extraction and processing workflow to prepare GitHub repository data for ethical analysis. This workflow includes clear stages:

📝 Issues and Comments Extraction

This stage involves extracting detailed issue data along with associated comments from specified GitHub repositories, followed by a processing phase to clean and structure the data:

a. Extraction (`issue_comments_extractor.py`)

Main Operations:

API Interaction: Utilizes the GitHub GraphQL API to efficiently fetch issues and comments.
Configuration Driven: Reads repository details (owner, repository name, language) and date range from a configuration file (config/config.json).
Rate Limit Management: Automatically handles GitHub API rate limits to ensure continuous data collection.
Data Filtering: Retrieves issues created within a specified date range.
Structured Storage: Saves extracted data into organized JSON files (data/issues/extracted_issues/{language}/{owner}_{repo}_issues.json).

b. JSON Processing and Cleaning (`process_raw_jsons.py`)

Main Operations:

Data Cleaning: Filters comments according to a defined cutoff date (specified in config/extraction.yaml).
Unique ID Generation: Adds universally unique identifiers (UUIDs) to each issue and comment to facilitate traceability and further analysis.
Output Storage: Saves cleaned and structured data into a dedicated directory (data/issues/processed_issues/{language}/{owner}_{repo}_issues.json).

Configuration:

Files: config/extraction.yaml, config/config.json
Dependencies: Python libraries including requests, yaml, json, uuid, datetime, and logging.

🔖 Repository Metadata Extraction (`repositories_metadata.py`)

This complementary process enhances the extracted data by providing comprehensive repository-level metadata:

Main Operations:

API Interaction: Leverages the GitHub REST API to obtain detailed metadata.
Metadata Collection: Retrieves repository information (stars, forks, license, description, topics, and code-of-conduct presence).
Contributor Analysis: Determines total contributors and contribution diversity, identifying the top five contributors and their percentage of contributions.
Bot Detection: Identifies commonly used moderation bots (e.g., dependabot[bot]).
Output Storage: Stores enriched metadata into structured JSON files (data/metadata/{language}/{owner}_{repo_name}_metadata.json).

Configuration:

Files: config/extraction.yaml, config/config.json
Dependencies: Python libraries including requests, yaml, json, and logging.

This structure supports flexible addition of multiple repositories for broader analyses.

🛠️ Comments Data Generation

This stage processes extracted issues and comments, transforming them into a structured and cleaned dataset suitable for ethical analysis:

📑 Raw Comments Generation (`generate_raw_comments.py`)

Main Operations:

JSON Data Extraction: Extracts detailed comments from previously processed JSON files.
Metadata Enrichment: Associates each comment with relevant issue and repository metadata (e.g., repository name, language, issue details, author).
CSV Export: Compiles the enriched comments into a unified CSV file for further analysis (data/comments/raw_comments.csv).

Configuration:

Files: config/data_pipeline.yaml
Dependencies: Python libraries including json, yaml, csv, os, logging, and pandas.

🧹 Comments Cleaning (`process_raw_comments.py`)

Main Operations:

Removal of Incomplete Entries: Eliminates rows with missing or empty comment_author or comment_body fields.
Bot Comment Filtering: Removes comments authored by bots identified through standard naming conventions (e.g., names containing '-bot') and a predefined list of known bots.
Duplicate Removal: Eliminates duplicated comments based on identical comment content (comment_body).
Cleaned CSV Output: Saves the refined data into a cleaned CSV file (data/comments/cleaned_raw_comments.csv).

Configuration:

Files: config/data_pipeline.yaml
Dependencies: Python libraries including pandas, yaml, os, and logging.

Configuration File Example (`data_pipeline.yaml`):

directories:
  issues_cleaned_dir: "data/issues/processed_issues"

files:
  raw_issues_file: "data/comments/raw_comments.csv"
  cleaned_raw_issues_file: "data/comments/cleaned_raw_comments.csv"

bots:
  classic_bots:
    - "probot"
    - "stale"
    - "dependabot"
    - "github-actions"
    - "mergify"
    - "cla-assistant"
    - "danger"

This setup enables flexible and efficient data processing for comprehensive ethical analyses.

🤖 Comments Classification

This stage involves classifying comments from GitHub issues to detect ethical behaviors, utilizing prompt engineering with OpenAI's GPT-4 Mini:

Main Operations:

Prompt Engineering: Uses GPT-4 Mini to analyze each comment and classify it according to predefined ethical flags, considering context from the issue title.
Classification Categories: Identifies behaviors such as empathy, constructive feedback, respect for differing opinions, harassment, insults, and other defined positive and negative behaviors.
JSON Output: Each comment receives a structured JSON output specifying the identified flags and reasons for classification.
CSV Output: Saves the classification results to a CSV file (data/comments/classified_comments.csv).

Configuration:

Files: config/classifier.yaml
Dependencies: Python libraries including openai, pandas, yaml, json, logging, and os.

⚙️ Process Classified Comments (`process_classified_comments.py`)

Main Operations:

JSON Parsing and Validation: Extracts ethical flags and corresponding reasons from the JSON outputs generated during classification.
Data Enhancement: Integrates parsed flags and reasons into the original dataset, creating structured columns for easy analysis and reporting.
Final CSV Generation: Outputs an enhanced and easily analyzable CSV file (data/comments/processed_classified_comments.csv).

Configuration:

Files: config/classifier.yaml
Dependencies: Python libraries including pandas, yaml, json, re, and logging.

Configuration Example (`classifier.yaml`):

files:
  raw_comments_file: "data/comments/cleaned_raw_comments.csv"
  classifier_output_csv: "data/comments/classified_comments.csv"
  processed_classification_output: "data/comments/processed_classified_comments.csv"

openai:
  api_key_env: "OPENAI_API_KEY"

This configuration ensures streamlined integration and flexible execution of the classification process.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
config		config
data		data
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE.md		LICENSE.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EthOSS Analysis Tool

📌 Overview

📋 Ethical Categories (Flags)

📊 Report Generation (`report_generation.py`)

📑 HTML Report Sections:

📈 Sample Visualizations:

🔧 Setup

1️⃣ Clone the Repository:

2️⃣ Install Requirements:

🔑 Environment Variables

💻 Usage

⚙️ Configure Repositories:

▶️ Run the Pipeline:

📁 Output Structure

🗂️ Data Extraction

📝 Issues and Comments Extraction

a. Extraction (`issue_comments_extractor.py`)

b. JSON Processing and Cleaning (`process_raw_jsons.py`)

Configuration:

🔖 Repository Metadata Extraction (`repositories_metadata.py`)

Configuration:

🛠️ Comments Data Generation

📑 Raw Comments Generation (`generate_raw_comments.py`)

🧹 Comments Cleaning (`process_raw_comments.py`)

Configuration File Example (`data_pipeline.yaml`):

🤖 Comments Classification

⚙️ Process Classified Comments (`process_classified_comments.py`)

Configuration Example (`classifier.yaml`):

About

Releases

Packages

Languages

License

SOM-Research/EthOSS

Folders and files

Latest commit

History

Repository files navigation

EthOSS Analysis Tool

📌 Overview

📋 Ethical Categories (Flags)

📊 Report Generation (report_generation.py)

📑 HTML Report Sections:

📈 Sample Visualizations:

🔧 Setup

1️⃣ Clone the Repository:

2️⃣ Install Requirements:

🔑 Environment Variables

💻 Usage

⚙️ Configure Repositories:

▶️ Run the Pipeline:

📁 Output Structure

🗂️ Data Extraction

📝 Issues and Comments Extraction

a. Extraction (issue_comments_extractor.py)

b. JSON Processing and Cleaning (process_raw_jsons.py)

Configuration:

🔖 Repository Metadata Extraction (repositories_metadata.py)

Configuration:

🛠️ Comments Data Generation

📑 Raw Comments Generation (generate_raw_comments.py)

🧹 Comments Cleaning (process_raw_comments.py)

Configuration File Example (data_pipeline.yaml):

🤖 Comments Classification

⚙️ Process Classified Comments (process_classified_comments.py)

Configuration Example (classifier.yaml):

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

📊 Report Generation (`report_generation.py`)

a. Extraction (`issue_comments_extractor.py`)

b. JSON Processing and Cleaning (`process_raw_jsons.py`)

🔖 Repository Metadata Extraction (`repositories_metadata.py`)

📑 Raw Comments Generation (`generate_raw_comments.py`)

🧹 Comments Cleaning (`process_raw_comments.py`)

Configuration File Example (`data_pipeline.yaml`):

⚙️ Process Classified Comments (`process_classified_comments.py`)

Configuration Example (`classifier.yaml`):

Packages