Skip to content

EthOSS: Analyze and report ethical behaviors in Open-Source communities using GitHub data and NLP classification.

License

Notifications You must be signed in to change notification settings

SOM-Research/EthOSS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

EthOSS Analysis Tool

πŸ“Œ Overview

The EthOSS Analysis Tool is an end-to-end framework designed to assess and report on ethical behavior within open-source software (OSS) communities, leveraging data extracted from GitHub repositories. By analyzing interactions in GitHub issues and comments, EthOSS classifies behaviors according to ethical standards defined by the Contributor Covenant. The resulting insights help maintain healthy, inclusive, and productive open-source environments.

EthOSS includes the following main stages:

  • Data Extraction: Collects and cleans issue and comment data from GitHub repositories.
  • Metadata Extraction: Retrieves detailed repository information and contributor statistics.
  • Comments Classification: Utilizes advanced NLP (GPT-4 Mini) to categorize comments according to ethical flags.
  • Report Generation: Creates interactive HTML reports that summarize ethical analysis, repository metadata, and community engagement patterns.

πŸ“‹ Ethical Categories (Flags)

EthOSS uses specific ethical categories, known as "flags," derived from the Contributor Covenant, to evaluate interactions within OSS projects. These flags represent clearly defined positive and negative behaviors, facilitating structured and objective analysis.

Flag Description Type
F1 Empathy and kindness toward other community members. Positive
F2 Respect for differing opinions, viewpoints, and experiences. Positive
F3 Constructive feedback, substantial contributions, and helpful recommendations. Positive
F4 Accepting responsibility, apologizing, and learning from mistakes. Positive
F5 Prioritizing actions beneficial for the entire community. Positive
F6 Sexualized language or unwanted sexual attention. Negative
F7 Insulting, derogatory, or trolling comments. Negative
F8 Public harassment, intimidation, or threats. Negative
F9 Publishing private information without consent. Negative
F11 Comments that do not exhibit any specific ethical behaviors outlined above. Neutral

These categories provide comprehensive visibility into community dynamics and ethical behaviors within GitHub projects, enabling actionable insights to foster healthy and inclusive environments.


πŸ“Š Report Generation (report_generation.py)

This stage compiles all previous analyses into comprehensive, interactive HTML reports, providing a complete ethical and statistical evaluation of repository interactions:

Main Operations:

  • Data Integration: Combines processed comments, issue data, repository metadata, and classification results into structured datasets.
  • Interactive Visualizations: Generates interactive charts and tables using Plotly, illustrating ethical behaviors, comment activity, issue trends, and contributor engagement.
  • Repository Metadata: Includes detailed metadata about the repository, such as contributors, bots, license details, popularity metrics (stars, forks, watchers), and the presence of a code of conduct.
  • HTML Reports: Produces interactive HTML reports with multiple sections:
    • Ethical Analysis: Distribution and frequency of ethical flags over time.
    • Data Analysis: Statistical summaries of issues and comments.
    • Repository Overview: Comprehensive repository metadata and contributor statistics.

Output:

  • HTML reports are generated and stored in an organized directory structure (data/reports/{language}/{repository}.html).

Configuration:

  • Dependencies: Python libraries including pandas, plotly, json, base64, yaml, logging, and os.

πŸ“‘ HTML Report Sections:

  • Introduction: Overview of dataset activity, summarizing total issues, comments, authors, and engagement trends.
  • Issue Analysis: Visualizations of issue creation patterns, label usage, and contributor activity.
  • Comment Analysis: Monthly distribution of comments and identification of top commenters.
  • Ethical Analysis: Detailed breakdown of ethical flags and their frequency, highlighting community interactions and behavioral trends.

πŸ“ˆ Sample Visualizations:

Ethical Analysis Overview:

Ethical Analysis Overview

Monthly Flag Distribution:

Monthly Flag Distribution

This structured approach ensures comprehensive visibility into the ethical and collaborative health of open-source repositories.


πŸ”§ Setup

1️⃣ Clone the Repository:

git clone https://github.com/your-username/EthOSS.git
cd EthOSS

2️⃣ Install Requirements:

pip install -r requirements.txt

πŸ”‘ Environment Variables

Set up the required environment variables for GitHub and OpenAI:

Linux / macOS:

export CRAWLER_ETHOSS="your_github_token"
export OPENAI_API_KEY="your_openai_api_key"

Windows:

set CRAWLER_ETHOSS=your_github_token
set OPENAI_API_KEY=your_openai_api_key

πŸ’» Usage

βš™οΈ Configure Repositories:

Edit config/config.json to specify repositories and extraction date ranges:

{
  "repos": [
    {
      "owner": "owner_name",
      "name": "repo_name",
      "language": "language"
    }
  ],
  "start_date": "YYYY-MM-DD",
  "end_date": "YYYY-MM-DD"
}

▢️ Run the Pipeline:

Execute the complete analysis pipeline from the main directory:

python src/main.py

πŸ“ Output Structure

After running the pipeline, you'll find the following files and folders generated under the data directory:

data
β”œβ”€β”€ comments
β”‚   β”œβ”€β”€ raw_comments.csv                  # All extracted comments from processed issues
β”‚   β”œβ”€β”€ cleaned_raw_comments.csv          # Comments cleaned (duplicates and bots removed)
β”‚   β”œβ”€β”€ classified_comments.csv           # Comments classified with ethical flags (raw format)
β”‚   └── processed_classified_comments.csv # Classified comments structured for analysis
β”œβ”€β”€ issues
β”‚   β”œβ”€β”€ extracted_issues
β”‚   β”‚   └── {language}
β”‚   β”‚       └── owner_repo_issues.json    # Raw issue data from GitHub extraction
β”‚   └── processed_issues
β”‚       └── {language}
β”‚           └── owner_repo_issues.json    # Issues with IDs and filtered comments
β”œβ”€β”€ metadata
β”‚   β”œβ”€β”€ {language}
β”‚       └── owner_repo_metadata.json      # Repository metadata from GitHub
β”‚                                         # Log file for extraction and processing
└── reports
    └── {language}
        └── owner_repo.html               # Interactive HTML analysis report
  • CSV Files: Data structured for easy analysis.
  • JSON Files: Raw and processed issue data and repository metadata.
  • HTML Reports: Interactive, visual summaries of repository analysis.

πŸ—‚οΈ Data Extraction

EthOSS performs a structured extraction and processing workflow to prepare GitHub repository data for ethical analysis. This workflow includes clear stages:

πŸ“ Issues and Comments Extraction

This stage involves extracting detailed issue data along with associated comments from specified GitHub repositories, followed by a processing phase to clean and structure the data:

a. Extraction (issue_comments_extractor.py)

Main Operations:

  • API Interaction: Utilizes the GitHub GraphQL API to efficiently fetch issues and comments.
  • Configuration Driven: Reads repository details (owner, repository name, language) and date range from a configuration file (config/config.json).
  • Rate Limit Management: Automatically handles GitHub API rate limits to ensure continuous data collection.
  • Data Filtering: Retrieves issues created within a specified date range.
  • Structured Storage: Saves extracted data into organized JSON files (data/issues/extracted_issues/{language}/{owner}_{repo}_issues.json).

b. JSON Processing and Cleaning (process_raw_jsons.py)

Main Operations:

  • Data Cleaning: Filters comments according to a defined cutoff date (specified in config/extraction.yaml).
  • Unique ID Generation: Adds universally unique identifiers (UUIDs) to each issue and comment to facilitate traceability and further analysis.
  • Output Storage: Saves cleaned and structured data into a dedicated directory (data/issues/processed_issues/{language}/{owner}_{repo}_issues.json).

Configuration:

  • Files: config/extraction.yaml, config/config.json
  • Dependencies: Python libraries including requests, yaml, json, uuid, datetime, and logging.

πŸ”– Repository Metadata Extraction (repositories_metadata.py)

This complementary process enhances the extracted data by providing comprehensive repository-level metadata:

Main Operations:

  • API Interaction: Leverages the GitHub REST API to obtain detailed metadata.
  • Metadata Collection: Retrieves repository information (stars, forks, license, description, topics, and code-of-conduct presence).
  • Contributor Analysis: Determines total contributors and contribution diversity, identifying the top five contributors and their percentage of contributions.
  • Bot Detection: Identifies commonly used moderation bots (e.g., dependabot[bot]).
  • Output Storage: Stores enriched metadata into structured JSON files (data/metadata/{language}/{owner}_{repo_name}_metadata.json).

Configuration:

  • Files: config/extraction.yaml, config/config.json
  • Dependencies: Python libraries including requests, yaml, json, and logging.

This structure supports flexible addition of multiple repositories for broader analyses.


πŸ› οΈ Comments Data Generation

This stage processes extracted issues and comments, transforming them into a structured and cleaned dataset suitable for ethical analysis:

πŸ“‘ Raw Comments Generation (generate_raw_comments.py)

Main Operations:

  • JSON Data Extraction: Extracts detailed comments from previously processed JSON files.
  • Metadata Enrichment: Associates each comment with relevant issue and repository metadata (e.g., repository name, language, issue details, author).
  • CSV Export: Compiles the enriched comments into a unified CSV file for further analysis (data/comments/raw_comments.csv).

Configuration:

  • Files: config/data_pipeline.yaml
  • Dependencies: Python libraries including json, yaml, csv, os, logging, and pandas.

🧹 Comments Cleaning (process_raw_comments.py)

Main Operations:

  • Removal of Incomplete Entries: Eliminates rows with missing or empty comment_author or comment_body fields.
  • Bot Comment Filtering: Removes comments authored by bots identified through standard naming conventions (e.g., names containing '-bot') and a predefined list of known bots.
  • Duplicate Removal: Eliminates duplicated comments based on identical comment content (comment_body).
  • Cleaned CSV Output: Saves the refined data into a cleaned CSV file (data/comments/cleaned_raw_comments.csv).

Configuration:

  • Files: config/data_pipeline.yaml
  • Dependencies: Python libraries including pandas, yaml, os, and logging.

Configuration File Example (data_pipeline.yaml):

directories:
  issues_cleaned_dir: "data/issues/processed_issues"

files:
  raw_issues_file: "data/comments/raw_comments.csv"
  cleaned_raw_issues_file: "data/comments/cleaned_raw_comments.csv"

bots:
  classic_bots:
    - "probot"
    - "stale"
    - "dependabot"
    - "github-actions"
    - "mergify"
    - "cla-assistant"
    - "danger"

This setup enables flexible and efficient data processing for comprehensive ethical analyses.


πŸ€– Comments Classification

This stage involves classifying comments from GitHub issues to detect ethical behaviors, utilizing prompt engineering with OpenAI's GPT-4 Mini:

Main Operations:

  • Prompt Engineering: Uses GPT-4 Mini to analyze each comment and classify it according to predefined ethical flags, considering context from the issue title.
  • Classification Categories: Identifies behaviors such as empathy, constructive feedback, respect for differing opinions, harassment, insults, and other defined positive and negative behaviors.
  • JSON Output: Each comment receives a structured JSON output specifying the identified flags and reasons for classification.
  • CSV Output: Saves the classification results to a CSV file (data/comments/classified_comments.csv).

Configuration:

  • Files: config/classifier.yaml
  • Dependencies: Python libraries including openai, pandas, yaml, json, logging, and os.

βš™οΈ Process Classified Comments (process_classified_comments.py)

Main Operations:

  • JSON Parsing and Validation: Extracts ethical flags and corresponding reasons from the JSON outputs generated during classification.
  • Data Enhancement: Integrates parsed flags and reasons into the original dataset, creating structured columns for easy analysis and reporting.
  • Final CSV Generation: Outputs an enhanced and easily analyzable CSV file (data/comments/processed_classified_comments.csv).

Configuration:

  • Files: config/classifier.yaml
  • Dependencies: Python libraries including pandas, yaml, json, re, and logging.

Configuration Example (classifier.yaml):

files:
  raw_comments_file: "data/comments/cleaned_raw_comments.csv"
  classifier_output_csv: "data/comments/classified_comments.csv"
  processed_classification_output: "data/comments/processed_classified_comments.csv"

openai:
  api_key_env: "OPENAI_API_KEY"

This configuration ensures streamlined integration and flexible execution of the classification process.

About

EthOSS: Analyze and report ethical behaviors in Open-Source communities using GitHub data and NLP classification.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages