Name		Name	Last commit message	Last commit date
parent directory ..
stack_readmes		stack_readmes
README.md		README.md
links.csv		links.csv
pipeline.py		pipeline.py

README.md

Extracting the organisation information from READMEs

This pipeline takes the content of software READMEs on the input, runs Named Entity Recognition on them to extract the organisation names, and finally uses ROR's affiliation matching to map the organisation names to ROR IDs.

Extracting READMEs

As input, we used READMEs extracted from The Stack:

Ingest The Stack into BigQuery thestack.files <-- already done at CSET
Filter to READMEs

SELECT
  DISTINCT hexsha,
  size,
  ext,
  lang,
  max_stars_repo_path AS file_path,
  max_stars_repo_name AS repo_name,
  max_stars_count AS star_count,
  content,
  max_line_length,
  alphanum_fraction
FROM
  thestack.files
where regexp_contains(max_stars_repo_path, r"(?i)^read\.?me\b")

Filter to READMEs with relevant keywords (222,408 rows):

create or replace table czi_hackathon.contains_institution as 
select 
  repo_name, 
  file_path, 
  content 
from thestack.readmes 
where (lower(content) like "%institute%") or (lower(content) like "%university%") or (lower(content) like "%school%")

A sample of 5K READMEs is available here.

Extracting and mapping organization names

Download the NER model from huggingface.
Run pipeline.py script to extract organisation names and map them to ROR IDs:

python pipeline.py --input <stack READMEs dir> --model <NER model dir> --output <output file> [--threads <number of threads>] [--chunk <size of the imap chunk>]

The NER model is responsible for extracting organisation names from text. It was adapted from this tool.

The script uses ROR's affiliation matching service to map extracted organisation names to ROR IDs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ner_text_extraction_pipeline

ner_text_extraction_pipeline

README.md

Extracting the organisation information from READMEs

Extracting READMEs

Extracting and mapping organization names

Files

ner_text_extraction_pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

ner_text_extraction_pipeline

Folders and files

parent directory

README.md

Extracting the organisation information from READMEs

Extracting READMEs

Extracting and mapping organization names