This pipeline takes the content of software READMEs on the input, runs Named Entity Recognition on them to extract the organisation names, and finally uses ROR's affiliation matching to map the organisation names to ROR IDs.
As input, we used READMEs extracted from The Stack:
-
Ingest The Stack into BigQuery
thestack.files
<-- already done at CSET -
Filter to READMEs
SELECT
DISTINCT hexsha,
size,
ext,
lang,
max_stars_repo_path AS file_path,
max_stars_repo_name AS repo_name,
max_stars_count AS star_count,
content,
max_line_length,
alphanum_fraction
FROM
thestack.files
where regexp_contains(max_stars_repo_path, r"(?i)^read\.?me\b")
- Filter to READMEs with relevant keywords (222,408 rows):
create or replace table czi_hackathon.contains_institution as
select
repo_name,
file_path,
content
from thestack.readmes
where (lower(content) like "%institute%") or (lower(content) like "%university%") or (lower(content) like "%school%")
A sample of 5K READMEs is available here.
-
Download the NER model from huggingface.
-
Run pipeline.py script to extract organisation names and map them to ROR IDs:
python pipeline.py --input <stack READMEs dir> --model <NER model dir> --output <output file> [--threads <number of threads>] [--chunk <size of the imap chunk>]
The NER model is responsible for extracting organisation names from text. It was adapted from this tool.
The script uses ROR's affiliation matching service to map extracted organisation names to ROR IDs.