Skip to content

Latest commit

 

History

History
159 lines (107 loc) · 5.16 KB

README.md

File metadata and controls

159 lines (107 loc) · 5.16 KB

PUGG: KBQA, MRC, IR dataset for Polish

This repository contains the code used in the research paper titled "Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction" authored by Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz. The paper was accepted for ACL 2024 (findings).

Paper

Citation

@misc{sawczyn2024developingpuggpolishmodern,
      title={Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction}, 
      author={Albert Sawczyn and Katsiaryna Viarenich and Konrad Wojtasik and Aleksandra Domogała and Marcin Oleksy and Maciej Piasecki and Tomasz Kajdanowicz},
      year={2024},
      eprint={2408.02337},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2408.02337}, 
}

PUGG

The PUGG dataset is available in the following repositories:

  • General - contains all tasks (KBQA, MRC, IR*)

For more straightforward usage, the tasks are also available in separate repositories:

The knowledge graph for KBQA task is available in the following repository:

Note: If you want to utilize the IR task in the BEIR format (qrels in .tsv format), please download the IR repository.

Getting Started

Prerequisites

  • Configured Python 3.10 environment.
  • Installed Poetry.
  • Installed Docker (for the search_results_acquisition stage and the rerank stage (read more)).

Installing

To install all dependencies, run:

poetry install 

Downloading data

The repository uses DVC to manage the data. To download the data, run:

dvc pull

Reproducing

DVC

The repository uses DVC to manage the dataset construction pipeline.

  • dvc.yaml contains all of the stages (except run_search_results_acquisition.py).
  • Any data that are external or generated by external tools (i.e. Inforex, spreadsheet) are associated with *.dvc files stored in the data directory.

To reproduce all dvc stages run:

dvc repro

The search_results_acquisition stage

The run_search_results_acquisition.py script acquires data from the Google Search API. It should be run using Docker, not DVC, due to the utilization of a database. In the case of a full reproduction, it should be run after the acquire_suggestions stage. Credentials should be passed using the following environmental variables.

CUSTOM_SEARCH_ID="..."
GOOGLE_API_KEYS='["...", "..."]'

To run script:

docker build -f docker/search_results_acquisition_runner/dockerfile -t search_results_acquisition_runner . 
docker run -v "$(pwd)"/data:/google-query-qa-dataset/data --env-file credentials.env search_results_acquisition_runner

Project structure

Below is an overview of the project structure along with descriptions of the most important modules, directories and files.

  • gqqd/ - a python module that contain python code for creating the KBQA (natural), MRC and IR datasets.

  • sqqd/ - is a python module that contain Python code for creating the KBQA (template-based) dataset.

  • tools/ - contains some tools that were used in the project but not integrated directly into the main codebase.

  • baselines/ - contains implementations of baseline models that are used for evaluation on the constructed datasets.

  • data/ - contains the data used in the project. It includes input data, intermediate data, and the final datasets.

  • tests/ - contains unit tests for the codebase.

  • .gitignore, .dockerignore, .dvcignore - the files specify patterns for files or directories that should be ignored by Git, Docker, or DVC respectively.

  • .env, credentials.env - the files contain environment variables or credentials required for the project. The files are not tracked by Git, because they contain sensitive information.

    If you want to reproduce the whole pipeline, you need to create the files with the following content:

    • .env
    SPARQL_USER_AGENT= # user agent for SPARQL queries
    • credentials.env
    CUSTOM_SEARCH_ID= # Google custom search ID
    GOOGLE_API_KEYS='["key1", "key2"]' # list of Google API keys to use custom search 
    OPENAI_API_KEY= # OpenAI API key
  • dvc.yaml, dvc.lock - related to DVC. The dvc.yaml file contains all the stages, specifying the data dependencies, while dvc.lock locks the exact versions of the data files.

  • pyproject.toml, poetry.lock - related to Poetry.

Other readme documents in the repository