arekit-ss [AREkit double "s"] -- is an object-pair context sampler
for datasources,
powered by AREkit
NOTE: For custom text sampling, please follow the ARElight project.
Install dependencies:
pip install git+https://github.com/nicolay-r/arekit-ss.git@0.25.0Download resources:
python -m arekit_ss.download_dataExample of composing prompts:
python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
--prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
--dest_lang en --docs_limit 1Mind the case (issue #18): switching to another language may affect on amount of extracted data because of
terms_per_contextparameter that crops context by fixed and predefined amount of words.
source-- source name from the list of the supported sources.terms_per_context-- amount of words (terms) in between SOURCE and TARGET objects.object-source-types-- filter specific source object typesobject-target-types-- filter specific target object typesrelation_types-- list of types, in which items separated with|char; all by defaultsplits-- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
sampler-- List of the supported samplers:nn-- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.no-vectorize-- flag is applicable only fornn, and denotes no need to generate embeddings for features
bert-- BERT-based, single-input sequence.prompt-- prompt-based sampler for LLM systems [prompt engeneering guide]prompt-- text of the prompt which includes the following parameters:{text}is an original text of the sample{s_val}and{t_val}values of the source and target of the pairs respectively{label_val}value of the label
writer-- the output format of samples:mask_entities-- mask entity mode.- Text translation parameters:
src_lang-- original language of the text.dest_lang-- target language of the text.
output_dir-- target directory for samples storing- Limiting the amount of documents from source:
docs_limit-- amount of documents to be considered for sampling from the whole source.doc_ids-- list of the document IDs.

