Skip to content

hplt-project/release3_inspection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

release3_inspection

Purpose of inspection

We want to get a rough idea about the actual content of the cleaned version of the 3rd data release. More specifically, for each language L we want to estimate the proportion of documents that are:

  1. not in the language L,
  2. contain undesirable artifacts,
  3. fully undesirable because they are mostly unnatural,
  4. undesirable porn texts.

Data for round 1 of HPLT 3.0 (cleaned) inspection:

  • samples stratified by language,
  • 5 batches of random documents per language,
  • 200 documents per batch,
  • full text for texts shorter than 1500 characters, otherwise the first 500 characters, the last 500 characters and 500 characters from the middle of the text.

Inspection

Please select one or more batches for a language you want to inspect. "Reserve" the batch(es) by filling in your name in the spreadsheet. Fill in the labels and push the updated files back to this repository or create a pull request if you don't have the write permission. Finally, please write down you observations here.

We ask to provide 4 binary labels for each example:

  • porn? empty/1: if the text looks like porn put 1, otherwise leave empty;
  • text artifacts? empty/1: if the text contains some artifacts that should not appear in a running text and should not be generated by LLMs (e.g. traces of menus or markup not removed by text extraction, headers and list items without proper delimitation, truncated text and snippet markers) put 1, otherwise leave empty;
  • unnatural? empty/1: if most of the text looks unnatural (e.g. word lists for SEO, mostly boilerplate, unnaturally looking machine translation, etc.) put 1, otherwise leave empty;
  • lang correct? 0/1: always fill this field (otherwise we will not distinguish labeled and unlabeled examples), put 0 if most of the text is not in the target language, otherwise put 1.

NB: "... (N chars skipped) ..." is shown instead of the skipped text, it is not part of the original text and should not be annotated as text artifacts or unnaturalness.

Advice on inspection

One way to annotate is using LibreOffice Calc. For convenience:

  • make the text preview area larger;
  • optimize column width (select the first 5 columns, select Format -> Columns -> Optimal Width...);
  • optimize row height (select the whole table with e.g. ctrl+a / cmd+a, select Format -> Rows -> Height.., enter a reasonable value e.g. 0.30);
  • freeze the first 5 columns (select them and click View -> Freeze Rows and Columns) and the header row (View -> Freeze Cells -> Freeze First Row).

You should get the interface similar to this: image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 20