Skip to content

Add guidelines on word boundaries#170

Open
xfq wants to merge 1 commit intogh-pagesfrom
xfq/issue-169
Open

Add guidelines on word boundaries#170
xfq wants to merge 1 commit intogh-pagesfrom
xfq/issue-169

Conversation

@xfq
Copy link
Member

@xfq xfq commented Feb 6, 2026

Fix #169.


Preview | Diff

@xfq xfq requested review from aphillips, jsahleen and r12a February 6, 2026 04:06
@netlify
Copy link

netlify bot commented Feb 6, 2026

Deploy Preview for bp-i18n-specdev ready!

Name Link
🔨 Latest commit 7045ce0
🔍 Latest deploy log https://app.netlify.com/projects/bp-i18n-specdev/deploys/698568c74382fd0008d91abd
😎 Deploy Preview https://deploy-preview-170--bp-i18n-specdev.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start.

I think in general you should be careful to be more consistent about saying 'script' or 'language'. I would try to focus on what spec authors are probably trying to do with "words". If we want a deeper explanation, we should write an article 😄

I think a problem here is that we say what not to do, but we don't provide any positive guidance. That's harder to do.


<p>There are many situations where a software process needs to access a substring or to point within a string and does so by the use of indices, i.e. numeric &quot;positions&quot; within a string. Where such indices are exchanged between components of the Web, there is a need for an agreed-upon definition of string indexing in order to ensure consistent behavior. The two main questions that arise are: &quot;What is the unit of counting?&quot; and &quot;Do we start counting at 0 or 1?&quot;.</p>

<p>The concept of a &quot;word&quot; is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this paragraph is problematic. I would take a different approach:

Suggested change
<p>The concept of a &quot;word&quot; is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.</p>
<p>In addition to user-perceived characters, [=natural language=] text is often divided in various larger units of text, that is, dividing text into paragraphs, lines, sentences, and words. <dfn>Boundary analysis</dfn> is the general term for the process of computing these different units of text. A common requirement for specifications or implementations is the division of text into words.</p>

Also, you should use markup <q> or <em> instead of &quot; I think?


<p>The concept of a &quot;word&quot; is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.</p>

<p class="advisement">Specifications SHOULD NOT assume that words are always separated by spaces.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHOULD NOT => MUST NOT

You need <div class="req" id="xxxx"> around every bit of new mustard.


<p class="advisement">Specifications SHOULD NOT assume that words are always separated by spaces.</p>

<p>Word separators differ across languages, and even for the same language, ancient and modern usage may differ. For example, in Arabic, short words like &quot;and&quot; (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means &quot;universities and colleges&quot;, but there is only one space between the two words). In typesetting, these words can be treated as part of the word they are attached to.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark up the Arabic with spans (translate=no lang=ar)!

Spell out e.g. as "for example"

I think this is a good example, but comes too fast. It is a quirk rather than the rule in Arabic. Start with the big, obvious examples: most scripts use spaces, so a lot of software developers think this works:

const text = "The quick brown fox jumps over the lazy dog";
const words = text.split(" ");

console.log(words);

So something like:

Suggested change
<p>Word separators differ across languages, and even for the same language, ancient and modern usage may differ. For example, in Arabic, short words like &quot;and&quot; (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means &quot;universities and colleges&quot;, but there is only one space between the two words). In typesetting, these words can be treated as part of the word they are attached to.</p>
A lot of languages use scripts that divide words using whitespace, but some scripts do not.

And then give the <ul> list you have just below.


<p>Many scripts have different spacing conventions:</p>
<ul>
<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"definition of a word is subjective" is arguable. Grammatical word boundaries are not aligned with writing conventions: a speaker of these languages would easily identify grammatical words (I'm glossing over some other linguistic quirks--you're not wrong, but I don't think the details are necessary here. I would lean back to "boundary analysis".

Suggested change
<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>
<li>Scripts such as Thai, Baliese, Batak, Tai Lue, and Khmer do not have word separators. Spaces sometimes appear in these scripts, but more commonly as phrase separators. Word boundary analysis in these scripts often require language-specific dictionaries. Example: <span lang="th" translate="no">สุนัขสีน้ำตาลตัวนี้วิ่งเร็ว</span></li>


<p>Word separators differ across languages, and even for the same language, ancient and modern usage may differ. For example, in Arabic, short words like &quot;and&quot; (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means &quot;universities and colleges&quot;, but there is only one space between the two words). In typesetting, these words can be treated as part of the word they are attached to.</p>

<p>Many scripts have different spacing conventions:</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spacing conventions => word boundary handling

Put CJK first as the most familiar and largest population non-spacing languages.

<p>Many scripts have different spacing conventions:</p>
<ul>
<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>
<li>In Vietnamese (written with the Latin alphabet) and Fraser script, spaces are used to separate syllables, not words.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alphabet => script

<ul>
<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>
<li>In Vietnamese (written with the Latin alphabet) and Fraser script, spaces are used to separate syllables, not words.</li>
<li>Writing systems like Chinese and Japanese have no spaces at all (except for a few exceptions, such as textbooks for foreigners). Although Japanese and Chinese don't use spaces, there are occasions where groups of characters are kept together for operations such as line-breaking (especially in headlines) or double-click selection.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"have no spaces at all" is overstating and the exception is an unusual case that probably doesn't bear mentioning. All of the bullet items in this section should be focused on word boundaries in any event. I would probably do this more like:

Suggested change
<li>Writing systems like Chinese and Japanese have no spaces at all (except for a few exceptions, such as textbooks for foreigners). Although Japanese and Chinese don't use spaces, there are occasions where groups of characters are kept together for operations such as line-breaking (especially in headlines) or double-click selection.</li>
<li>The scripts used for writing Chinese and Japanese generally do not use spaces, particularly between words. Example: <span lang="ja" translate="no">素早い茶色のキツネは怠け者の犬を飛び越える。</span>

<li>Writing systems like Chinese and Japanese have no spaces at all (except for a few exceptions, such as textbooks for foreigners). Although Japanese and Chinese don't use spaces, there are occasions where groups of characters are kept together for operations such as line-breaking (especially in headlines) or double-click selection.</li>
</ul>

<p>When specifications refer to 'words', it is typically in the context of performing some operation on text, such as line-breaking, text spacing, double-clicking, etc. For these operations, the appropriate boundary or location to support the operation must be identified. These boundaries and locations vary from script to script, and may also vary from language to language and depending on the operation being performed. They may or may not be marked by visible separators. For many scripts, these points of interest are indicated by space characters, but for many others they involve more complex heuristics.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted, I would say this sooner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Choosing a definition of 'word'

2 participants