Add guidelines on word boundaries by xfq · Pull Request #170 · w3c/bp-i18n-specdev

xfq · 2026-02-06T04:06:28Z

netlify · 2026-02-06T04:06:34Z

✅ Deploy Preview for bp-i18n-specdev ready!

Name	Link
🔨 Latest commit	`7045ce0`
🔍 Latest deploy log	https://app.netlify.com/projects/bp-i18n-specdev/deploys/698568c74382fd0008d91abd
😎 Deploy Preview	https://deploy-preview-170--bp-i18n-specdev.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

aphillips

This is a good start.

I think in general you should be careful to be more consistent about saying 'script' or 'language'. I would try to focus on what spec authors are probably trying to do with "words". If we want a deeper explanation, we should write an article 😄

I think a problem here is that we say what not to do, but we don't provide any positive guidance. That's harder to do.

aphillips · 2026-02-06T16:10:34Z

index.html


 <p>There are many situations where a software process needs to access a substring or to point within a string and does so by the use of indices, i.e. numeric &quot;positions&quot; within a string. Where such indices are exchanged between components of the Web, there is a need for an agreed-upon definition of string indexing in order to ensure consistent behavior. The two main questions that arise are: &quot;What is the unit of counting?&quot; and &quot;Do we start counting at 0 or 1?&quot;.</p>

+<p>The concept of a &quot;word&quot; is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.</p>


I think this paragraph is problematic. I would take a different approach:

Suggested change

The concept of a "word" is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.

In addition to user-perceived characters, [=natural language=] text is often divided in various larger units of text, that is, dividing text into paragraphs, lines, sentences, and words. <dfn>Boundary analysis</dfn> is the general term for the process of computing these different units of text. A common requirement for specifications or implementations is the division of text into words.

Also, you should use markup <q> or  instead of " I think?

aphillips · 2026-02-06T16:11:45Z

index.html


+<p>The concept of a &quot;word&quot; is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.</p>
+
+<p class="advisement">Specifications SHOULD NOT assume that words are always separated by spaces.</p>


SHOULD NOT => MUST NOT

You need <div class="req" id="xxxx"> around every bit of new mustard.

aphillips · 2026-02-06T16:21:03Z

index.html

+
+<p class="advisement">Specifications SHOULD NOT assume that words are always separated by spaces.</p>
+
+<p>Word separators differ across languages, and even for the same language, ancient and modern usage may differ. For example, in Arabic, short words like &quot;and&quot; (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means &quot;universities and colleges&quot;, but there is only one space between the two words). In typesetting, these words can be treated as part of the word they are attached to.</p>


Mark up the Arabic with spans (translate=no lang=ar)!

Spell out e.g. as "for example"

I think this is a good example, but comes too fast. It is a quirk rather than the rule in Arabic. Start with the big, obvious examples: most scripts use spaces, so a lot of software developers think this works:

const text = "The quick brown fox jumps over the lazy dog"; const words = text.split(" "); console.log(words);

So something like:

Suggested change

Word separators differ across languages, and even for the same language, ancient and modern usage may differ. For example, in Arabic, short words like "and" (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means "universities and colleges", but there is only one space between the two words). In typesetting, these words can be treated as part of the word they are attached to.

A lot of languages use scripts that divide words using whitespace, but some scripts do not.

And then give the <ul> list you have just below.

aphillips · 2026-02-06T16:26:05Z

index.html

+
+<p>Many scripts have different spacing conventions:</p>
+<ul>
+<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>


"definition of a word is subjective" is arguable. Grammatical word boundaries are not aligned with writing conventions: a speaker of these languages would easily identify grammatical words (I'm glossing over some other linguistic quirks--you're not wrong, but I don't think the details are necessary here. I would lean back to "boundary analysis".

Suggested change

<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>

<li>Scripts such as Thai, Baliese, Batak, Tai Lue, and Khmer do not have word separators. Spaces sometimes appear in these scripts, but more commonly as phrase separators. Word boundary analysis in these scripts often require language-specific dictionaries. Example: สุนัขสีน้ำตาลตัวนี้วิ่งเร็ว</li>

aphillips · 2026-02-06T16:27:46Z

index.html

+
+<p>Word separators differ across languages, and even for the same language, ancient and modern usage may differ. For example, in Arabic, short words like &quot;and&quot; (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means &quot;universities and colleges&quot;, but there is only one space between the two words). In typesetting, these words can be treated as part of the word they are attached to.</p>
+
+<p>Many scripts have different spacing conventions:</p>


spacing conventions => word boundary handling

Put CJK first as the most familiar and largest population non-spacing languages.

aphillips · 2026-02-06T16:28:23Z

index.html

+<p>Many scripts have different spacing conventions:</p>
+<ul>
+<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>
+<li>In Vietnamese (written with the Latin alphabet) and Fraser script, spaces are used to separate syllables, not words.</li>


alphabet => script

aphillips · 2026-02-06T16:35:01Z

index.html

+<ul>
+<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>
+<li>In Vietnamese (written with the Latin alphabet) and Fraser script, spaces are used to separate syllables, not words.</li>
+<li>Writing systems like Chinese and Japanese have no spaces at all (except for a few exceptions, such as textbooks for foreigners). Although Japanese and Chinese don't use spaces, there are occasions where groups of characters are kept together for operations such as line-breaking (especially in headlines) or double-click selection.</li>


"have no spaces at all" is overstating and the exception is an unusual case that probably doesn't bear mentioning. All of the bullet items in this section should be focused on word boundaries in any event. I would probably do this more like:

Suggested change

<li>Writing systems like Chinese and Japanese have no spaces at all (except for a few exceptions, such as textbooks for foreigners). Although Japanese and Chinese don't use spaces, there are occasions where groups of characters are kept together for operations such as line-breaking (especially in headlines) or double-click selection.</li>

<li>The scripts used for writing Chinese and Japanese generally do not use spaces, particularly between words. Example: 素早い茶色のキツネは怠け者の犬を飛び越える。

aphillips · 2026-02-06T16:36:04Z

index.html

+<li>Writing systems like Chinese and Japanese have no spaces at all (except for a few exceptions, such as textbooks for foreigners). Although Japanese and Chinese don't use spaces, there are occasions where groups of characters are kept together for operations such as line-breaking (especially in headlines) or double-click selection.</li>
+</ul>
+
+<p>When specifications refer to 'words', it is typically in the context of performing some operation on text, such as line-breaking, text spacing, double-clicking, etc. For these operations, the appropriate boundary or location to support the operation must be identified. These boundaries and locations vary from script to script, and may also vary from language to language and depending on the operation being performed. They may or may not be marked by visible separators. For many scripts, these points of interest are indicated by space characters, but for many others they involve more complex heuristics.</p>


As noted, I would say this sooner.

Add guidelines on word boundaries

7045ce0

Fix #169.

xfq requested review from aphillips, jsahleen and r12a February 6, 2026 04:06

aphillips requested changes Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add guidelines on word boundaries#170

Add guidelines on word boundaries#170
xfq wants to merge 1 commit intogh-pagesfrom
xfq/issue-169

xfq commented Feb 6, 2026 •

edited by pr-preview bot

Loading

Uh oh!

netlify bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

aphillips left a comment

Uh oh!

aphillips Feb 6, 2026

Uh oh!

aphillips Feb 6, 2026

Uh oh!

aphillips Feb 6, 2026

Uh oh!

aphillips Feb 6, 2026

Uh oh!

aphillips Feb 6, 2026

Uh oh!

aphillips Feb 6, 2026

Uh oh!

aphillips Feb 6, 2026

Uh oh!

aphillips Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		<p>There are many situations where a software process needs to access a substring or to point within a string and does so by the use of indices, i.e. numeric "positions" within a string. Where such indices are exchanged between components of the Web, there is a need for an agreed-upon definition of string indexing in order to ensure consistent behavior. The two main questions that arise are: "What is the unit of counting?" and "Do we start counting at 0 or 1?".</p>

		<p>The concept of a "word" is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.</p>

	<p>The concept of a "word" is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.</p>
	<p>In addition to user-perceived characters, [=natural language=] text is often divided in various larger units of text, that is, dividing text into paragraphs, lines, sentences, and words. <dfn>Boundary analysis</dfn> is the general term for the process of computing these different units of text. A common requirement for specifications or implementations is the division of text into words.</p>


		<p>The concept of a "word" is difficult to define across languages and scripts. While it usually refers to a grammatical unit smaller than a phrase and containing one or more syllables, the boundaries and separators used to delimit words vary significantly across languages and writing systems.</p>

		<p class="advisement">Specifications SHOULD NOT assume that words are always separated by spaces.</p>


		<p class="advisement">Specifications SHOULD NOT assume that words are always separated by spaces.</p>

		<p>Word separators differ across languages, and even for the same language, ancient and modern usage may differ. For example, in Arabic, short words like "and" (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means "universities and colleges", but there is only one space between the two words). In typesetting, these words can be treated as part of the word they are attached to.</p>

	<li>Scripts such as Balinese, Batak, Tai Lue, and Khmer do not have word separators, and the definition of a word is subjective. Spaces may appear in these languages, but they may be phrase separators rather than word separators.</li>
	<li>Scripts such as Thai, Baliese, Batak, Tai Lue, and Khmer do not have word separators. Spaces sometimes appear in these scripts, but more commonly as phrase separators. Word boundary analysis in these scripts often require language-specific dictionaries. Example: <span lang="th" translate="no">สุนัขสีน้ำตาลตัวนี้วิ่งเร็ว</span></li>


		<p>Word separators differ across languages, and even for the same language, ancient and modern usage may differ. For example, in Arabic, short words like "and" (و) can be written directly next to the preceding word without a space (e.g., الجامعات والكليات means "universities and colleges", but there is only one space between the two words). In typesetting, these words can be treated as part of the word they are attached to.</p>

		<p>Many scripts have different spacing conventions:</p>

	<li>Writing systems like Chinese and Japanese have no spaces at all (except for a few exceptions, such as textbooks for foreigners). Although Japanese and Chinese don't use spaces, there are occasions where groups of characters are kept together for operations such as line-breaking (especially in headlines) or double-click selection.</li>
	<li>The scripts used for writing Chinese and Japanese generally do not use spaces, particularly between words. Example: <span lang="ja" translate="no">素早い茶色のキツネは怠け者の犬を飛び越える。</span>

Conversation

xfq commented Feb 6, 2026 • edited by pr-preview bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for bp-i18n-specdev ready!

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xfq commented Feb 6, 2026 •

edited by pr-preview bot

Loading

netlify bot commented Feb 6, 2026 •

edited

Loading