Skip to content

Conversation

@stloyd
Copy link
Member

@stloyd stloyd commented Oct 29, 2025

Resolves: #1914

Change Log


Added

  • Add a new `HTMLEntry`

Fixed

Changed

Removed

Deprecated

Security

@stloyd stloyd requested a review from norberttech as a code owner October 29, 2025 18:04
@stloyd stloyd force-pushed the task/1914-html-entry branch from 21c7f77 to 253cd95 Compare October 29, 2025 18:05
@codecov
Copy link

codecov bot commented Oct 29, 2025

Codecov Report

❌ Patch coverage is 92.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.72%. Comparing base (9e7c5f7) to head (d18f6af).
⚠️ Report is 2 commits behind head on 1.x.

Additional details and impacted files
@@            Coverage Diff             @@
##              1.x    #1940      +/-   ##
==========================================
+ Coverage   79.67%   79.72%   +0.05%     
==========================================
  Files         832      833       +1     
  Lines       24748    24771      +23     
==========================================
+ Hits        19718    19749      +31     
+ Misses       5030     5022       -8     
Components Coverage Δ
etl 88.84% <90.24%> (+0.02%) ⬆️
cli 85.96% <ø> (ø)
lib-array-dot 95.00% <ø> (ø)
lib-azure-sdk 60.39% <ø> (ø)
lib-doctrine-dbal-bulk 95.14% <ø> (ø)
lib-filesystem 80.68% <100.00%> (ø)
lib-types 88.88% <100.00%> (+0.54%) ⬆️
lib-parquet 68.35% <ø> (ø)
lib-parquet-viewer 83.04% <ø> (ø)
lib-snappy 90.18% <ø> (ø)
bridge-filesystem-async-aws 91.00% <ø> (ø)
bridge-filesystem-azure 89.47% <ø> (ø)
bridge-monolog-http 96.91% <ø> (ø)
bridge-openapi-specification 94.50% <ø> (ø)
symfony-http-foundation 73.17% <ø> (ø)
adapter-chartjs 86.36% <ø> (ø)
adapter-csv 89.08% <ø> (ø)
adapter-doctrine 90.97% <ø> (ø)
adapter-elasticsearch 97.17% <ø> (ø)
adapter-google-sheet 91.40% <ø> (ø)
adapter-http 60.56% <ø> (ø)
adapter-json 89.80% <ø> (ø)
adapter-logger 83.33% <ø> (ø)
adapter-meilisearch 97.87% <ø> (ø)
adapter-parquet 78.91% <ø> (ø)
adapter-text 88.09% <ø> (ø)
adapter-xml 83.07% <ø> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@stloyd stloyd force-pushed the task/1914-html-entry branch 2 times, most recently from 5dba5b6 to 5561f51 Compare October 31, 2025 08:52
@norberttech
Copy link
Member

there is a bug in https://github.com/flow-php/flow/blob/1.x/phpunit.xml.dist#L75C1-L80C17 (those should have lib- prefix) which prevents types tests from being executed on the CI/CD 🤦‍♂️ I noticed because on my local machine I have overwritten phpunjt.xml.dist and noticed that html type related tests are failing

@norberttech
Copy link
Member

(it was automatically closed since #1944 mentioned this as something that should be resolved)

@stloyd stloyd force-pushed the task/1914-html-entry branch 2 times, most recently from d56e6d4 to ef8d7bf Compare October 31, 2025 13:26
@norberttech
Copy link
Member

I checked out locally to that branch (to figure out why infection is complaining) and when I execute:

 nix-shell --arg php-version 8.2 --run "composer test:lib:types"

I'm getting:

1) Flow\Types\Tests\Unit\Value\HTMLDocumentTest::test_create_with_invalid_html_on_old
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'<p>invalid</p>'
+''

/Users/norbert/Workspace/flow-php/flow/src/lib/types/tests/Flow/Types/Tests/Unit/Value/HTMLDocumentTest.php:48

Which makes sense, that test is expecting that DOMDocument is going to wrap value with <p></p> but you added \LIBXML_HTML_NOIMPLIED that should prevent that and I have no idea why it's passing on GitHub Actions :|

Different version of libxml perhaps? 🤔

@norberttech
Copy link
Member

As for CodeCove / Infection - this PR just does not have enough code coverage (I thought it's something else related to tools config).

https://app.codecov.io/gh/flow-php/flow/pull/1940?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=flow-php

It adds some lines of code, but around 27% of HTMLEntry is not covered by any test

@norberttech
Copy link
Member

Different version of libxml perhaps? 🤔

Yes, here is the result of the investigation around this issue:

Breaking change: libxml2 2.14.0 (March 27, 2025)
The LIBXML_HTML_NOIMPLIED behavior changed fundamentally in libxml2 2.14.0 when the HTML parser was rewritten to use an HTML5-compliant tokenizer.
Key changes:
Before 2.14.0:

HTML parser based on HTML4 spec (underspecified)
LIBXML_HTML_NOIMPLIED worked leniently with plain text and fragments
Plain text like 'body' was tolerated even without proper HTML tags

From 2.14.0 onwards:

HTML tokenizer fully conforms to HTML5 specification
Tree construction still uses custom algorithm (not fully HTML5)
Stricter handling of invalid markup
LIBXML_HTML_NOIMPLIED now requires actual HTML elements - plain text without tags results in empty output

Copy link
Member

@norberttech norberttech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTMLEntry needs better coverage

@norberttech
Copy link
Member

After taking another look at \Flow\Types\Value\HTMLDocument I think there is a bigger problem with the implementation.

\Flow\Types\Value\Uuid - when you pass string that is not Uuid or uuid string it will simply throw an exception.

    public function test_construct_with_invalid_string_uuid_throws_exception() : void
    {
        $this->expectException(InvalidArgumentException::class);
        new Uuid('invalid-uuid-string');
    }

\Flow\Types\Value\HTMLDocument is making some weird flips

    #[RequiresPhp('>= 8.4')]
    public function test_create_with_proper_html_on_newer() : void
    {
        $html = '<html><body><div><span>bar</span></div></body></html>';
        $document = new HTMLDocument($html);

        self::assertSame($html, (string) $document);
    }

So what needs to happen is following:

HTMLDocument needs to validate the input and check if passed value is either:

\DOM\HTMLDocument - and extract string from it
\DOMDocument - and extract string from it
string - nothing to extract

Once values are extracted it should validate if given string is a valid HTML5 by checking if:

  • starts with <!DOCTYPE html ...>
  • root element is named html
  • root element has exactly one body
  • root element has exactly one head

So now the test that is failing because of different versions of libxml should simply throw an exception.

@norberttech norberttech closed this Nov 2, 2025
@norberttech norberttech reopened this Nov 2, 2025
@stloyd stloyd force-pushed the task/1914-html-entry branch from 8764b22 to 937d7ed Compare November 2, 2025 14:49
@stloyd stloyd requested a review from norberttech November 2, 2025 15:21
@stloyd stloyd force-pushed the task/1914-html-entry branch from 254aacf to d18f6af Compare November 2, 2025 17:18
@stloyd stloyd requested a review from norberttech November 2, 2025 17:19
@norberttech norberttech merged commit 7327e12 into flow-php:1.x Nov 2, 2025
17 of 18 checks passed
@stloyd stloyd deleted the task/1914-html-entry branch November 2, 2025 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Task]: HTML Entry

2 participants