Skip to content
This repository was archived by the owner on Jul 3, 2023. It is now read-only.

Conversation

@JulioCCBUcuenca
Copy link
Contributor

@JulioCCBUcuenca JulioCCBUcuenca commented Jul 19, 2018

Integration of librdfa on Any23 through the Rio API of RDF4J. For that, I have created a new parser.

In order to try out the PR, you should:

  1. Install librdfa
  2. Compile Any23

@JulioCCBUcuenca
Copy link
Contributor Author

@lewismc just as a remainder, this will be the PR that I will be using in the last stage of GSoC.

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JulioCCBUcuenca this is looking excellent as a first pass. Please see my comments.

# Allows to decide which RDFa Extractor to enable.
# If 'on' will be activated the programmatic RDFa 1.1 Extractor
# (org.deri.any23.extractor.rdfa.RDFa11Extractor) otherwise will be
# registered the RDFa 1.0 legacy one (org.deri.any23.extractor.rdfa.RDFaExtractor).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no mention of org.deri.any23 in this file. It should be org.apache.any23, can you please correct this. Thanks


# Allows to enable Librdfa Extractor.
# If 'on' will override the extractors with the programmatic option,
# RDFa 1.1 Extractor (org.deri.any23.extractor.rdfa.RDFa11Extractor) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace deri with apache

core/pom.xml Outdated

<!-- BEGIN: Librdfa-RDF4J, loading from repository -->
<dependency>
<groupId>${project.groupId}</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use 2 space indents the same as the rest of the file.

core/pom.xml Outdated
<dependency>
<groupId>${project.groupId}</groupId>
<artifactId>apache-any23-librdfa</artifactId>
<version>1.0.0</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that the versioning is consistent with the rest of the codebase e.g. 2.3-SNAPSHOT or whatever it is currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should I include librdfa-rdf4j as part of the build life-cycle?

core/pom.xml Outdated
<repositories>
<repository>
<id>librdfa-rdf4j</id>
<url>https://raw.github.com/JulioCCBUcuenca/librdfa-java/repository/</url>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hacky... what is meant to reside at this URL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we haven't published the librdfa-rdf4j parser in maven central, I added the jars there. I added that because if while compiling librdfa is not installed, the build will fail. In that URL lives the jars of librdfa-rdf4j.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, librdfa-rdf4j is a standalone module that is not part of any23. Thus, at compiling time, any23 doesn't know about librdfa-rdf4j. If we integrate this with any, we need to deal with the requirement that librdfa-rdfa needs librdfa.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

librdfa-rdf4j should definitely be a separate module of Any23, please make sure that it is and that the require JAR dependencies are located at build and compile/runtime such.

caller.setCallback(callback);

caller.parse();
//rdfa.set_rdfa_parser(caller);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not in use, please remove.


@Override
public void default_graph(String subject, String predicate, String object, int object_type, String datatype, String language) {
System.out.println("default_graph(...)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using System calls is never particularly safe... are all of these calls necessary?

@@ -0,0 +1,40 @@
%module(directors="1") rdfa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing license header.

@@ -0,0 +1 @@
org.apache.any23.rdf.rdfa.LibrdfaRDFaParserFactory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header if possible.

limitations under the License.
-->
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a strange addition. Can you explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lang tag is used in HTML files, and the language of XHTML files is provided using xml:lang. So, since this is a HTML page, it needs to use the lang tag. Current XML parsers can use both lang or xml:lang, but since librdfa uses an old library for parsing XML it generates an error since it cannot identify the language.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats fine thanks for the explanation.

@JulioCCBUcuenca
Copy link
Contributor Author

Thanks @lewismc for the feedback. I will work on all your suggestions.

@lewismc
Copy link
Member

lewismc commented Aug 8, 2018

@JulioCCBUcuenca Please address the issue of making librdfa-rdf4j a module within Any23 then push your changes. At that stage I will most likely approve the PR and we can test more thoroughly. Thanks.

@HansBrende
Copy link
Member

Current XML parsers can use both lang or xml:lang, but since librdfa uses an old library for parsing XML it generates an error since it cannot identify the language.

That seems worrisome to me... considering then, that most html pages will break the librdfa parser.

@lewismc should we not test more thoroughly before merging to master? Maybe a separate branch instead?

Also, I would think that all of our current rdfa parsing tests should be duplicated for the librdfa parser to ensure that it is at least as stable as our current rdfa parser. TBH, I'd be in favor of adding this into version 2.4 rather than 2.3 so we have more time to thoroughly test the module.

@lewismc
Copy link
Member

lewismc commented Aug 8, 2018

@lewismc should we not test more thoroughly before merging to master? Maybe a separate branch instead?

Yes absolutely. We should also definitely extend the integration tests as well such that we can make more confident comparisons regarding runtime execution.

@lewismc
Copy link
Member

lewismc commented Aug 8, 2018

TBH, I'd be in favor of adding this into version 2.4 rather than 2.3 so we have more time to thoroughly test the module.

That is fine with me...

@JulioCCBUcuenca
Copy link
Contributor Author

JulioCCBUcuenca commented Aug 8, 2018

Thanks for your comments @lewismc. Will do it! 😄

@HansBrende librdfa supports RDFa 1.0 and 1.1, so I made sure that librdfa extractor works on all tests that Semargl extractor runs for both RDFa 1.0 and 1.1; additionally, the librdfa extractor is tested on the base suitcase (AbstractRDFaExtractorTestCase ).

I agree that librdfa functionalities should go to other release. I am mostly concerned for the requirement that librdfa-rdf4j needs, which is that librdfa (the C library) needs to be installed. Remember, that not only the libraries that librdfa uses are old but the librdfa project itself has not being active in a while.

If we are moving to other branch, could someone create a new branch, so I can point the PR to the new branch.

I will be glad to keep contributing after GSoC. Honestly, I'd like the idea of becoming a commiter, so I have to keep working to earn it. 😃

@lewismc
Copy link
Member

lewismc commented Aug 8, 2018

Remember, that not only the libraries that librdfa uses are old but the librdfa project itself has not being active in a while.

Yes but it is compliant with the RDFa test suite which is exactly what we want. It can be assumed to be somewhat complete...

If we are moving to other branch, could someone create a new branch, so I can point the PR to the new branch.

@HansBrende can you please create a new branch called ANY23-295, I do not have access to a terminal right now. Thank you in advance,

I will be glad to keep contributing after GSoC. Honestly, I'd like the idea of becoming a commiter, so I have to keep working to earn it.

👍

@HansBrende
Copy link
Member

@JulioCCBUcuenca Great to hear it passes the semargl test suites! However, AbstractRDFaExtractorTestCase contains only a fraction of the tests we run on RDFa. You should also test against the RDFaExtractorTest and RDFa11ExtractorTest classes.

@JulioCCBUcuenca
Copy link
Contributor Author

JulioCCBUcuenca commented Aug 8, 2018

You should also test against the RDFaExtractorTest and RDFa11ExtractorTest classes.

@HansBrende it is testing those main tests as well. It is testing the tests of Semargl RDFa 1.0 (RDFaExtractorTest), Semargl RDFa 1.1 (RDFa11ExtractorTest), and base suitcase AbstractRDFaExtractorTestCase.

@HansBrende
Copy link
Member

@JulioCCBUcuenca Alright, sounds great then!

@HansBrende
Copy link
Member

@lewismc I attempted to create a new branch using:

git checkout -b ANY23-295
git push canonical ANY23-295

to which git responded with:

To https://git-wip-us.apache.org/repos/asf/any23.git
 * [new branch]        ANY23-295 -> ANY23-295

However, it isn't showing up under "Branches". Not sure why. It's showing up under my own "Branches" when I pushed to origin.

@HansBrende
Copy link
Member

@HansBrende
Copy link
Member

Ok, apparently due to some quirkiness of the way mirroring works, new branches in git-wip will not show up on github until an actual change is made to the branch.

@HansBrende
Copy link
Member

Ok, I added some whitespace to a package-info.java file, should be showing up now.

@JulioCCBUcuenca JulioCCBUcuenca changed the base branch from master to ANY23-295 August 8, 2018 21:41
@JulioCCBUcuenca
Copy link
Contributor Author

JulioCCBUcuenca commented Aug 8, 2018

I changed the base branch. Thanks so much @HansBrende

<parent>
<groupId>org.apache.any23</groupId>
<artifactId>apache-any23</artifactId>
<version>2.3-SNAPSHOT</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JulioCCBUcuenca can you please update this to 2.4-SNAPSHOT so it is synchronized with master branch. Thank you

@HansBrende
Copy link
Member

@lewismc do we have any benchmarks on this version vs our current rdfa parser?

@lewismc
Copy link
Member

lewismc commented Sep 12, 2019

@HansBrende

https://cwiki.apache.org/confluence/display/ANY23/Librdfa-rdf4j+documentation#Librdfa-rdf4jdocumentation-Performance

The current implementation does not look particularly favorable for librdfa

@HansBrende
Copy link
Member

HansBrende commented Sep 13, 2019

@lewismc My first thought is: if the performance of this module is not as good as that of our current implementation, then in its current form, what is the added value?

My second thought is: the benchmarks do not test the Any23 Extractor wrappers around these rdf4j parsers, only the underlying parsers themselves. However, in Any23's BaseRDFExtractor, due to a lot of bugs in the semargl html parser, we had to preprocess the input stream using jsoup before passing "clean html" into the underlying parser. I am curious as to whether or not the librdfa parser would have any of those same html parsing bugs. If not, if I can take the preprocessing logic out of BaseRDFExtractor and move it to the semargl parser specifically, and if the librdfa parser can still pass the entire test suite without using the jsoup-preprocessed stream, then there would be a much better case for including it (as its performance would then likely eclipse our current rdfa performance without the preprocessing overhead).

@HansBrende
Copy link
Member

I created an issue for this. We should resolve that issue first and then make sure the librdfa test suite still passes before adding the librdfa module.

@HansBrende
Copy link
Member

@lewismc I've resolved that issue, moving the semargl-specific bugfixes into the semargl extractors.

@JulioCCBUcuenca a couple recommendations for this branch whenever you get a chance:

(1) Synchronize this branch with master
(2) Rerun test suite, making sure all tests still pass
(3) If possible, benchmark the Extractor wrappers (not just underlying rdf4j parsers). Doing this may give the librdfa extractor a performance edge, as the semargl parser requires the html stream to be preprocessed and transformed into strictly conforming XML.

@lewismc
Copy link
Member

lewismc commented Sep 18, 2019

@HansBrende

My first thought is: if the performance of this module is not as good as that of our current implementation, then in its current form, what is the added value?

The project was started and completed as a research effort to see what results we could come up with. Thank you so much for the additional investigation on your part... I agree with your statement. I'll see if I can find some time to update the PR. Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants