-
Notifications
You must be signed in to change notification settings - Fork 54
Any23 295: Implement ability to use librdfa #104
base: ANY23-295
Are you sure you want to change the base?
Conversation
Signed-off-by: Julio Caguano
…d xml:lang is used to identify in xml files.
|
@lewismc just as a remainder, this will be the PR that I will be using in the last stage of GSoC. |
lewismc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JulioCCBUcuenca this is looking excellent as a first pass. Please see my comments.
| # Allows to decide which RDFa Extractor to enable. | ||
| # If 'on' will be activated the programmatic RDFa 1.1 Extractor | ||
| # (org.deri.any23.extractor.rdfa.RDFa11Extractor) otherwise will be | ||
| # registered the RDFa 1.0 legacy one (org.deri.any23.extractor.rdfa.RDFaExtractor). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no mention of org.deri.any23 in this file. It should be org.apache.any23, can you please correct this. Thanks
|
|
||
| # Allows to enable Librdfa Extractor. | ||
| # If 'on' will override the extractors with the programmatic option, | ||
| # RDFa 1.1 Extractor (org.deri.any23.extractor.rdfa.RDFa11Extractor) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace deri with apache
core/pom.xml
Outdated
|
|
||
| <!-- BEGIN: Librdfa-RDF4J, loading from repository --> | ||
| <dependency> | ||
| <groupId>${project.groupId}</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use 2 space indents the same as the rest of the file.
core/pom.xml
Outdated
| <dependency> | ||
| <groupId>${project.groupId}</groupId> | ||
| <artifactId>apache-any23-librdfa</artifactId> | ||
| <version>1.0.0</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure that the versioning is consistent with the rest of the codebase e.g. 2.3-SNAPSHOT or whatever it is currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, should I include librdfa-rdf4j as part of the build life-cycle?
core/pom.xml
Outdated
| <repositories> | ||
| <repository> | ||
| <id>librdfa-rdf4j</id> | ||
| <url>https://raw.github.com/JulioCCBUcuenca/librdfa-java/repository/</url> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit hacky... what is meant to reside at this URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we haven't published the librdfa-rdf4j parser in maven central, I added the jars there. I added that because if while compiling librdfa is not installed, the build will fail. In that URL lives the jars of librdfa-rdf4j.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, librdfa-rdf4j is a standalone module that is not part of any23. Thus, at compiling time, any23 doesn't know about librdfa-rdf4j. If we integrate this with any, we need to deal with the requirement that librdfa-rdfa needs librdfa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
librdfa-rdf4j should definitely be a separate module of Any23, please make sure that it is and that the require JAR dependencies are located at build and compile/runtime such.
librdfa-rdf4j/src/main/c/main.java
Outdated
| caller.setCallback(callback); | ||
|
|
||
| caller.parse(); | ||
| //rdfa.set_rdfa_parser(caller); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not in use, please remove.
librdfa-rdf4j/src/main/c/main.java
Outdated
|
|
||
| @Override | ||
| public void default_graph(String subject, String predicate, String object, int object_type, String datatype, String language) { | ||
| System.out.println("default_graph(...)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using System calls is never particularly safe... are all of these calls necessary?
| @@ -0,0 +1,40 @@ | |||
| %module(directors="1") rdfa | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing license header.
| @@ -0,0 +1 @@ | |||
| org.apache.any23.rdf.rdfa.LibrdfaRDFaParserFactory | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License header if possible.
| limitations under the License. | ||
| --> | ||
| <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> | ||
| <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a strange addition. Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lang tag is used in HTML files, and the language of XHTML files is provided using xml:lang. So, since this is a HTML page, it needs to use the lang tag. Current XML parsers can use both lang or xml:lang, but since librdfa uses an old library for parsing XML it generates an error since it cannot identify the language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats fine thanks for the explanation.
|
Thanks @lewismc for the feedback. I will work on all your suggestions. |
|
@JulioCCBUcuenca Please address the issue of making librdfa-rdf4j a module within Any23 then push your changes. At that stage I will most likely approve the PR and we can test more thoroughly. Thanks. |
|
That seems worrisome to me... considering then, that most html pages will break the librdfa parser. @lewismc should we not test more thoroughly before merging to master? Maybe a separate branch instead? Also, I would think that all of our current rdfa parsing tests should be duplicated for the librdfa parser to ensure that it is at least as stable as our current rdfa parser. TBH, I'd be in favor of adding this into version 2.4 rather than 2.3 so we have more time to thoroughly test the module. |
Yes absolutely. We should also definitely extend the integration tests as well such that we can make more confident comparisons regarding runtime execution. |
That is fine with me... |
|
Thanks for your comments @lewismc. Will do it! 😄 @HansBrende librdfa supports RDFa 1.0 and 1.1, so I made sure that librdfa extractor works on all tests that Semargl extractor runs for both RDFa 1.0 and 1.1; additionally, the librdfa extractor is tested on the base suitcase ( I agree that librdfa functionalities should go to other release. I am mostly concerned for the requirement that librdfa-rdf4j needs, which is that librdfa (the C library) needs to be installed. Remember, that not only the libraries that librdfa uses are old but the librdfa project itself has not being active in a while. If we are moving to other branch, could someone create a new branch, so I can point the PR to the new branch. I will be glad to keep contributing after GSoC. Honestly, I'd like the idea of becoming a commiter, so I have to keep working to earn it. 😃 |
Yes but it is compliant with the RDFa test suite which is exactly what we want. It can be assumed to be somewhat complete...
@HansBrende can you please create a new branch called ANY23-295, I do not have access to a terminal right now. Thank you in advance,
👍 |
|
@JulioCCBUcuenca Great to hear it passes the semargl test suites! However, |
@HansBrende it is testing those main tests as well. It is testing the tests of Semargl RDFa 1.0 ( |
|
@JulioCCBUcuenca Alright, sounds great then! |
|
@lewismc I attempted to create a new branch using: to which git responded with: However, it isn't showing up under "Branches". Not sure why. It's showing up under my own "Branches" when I pushed to |
|
It is showing up here: https://git-wip-us.apache.org/repos/asf?p=any23.git But not here: https://github.com/apache/any23/branches |
|
Ok, apparently due to some quirkiness of the way mirroring works, new branches in git-wip will not show up on github until an actual change is made to the branch. |
|
Ok, I added some whitespace to a package-info.java file, should be showing up now. |
|
I changed the base branch. Thanks so much @HansBrende |
| <parent> | ||
| <groupId>org.apache.any23</groupId> | ||
| <artifactId>apache-any23</artifactId> | ||
| <version>2.3-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JulioCCBUcuenca can you please update this to 2.4-SNAPSHOT so it is synchronized with master branch. Thank you
|
@lewismc do we have any benchmarks on this version vs our current rdfa parser? |
|
The current implementation does not look particularly favorable for librdfa |
|
@lewismc My first thought is: if the performance of this module is not as good as that of our current implementation, then in its current form, what is the added value? My second thought is: the benchmarks do not test the Any23 |
|
I created an issue for this. We should resolve that issue first and then make sure the librdfa test suite still passes before adding the librdfa module. |
|
@lewismc I've resolved that issue, moving the semargl-specific bugfixes into the semargl extractors. @JulioCCBUcuenca a couple recommendations for this branch whenever you get a chance: (1) Synchronize this branch with master |
The project was started and completed as a research effort to see what results we could come up with. Thank you so much for the additional investigation on your part... I agree with your statement. I'll see if I can find some time to update the PR. Thanks |
Integration of librdfa on Any23 through the Rio API of RDF4J. For that, I have created a new parser.
In order to try out the PR, you should: