-
Notifications
You must be signed in to change notification settings - Fork 0
Lexer! #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@codex Please review this PR and provide feedback on design, C++ best practices, and potential bugs in the lexer :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| #include <cctype> | ||
| #include <functional> | ||
| #include <stdexcept> | ||
| #include <string> | ||
| #include <string_view> | ||
| #include <unordered_set> | ||
| #include <vector> | ||
|
|
||
| #include "handlers/CharHandler.hpp" | ||
| #include "handlers/DefaultHandler.hpp" | ||
| #include "handlers/Handler.hpp" | ||
| #include "handlers/IdentifierHandler.hpp" | ||
| #include "handlers/NewlineHandler.hpp" | ||
| #include "handlers/NumberHandler.hpp" | ||
| #include "handlers/OperatorHandler.hpp" | ||
| #include "handlers/PunctHandler.hpp" | ||
| #include "handlers/SlashHandler.hpp" | ||
| #include "handlers/StringHandler.hpp" | ||
| #include "handlers/WhitespaceHandler.hpp" | ||
|
|
||
| #include "tokens/TokenFactory.hpp" | ||
|
|
||
| class Lexer { | ||
| public: | ||
| explicit Lexer(std::string_view src, bool keep_comments = false); | ||
|
|
||
| std::vector<TokenPtr> tokenize(); | ||
|
|
||
| bool is_at_end() const noexcept; | ||
| char peek(size_t offset = 0) const noexcept; | ||
| char current_char() const noexcept; | ||
| char advance(); | ||
| void retreat_one(); | ||
| void consume_while(std::string &out, const std::function<bool(char)> &pred); | ||
| std::string raw_lexeme() const; | ||
|
|
||
| int line() const noexcept { | ||
| return line_; | ||
| } | ||
| int token_col() const noexcept { | ||
| return token_col_; | ||
| } | ||
| bool keep_comments() const noexcept { | ||
| return keep_comments_; | ||
| } | ||
| bool is_keyword(std::string_view s) const; | ||
| bool is_multiop(std::string_view s) const; | ||
|
|
||
| void set_handler(unsigned char c, std::unique_ptr<Handler> handler) { | ||
| handlers_[c] = std::move(handler); | ||
| } | ||
| void set_default_handler(std::unique_ptr<Handler> handler) { | ||
| default_handler_ = std::move(handler); | ||
| } | ||
|
|
||
| private: | ||
| void register_defaults(); | ||
|
|
||
| std::string_view src_; | ||
| bool keep_comments_; | ||
|
|
||
| size_t start_{0}; | ||
| size_t current_{0}; | ||
| int line_{1}; | ||
| int col_{1}; | ||
| int token_col_{1}; | ||
|
|
||
| std::array<std::unique_ptr<Handler>, 256> handlers_; | ||
| std::unique_ptr<Handler> default_handler_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header declares std::array<std::unique_ptr<Handler>, 256> handlers_; but the file only includes <vector> and other headers, not <array>. Any TU including this header will fail to compile because std::array is undefined. Add #include <array> above the class definition.
Useful? React with 👍 / 👎.
| #include "DefaultHandler.hpp" | ||
|
|
||
| OptToken DefaultHandler::scan(Lexer &lx) { | ||
| char c = lx.current_char(); | ||
| throw LexerError(std::string("Unexpected character: ") + c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add LexerError declaration to handler sources
This translation unit throws LexerError but none of the included headers declare it (DefaultHandler.hpp only pulls in Lexer.hpp). As a result the file will not compile. Include lib/lexer/LexerError.hpp here (and in other handlers that throw LexerError).
Useful? React with 👍 / 👎.
| while (!lx.is_at_end()) { | ||
| char c = lx.advance(); | ||
| raw.push_back(c); | ||
| if (c == '"') | ||
| break; | ||
| if (c == '\\') { | ||
| char e = lx.advance(); | ||
| raw.push_back(e); | ||
| switch (e) { | ||
| case 'n': | ||
| out.push_back('\n'); | ||
| break; | ||
| case 't': | ||
| out.push_back('\t'); | ||
| break; | ||
| case 'r': | ||
| out.push_back('\r'); | ||
| break; | ||
| case '\\': | ||
| out.push_back('\\'); | ||
| break; | ||
| case '"': | ||
| out.push_back('"'); | ||
| break; | ||
| default: | ||
| out.push_back(e); | ||
| break; | ||
| } | ||
| } else { | ||
| if (c == '\n') | ||
| throw LexerError("Unterminated string literal (newline inside)"); | ||
| out.push_back(c); | ||
| } | ||
| } | ||
| return std::make_optional( | ||
| TokenFactory::make_string_literal(std::move(raw), std::move(out), lx.line(), lx.token_col())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detect EOF before returning unterminated string tokens
The string handler breaks only when a closing " is read, but the loop exits silently when the source ends. For an input like "abc at EOF, the handler returns a string token with no error instead of signalling an unterminated literal, leaving the lexer in an inconsistent state. Check after the loop that the last character was a quote (or throw when is_at_end() becomes true) before constructing the token.
Useful? React with 👍 / 👎.
| for (unsigned char d = '0'; d <= '9'; ++d) | ||
| set_handler(d, std::make_unique<NumberHandler>()); | ||
| set_handler((unsigned char) '.', std::make_unique<NumberHandler>()); | ||
|
|
||
| set_handler((unsigned char) '"', std::make_unique<StringHandler>()); | ||
| set_handler((unsigned char) '\'', std::make_unique<CharHandler>()); | ||
|
|
||
| set_handler((unsigned char) '/', std::make_unique<SlashHandler>()); | ||
|
|
||
| const std::string opchars = "+-*/%<>=!&|^~?:."; | ||
| for (unsigned char c : opchars) | ||
| set_handler(c, std::make_unique<OperatorHandler>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not overwrite '.' number handler
In register_defaults you assign NumberHandler to '.' (line 115) but immediately afterwards iterate over opchars which also contains '.' (lines 122‑124). The latter assignment replaces the numeric handler, so literals like .5 are tokenized as an operator '.' followed by 5 rather than a float. Either remove '.' from opchars or re‑apply the numeric handler after operator registration.
Useful? React with 👍 / 👎.
|
Unfortunately, I have to close this PR. This repository is for more high-level logic of |
This PR introduces a Lexer module for the OvumLanguage compiler, responsible for lexical analysis and tokenization of source code. The implementation focuses on modularity, polymorphism, and a table-driven approach to reduce branching in character handling.