Skip to content

Conversation

@biqiboqi
Copy link

This PR introduces a Lexer module for the OvumLanguage compiler, responsible for lexical analysis and tokenization of source code. The implementation focuses on modularity, polymorphism, and a table-driven approach to reduce branching in character handling.

@biqiboqi
Copy link
Author

@codex Please review this PR and provide feedback on design, C++ best practices, and potential bugs in the lexer :)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines +4 to +72
#include <cctype>
#include <functional>
#include <stdexcept>
#include <string>
#include <string_view>
#include <unordered_set>
#include <vector>

#include "handlers/CharHandler.hpp"
#include "handlers/DefaultHandler.hpp"
#include "handlers/Handler.hpp"
#include "handlers/IdentifierHandler.hpp"
#include "handlers/NewlineHandler.hpp"
#include "handlers/NumberHandler.hpp"
#include "handlers/OperatorHandler.hpp"
#include "handlers/PunctHandler.hpp"
#include "handlers/SlashHandler.hpp"
#include "handlers/StringHandler.hpp"
#include "handlers/WhitespaceHandler.hpp"

#include "tokens/TokenFactory.hpp"

class Lexer {
public:
explicit Lexer(std::string_view src, bool keep_comments = false);

std::vector<TokenPtr> tokenize();

bool is_at_end() const noexcept;
char peek(size_t offset = 0) const noexcept;
char current_char() const noexcept;
char advance();
void retreat_one();
void consume_while(std::string &out, const std::function<bool(char)> &pred);
std::string raw_lexeme() const;

int line() const noexcept {
return line_;
}
int token_col() const noexcept {
return token_col_;
}
bool keep_comments() const noexcept {
return keep_comments_;
}
bool is_keyword(std::string_view s) const;
bool is_multiop(std::string_view s) const;

void set_handler(unsigned char c, std::unique_ptr<Handler> handler) {
handlers_[c] = std::move(handler);
}
void set_default_handler(std::unique_ptr<Handler> handler) {
default_handler_ = std::move(handler);
}

private:
void register_defaults();

std::string_view src_;
bool keep_comments_;

size_t start_{0};
size_t current_{0};
int line_{1};
int col_{1};
int token_col_{1};

std::array<std::unique_ptr<Handler>, 256> handlers_;
std::unique_ptr<Handler> default_handler_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Include for handlers_ member

The header declares std::array<std::unique_ptr<Handler>, 256> handlers_; but the file only includes <vector> and other headers, not <array>. Any TU including this header will fail to compile because std::array is undefined. Add #include <array> above the class definition.

Useful? React with 👍 / 👎.

Comment on lines +1 to +5
#include "DefaultHandler.hpp"

OptToken DefaultHandler::scan(Lexer &lx) {
char c = lx.current_char();
throw LexerError(std::string("Unexpected character: ") + c);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Add LexerError declaration to handler sources

This translation unit throws LexerError but none of the included headers declare it (DefaultHandler.hpp only pulls in Lexer.hpp). As a result the file will not compile. Include lib/lexer/LexerError.hpp here (and in other handlers that throw LexerError).

Useful? React with 👍 / 👎.

Comment on lines +7 to +42
while (!lx.is_at_end()) {
char c = lx.advance();
raw.push_back(c);
if (c == '"')
break;
if (c == '\\') {
char e = lx.advance();
raw.push_back(e);
switch (e) {
case 'n':
out.push_back('\n');
break;
case 't':
out.push_back('\t');
break;
case 'r':
out.push_back('\r');
break;
case '\\':
out.push_back('\\');
break;
case '"':
out.push_back('"');
break;
default:
out.push_back(e);
break;
}
} else {
if (c == '\n')
throw LexerError("Unterminated string literal (newline inside)");
out.push_back(c);
}
}
return std::make_optional(
TokenFactory::make_string_literal(std::move(raw), std::move(out), lx.line(), lx.token_col()));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Detect EOF before returning unterminated string tokens

The string handler breaks only when a closing " is read, but the loop exits silently when the source ends. For an input like "abc at EOF, the handler returns a string token with no error instead of signalling an unterminated literal, leaving the lexer in an inconsistent state. Check after the loop that the last character was a quote (or throw when is_at_end() becomes true) before constructing the token.

Useful? React with 👍 / 👎.

Comment on lines +113 to +124
for (unsigned char d = '0'; d <= '9'; ++d)
set_handler(d, std::make_unique<NumberHandler>());
set_handler((unsigned char) '.', std::make_unique<NumberHandler>());

set_handler((unsigned char) '"', std::make_unique<StringHandler>());
set_handler((unsigned char) '\'', std::make_unique<CharHandler>());

set_handler((unsigned char) '/', std::make_unique<SlashHandler>());

const std::string opchars = "+-*/%<>=!&|^~?:.";
for (unsigned char c : opchars)
set_handler(c, std::make_unique<OperatorHandler>());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not overwrite '.' number handler

In register_defaults you assign NumberHandler to '.' (line 115) but immediately afterwards iterate over opchars which also contains '.' (lines 122‑124). The latter assignment replaces the numeric handler, so literals like .5 are tokenized as an operator '.' followed by 5 rather than a float. Either remove '.' from opchars or re‑apply the numeric handler after operator registration.

Useful? React with 👍 / 👎.

@bialger
Copy link
Contributor

bialger commented Oct 14, 2025

Unfortunately, I have to close this PR. This repository is for more high-level logic of ovum executable. Please commit your changes to https://github.com/Ovum-Programming-Language/OvumCompiler.

@bialger bialger closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants