快速入门 2 - 使用 Spirit.Lex 实现更好的单词计数器

快速入门 2 - 使用 Spirit.Lex 实现更好的单词计数器 - Boost C++ 函数库

快速入门 2 - 使用 Spirit.Lex 实现更好的单词计数器

熟悉 Flex 的人可能会抱怨本节快速入门 1 - 使用 Spirit.Lex 实现单词计数器中的示例过于复杂，并且没有充分利用该工具提供的可能性。特别是之前的示例没有直接使用词法分析器操作来计算行数、单词数和字符数。因此，本教程本步骤中提供的示例将展示如何在 Spirit.Lex 中使用语义操作。尽管此示例仍会计算文本元素，但目的是介绍新的概念和配置选项（完整示例代码请参见此处：word_count_lexer.cpp）。

前提条件

除了 Spirit.Lex 特有的唯一必需的 #include 之外，此示例还需要包含 Boost.Phoenix 库中的几个头文件。此示例展示了如何将函数对象附加到 token 定义，这可以使用任何类型的 C++ 技术来实现可调用对象。为此任务使用 Boost.Phoenix 可以简化操作并避免添加对其他库的依赖（Boost.Phoenix 已用于 Spirit）。

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/phoenix/operator.hpp>
#include <boost/phoenix/statement.hpp>
#include <boost/phoenix/stl/algorithm.hpp>
#include <boost/phoenix/core.hpp>

为了使下面的所有代码更具可读性，我们引入以下命名空间。

namespace lex = boost::spirit::lex;

为了让您对本示例有所了解，这是用作起点的 flex 程序。有用的代码直接包含在与每个 token 定义关联的操作中。

%{
    int c = 0, w = 0, l = 0;
%}
%%
[^ \t\n]+  { ++w; c += yyleng; }
\n         { ++c; ++l; }
.          { ++c; }
%%
main()
{
    yylex();
    printf("%d %d %d\n", l, w, c);
}

Spirit.Lex 中的语义操作

Spirit.Lex 使用一种非常相似的方式将操作与 token 定义关联起来（对于熟悉 Spirit 的任何人来说，这应该看起来很熟悉）：在一对 [] 方括号中指定要执行的操作。为了能够将语义操作附加到 token 定义，为每个 token 定义定义了一个 token_def<> 实例。

template <typename Lexer>
struct word_count_tokens : lex::lexer<Lexer>
{
    word_count_tokens()
      : c(0), w(0), l(0)
      , word("[^ \t\n]+")     // define tokens
      , eol("\n")
      , any(".")
    {
        using boost::spirit::lex::_start;
        using boost::spirit::lex::_end;
        using boost::phoenix::ref;

        // associate tokens with the lexer
        this->self
            =   word  [++ref(w), ref(c) += distance(_start, _end)]
            |   eol   [++ref(c), ++ref(l)]
            |   any   [++ref(c)]
            ;
    }

    std::size_t c, w, l;
    lex::token_def<> word, eol, any;
};

所示代码的语义如下。当词法分析器匹配到相应的 token 时，[] 方括号内的代码将被执行。这与 Flex 非常相似，在 Flex 中，与 token 定义关联的操作代码在识别匹配的输入序列后执行。上面的代码使用使用 Boost.Phoenix 构建的函数对象，但是只要 C++ 函数或函数对象暴露了正确的接口，就可以插入任何 C++ 函数或函数对象。有关更多详细信息，请参阅 Lexer Semantic Actions 部分。

将 Token 定义与 Lexer 关联

如果您将此代码与快速入门 1 - 使用 Spirit.Lex 实现单词计数器中的代码进行比较，关于如何将 token 定义与 lexer 关联的方式，您会注意到这里使用了不同的语法。在前面的示例中，我们使用了 API 的 self.add() 风格，而在这里，我们直接将 token 定义分配给 self，使用 | 运算符组合不同的 token 定义。这里是代码片段：

this->self
    =   word  [++ref(w), ref(c) += distance(_1)]
    |   eol   [++ref(c), ++ref(l)]
    |   any   [++ref(c)]
    ;

这样，我们就拥有了一种非常强大且自然的方式来构建词法分析器。如果翻译成英语，可以这样读：词法分析器将识别（‘=’）由 token 定义 word、eol 和 any 中的任何一个（‘|’）定义的 token。

与前一个示例的第二个区别是我们没有为单独的 token 显式指定任何 token ID。使用语义操作来触发一些有用的工作使我们不必定义这些。为了确保每个 token 都被分配一个 ID，Spirit.Lex 库在内部为 token 定义分配唯一的数字，从由 boost::spirit::lex::min_token_id 定义的常量开始。

整合所有内容

为了执行上面定义的代码，我们仍然需要实例化一个 lexer 类型的实例，从中馈送输入序列，并创建一个迭代器对，允许遍历 lexer 创建的 token 序列。这段代码展示了如何实现这些步骤：

int main(int argc, char* argv[])
{

  typedef
        lex::lexertl::token<char const*, lex::omit, boost::mpl::false_>
     token_type;

  typedef lex::lexertl::actor_lexer<token_type> lexer_type;

  word_count_tokens<lexer_type> word_count_lexer;

  std::string str (read_from_file(1 == argc ? "word_count.input" : argv[1]));
    char const* first = str.c_str();
    char const* last = &first[str.size()];

  lexer_type::iterator_type iter = word_count_lexer.begin(first, last);
    lexer_type::iterator_type end = word_count_lexer.end();

  while (iter != end && token_is_valid(*iter))
        ++iter;

    if (iter == end) {
        std::cout << "lines: " << word_count_lexer.l
                  << ", words: " << word_count_lexer.w
                  << ", characters: " << word_count_lexer.c
                  << "\n";
    }
    else {
        std::string rest(first, last);
        std::cout << "Lexical analysis failed\n" << "stopped at: \""
                  << rest << "\"\n";
    }
    return 0;
}

	将 `omit` 指定为 token 属性类型会生成一个不包含任何 token 属性的 token 类（甚至不包含匹配输入序列的迭代器范围），因此可以尽可能优化 token、lexer 和可能的 parser 实现。将 `mpl::false_` 指定为第三个模板参数会生成一个 token 类型和一个迭代器，两者都不包含 lexer 状态，从而可以进行更积极的优化。结果是 token 实例包含 token ID 作为唯一的成员数据。
	这定义了要使用的 lexer 类型
	创建 lexer 对象实例以调用词法分析
	从给定文件读取输入，标记所有输入，同时丢弃所有生成的 token
	创建一对迭代器，返回生成的 token 序列
	这里我们简单地遍历所有 token，确保在 lexer 返回无效 token 时中断循环

Boost C++ 库

快速入门 2 - 使用 Spirit.Lex 实现更好的单词计数器

前提条件

Spirit.Lex 中的语义操作

将 Token 定义与 Lexer 关联

整合所有内容