Introduction to Buckets and Brigades
Bucket和Brigade介绍
The Apache 2 Filter Architecture is the major innovation that sets it apart from other webservers, including Apache 1.x, as a uniquely powerful and versatile applications platform. But this power comes at a price: there is a bit of a learning curve to harnessing it. Apart from understanding the architecture itself, the crux of the matter is to get to grips with Buckets and Brigades, the building blocks of a filter.
Apache2过滤器架构的主要革新让它从其它web服务器,包括Apache1.x区别开来,是一个有独特能力和万能的应用程序结构.但是这 个能力带来了代价:为了驾驭它需要一个学习曲线.不考虑理解结构自身,问题的关键是掌握Bucket和Brigade,这是过滤器的构建块.
In this article, we introduce buckets and brigades, taking the reader to the point where you should have a basic working knowledge. In the process, we develop a simple but useful filter module that works by manipulating buckets and brigades directly.
This direct manipulation is the lowest-level API for working with buckets and brigades, and probably the hardest to use. But because it is low level, it serves to demonstrate what's going on. In other articles we will discuss related subjects including debugging, resource management, and alternative ways to work with the data.
在这篇文章中,我们介绍bucket和brigade,帮助读者找到必须具备哪方面的基本知识.在这个过程中,我们开发了一个简单,但是有用的过滤模块,这个模块通过直接操作bucket和brigade工作.
通过最底层的API来直接操作bucket和brigade,可能是最难使用的.但是因为它是低级别的API,它能给我们展示真正发生了什么.在其它文章中我们将会讨论相关的主题,包括调试,资源管理和可供选择的数据处理方式.
Basic Concepts
基本概念
The basic concepts we are dealing with are the bucket and the brigade. Let us first introduce them, before moving on to why and how to use them.
Bucket和brigade是我们要处理的基本概念. 在我们开始介绍为什么使用和怎么使用他们之前,让我们首先介绍他们.
Buckets
A bucket is a container for data. Buckets can contain any type of data. Although the most common case is a block of memory, a bucket may instead contain a file on disc, or even be fed a data stream from a dynamic source such as a separate program. Different bucket types exist to hold different kinds of data and the methods for handling it. In OOP terms, the apr_bucket is an abstract base class from which actual bucket types are derived.
There are several different types of data bucket, as well as metadata buckets. We will describe these at the end of this article.
Bucket是数据的容器,能包含任何类型的数据.尽管通常情况下是一块内存,bucket也能包含硬盘上的文件,甚至来自动态数据源,例如来 自另一个独立的程序.不同的bucket类型保存不同类型的数据和处理这些数据的方法.在OOP项目组中,apr_bucket是一个抽象的基类,真正的 bucket类型能从这个基本派生.
这里有几种不同bucket数据类型,也包括元数据bucket,我们将在这篇文章结束部分讨论.
Brigades
In normal use, there is no such thing as a freestanding bucket: they are contained in bucket brigades. A brigade is a container that may hold any number of buckets in a ring structure. The brigade serves to enable flexible and efficient manipulation of data, and is the unit that gets passed to and from your filter.
在正常使用中,这里没有孤立的bucket:他们被包含在brigade中.brigade是一个容器,包含许多的bucket而组成一个环状结构.brigade用来提供灵活的和有效率的数据操作,也是从你的过滤器中传进和传出的单元.
Motivation
目的
So, why do we need buckets and brigades? Can't we just keep it simple and pass simple blocks of data? Maybe a void* with a length, or a C++ string?
Well, the first part of the answer we've seen already: buckets are more than just data: they are an abstraction that unifies fundamentally different types of data. But even so, how do they justify the additional complexity over simple buffers and ad-hoc use of other data sources in exceptional cases?
因此,我们为什么需要bucket和brigade?我们难道不能让它简单点么,处理一个简单的数据块?可以用void指针加长度或C++字符串?
恩,第一部分的答案我们已经知道:bucket不仅仅是数据,他们是不同基础数据类型数据的统一抽象.尽管如此,他们怎么证明在例外情况下简单缓冲区上的额外复杂和其它数据源的特别使用是正确的?
The second motivation for buckets and brigades is that they enable efficient manipulation of blocks of memory, typical of many filtering applications. We will demonstrate a simple but typical example of this: a filter to display plain text documents prettified as an HTML page, with header and footer supplied by the webmaster in the manner of a fancy directory listing.
Bucket和brigade的第二个目的是它们能够有效率的操作块状内存,这个是许多过滤器典型的需要操作.我们将举一个简单的但是有典型的例子:这个过滤器修饰纯文本文档为HTML页面,这个页面同时被管理员加上页眉和页脚的,展示一个奇特的目录列表.
Now, HTML can of course include blocks of plain text, enclosing them in <pre> to preserve spacing and formatting. So the main task of a text->html filter is to pass the text straight through. But certain special characters need to be escaped. To be safe both with the HTML spec and browsers, we will escape the four characters <, >, &, and " as <, etc.
现在,HTML能够包含纯文本,把这些纯文本包含在<pre>中来保持纯文本的原有空格和格式.因此text->html过 滤器主要任务是直接传递文本.但是一些特殊字符需要被处理.为了HTML和浏览器的安全,我们将会处理四个字符<,>,&和” 为<等等.
Because the replacement < is longer by three bytes than the original, we cannot just replace the character. If we are using a simple buffer, we either have to extend it with realloc() or equivalent, or copy the whole thing interpolating the replacement. Repeat this a few times and it rapidly gets very inefficient. A better solution is a two-pass scan of the buffer: the first pass simply computes the length of the new buffer, after which we allocate the memory and copy the data with the required replacements. But even that is by no means efficient.
因为<的替代比原有字符长三个字符,我们不能仅仅替换原有字符.如果我们使用简单缓存区,我们可能需要使用realloc()类 函数扩展缓冲区或者拷贝整个字符串来插入替换的字符串.多次这种操作会迅速导致没有效率.一个好的方法是缓冲区的两遍扫描:第一遍扫描简单的计算新缓冲区 的长度,然后我们分配内存,插入替换,拷贝数据,但是甚至这样都不是有效率的.
By using buckets and brigades in place of a simple buffer, we can simply replace the characters in situ, without allocating or copying any big blocks of memory. Provided the number of characters replaced is small in comparison to the total document size, this is much more efficient. In outline:
- We encounter a character that needs replacing in the bucket
- We split the bucket before and after the character. Now we have three buckets: the character itself, and all data before and after it.
- We drop the character, leaving the before and after buckets.
- We create a new bucket containing the replacement, and insert it where the character was.
Now instead of moving/copying big blocks of data, we are just manipulating pointers into an existing block. The only actual data to change are the single character removed and the few bytes that replace it.
通过使用bucket和brigade替换简单缓冲区,我们能简单的替换字符,不需要分配或者拷贝任何大块内存.我们这里的替换字符个数相对文档的大小是很少的.这是更有效率的,概括:
1. 我们在bucket中遇到需要替换的字符.
2. 我们在这个字符位置前后把bucket分开,现在我们有三个bucket:字符本身,字符前面数据和字符后面数据.
3. 我们丢掉字符,剩下该字符前和字符后的bucket.
4. 我们创建一个新的bucket来包含替换的字符串,插入到原有字符串的位置.
现在没有了移动和拷贝大块数据,我们仅仅操作现有数据块的指针.真正需要修改的数据是一个单一字符和替换这个字符的少量字节.
A Real example: mod_txt
现实例子:mod_txt
mod_txt is a simple output filter module to display plain text files as HTML (or XHTML) with a header and footer. When a text file is requested, it escapes the text as required for HTML, and displays it between the header and the footer.
It works by direct manipulation of buckets (the lowest-level API), and demonstrates both insertion of file data and substitution of characters, without any allocation of moving of big blocks.
Mod_txt是一个简单的输出过滤器,用来把纯文本显示成HTML(或者XHTML)有着附加的页眉和页脚.当文本文件被请求,它处理纯文本为需要的HTML,在页眉和页脚显示.
直接操作bucket(最低级别的API),展示插入文件数据和字符的替换,没有任何分配和移动大块数据.
Bucket functions
Bucket函数
Firstly we introduce two functions to deal with the data insertions: one for the files, one for the simple entity replacements:
Creating a File bucket requires an open filehandle and a byte range within the file. Since we're transmitting the entire file, we just stat its size to set the byte range. We open it with a shared lock and with sendfile enabled for maximum performance.
首先我们介绍两个函数来处理数据的插入:文件相关,简单字符替换:
创建一个文件bucket需要一个打开的文件句柄和文件偏移区间.既然我们传送整个文件,因此我们仅仅stat它的大小作为文件偏移区间.我们以共享锁和sendfile标志,为了更大的性能,打开文件.
static apr_bucket* txt_file_bucket(request_rec* r, const char* fname) {
apr_file_t* file = NULL ;
apr_finfo_t finfo ;
if ( apr_stat(&finfo, fname, APR_FINFO_SIZE, r->pool) != APR_SUCCESS ) {
return NULL ;
}
if ( apr_file_open(&file, fname, APR_READ|APR_SHARELOCK|APR_SENDFILE_ENABLED,
APR_OS_DEFAULT, r->pool ) != APR_SUCCESS ) {
return NULL ;
}
if ( ! file ) {
return NULL ;
}
return apr_bucket_file_create(file, 0, finfo.size, r->pool,
r->connection->bucket_alloc) ;
}
Creating the simple text replacements, we can just make a bucket of an inline string. The appropriate bucket type for such data is transient:
创建一个简单的文本替换,我们只需要用内嵌的字符串创建bucket.适合这种数据的bucket类型是transient(短暂):
static apr_bucket* txt_esc(char c, apr_bucket_alloc_t* alloc ) {
switch (c) {
case '<': return apr_bucket_transient_create("<", 4, alloc) ;
case '>': return apr_bucket_transient_create(">", 4, alloc) ;
case '&': return apr_bucket_transient_create("&", 5, alloc) ;
case '"': return apr_bucket_transient_create(""", 6, alloc) ;
default: return NULL ; /* shut compilers up */
}
}
Actually this is not the most efficient way to do this. We will discuss alternative formulations of the above below.
事实上,这不是最有效率的方法.我们下面将讨论可选方法
The Filter
过滤器
Now the main filter itself is broadly straightforward, but there are a number of interesting and unexpected points to consider. Since this is a little longer than the above utility functions, we'll comment it inline instead. Note that the Header and Footer file buckets are set in a filter_init function (omitted for brevity).
现在过滤器主要部分还是十分明了的,但是这有一些有兴趣的和异常情况需要考虑.既然下面这些代码比上面函数要长一点,我们将在代码中注释.注意:页眉和页脚文件bucket在filter_init函数中生成(为了简洁省略掉了).
static int txt_filter(ap_filter_t* f, apr_bucket_brigade* bb) {
apr_bucket* b ;
txt_ctxt* ctxt = (txt_ctxt*)f->ctx ;
if ( ctxt == NULL ) {
txt_filter_init(f) ;
ctxt = f->ctx ;
}
Main Loop: This construct is typical for iterating over the incoming data
主循环:这个结构是遍历输入数据的典型方法.
for ( b = APR_BRIGADE_FIRST(bb);
b != APR_BRIGADE_SENTINEL(bb);
b = APR_BUCKET_NEXT(b) ) {
const char* buf ;
size_t bytes ;
As in any filter, we need to check for EOS. When we encounter it,
we insert the footer in front of it. We shouldn't get more than
one EOS, but just in case we do we'll note having inserted the
footer. That means we're being error-tolerant.
在任何一个过滤器中,我们需要检查EOS.当我们遇到它,在它前面插入页脚.我们不会得到一个以上的EOS,但是如果出现的话,我们将会知道已经插入了页脚.这是错误检查.
if ( APR_BUCKET_IS_EOS(b) ) {
/* end of input file - insert footer if any */
if ( ctxt->foot && ! (ctxt->state & TXT_FOOT ) ) {
ctxt->state |= TXT_FOOT ;
APR_BUCKET_INSERT_BEFORE(b, ctxt->foot);
}
The main case is a bucket containing data, We can get it as a simple
buffer with its size in bytes:
主要情况是要检查bucket是否含有数据,我们能得到简单的缓冲区地址和大小,按字节算:
} else if ( apr_bucket_read(b, &buf, &bytes, APR_BLOCK_READ)
== APR_SUCCESS ) {
/* We have a bucket full of text. Just escape it where necessary */
size_t count = 0 ;
const char* p = buf ;
Now we can search for characters that need replacing, and replace them
现在我们能搜索需要替换的字符,然后替换它们
while ( count <>
size_t sz = strcspn(p, "<>&\"") ;
count += sz ;
Here comes the tricky bit: replacing a single character inline.
这里有一个小技巧:在内联中替换单一字符.
if ( count <>
apr_bucket_split(b, sz) ; Split off before buffer
b = APR_BUCKET_NEXT(b) ; Skip over before buffer
APR_BUCKET_INSERT_BEFORE(b, txt_esc(p[sz],
f->r->connection->bucket_alloc)) ;
Insert the replacement
apr_bucket_split(b, 1) ; Split off the char to remove
APR_BUCKET_REMOVE(b) ; ... and remove it
b = APR_BUCKET_NEXT(b) ; Move cursor on to what-remains
so that it stays in sequence with
our main loop
count += 1 ;
p += sz + 1 ;
}
}
}
}
Now we insert the Header if it hasn't already been inserted.
Note:
(a) This has to come after the main loop, to avoid the header itself
getting into the parse.
(b) It works because we can insert a bucket anywhere in the brigade,
and in this case put it at the head.
(c) As with the footer, we save state to avoid inserting it more than once.
如果还没有插入页眉的话,现在插入.
注意:
(a) 这个在主循环以后,避免页眉被处理(译者注:被进行字符替换).
(b) 这个能工作,因为我们能在brigade任何位置插入bucket,这里是插入最前面.
(c) 考虑页脚,我们保存了状态,避免多次插入.
if ( ctxt->head && ! (ctxt->state & TXT_HEAD ) ) {
ctxt->state |= TXT_HEAD ;
APR_BRIGADE_INSERT_HEAD(bb, ctxt->head);
}
Now we've finished manipulating data, we just pass it down the filter chain.
现在我们已经完成数据的处理,我们把brigade传入下一个过滤器.
return ap_pass_brigade(f->next, bb) ;
}
Note that we created a new bucket every time we replaced a character. Couldn't we have prepared four buckets in advance - one for each of the characters to be replaced - and then re-used them whenever the character occurred?
注意:在每次替换字符的时候,我们创建了一个新的bucket.我们不能预先准备四个bucket一一为每一个需要被替换的字符一一然后在需要的时候重用他们?
The problem here is that each bucket is linked to its neighbours. So if we re-use the same bucket, we lose the links, so that the brigade now jumps over any data between the two instances of it. Hence we do need a new bucket every time. That means this technique becomes inefficient when a high proportion of input data has to be changed. We will show alternative techniques for such cases in other articles.
这里的问题是,每一个bucket是它的邻居相邻,我们重用相同的bucket,链表顺序会被打乱,因此brigade现在会跳过中间的一些数 据.为此,我们必须每次都创建新的bucket.这意味着, 当在处理需要被替换的字符占输入数据很高比例的时候,这种技巧是非常没有效率的.我们将会在其它文章中展示可供选择的方案.
Bucket Types
Bucket类型
In the above, we used two data bucket types: file and transient, and the eos metadata bucket type. There are several other bucket types suitable for different kinds of data and metadata.
在上面,我们使用了两种bucket数据类型:文件和transient(短暂),EOS元数据类型.这里有几个其它的bucket数据类型,适合不同类型的数据和元数据.
Simple in-memory buckets
简单内存bucket
When we created transient buckets above, we were inserting a chunk of memory in the output stream. But we noted that this bucket was not the most efficient way to escape a character. The reason for this is that the transient memory has to be copied internally to prevent it going out of scope. We could instead have used memory that's guaranteed never to go out of scope, by replacing
当我们在上面创建短暂bucket,我们在输出流中插入了一块数据.但是我们注意到这种bucket处理起来不是最效率的.因为短暂内存不得不在内部被拷贝,阻止它失效.我们能够使用保证不会失效内存,取代
case '<': return apr_bucket_transient_create("<", 4, alloc) ;
with
成
static const char* lt = "<" ;
...
case '<': return apr_bucket_immortal_create(lt, 4, alloc) ;
When we create an immortal bucket, we guarantee that the memory won't go out of scope during the lifetime of the bucket, so the APR never needs to copy it internally.
A third variant on the same principle is the pool bucket. This refers to memory allocated on a pool, and will have to be copied internally if and only if the pool is destroyed within the lifetime of the bucket.
当我们创建永久bucket的时候,我们保证在bucket生命周期中内存不会失效,因此APR绝不需要在内部拷贝.
有着相同原理的第三个bucket是pool bucket.涉及在pool上的内存分配,和仅仅如果在bucket的生命周期内pool被销毁,我们不得不进行内部拷贝.
The Heap bucket
堆bucket
The heap bucket is another form of in-memory bucket. But its usage is rather different from any of the above. We rarely if ever need to create a heap bucket explicitly: rather they are managed internally when we use the stdio-like API to write data to the next filter. This API is discussed in other articles.
堆bucket是内存bucket的另一种.但是使用方式和上面几种十分不同.我们很少,仅仅在明确需要的时候,创建堆bucket:宁愿它们在内部被管理,当我们使用stdio类的API去向下一个过滤器写数据的时候.这些API在其它文章中讨论.
External data buckets
外部数据bucket
The File bucket, as we have already seen, enables us to insert a file (or part) file into the data stream. Although we had to stat it to find its length, we didn't have to read it. If sendfile is enabled, the operating system (through the APR) can optimise sending the file.
The mmap bucket type is similar, and is appropriate to mmaped files. APR may convert file buckets to mmap internally if we (or a later filter) read the data.
Two other more unusual bucket types are the pipe and the socket, which enable us to insert data from an external source via IPC.
文件bucket,我们已经见到,让我们能够插入一个文件或者部分到数据流.尽管我们不得不stat来知道它的长度,但我们不需要读取它的内容.如果允许sendfile标志,操作系统(通过APR)能优化传送文件.
内存映射bucket和内存映射文件是相似的.当我们(接下来的过滤器)读取数据的时候,APR能在内部把文件bucket转成内存映射bucket.
其它两个不常用的bucket类型,管道和套接字,这两个能够让我们从外部数据源,通过IPC插入数据.
Metadata buckets
元数据bucket
In addition to data buckets, there are two metadata types. The EOS bucket is crucial: it must be sent at the end of a data stream, and it serves to signal the end of incoming data. The other metadata type is the rarely-user FLUSH bucket, which may occasionally be required, but is not guaranted to be propagated by every filter.
处理数据bucket,这里有两个元数据bucket.EOS bucket是关键:它必须在数据流结束的时候被发送,也被用来标志输入数据的结束.另一个元数据类型是很少使用的FLUSH bucket,偶尔会需要,但是可以保证不会被每一个过滤器使用.
Further reading
进一步参考
Cliff Woolley gave a talk on buckets and brigades at ApacheCon 2002. His notes are very readable, and go into more depth than this article.
Cliff Woolley在ApacheCon 2002上给我们一个关于bucket和brigade的讨论.他的注释是非常值得去度的,内容比这篇文章要深入许多.
http://www.cs.virginia.edu/~jcw5q/talks/
没有评论:
发表评论