PostgreSQL源码解读(232)-查询#125(NOTIN实现#3)
本节介绍了PostgreSQL含有NOT IN查询语句实现ExecMaterial函数中部分依赖的函数。
我们提供的服务有:做网站、成都网站设计、微信公众号开发、网站优化、网站认证、万柏林ssl等。为超过千家企事业单位解决了网站和推广的问题。提供周到的售前咨询和贴心的售后服务,是有科学管理、有技术的万柏林网站制作公司
一、数据结构
SubPlanState
子计划运行期状态
/* ----------------
* SubPlanState node
* ----------------
*/
typedef struct SubPlanState
{
NodeTag type;
SubPlan *subplan; /* expression plan node */
struct PlanState *planstate; /* subselect plan's state tree */
struct PlanState *parent; /* parent plan node's state tree */
ExprState *testexpr; /* 组合表达式状态;state of combining expression */
List *args; /* 参数表达式状态;states of argument expression(s) */
HeapTuple curTuple; /* subplan最近的元组;copy of most recent tuple from subplan */
Datum curArray; /* most recent array from ARRAY() subplan */
/* these are used when hashing the subselect's output: */
TupleDesc descRight; /* 投影后的子查询描述符;subselect desc after projection */
ProjectionInfo *projLeft; /* for projecting lefthand exprs */
ProjectionInfo *projRight; /* for projecting subselect output */
TupleHashTable hashtable; /* hash table for no-nulls subselect rows */
TupleHashTable hashnulls; /* hash table for rows with null(s) */
bool havehashrows; /* true if hashtable is not empty */
bool havenullrows; /* true if hashnulls is not empty */
MemoryContext hashtablecxt; /* memory context containing hash tables */
MemoryContext hashtempcxt; /* temp memory context for hash tables */
ExprContext *innerecontext; /* econtext for computing inner tuples */
AttrNumber *keyColIdx; /* control data for hash tables */
Oid *tab_eq_funcoids; /* equality func oids for table
* datatype(s) */
Oid *tab_collations; /* collations for hash and comparison */
FmgrInfo *tab_hash_funcs; /* hash functions for table datatype(s) */
FmgrInfo *tab_eq_funcs; /* equality functions for table datatype(s) */
FmgrInfo *lhs_hash_funcs; /* hash functions for lefthand datatype(s) */
FmgrInfo *cur_eq_funcs; /* equality functions for LHS vs. table */
ExprState *cur_eq_comp; /* equality comparator for LHS vs. table */
} SubPlanState;
SubPlan
子查询计划
/*
* SubPlan - executable expression node for a subplan (sub-SELECT)
*
* The planner replaces SubLink nodes in expression trees with SubPlan
* nodes after it has finished planning the subquery. SubPlan references
* a sub-plantree stored in the subplans list of the toplevel PlannedStmt.
* (We avoid a direct link to make it easier to copy expression trees
* without causing multiple processing of the subplan.)
* 查询规划器在完成子查询的规划后使用SubPlan节点替换表达式树中的SubLink节点。
* SubPlan引用了存储在高层PlannedStmt中的subplans链表中的sub-plantree。
* (避免使用直接链接,从而使得拷贝表达式树相对比较简单)
*
* In an ordinary subplan, testexpr points to an executable expression
* (OpExpr, an AND/OR tree of OpExprs, or RowCompareExpr) for the combining
* operator(s); the left-hand arguments are the original lefthand expressions,
* and the right-hand arguments are PARAM_EXEC Param nodes representing the
* outputs of the sub-select. (NOTE: runtime coercion functions may be
* inserted as well.) This is just the same expression tree as testexpr in
* the original SubLink node, but the PARAM_SUBLINK nodes are replaced by
* suitably numbered PARAM_EXEC nodes.
* 常规情况下,testexpr指向用于组合操作的可执行表达式(OpExpr、OpExprs的AND/OR树或者RowCompareExpr);
* 左参数是原始的左表达式,右参数是PARAM_EXEC参数节点用以表示子查询的输出。
* 与原始SubLink节点的testexpr具有相同的表达式树,但PARAM_SUBLINK节点则使用合适的已编号PARAM_EXEC节点替代。
*
* If the sub-select becomes an initplan rather than a subplan, the executable
* expression is part of the outer plan's expression tree (and the SubPlan
* node itself is not, but rather is found in the outer plan's initPlan
* list). In this case testexpr is NULL to avoid duplication.
* 如果子查询成了initplan而不是subplan,可执行的表达式是外层plan表达式树的一部分。
* 这种情况下,testexpr为NULL以避免重复。
*
* The planner also derives lists of the values that need to be passed into
* and out of the subplan. Input values are represented as a list "args" of
* expressions to be evaluated in the outer-query context (currently these
* args are always just Vars, but in principle they could be any expression).
* The values are assigned to the global PARAM_EXEC params indexed by parParam
* (the parParam and args lists must have the same ordering). setParam is a
* list of the PARAM_EXEC params that are computed by the sub-select, if it
* is an initplan; they are listed in order by sub-select output column
* position. (parParam and setParam are integer Lists, not Bitmapsets,
* because their ordering is significant.)
* 规划器还派生了需要传入和传出子计划的值的链表。
* 输入值标识位表达式的“args”链表,在外层查询上下文中进行解析。
* (这些args通常是Vars,但原则上它们可以是任意表达式)
* 这些值以parParam为索引给全局PARAM_EXEC参数赋值。
* setParam是PARAM_EXEC参数链表,通过子查询(如为initplan)计算所得。
* 它们按子查询输出列的位置进行排序组织为链表形式。
* (parParam和setParam是整型链表,而不是Bitmapsets链表)
*
* Also, the planner computes startup and per-call costs for use of the
* SubPlan. Note that these include the cost of the subquery proper,
* evaluation of the testexpr if any, and any hashtable management overhead.
* 同时,规划器计算SubPlan启动和每次调用的成本。注意:包括子查询正常解析testexpr的成本以及哈希表管理成本。
*/
typedef struct SubPlan
{
Expr xpr;//表达式
/* Fields copied from original SubLink: */
//从SubLink中拷贝而来
SubLinkType subLinkType; /* see above */
/* The combining operators, transformed to an executable expression: */
//组合操作符,转换为可执行的表达式
Node *testexpr; /* OpExpr or RowCompareExpr expression tree */
List *paramIds; /* 参数IDs;IDs of Params embedded in the above */
/* Identification of the Plan tree to use: */
//Plan tree标识
int plan_id; /* Index (from 1) in PlannedStmt.subplans */
/* Identification of the SubPlan for EXPLAIN and debugging purposes: */
//EXPLAIN和debug目的的SubPlan标识
char *plan_name; /* A name assigned during planning */
/* Extra data useful for determining subplan's output type: */
//用于确定subplan输出类型的额外信息
Oid firstColType; /* subplan结果的第一个列类型;Type of first column of subplan result */
int32 firstColTypmod; /* 第一列的Typmod;Typmod of first column of subplan result */
Oid firstColCollation; /* 第一列的Collation;Collation of first column of subplan
* result */
/* Information about execution strategy: */
//执行阶段的相关信息
bool useHashTable; /* 是否使用哈希表存储子查询输出;true to store subselect output in a hash
* table (implies we are doing "IN") */
bool unknownEqFalse; /* 如OK为T,如为未知则为F;快速处理null值;true if it's okay to return FALSE when the
* spec result is UNKNOWN; this allows much
* simpler handling of null values */
bool parallel_safe; /* 是否并行安全?is the subplan parallel-safe? */
/* Note: parallel_safe does not consider contents of testexpr or args */
/* Information for passing params into and out of the subselect: */
//用于给子查询传入和传出参数的信息
/* setParam and parParam are lists of integers (param IDs) */
//setParam和parParam是整型链表(param IDs)
List *setParam; /* initplan subqueries have to set these
* Params for parent plan */
List *parParam; /* indices of input Params from parent plan */
List *args; /* 以parParam值进行传递的表达式;exprs to pass as parParam values */
/* Estimated execution costs: */
//估算执行成本
Cost startup_cost; /* one-time setup cost */
Cost per_call_cost; /* cost for each subplan evaluation */
} SubPlan;
SubLinkType
SubLink类型
/*
* SubLink
*
* A SubLink represents a subselect appearing in an expression, and in some
* cases also the combining operator(s) just above it. The subLinkType
* indicates the form of the expression represented:
* EXISTS_SUBLINK EXISTS(SELECT ...)
* ALL_SUBLINK (lefthand) op ALL (SELECT ...)
* ANY_SUBLINK (lefthand) op ANY (SELECT ...)
* ROWCOMPARE_SUBLINK (lefthand) op (SELECT ...)
* EXPR_SUBLINK (SELECT with single targetlist item ...)
* MULTIEXPR_SUBLINK (SELECT with multiple targetlist items ...)
* ARRAY_SUBLINK ARRAY(SELECT with single targetlist item ...)
* CTE_SUBLINK WITH query (never actually part of an expression)
* 我们使用SubLink表示在表达式中出现的子查询,在某些情况下组合操作符会出现在SubLink之上。
* subLinkType表示表达式的形式:
* EXISTS_SUBLINK EXISTS(SELECT ...)
* ALL_SUBLINK (lefthand) op ALL (SELECT ...)
* ANY_SUBLINK (lefthand) op ANY (SELECT ...)
* ROWCOMPARE_SUBLINK (lefthand) op (SELECT ...)
* EXPR_SUBLINK (SELECT with single targetlist item ...)
* MULTIEXPR_SUBLINK (SELECT with multiple targetlist items ...)
* ARRAY_SUBLINK ARRAY(SELECT with single targetlist item ...)
* CTE_SUBLINK WITH query (never actually part of an expression)
*
* For ALL, ANY, and ROWCOMPARE, the lefthand is a list of expressions of the
* same length as the subselect's targetlist. ROWCOMPARE will *always* have
* a list with more than one entry; if the subselect has just one target
* then the parser will create an EXPR_SUBLINK instead (and any operator
* above the subselect will be represented separately).
* ROWCOMPARE, EXPR, and MULTIEXPR require the subselect to deliver at most
* one row (if it returns no rows, the result is NULL).
* ALL, ANY, and ROWCOMPARE require the combining operators to deliver boolean
* results. ALL and ANY combine the per-row results using AND and OR
* semantics respectively.
* ARRAY requires just one target column, and creates an array of the target
* column's type using any number of rows resulting from the subselect.
* 对于ALL,ANY和ROWCOMPARE,左操作符是与子查询目标链表长度一致的表达式链表。
* ROWCOMPARE通常有超过一个条目的链表;如果子查询刚好只有一个目标列,那么解析器会创建EXPR_SUBLINK
* (同时所有在子查询之上的操作符会单独表示)
* ROWCOMPARE, EXPR, 和MULTIEXPR要求子查询至少输出一行(如返回0行,则结果为NULL)。
* ALL,ANY和ROWCOMPARE要求组合操作符输出布尔型结果。
* ALL/ANY使用AND/OR语义来组合每一行的结果。
*
* SubLink is classed as an Expr node, but it is not actually executable;
* it must be replaced in the expression tree by a SubPlan node during
* planning.
* SubLink归类为Expr节点,但实际上并不是可执行的,必须在计划阶段通过SubPlan替代。
*
* NOTE: in the raw output of gram.y, testexpr contains just the raw form
* of the lefthand expression (if any), and operName is the String name of
* the combining operator. Also, subselect is a raw parsetree. During parse
* analysis, the parser transforms testexpr into a complete boolean expression
* that compares the lefthand value(s) to PARAM_SUBLINK nodes representing the
* output columns of the subselect. And subselect is transformed to a Query.
* This is the representation seen in saved rules and in the rewriter.
* 注意:在gram.y的裸输出中,testexpr只包含左表达式的裸形式,operName是组合操作符的字符串名称。
* 同时,子查询是裸parsetree。在解析分析期间,
* 解析器转换testexpr为完整的布尔表达式用于比较左操作符值与PARAM_SUBLINK节点所代表的子查询输出列值。
* 子查询会转换为Query结构体。
* 在已存储的规则和重写时可见的表示形式。
*
* In EXISTS, EXPR, MULTIEXPR, and ARRAY SubLinks, testexpr and operName
* are unused and are always null.
* 在EXISTS/EXPR/MULTEXPR/ARRAY SubLinks中,testexpr和operName不再使用通常是NULL值。
*
* subLinkId is currently used only for MULTIEXPR SubLinks, and is zero in
* other SubLinks. This number identifies different multiple-assignment
* subqueries within an UPDATE statement's SET list. It is unique only
* within a particular targetlist. The output column(s) of the MULTIEXPR
* are referenced by PARAM_MULTIEXPR Params appearing elsewhere in the tlist.
* subLinkId当前只用于MULTIEXPR,在其他SubLinks中取值为0.
* 该数字标识了在UPDATE语句SET链表中不同的多个赋值子查询。
* 只有在特定的targetlist内是唯一的。
* 出现在tlist其他地方的PARAM_MULTIEXPR参数依赖于MULTIEXPR的输出列。
*
* The CTE_SUBLINK case never occurs in actual SubLink nodes, but it is used
* in SubPlans generated for WITH subqueries.
* CTE_SUBLINK不会出现在实际的SubLink节点中,但用于WITH子查询所产生的SubPlans中。
*/
typedef enum SubLinkType
{
EXISTS_SUBLINK,
ALL_SUBLINK,
ANY_SUBLINK,
ROWCOMPARE_SUBLINK,
EXPR_SUBLINK,
MULTIEXPR_SUBLINK,
ARRAY_SUBLINK,
CTE_SUBLINK /* 仅用于SubPlans中;for SubPlans only */
} SubLinkType;
SubLink
SubLink结构体
typedef struct SubLink
{
Expr xpr;
SubLinkType subLinkType; /* see above */
int subLinkId; /* ID (1..n); 0 if not MULTIEXPR */
Node *testexpr; /* outer-query test for ALL/ANY/ROWCOMPARE */
List *operName; /* originally specified operator name */
Node *subselect; /* subselect as Query* or raw parsetree */
int location; /* token location, or -1 if unknown */
} SubLink;
MaterialState
Material状态
/* ----------------
* MaterialState information
*
* materialize nodes are used to materialize the results
* of a subplan into a temporary file.
* materialize节点用于物化subplan的结果为临时文件。
*
* ss.ss_ScanTupleSlot refers to output of underlying plan.
* ss.ss_ScanTupleSlot指向underlyling plan的输出(subplan)
* ----------------
*/
typedef struct MaterialState
{
ScanState ss; /* its first field is NodeTag */
int eflags; /* 传递给tuplestore的capability标记;capability flags to pass to tuplestore */
bool eof_underlying; /* 已经到达underlying plan的末尾?reached end of underlying plan? */
Tuplestorestate *tuplestorestate;
} MaterialState;
二、源码解读
ExecMaterial
执行物化操作。
/* ----------------------------------------------------------------
* ExecMaterial
*
* As long as we are at the end of the data collected in the tuplestore,
* we collect one new row from the subplan on each call, and stash it
* aside in the tuplestore before returning it. The tuplestore is
* only read if we are asked to scan backwards, rescan, or mark/restore.
* 只要在tuplestore中数据收集结束时,就会在每次调用时从subplan中收集一条新行,
* 并在返回之前将其保存在tuplestore中。
* 只要在往后扫描、重新扫描或标记/恢复时tuplestore才会读取。
*
* ----------------------------------------------------------------
*/
static TupleTableSlot * /* 从subplan中返回的结果;result tuple from subplan */
ExecMaterial(PlanState *pstate)
{
MaterialState *node = castNode(MaterialState, pstate);//物化节点
EState *estate;//运行期状态
ScanDirection dir;//扫描方向
bool forward;//是否往前扫描
Tuplestorestate *tuplestorestate;//Tuplestorestate结构体指针
bool eof_tuplestore;//是否完成?
TupleTableSlot *slot;//存储元组的slot
CHECK_FOR_INTERRUPTS();
/*
* get state info from node
* 从物化节点中获取相关信息
*/
estate = node->ss.ps.state;
dir = estate->es_direction;//方向
forward = ScanDirectionIsForward(dir);//是否往前扫描
tuplestorestate = node->tuplestorestate;
/*
* If first time through, and we need a tuplestore, initialize it.
* 第一次,需要tuplestore并初始化
*/
if (tuplestorestate == NULL && node->eflags != 0)
{
tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
tuplestore_set_eflags(tuplestorestate, node->eflags);
if (node->eflags & EXEC_FLAG_MARK)
{
/*
* Allocate a second read pointer to serve as the mark. We know it
* must have index 1, so needn't store that.
* 分配用于mark的读指针
*/
int ptrno PG_USED_FOR_ASSERTS_ONLY;
ptrno = tuplestore_alloc_read_pointer(tuplestorestate,
node->eflags);
Assert(ptrno == 1);
}
node->tuplestorestate = tuplestorestate;
}
/*
* If we are not at the end of the tuplestore, or are going backwards, try
* to fetch a tuple from tuplestore.
* 如果不在tuplestore的末尾或者正在往后扫描,尝试从tuplestore中提取一个元组
*/
eof_tuplestore = (tuplestorestate == NULL) ||
tuplestore_ateof(tuplestorestate);
if (!forward && eof_tuplestore)
{
if (!node->eof_underlying)
{
/*
* When reversing direction at tuplestore EOF, the first
* gettupleslot call will fetch the last-added tuple; but we want
* to return the one before that, if possible. So do an extra
* fetch.
* 在EOF处反转方向,第一次的gettupleslot调用会提取最后添加的元组;
* 但如可能,希望返回在此之前的元组,执行额外的提取操作。
*/
if (!tuplestore_advance(tuplestorestate, forward))
return NULL; /* the tuplestore must be empty */
}
eof_tuplestore = false;
}
/*
* If we can fetch another tuple from the tuplestore, return it.
* 如能从tuplestore中提取另外一个tuple,返回
*/
slot = node->ss.ps.ps_ResultTupleSlot;
if (!eof_tuplestore)
{
if (tuplestore_gettupleslot(tuplestorestate, forward, false, slot))
return slot;
if (forward)
eof_tuplestore = true;
}
/*
* If necessary, try to fetch another row from the subplan.
* 如需要(tuplestore末尾),尝试从subplan中提取另外一行
*
* Note: the eof_underlying state variable exists to short-circuit further
* subplan calls. It's not optional, unfortunately, because some plan
* node types are not robust about being called again when they've already
* returned NULL.
*/
if (eof_tuplestore && !node->eof_underlying)
{
PlanState *outerNode;
TupleTableSlot *outerslot;
/*
* We can only get here with forward==true, so no need to worry about
* which direction the subplan will go.
*/
outerNode = outerPlanState(node);
outerslot = ExecProcNode(outerNode);
if (TupIsNull(outerslot))
{
node->eof_underlying = true;
return NULL;
}
/*
* Append a copy of the returned tuple to tuplestore. NOTE: because
* the tuplestore is certainly in EOF state, its read position will
* move forward over the added tuple. This is what we want.
* 追加返回的元组到tuplestore中。
* 注意:因为tuplestore当前处于EOF状态,读取的位置会前移至已添加的tuple前面,这是我们希望看到的。
*/
if (tuplestorestate)
tuplestore_puttupleslot(tuplestorestate, outerslot);
ExecCopySlot(slot, outerslot);
return slot;
}
/*
* Nothing left ...
*/
return ExecClearTuple(slot);
}
tuplestore_begin_heap
初始化tuplestore
/*
* tuplestore_begin_heap
*
* Create a new tuplestore; other types of tuple stores (other than
* "heap" tuple stores, for heap tuples) are possible, but not presently
* implemented.
* 创建新的tuplestore:目前仅实现了heap tuples。
*
* randomAccess: if true, both forward and backward accesses to the
* tuple store are allowed.
* randomAccess : 如为T,支持往前和往后访问。
*
* interXact: if true, the files used for on-disk storage persist beyond the
* end of the current transaction. NOTE: It's the caller's responsibility to
* create such a tuplestore in a memory context and resource owner that will
* also survive transaction boundaries, and to ensure the tuplestore is closed
* when it's no longer wanted.
* interXact : 如为T,磁盘上的存储文件在当前事务结束后也会一直保持。
* 注意:调用者有责任在事务边界内存活的内存上下文和资源拥有者中创建tuplestore并确保不再使用时销毁tuplestore。
*
* maxKBytes: how much data to store in memory (any data beyond this
* amount is paged to disk). When in doubt, use work_mem.
* maxKBytes:有多少数据需要存储到内存中(超长此大小的会分页到磁盘上)。
* 如存在问题,则使用work_mem。
*/
Tuplestorestate *
tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
{
Tuplestorestate *state;
int eflags;
/*
* This interpretation of the meaning of randomAccess is compatible with
* the pre-8.3 behavior of tuplestores.
*/
eflags = randomAccess ?
(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
(EXEC_FLAG_REWIND);
state = tuplestore_begin_common(eflags, interXact, maxKBytes);
state->copytup = copytup_heap;
state->writetup = writetup_heap;
state->readtup = readtup_heap;
return state;
}
/*
* tuplestore_begin_xxx
*
* Initialize for a tuple store operation.
* 初始化tuplestore
*/
static Tuplestorestate *
tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
{
Tuplestorestate *state;
state = (Tuplestorestate *) palloc0(sizeof(Tuplestorestate));
state->status = TSS_INMEM;
state->eflags = eflags;
state->interXact = interXact;
state->truncated = false;
state->allowedMem = maxKBytes * 1024L;
state->availMem = state->allowedMem;
state->myfile = NULL;
state->context = CurrentMemoryContext;
state->resowner = CurrentResourceOwner;
state->memtupdeleted = 0;
state->memtupcount = 0;
state->tuples = 0;
/*
* Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
* see comments in grow_memtuples().
*/
state->memtupsize = Max(16384 / sizeof(void *),
ALLOCSET_SEPARATE_THRESHOLD / sizeof(void *) + 1);
state->growmemtuples = true;
state->memtuples = (void **) palloc(state->memtupsize * sizeof(void *));
USEMEM(state, GetMemoryChunkSpace(state->memtuples));
state->activeptr = 0;
state->readptrcount = 1;
state->readptrsize = 8; /* arbitrary */
state->readptrs = (TSReadPointer *)
palloc(state->readptrsize * sizeof(TSReadPointer));
state->readptrs[0].eflags = eflags;
state->readptrs[0].eof_reached = false;
state->readptrs[0].current = 0;
return state;
}
tuplestore_advance
从tuplestore前进一行
/*
* tuplestore_advance - exported function to adjust position without fetching
*
* We could optimize this case to avoid palloc/pfree overhead, but for the
* moment it doesn't seem worthwhile.
*/
bool
tuplestore_advance(Tuplestorestate *state, bool forward)
{
void *tuple;
bool should_free;
tuple = tuplestore_gettuple(state, forward, &should_free);
if (tuple)
{
if (should_free)
pfree(tuple);
return true;
}
else
{
return false;
}
}
tuplestore_gettupleslot
获取slot
/*
* tuplestore_gettupleslot - exported function to fetch a MinimalTuple
* 提取MinimalTuple
*
* If successful, put tuple in slot and return true; else, clear the slot
* and return false.
* 如成功,则把元组塞进slot中并返回T,否则清空slot返回F
*
* If copy is true, the slot receives a copied tuple (allocated in current
* memory context) that will stay valid regardless of future manipulations of
* the tuplestore's state. If copy is false, the slot may just receive a
* pointer to a tuple held within the tuplestore. The latter is more
* efficient but the slot contents may be corrupted if additional writes to
* the tuplestore occur. (If using tuplestore_trim, see comments therein.)
* 如copy为T,则slot会接收拷贝之后的元组,独立于tuplestore的状态.
* 如copy为F,则slot可能接收到tuplestore中的元组指针.
*/
bool
tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
bool copy, TupleTableSlot *slot)
{
MinimalTuple tuple;
bool should_free;
tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
if (tuple)
{
if (copy && !should_free)
{
tuple = heap_copy_minimal_tuple(tuple);
should_free = true;
}
ExecStoreMinimalTuple(tuple, slot, should_free);
return true;
}
else
{
ExecClearTuple(slot);
return false;
}
}
tuplestore_gettuple
返回下一个元组
/*
* Fetch the next tuple in either forward or back direction.
* Returns NULL if no more tuples. If should_free is set, the
* caller must pfree the returned tuple when done with it.
* 往前/后返回下一个元组。
* 如无更多元组,返回NULL。如should_free有值,调用者必须在处理完毕后释放返回的元组
*
* Backward scan is only allowed if randomAccess was set true or
* EXEC_FLAG_BACKWARD was specified to tuplestore_set_eflags().
* 在randomAccess设置为T或者指定EXEC_FLAG_BACKWARD时才允许。
*/
static void *
tuplestore_gettuple(Tuplestorestate *state, bool forward,
bool *should_free)
{
TSReadPointer *readptr = &state->readptrs[state->activeptr];//读取指针
unsigned int tuplen;
void *tup;
Assert(forward || (readptr->eflags & EXEC_FLAG_BACKWARD));
switch (state->status)
{
case TSS_INMEM://内存中
*should_free = false;
if (forward)
{
if (readptr->eof_reached)
return NULL;
if (readptr->current < state->memtupcount)
{
/* We have another tuple, so return it */
return state->memtuples[readptr->current++];
}
readptr->eof_reached = true;
return NULL;
}
else
{
/*
* if all tuples are fetched already then we return last
* tuple, else tuple before last returned.
*/
if (readptr->eof_reached)
{
readptr->current = state->memtupcount;
readptr->eof_reached = false;
}
else
{
if (readptr->current <= state->memtupdeleted)
{
Assert(!state->truncated);
return NULL;
}
readptr->current--; /* last returned tuple */
}
if (readptr->current <= state->memtupdeleted)
{
Assert(!state->truncated);
return NULL;
}
return state->memtuples[readptr->current - 1];
}
break;
case TSS_WRITEFILE://写文件
/* Skip state change if we'll just return NULL */
//如只需要返回NULL则跳过状态变换
if (readptr->eof_reached && forward)
return NULL;
/*
* Switch from writing to reading.
* 从写切换至读
*/
BufFileTell(state->myfile,
&state->writepos_file, &state->writepos_offset);
if (!readptr->eof_reached)
if (BufFileSeek(state->myfile,
readptr->file, readptr->offset,
SEEK_SET) != 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not seek in tuplestore temporary file: %m")));
state->status = TSS_READFILE;
/* FALLTHROUGH */
//进入读文件状态的处理逻辑
case TSS_READFILE:
*should_free = true;
if (forward)
{
//往前读
if ((tuplen = getlen(state, true)) != 0)
{
tup = READTUP(state, tuplen);
return tup;
}
else
{
readptr->eof_reached = true;
return NULL;
}
}
/*
* Backward.
* 往后读
*
* if all tuples are fetched already then we return last tuple,
* else tuple before last returned.
* 如果所有元组时已提取,则返回最后一个元组,否则返回先前最后返回的元组
*
* Back up to fetch previously-returned tuple's ending length
* word. If seek fails, assume we are at start of file.
* 往回向上提取先前已返回的元组结束长度字,如检索失败,假定处于文件的开始位置.
*/
if (BufFileSeek(state->myfile, 0, -(long) sizeof(unsigned int),
SEEK_CUR) != 0)
{
/* even a failed backwards fetch gets you out of eof state */
readptr->eof_reached = false;
Assert(!state->truncated);
return NULL;
}
tuplen = getlen(state, false);
if (readptr->eof_reached)
{
readptr->eof_reached = false;
/* We will return the tuple returned before returning NULL */
//在返回NULL前返回先前已返回的元组
}
else
{
/*
* Back up to get ending length word of tuple before it.
* 获取结束长度字
*/
if (BufFileSeek(state->myfile, 0,
-(long) (tuplen + 2 * sizeof(unsigned int)),
SEEK_CUR) != 0)
{
/*
* If that fails, presumably the prev tuple is the first
* in the file. Back up so that it becomes next to read
* in forward direction (not obviously right, but that is
* what in-memory case does).
*/
if (BufFileSeek(state->myfile, 0,
-(long) (tuplen + sizeof(unsigned int)),
SEEK_CUR) != 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not seek in tuplestore temporary file: %m")));
Assert(!state->truncated);
return NULL;
}
tuplen = getlen(state, false);
}
/*
* Now we have the length of the prior tuple, back up and read it.
* Note: READTUP expects we are positioned after the initial
* length word of the tuple, so back up to that point.
* 已获得优先元组的长度,读取之.
*/
if (BufFileSeek(state->myfile, 0,
-(long) tuplen,
SEEK_CUR) != 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not seek in tuplestore temporary file: %m")));
tup = READTUP(state, tuplen);
return tup;
default:
elog(ERROR, "invalid tuplestore state");
return NULL; /* keep compiler quiet */
}
}
三、跟踪分析
执行SQL:
[pg12@localhost ~]$ psql -d testdb
Timing is on.
Expanded display is used automatically.
psql (12.0)
Type "help" for help.
[local]:5432 pg12@testdb=#
[local]:5432 pg12@testdb=# select * from tbl;
id | value
----+-------
1 | 2
(1 row)
Time: 2.678 ms
[local]:5432 pg12@testdb=# select count(*) from t_big_null;
count
----------
10000001
(1 row)
Time: 679.972 ms
[local]:5432 pg12@testdb=# analyze tbl;
ANALYZE
Time: 64.442 ms
[local]:5432 pg12@testdb=# analyze t_big_null;
ANALYZE
Time: 434.702 ms
[local]:5432 pg12@testdb=#
[local]:5432 pg12@testdb=# select pg_backend_pid();
pg_backend_pid
----------------
18758
(1 row)
Time: 1.990 ms
[local]:5432 pg12@testdb=# select * from tbl a where a.id not in (select b.id from t_big_null b);
启动gdb跟踪
(gdb) b ExecMaterial
Breakpoint 1 at 0x720edb: file nodeMaterial.c, line 41.
(gdb) c
Continuing.
Breakpoint 1, ExecMaterial (pstate=0x1230128) at nodeMaterial.c:41
41 MaterialState *node = castNode(MaterialState, pstate);
(gdb)
单步调试
(gdb) n
49 CHECK_FOR_INTERRUPTS();
(gdb)
54 estate = node->ss.ps.state;
(gdb)
55 dir = estate->es_direction;
(gdb)
56 forward = ScanDirectionIsForward(dir);
(gdb)
57 tuplestorestate = node->tuplestorestate;
(gdb)
62 if (tuplestorestate == NULL && node->eflags != 0)
(gdb)
64 tuplestorestate = tuplestore_begin_heap(true, false, work_mem);
(gdb)
65 tuplestore_set_eflags(tuplestorestate, node->eflags);
(gdb)
66 if (node->eflags & EXEC_FLAG_MARK)
(gdb)
78 node->tuplestorestate = tuplestorestate;
(gdb)
85 eof_tuplestore = (tuplestorestate == NULL) ||
(gdb)
86 tuplestore_ateof(tuplestorestate);
(gdb)
85 eof_tuplestore = (tuplestorestate == NULL) ||
(gdb)
88 if (!forward && eof_tuplestore)
(gdb) p eof_tuplestore
$1 = false
(gdb)
进入tuplestore_gettupleslot
(gdb) n
107 slot = node->ss.ps.ps_ResultTupleSlot;
(gdb)
108 if (!eof_tuplestore)
(gdb)
110 if (tuplestore_gettupleslot(tuplestorestate, forward, false, slot))
(gdb) step
tuplestore_gettupleslot (state=0x3069c18, forward=true, copy=false, slot=0x30687a8)
at tuplestore.c:1084
1084 tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
(gdb)
进入tuplestore_gettuple
(gdb) step
tuplestore_gettuple (state=0x3069c18, forward=true, should_free=0x7ffd18474ff7)
at tuplestore.c:906
906 TSReadPointer *readptr = &state->readptrs[state->activeptr];
(gdb)
tuplestore_gettuple->文件读写指针信息
(gdb) n
910 Assert(forward || (readptr->eflags & EXEC_FLAG_BACKWARD));
(gdb) p *readptr
$2 = {eflags = 2, eof_reached = false, current = 0, file = 2139062143,
offset = 9187201950435737471}
tuplestore_gettuple->当前状态为TSS_INMEM
(gdb) n
912 switch (state->status)
(gdb) p *state
$3 = {status = TSS_INMEM, eflags = 2, backward = false, interXact = false,
truncated = false, availMem = 4177896, allowedMem = 4194304, tuples = 0, myfile = 0x0,
context = 0x3067da0, resowner = 0x2fa62c8, copytup = 0xaba7bd ,
writetup = 0xaba811 , readtup = 0xaba9d9 ,
memtuples = 0x3051e90, memtupdeleted = 0, memtupcount = 0, memtupsize = 2048,
growmemtuples = true, readptrs = 0x3077f70, activeptr = 0, readptrcount = 1,
readptrsize = 8, writepos_file = 0, writepos_offset = 0}
(gdb) p state->status
$4 = TSS_INMEM
(gdb)
tuplestore_gettuple->返回NULL
(gdb) n
915 *should_free = false;
(gdb) n
916 if (forward)
(gdb)
918 if (readptr->eof_reached)
(gdb)
920 if (readptr->current < state->memtupcount)
(gdb) p readptr->current
$5 = 0
(gdb) p state->memtupcount
$6 = 0
(gdb) n
925 readptr->eof_reached = true;
(gdb)
926 return NULL;
(gdb)
1062 }
(gdb)
tuplestore_gettupleslot->返回false
(gdb) n
tuplestore_gettupleslot (state=0x3069c18, forward=true, copy=false, slot=0x30687a8)
at tuplestore.c:1086
1086 if (tuple)
(gdb)
1098 ExecClearTuple(slot);
(gdb)
1099 return false;
(gdb)
回到ExecMaterial
(gdb) n
1101 }
(gdb)
ExecMaterial (pstate=0x3068158) at nodeMaterial.c:112
112 if (forward)
(gdb)
113 eof_tuplestore = true;
(gdb)
从outerPlan中获取一行(即从t_big_null中获取一行)
(gdb) n
124 if (eof_tuplestore && !node->eof_underlying)
(gdb) p node->eof_underlying
$7 = false
(gdb) n
133 outerNode = outerPlanState(node);
(gdb)
###
#define innerPlanState(node) (((PlanState *)(node))->righttree)
#define outerPlanState(node) (((PlanState *)(node))->lefttree)
###
134 outerslot = ExecProcNode(outerNode);
(gdb) p outerNode
$8 = (PlanState *) 0x3068270
(gdb) p *outerNode
$9 = {type = T_SeqScanState, plan = 0x3037628, state = 0x3067eb8,
ExecProcNode = 0x6f802a , ExecProcNodeReal = 0x72b904 ,
instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0, qual = 0x0,
lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0,
ps_ResultTupleDesc = 0x3068578, ps_ResultTupleSlot = 0x0, ps_ExprContext = 0x3068388,
ps_ProjInfo = 0x0, scandesc = 0x7fab449cae98,
scanops = 0xc3e780 , outerops = 0x0, innerops = 0x0,
resultops = 0xc3e780 , scanopsfixed = true,
outeropsfixed = false, inneropsfixed = false, resultopsfixed = true, scanopsset = true,
outeropsset = false, inneropsset = false, resultopsset = true}
(gdb) p *outerNode->state
$10 = {type = T_EState, es_direction = ForwardScanDirection, es_snapshot = 0x2f9cd10,
es_crosscheck_snapshot = 0x0, es_range_table = 0x3042130,
es_range_table_array = 0x3068108, es_range_table_size = 2, es_relations = 0x3068130,
es_rowmarks = 0x0, es_plannedstmt = 0x3042438,
es_sourceText = 0x2f74d88 "select * from tbl a where a.id not in (select b.id from t_big_null b);", es_junkFilter = 0x0, es_output_cid = 0, es_result_relations = 0x0,
es_num_result_relations = 0, es_result_relation_info = 0x0,
es_root_result_relations = 0x0, es_num_root_result_relations = 0,
es_partition_directory = 0x0, es_tuple_routing_result_relations = 0x0,
es_trig_target_relations = 0x0, es_param_list_info = 0x0,
es_param_exec_vals = 0x30680d0, es_queryEnv = 0x0, es_query_cxt = 0x3067da0,
es_tupleTable = 0x3068540, es_processed = 0, es_top_eflags = 16, es_instrument = 0,
es_finished = false, es_exprcontexts = 0x3068448, es_subplanstates = 0x3068950,
es_auxmodifytables = 0x0, es_per_tuple_exprcontext = 0x0, es_epq_active = 0x0,
es_use_parallel_mode = false, es_query_dsa = 0x0, es_jit_flags = 25, es_jit = 0x0,
es_jit_worker_instr = 0x0}
(gdb) p ((PlanState *)node)->righttree
$21 = (struct PlanState *) 0x0
(gdb)
回过头来看执行计划,Materialize Node的lefttree是Seq Scan on public.t_big_null b,righttree为NULL。
[local]:5432 pg12@testdb=# explain verbose select * from tbl a where a.id not in (select b.id from t_big_null b);
QUERY PLAN
-------------------------------------------------------------------------------------------
-
Seq Scan on public.tbl a (cost=0.00..129156.33 rows=1 width=8)
Output: a.id, a.value
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..233310.68 rows=9999979 width=4)
Output: b.id
-> Seq Scan on public.t_big_null b (cost=0.00..144247.79 rows=9999979 width=4)
Output: b.id
(8 rows)
Time: 7.681 ms
获取outerslot
(gdb) n
135 if (TupIsNull(outerslot))
(gdb) p *outerslot
$16 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 0,
tts_ops = 0xc3e780 , tts_tupleDescriptor = 0x7fab449cae98,
tts_values = 0x30684f0, tts_isnull = 0x30684f8, tts_mcxt = 0x3067da0, tts_tid = {
ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 1}, tts_tableOid = 49155}
(gdb) p *outerslot->tts_values
$17 = 0
(gdb) p outerslot->tts_values[1]
$18 = 0
(gdb) p outerslot->tts_values[0]
$19 = 0
(gdb) p *outerslot->tts_tupleDescriptor
$20 = {natts = 1, tdtypeid = 49157, tdtypmod = -1, tdrefcount = 2, constr = 0x0,
attrs = 0x7fab449caeb0}
获取outerslot后,put到tuplestore中
(gdb) p *node
$22 = {ss = {ps = {type = T_MaterialState, plan = 0x3040a60, state = 0x3067eb8,
ExecProcNode = 0x720ecf , ExecProcNodeReal = 0x720ecf ,
instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0, qual = 0x0,
lefttree = 0x3068270, righttree = 0x0, initPlan = 0x0, subPlan = 0x0,
chgParam = 0x0, ps_ResultTupleDesc = 0x3068690, ps_ResultTupleSlot = 0x30687a8,
ps_ExprContext = 0x0, ps_ProjInfo = 0x0, scandesc = 0x3068578,
scanops = 0xc3e720 , outerops = 0x0, innerops = 0x0,
resultops = 0xc3e720 , scanopsfixed = true,
outeropsfixed = false, inneropsfixed = false, resultopsfixed = true,
scanopsset = true, outeropsset = false, inneropsset = false, resultopsset = true},
ss_currentRelation = 0x0, ss_currentScanDesc = 0x0, ss_ScanTupleSlot = 0x3068868},
eflags = 2, eof_underlying = false, tuplestorestate = 0x3069c18}
(gdb) n
146 if (tuplestorestate)
(gdb)
147 tuplestore_puttupleslot(tuplestorestate, outerslot);
(gdb) p outerslot->tts_values[0]
$23 = 0
(gdb) n
149 ExecCopySlot(slot, outerslot);
(gdb) p outerslot->tts_values[0]
$24 = 0
(gdb) n
150 return slot;
(gdb) p outerslot->tts_values[0]
$25 = 0
(gdb) p slot->tts_values[0]
$26 = 0
(gdb) n
157 }
(gdb)
继续“物化”
(gdb) n
ExecProcNodeFirst (node=0x3068158) at execProcnode.c:446
446 }
(gdb) c
Continuing.
Breakpoint 1, ExecMaterial (pstate=0x3068158) at nodeMaterial.c:41
41 MaterialState *node = castNode(MaterialState, pstate);
(gdb) n
49 CHECK_FOR_INTERRUPTS();
(gdb)
54 estate = node->ss.ps.state;
(gdb)
55 dir = estate->es_direction;
(gdb)
56 forward = ScanDirectionIsForward(dir);
(gdb)
57 tuplestorestate = node->tuplestorestate;
(gdb)
62 if (tuplestorestate == NULL && node->eflags != 0)
(gdb)
85 eof_tuplestore = (tuplestorestate == NULL) ||
(gdb)
86 tuplestore_ateof(tuplestorestate);
(gdb)
85 eof_tuplestore = (tuplestorestate == NULL) ||
(gdb)
88 if (!forward && eof_tuplestore)
(gdb)
107 slot = node->ss.ps.ps_ResultTupleSlot;
(gdb)
108 if (!eof_tuplestore)
(gdb)
124 if (eof_tuplestore && !node->eof_underlying)
(gdb)
133 outerNode = outerPlanState(node);
(gdb) p eof_tuplestore
$27 = true
(gdb) n
134 outerslot = ExecProcNode(outerNode);
(gdb)
135 if (TupIsNull(outerslot))
(gdb)
146 if (tuplestorestate)
(gdb)
147 tuplestore_puttupleslot(tuplestorestate, outerslot);
(gdb)
149 ExecCopySlot(slot, outerslot);
(gdb)
150 return slot;
(gdb) p slot->tts_values[0]
$28 = 2
(gdb)
第一次执行时间较久,第二次相对快2个数量级,需要继续研究。
[local]:5432 pg12@testdb=# select * from tbl a where a.id not in (select b.id from t_big_null b);
id | value
----+-------
(0 rows)
Time: 3633462.666 ms (01:00:33.463) --> 包括了debug的时间,实际时间是5s左右
[local]:5432 pg12@testdb=#
[local]:5432 pg12@testdb=# select * from tbl a where a.id not in (select b.id from t_big_null b);
id | value
----+-------
(0 rows)
Time: 6.480 ms --> 第2+次就快很多
[local]:5432 pg12@testdb=#
DONE
四、参考资料
N/A
分享文章:PostgreSQL源码解读(232)-查询#125(NOTIN实现#3)
本文地址:http://scyanting.com/article/gcesed.html