BGP Statement Pattern Counts / James Anderson

James Anderson

Workspace

Public

Edited

Aug 10, 2024

md`# BGP Statement Pattern Counts`

md`

A recent developer mailing list inquiry caught our attention a few days ago.

The author was wondering, _"[can] anyone ... give me a recommendation for 'how big a sane query can get'?"_

His use case was to apply a [SPARQL](http://www.w3.org/TR/sparql11-query/)

query against the

[LOD dataset](http://lod-cloud.net) in order to test for

the presence of "patterns for human associations".

There was mention of subgraph sizes in excess of 400K triples, but as it were,

his efforts appeared thwarted by much lower limits which his query processor

set for the number of statement patterns in a [BGP](http://www.w3.org/TR/sparql11-query/#BasicGraphPatterns).

The case made me wonder where our limits lie and what accounts for them.

To this end, I used a generator for simple count queries:

* (defun generate-bgp (count)

\`(spocq.a:|select| (spocq.a:|bgp| ,@(loop for i from 0 below count

collect '(spocq.a:|triple| ?::|s| ?::|p| ?::|o|)))

((?::|count| (spocq:|count| SPOCQ.S:*)))))

GENERATE-BGP

which generates trivial queries, such as:

* (generate-bgp 4)

(spocq:|select|

(spocq:|bgp|

(spocq:|triple| ?::|s| ?::|p| ?::|o|)

(spocq:|triple| ?::|s| ?::|p| ?::|o|))

((?::|count| (spocq:|count| SPOCQ.S:*))))

* (defun run-test-bgp (count)

(run-sparql (generate-bgp count) :repository-id "james/bgp-tests"))

RUN-TEST-BGP

with a single-statement repository, just to exercise the query invocation, without

any concern for solution field size,

* (run-test-bgp 4)

((1))

(?::|count|)

#<QUERY [E7DA6E00-E215-11E4-90B0-000000000000/NIL, select@COMPLETE, james/bgp-tests] {100AFD9199}>

In order to gauge the resource use in terms of execution time, it sufficed to

simply profile the significant operators:

* (sb-profile:profile rdfcache:match rdfcache:count spocq-compile run-agp-thread)

* (loop for count = 2 then (* count 2) until (> count (* 32 1024))

do (let ((start (get-internal-run-time)))

(sb-profile:reset)

(run-test-bgp count)

(format *trace-output* "~%~%~a ~ams" count (- (get-internal-run-time) start))

(sb-profile:report)

(format *trace-output* "~%")))

The result is a transcript of the elapsed times, function call counts and cumulative run-time for

the respective operators, in this case, on a six-core CPU.

For example:

2 101ms

--------------------------------------------------------

0.091 | 0.000 | 11,168,640 | 3 | 0.030333 | SPOCQ-COMPILE

0.001 | 0.000 | 32,576 | 1 | 0.001000 | RDFCACHE:MATCH

0.000 | 0.000 | 0 | 1 | 0.000000 | RDFCACHE:COUNT

0.000 | 0.000 | 0 | 1 | 0.000000 | RUN-AGP-THREAD

--------------------------------------------------------

0.092 | 0.000 | 11,201,216 | 6 | | Total

For this query form, that is with a single BGP, which is compiled into a single,

monolithic pattern matching function, the results are as follows:

<tr><th>pattern count </th><th>milliseconds </th><th>compile </th><th>count </th><th>match</th></tr>

</table>

above 1K patterns, however, the run-time reaches a compilation limit.

The compiler exhausts its dynamic binding stack either when precompiling the BGP,

or - if the query processor is set to interpret rather than compile queries, when interpreting

the pattern's source form.

As an alternative, one could limit each BGP to a single statement.

* (defun generate-singleton-bgp (count)

(let ((body (loop with bgp = '(spocq.a:|bgp| (spocq.a:|triple| ?::|s| ?::|p| ?::|o|))

for form = bgp then \`(spocq.a:|join| ,bgp ,form)

for i from 0 below count

finally (return form))))

GENERATE-SINGLETON-BGP.

an attempt with queries of this form progressed up to 16K:

<tr><th>pattern count</th><th>milliseconds</th></tr>

</table>

but then also ran into limits.

This time, it was system limits on connections to the store which for the

16K single-pattern BGPs amounted to that many connections, all active in parallel,

with twice that many active

threads - one for each BGP and one for each consequential JOIN operator.

Even this limit required some reconfiguration once to raise the page map limit to permit that many simultaneous

store connections and once to raise the compiler's stack space to permit it to compile a query with

that degree of nested joins.

In any event, given these two limits - 16K parallel BGPs and 1K statements per BGP, it seemed possible

to achieve rather large statement counts.

As indeed it was.

With a configuration option to limit the BGP statement pattern count to 256, the effective limits were

quite high:

<tr><th>count </th><th> milliseconds </th><th> compile </th><th> count </th><th> match (net gc)</th></tr>

</table>

<br />

The degenerate case of a single BGP with thousands of statement patterns turns out to not be

all that interesting.

On one hand, if it is implemented with value propagation,

only one thread is involved and, depending on the dataset statistics, it would execute likely

half the statements, but one match at a time.

For larger sizes, in any event, it fails early due to stack limits in the compiler.

The path to test large pattern counts is to force the partitioning into joined subexpressions,

with which, even within the bounds for

- stack space limits in the query processor run-time which constrain operator depth,

- memory mapping limits which constrain the number of simultaneous connection to the store,

- stack space limits within the compiler which limit both BGP statement count and operator depth,

- file descriptor limits which constrain connections to the store, and

- eventual thread count limits which constrain total operator and BGP count,

it is possible to evaluate queries with 512K statements.

The [progression of elapsed times](https://docs.google.com/spreadsheets/d/1Gup83gVu4w72xqaJd7HE4Az-uacto-Or9oU7fuvcVo0/pubhtml?gid=626408870&single=true)

do indicate, however, that significant contention occurs

as the count increases,

which suggests the results would improve if the processor were to limit the number

of BGPs which it executed in parallel.

import {catalog} from "@lomoramic/blog-catalog"

Purpose-built for displays of data

Observable is your go-to platform for exploring data and creating expressive data visualizations. Use reactive JavaScript notebooks for prototyping and a collaborative canvas for visual data exploration and dashboard creation.

Learn more