Using User Access Patterns for
Semantic Query Caching
Qingsong Yao, and Aijun An
{qingsong,ann}@cs.yorku.ca
Department of Computer Science
York University
Toronto, Canada
2003.08
Topics of Discussion
User Access Patterns
Semantic Query Caching Problem and Solutions
Algorithms
Implementation and Experiments
Conclusion and Future Work
Background
All SQL queries submitted by a client or a user have specific meaning, and
the query execution orders follow certain business logics or rules.
The applications are written by using certain programming tools. The
embedded business logics ensure that the submitted queries have certain
formats and follow certain rules.
In a database-driven web site, each dynamic web page corresponds to a set
of queries. The web visitors show certain navigation patterns, thus the
queries show certain orders.
User access patterns describe how a group of users or a client application
accesses the data of a database, include:
a collection of user access events which represents the format of queries.
a collection of frequent user access graphs which describes the query
execution orders.
User access patterns can be mined from database workload or business
logics, and they can
help to rewrite certain SQL to gain fast execution time.
help to tune the database system,
help to anticipate and pre-fetch incoming queries.
help for semantic query caching.
User Access Event and User Access Graph
A user access event represents a set of similar queries. It contains a SQL
template and a set of parameters.
SQL Template: each value of the SQL queries is replaced by a
wildcard character (%)
Parameters: the corresponding values of the queries, and can be
constants or variables
For example, event (“select name from customer where id=%,101)
will retrieve customer 101 ‘s name, and event (“select name from
customer where id=%d, cid) represent a set of queries that retrieve
the name of a given customer
A user access graph is a directed dependency graph which represents
the query execution order:
each node is a user access event or user access graph
each edge associates with a confidence value
contains a set of global variables shared by the nodes
some nodes associate with actions which change the value of global
variables
Semantic relationship exists between the events of a graph
An Example of User Access Graph
User access graph P ( g_cid,g_date )
User access graph P3:(g_cid, g_date, g_tid)
V31(g_cid)
P3
0.7
P1
P5
P2
P4
P1: Login customer.
P2: Retrieve customer’s profile.
P3: Retrieve treatment history .
P4: Retrieve treatment schedule.
P5: Logout customer.
Start node
Node
End node
V32(g_cid) V33(g_tid) V34(g_tid)
0.6
1.0
0.8
V31: select count(*) from treatment
where customer_id =%,
g_cid
V32: select t_id,t_date
from treatment
where customer_id =%,
g_cid
V33: select *
from treatment_details
where treatment_id=%,
l_tid
V34: select *
from treatment_payment
where treatment_id=%,
g_tid
actions: V33: g_tid =l_tid
V31,V32,V34 are determined event.V34 is parameter-determined by V33.
V31 and V32 have the same query predicate.
Problem and Solutions
Semantic Query Caching:
–
–
–
–
Previous query results are cached at clients or mediates.
Each cache associates with a formula to describe the content.
By finding the semantic relationship between the cache formula
and the query predicate, cached query result can help to evaluate
incoming queries.
In order to answer a query, a probe query is performed on the
cached data, as well as a reminder query is performed on the
server to retrieve the data which is not in the caches.
Problem:
–
–
no good cache selection and replacement policies.
cache matching time can not be ignored.
Solution:
–
–
anticipate future queries according to request sequence and user
access graphs.
rewrite current query to answer future queries.
Problem and Solutions (2)
Given two consecutive parameterized queries u and v, three
kinds of solutions are proposed:
rows
u
u’
u
v
v
SEQ
uv
UNI
v’
PR
columns
Costs:
Response Time; Network Traffic; Server Processing Time and Costs.
Factors:
1. The semantic relationship between u and v.
2. The result size of u and v.
3. The possibility that v follows u .
Semantic Relationship
case 1
case 2
case 3
case 4
case 5
rows
u
v
columns
1. u contains v: either pre-fetch v or submit query u’ .
2. u is contained by v: all solutions are applicable.
3. u horizontal-matches with v: need to add or remove columns
4. u disjoints with v: union solution is still a possible solution.
5. u partial-matches with v. all solutions are applicable, need to
analyze the relationship between rows, as well as columns
6. u and v are irrelevant.
Rewriting Algorithm
Assumption: the query predicates of u and v are both the conjunction of basic
predicate units: var1 op var2 + cons., where op in {=,<,>,>=,<=}
1. solutions = {};
2. if v is undetermined, solutions={};
3. else if v is result-determined by u, solutions= {SEQ};
4. else if u and v are irrelevant, solutions ={SEQ};
5. else Gu = weighted_directed_graph(u);
Gv = weighted_directed_graph(v);
7.
comm_diff(Gu,Gv,comm,diff);
8.
rel = relation(comm,diff);
9.
choose solutions according to rel;
10.
generate rewriting queries;
11.select a solution based on the overall costs.
6.
Example
u1: select a1 from r1 where a2=1 and a3<=3
u1’=> select a1 from r1 where a2<=1 and 0<=a2-1 and a3<=3
u2: select a1 from r1 where a2=1 and a3>=1,
2
a2
0
a3
a2
a3
0
0
Gu1
Gu2
comm:= {(a2,0,1),(0,a2,-1)}
diff:={<(a3,0,3),null>, <null,(0,a3,-1)>,
<(a3,a2,2),null>,<null,(a2,a3,0)>}
Relationship: partial-match
Rewriting queries:
u1 u2: select a1,a3 from r1 where a2=1
u1’: select a1,a3 from r1 where a2=1 and a3 <= 3
u2’: select a1 from r1 where a2=1 and a3>3
Implementation - Architecture
SQL-Relay is an event-driven, rule-based
database gateway:
Each connected user correspond to a state machine which
contains a set of states, variables and user request sequence.
Each incoming query is one kind of user access event.
Each event associates with a set of pre-defined execution
rules.
SQL-Relay contains a set of standard routine to process a
given execution rules.
Previous query results are cached for answering incoming
queries.
Two different kinds of caches: global cache and local cache.
Client /
1
2
Server
Client / SQL-Relay / Server
1
4
2
3
Experiment Result (1)
Mining result for a client/server application from one day's database queries log:
12 instances of the application, and 9,344 SQL queries.
190 user access events.
718 user request sequences belong to 21 frequent user access graphs (support >10).
An instance of user access graph P1
q1. select authority
from employee
where employee_id ='1025‘
q2. select count(*) as num
from customer
where cust_num = '1074'
q3. select card_name
from customer t1,member_card t2
where t1.cust_num = '1074‘ and t1.card_id
=t2.card_id
q4. select contact_last,contact_first
from customer
where cust_num = '1074'
q5. …
High-level user access graph
q2,q3,q4 has horizontal-match
relationship, can be rewrite as:
select count(*) as num, card_id,
contract_last, contract_first
from customer
where cust_num=‘1074’
Experiment Result (3)
Client program simulates user request sequences based on the user access
graphs. Database server is MySQL version 4 which is configured with a
128Kbytes server cache. SQL-Relay is implemented by using java language,
and has 128Kbytes global cache, 4Kbytes local cache per connected client
Comparison of cache performance under the following conditions:
1. executing queries without cache
2. executing queries with 128K server cache
3. pre-fetching queries based on user access graphs
4. integrating query pre-fetching and query rewriting rules together
Our solution has better performance than others.:
– cache hit frequency is higher that server-side cache.
– save network traffic by retrieving less data from the server.
– the rewritten queries have less server I/Os.
Experiment Result (4) – TPC-W Benchmark
Pseudo code for order display web interaction:
q1: select c_id from customer where c_uname=@c_uname and c_passwd=@c_passwd
q2: select max(o_id) from orders where o_c_id=@c_id
q3: select customer.*, orders.*, address.* country.* from customer,address,country,orders
where o_id=@o_id and c_id=@c_id and o_bill_addr=addr_id and addr_co_id=co_id
q4: select address.*, country.* from address, country
where addr_id=@a_ship_id and addr_co_id=co_id
q5: select * from order_line,item where ol_i_id=i_id and ol_o_id=@o_id
1.
2.
3.
split join query q3 into two queries:
• q3_1: select * from customer where cid=@c_id
• q3_2: select orders.*, address.* country.* from customer,address,country,orders
where o_id=@o_id and o_bill_addr=addr_id and addr_co_id=co_id
rewrite q1 to include the answer of q1 and q3_1:
• q1’: select * from customer where c_uname=@c_uname and c_passwd=@c_passwd
• reason: c_uname and c_passwd is the key.
• benefit: access table customer only once, and avoid join customer with other tables.
• disadvantage: retrieve more data when the customer has no orders.
question: can we rewrite q3_2 to include the answer of q3_2 and q4?
• select orders.*, a1.*, a2.* c1.*, c2.* from address a1, address a2, country c1, country c2, orders
where o_id=@o_id and o_bill_addr=a1.addr_id and a1.addr_co_id=c1.co_id
and o_ship_addr=a2.addr_id and a2.addr_co_id=c2.co_id
• reason: o_ship_addr , addr_co_id are foreign key.
• benefit: make use of foreign key constraints.
• disadvantage: introduce new joins.
Experiment Result (5) – TPC-C Benchmark
Pseudo code for order status transaction:
q1: EXEC SQL select count(c_id) INTO :namecnt
from customer where c_last=:c_last AND c_d_id=:d_id AND c_w_id=:w_id
q2:EXEC SQL DECLARE c_name CURSOR FOR
select c_balance, c_first, c_middle, c_id
from customer where c_last=:c_last AND c_d_id=:d_id AND c_w_id=:w_id
order by c_first;
EXEC SQL OPEN c_name;
q3: if (namecnt%2) namecnt++; // Locate midpoint customer
for (n=0; n<namecnt/2; n++)
q4:
EXEC SQL FETCH c_name INTO :c_balance, :c_first, :c_middle, :c_id;
q5: EXEC SQL CLOSE c_name;
Solution:
•
•
Reason:
1.
2.
3.
Benefit:
1.
2.
3.
Execute q2 instead of q1 when q1 is submitted.
q1 can be answered by retrieving the number of rows returned by the cursor.
q2 always follows q1 (i.e., the confidence is 1.0).
q1 and q2 have similar query predicate (except the order by clause).
q2 contains q1 (i.e., the answer of q1 can be answered by that of q2).
Only submit one query to the server.
Only access the base relation once.
Improve performance when no customer meets the condition since the original solution search
the database twice.
Disadvantage:
Not a general solution, every DB has own function to retrieve the number of rows of a cursor.
More:
1.
DB server can take advantage of such relationship.
Conclusion and Future Work
Our solution has the several advantages:
our caching algorithms are based on the query execution
orders and the semantic relationship between queries,
which are better than the selection policies based on the
global query reference statistics.
It separates global cache with local cache, and which will
result in a better cache hit ratio.
Our SQL-Relay application is flexible and extendable
where various caching and rewriting rules can be added
and tested.
Future work:
Finding user access pattern from database workload.
Exploring more ways to use user access patterns.
The End.
Thanks……
© Copyright 2026 Paperzz