Queries validated by running on test data and manually checking results



Yüklə 491 b.
tarix19.11.2017
ölçüsü491 b.
#11315



Queries validated by running on test data and manually checking results

  • Queries validated by running on test data and manually checking results

    • Test data typically generated manually and/or in query independent way
    • May not catch all errors
  • The XData system (Shah et al. ICDE 2011) provides a principled approach to automatically generating test data

    • Based on “killing” query “mutations”


Mutation: single change to a given query

  • Mutation: single change to a given query

    • Mutations model common programming errors, e.g.
      • Join vs. left/right/full outerjoin
      • Selection/join condition: e.g.
      • Wrong aggregate: e.g.
      • select vs. select distinct
      • And many more ..
    • Note: Mutant may be the intended query


Traditional use of mutation testing is to check coverage of dataset

  • Traditional use of mutation testing is to check coverage of dataset

    • Generate mutants of the original query Q by modifying Q in a controlled manner
    • A dataset kills a mutant Q’, if Q and Q’ give different results on the dataset
    • A suite of datasets is considered complete if each non-equivalent mutant of the given query is killed by at least one of the datasets
  • E.g. To kill below mutation, need a department without any high-salary instructors



Goal of XData system: generate datasets for testing SQL queries

  • Goal of XData system: generate datasets for testing SQL queries

    • Testing queries: Each test dataset + query result is shown to human tester for verification.
    • Testing SQL assignments: Run correct query and student query on each of the datasets, flag if results differ
    • Note that we do not actually generate and execute mutants
  • XData [ICDE 2011] addresses data generation

    • Considers a subclass of SQL queries
    • Number of datasets is linear in size of the query
    • Completeness results for a subclass of mutations
      • Mutations between joins and outerjoin across all join orders
      • Mutations between {=,<>,<,<=,>,>=}
      • Mutations between {min, max}, and distinct/non-distinct versions of sum, count, avg (with no constraints on aggregate values)


Extends XData [ICDE 2011] to handle many commonly used SQL features

  • Extends XData [ICDE 2011] to handle many commonly used SQL features

    • NULL values
    • Constrained aggregation
    • String constraints,
    • Subqueries
    • Set operations
    • Date and time
    • Views
    • Insert/delete/update queries
  • Studies effectiveness of techniques for grading real SQL assignments



Tuya and Suarez-Cabal [IST07], Chan et al. [QSIC05]

  • Tuya and Suarez-Cabal [IST07], Chan et al. [QSIC05]

    • Describe a number of SQL query mutations
    • But do not address test data generation
  • Olston et al (SIGMOD09], Qex [LPAR10]

    • Generate data for the original query
    • But do not generate data to kill mutations
  • de la Riva et al [AST10]

    • Address data generation using constraints, with the Alloy solver
    • Consider only join/selection queries
    • Do not consider alternative join orders


Create template datasets with constraints

  • Create template datasets with constraints

    • Each attribute of each tuple is a variable
    • Add constraints on variables to model
      • Primary/foreign key constraints
        • E.g. ASSERT FORALL i EXISTS j (O_CRSE[i].dept_name= O_DEPT[j].dept_name);
      • Query constraints
      • Constraints to kill a specific mutation
        • This part is different for each template dataset generated
        • E.g. ASSERT NOT EXISTS (i: INT) (O_DEPT[1].dept_name=O_CRSE[i].dept_name) ;
  • Generate data for each template dataset

    • Using constraint solver, currently CVC3


Datatype declarations for each attribute

  • Datatype declarations for each attribute

    • DATATYPE Course_id= BIO101 | BIO301 | BIO399 | CS101 | END; Credits : TYPE = SUBTYPE (LAMBDA (x: INT) : x > 1 AND x < 5);
  • Array of tuples per relation

    • CRSE_TupleType: TYPE = [course_id, dept_name, credits];
    • O_CRSE: ARRAY INT OF CRSE_TupleType;
    • Each attribute of each tuple is a variable
    • whose value will be assigned by the solver
    • E.g. each of O_CRSE[1].course_id, O_CRSE[1].dept_name, O_CRSE[2].course_id, …, is a variable


Constraints for primary and foreign keys

  • Constraints for primary and foreign keys

    • f.k. from crse.dept_name to dept.dept_name
      • ASSERT FORALL i EXISTS j (O_CRSE[i].dept_name= O_DEPT[j].dept_name);
    • p.k. on COURSE_ID
      • ASSERT FORALL i FORALL j (O_CRSE[i].course_id = O_CRSE[j].course_id) => “all other attrs equal”
  • Constraints from join conditions

    • E.g. to ensure first tuples of course and department match
      • ASSERT (O_CRSE[1].dept_name= O_DEPT[1].dept_name) ;
    • To kill join to right-outer-join mutation, generate another dataset with above condition modified to:
      • ASSERT NOT EXISTS (i: INT) (O_CRSE[i].dept_name=O_DEPT[1].dept_name ) ;
  • Constraints from other query selection conditions

    • E.g. O_CRSE[1].credits < 5;
    • To kill mutations between {=, <, <=, ..}, generate datasets with above modified to = 5, and > 5 respectively.


Extensions to handle several SQL constructs/features

  • Extensions to handle several SQL constructs/features



CVC3 does not understand NULL values

  • CVC3 does not understand NULL values

  • Define values which act as placeholder NULL values, for e.g.,

    • DATATYPE COURSE_ID = BIO101 | BIO301 | BIO399 | CS101 | CS190 | NULL_COURSE_ID_1 | NULL_COURSE_ID_2
  • Add CVC3 constraints which distinguish between NULL and non-NULL values

  • CVC3 constraint to enforce foreign keys modified to allow null values

  • Generate constraints assigning null values to some variables, to kill mutations

    • Different placeholder NULL values assigned to different variables
  • Can now handle

    • Mutation of count(attr) to count (*), and
    • Mutations of IS NULL to NOT IS NULL


We consider following string constraints

  • We consider following string constraints

    • s1 relop const/s2, where relop in {=, <, <=, >, >=, <>} and s op const where op in {~=, LIKE, ILIKE, NOT LIKE, NOT ILIKE}
    • strlen(s) relop const
    • Constraints on upper(s) and lower(s)
  • CVC3 does not support string data type

  • We built our own string constraint solver

    • Other string solvers (Hampi, Kaluza, Rex) didn’t offer features we needed
  • For a query containing string and non-string constraints, we

    • First solve string constraints independently
    • Then solve non-string constraints using CVC3
    • We do not allow constraints between string and non-string variables


Collect all string constraints; reduce the number of constraints by propagating constants

  • Collect all string constraints; reduce the number of constraints by propagating constants

  • Construct a graph

    • With string variables as vertices
    • With edges derived from constraints of the form Vi < Vj
  • Traverse vertices Vi in increasing order of < constraint

    • For each such vertex Vi, find the lexicographically smallest string that satisfies all constraints for Vi
      • Done by
        • translating each constraint (LIKE const, < const, ..) on Vi into an automaton
        • intersecting automatons for the different constraints
        • and solving to get lexicographically smallest solution


For killing mutations between string comparisons {=, <>, <, <=, >, >=}

  • For killing mutations between string comparisons {=, <>, <, <=, >, >=}

    • Datasets generated for S1 = S2, S1 > S2, S1 < S2
  • For killing mutations between {LIKE, ILIKE, NOT LIKE, NOT ILIKE}

    • Dataset 1 satisfying S1 LIKE pattern
    • Dataset 2 satisfying S1 ILIKE pattern but not S1 LIKE pattern
    • Dataset 3 not satisfying both LIKE and ILIKE conditions


Example:

  • Example:

    • select r.A, sum(s.C) from r, s where r.B = s.B and s.C < 5 group by r.A having sum(s.C) > 20
  • Major Issue: CVC3 does not support relation types.

  • To specify aggregation constraints like SUM(s.C) > 20 in CVC3, we have to pre-decide the number of tuples in the input to the aggregation operation.

  • Before generating CVC3 constraints we need to estimate

    • Number of tuples required to satisfy the aggregation
    • Translate this number to appropriate number of tuples for each base relation
    • E.g. for above query
      • At least 5 tuples in join result, for a given r.A value
      • If r.A is a primary key, need 5 s tuples matching a given r tuple
      • Else need 5 r tuples, each with a matching s tuple


For each aggregated attribute A, we collect

  • For each aggregated attribute A, we collect

    • All aggregation constraints, domain constraints and non-aggregate constraints on A like A < 15
    • Invariant constraints on aggregation operators like
      • AVG * COUNT = SUM, MIN <= MAX, ..
  • Solve above constraints using CVC3 to find value for COUNT

    • Gives number of input tuples n required to satisfy aggregation constraint
  • If input to the aggregation is a join, we decide number of tuples of each base relation

    • Heuristic based on group by attributes, the joining attributes, and unique/p.k. constraints on joining attributes
    • Again using CVC3 to generate values
    • Details in paper
  • Create final set of constraints for each dataset based on above solution.



Nested subqueries in WHERE clause

  • Nested subqueries in WHERE clause

    • Some details in paper
  • Set Operations:

    • Generate dataset to kill mutations between UNION (ALL), INTERSECT (ALL), EXCEPT (ALL)
  • Handling Parameterized Queries

    • assign a variable to each parameter; solution given by CVC3 contains the value for the parameter
  • Date and Time

    • converted to integers
  • Handling Insert/Delete/Update Queries:

    • convert into SELECT queries and data is generated to catch mutations of the resultant SELECT queries


Used XData to grade SQL query assignments in an undergrad database course at IITB

  • Used XData to grade SQL query assignments in an undergrad database course at IITB

    • University schema from textbook (Database System Concepts, 6th ed, Silberschatz et al)
    • 15 assignment queries chosen, each with about 70 submissions
  • Ubuntu system with Intel i5, 3.30 GHz CPU and 8 GB memory

  • Time to generate datasets ranged from 6.7 to 49 seconds with on

    • Most datasets had <= 2 tuples per relation, those with aggregates had at most 5.
    • Average of approx. 6 datasets per query
  • Compared with two sample University datasets provided with textbook, and with result of manual correction by TAs





Query 8: SELECT DISTINCT course_id, title FROM course NATURAL JOIN section WHERE section.semester =‘Spring’ AND section.year =‘2010’ AND course_id NOT IN (SELECT course_id FROM prereq)

  • Query 8: SELECT DISTINCT course_id, title FROM course NATURAL JOIN section WHERE section.semester =‘Spring’ AND section.year =‘2010’ AND course_id NOT IN (SELECT course_id FROM prereq)

  • # Queries : 79



Query 14: SELECT DISTINCT * FROM takes T WHERE T.grade IS NOT NULL AND (T.grade <> ‘F” OR NOT EXISTS (SELECT id, course_id FROM takes S WHERE S.grade <> ‘F’ AND T.id = S.id AND T.course_id = S.course_id))

  • Query 14: SELECT DISTINCT * FROM takes T WHERE T.grade IS NOT NULL AND (T.grade <> ‘F” OR NOT EXISTS (SELECT id, course_id FROM takes S WHERE S.grade <> ‘F’ AND T.id = S.id AND T.course_id = S.course_id))

  • # Queries : 67



Extended XData to handle a number of widely used constructs

  • Extended XData to handle a number of widely used constructs

    • These were important for grading student SQL assignments
    • Performance study shows benefits over using fixed datasets, and manual correction
  • Currently working on further extensions

    • disjunctions, extra/missing group by attribute, unintended equating of attributes in natural join
    • handling multiple queries in an application
  • Also working on a tool for grading SQL assignments







teaches instructor is equivalent to teaches instructor if there is a foreign key from teaches.ID to instructor.ID

  • teaches instructor is equivalent to teaches instructor if there is a foreign key from teaches.ID to instructor.ID

  • BUT: teaches σ dept=CS(instructor) is not equivalent to teaches σ dept=CS(instructor)

  • Key idea: have a teaches tuple with an instructor not from CS

  • Selections and joins can be used to kill mutations



Space of join-type mutants: includes mutations of join operator of a single node for all trees equivalent to given query tree

  • Space of join-type mutants: includes mutations of join operator of a single node for all trees equivalent to given query tree

  • Datasets should kill mutants across all such trees



Whether query conditions written as

  • Whether query conditions written as

    • A.x = B.x AND B.x = C.x or as
    • A.x = B.x AND A.x = C.x
  • should not affect set of mutants generated

  • Solution: Equivalence classes of attributes





For a class of mutations:

  • For a class of mutations:

    • Join to outerjoin
    • selection condition, unconstrained aggregation
  • Algorithm to generate test data to kill all non-equivalent mutants in above class, for a (fairly large) subset of SQL.

    • Under some simplifying assumptions
      • Only primary and foreign key constraints
      • Single block SQL queries; no nested subqueries
      • Expr/functions: Only arithmetic exprs
      • Join/selection predictates : conjunctions of {expr relop expr}


Schema: r(A,B), s(C,D), t(E)

    • Schema: r(A,B), s(C,D), t(E)
    • To kill this mutant we must ensure that for an r tuple there is no matching s tuple, but there is a matching t tuple
    • Generated test case: r(A,B)={(1,2)}; s(C,D)={}; t(E)={(2)}




Query 5: SELECT DISTINCT course.dept_name FROM course NATURAL JOIN section WHERE section.semester =‘Spring’ AND section.year =‘2010’

  • Query 5: SELECT DISTINCT course.dept_name FROM course NATURAL JOIN section WHERE section.semester =‘Spring’ AND section.year =‘2010’

  • # Queries : 72



Nested subqueries in the WHERE clause connected by IN, NOT IN, EXISTS, NOT EXISTS, ALL or ANY clause

  • Nested subqueries in the WHERE clause connected by IN, NOT IN, EXISTS, NOT EXISTS, ALL or ANY clause

  • We make some assumptions

    • Correlation conditions involve a single relation in the outer query
    • IN connectives involve attributes from only one outer relation
    • Nesting depth is at most 1
  • We first generate constraints to create tuples for the outer query result ignoring the subquery and then generate constraints for the subquery



Dataset generated for original query will kill mutations between IN and NOT IN, EXISTS and NOT EXISTS

  • Dataset generated for original query will kill mutations between IN and NOT IN, EXISTS and NOT EXISTS

  • For conditions of the form r.A relop scalar subquery, we generate data to kill mutations between different relops

  • To kill mutation of ALL versus ANY, we generate two tuples for the inner query such that one satisfies the comparison operator and other does not

  • Subquery Mutations : generate constraints to kill mutations of the subquery and then add constraints of the outer query



Yüklə 491 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə