DbLoadgen: A Scalable Solution for Generating Transactional Load Against a Database
DbLoadgen is scalable solution for generating transactional loads against a database, perhaps must useful for testing CDC tools such as Qlik Replicate. Focus is on generating a random series of INSERT, UPDATE, and DELETE statements across the tables that make up the dataset being tested, not in generating a true OLTP-load as one might see in an implementation of a benchmark such as TPC-C.
DbLoadgen is somewhat different from other open source tools that can be found in that it is designed to be extremely flexible. In particular:
-
DbLoadgen is metadata-driven, meaning that all the data structures associated with managing the workload and defining the tables to be used in testing are created from yaml-based files.
-
All SQL is generated on the fly from the meta-data files rather than relying on canned scripts for creating the tables used in testing. This makes adding support for additional target databases and/or additional datasets for testing much easier.
-
As with any of the load generation and benchmarking tools that are available on the internet, DbLoadgen generates random values to use in populating table rows. DbLoadgen is different in that it allows specification of custom initializers for each column, allowing the data that is generated to have a more meaningful appearance when the data in the database is queried. For example, integer values can be limited to a particular range on a column-by-column basis, and patterns may be specified to generate values that look like real-life content such as social security numbers, credit card numbers, etc.
DbLoadgen Architecture
DbLoadgen is written in Java an provide several mechanisms for configuring and initiating workloads:
- Controlling a workload directly from the command line
- Controlling a workload using REST API calls (still to be developed using SpringBoot).
- Starting and controlling a workload using an HTML- and Javascript-based GUI (still to be developed using SpringBoot).
This image represents a high-level view of the object model for the workload manager.
This image represents a the DbLoadgen process architecture as it relates to interacting with a target database.
Regarding SpringBoot
Spring Boot provides a backend server that can support interactions with many users. DbLoadgen could in theory do this as a result, but its primary intent is for there to be a single user running a single workload at any point in time. The reason for this is that the process of preloading a database and generating transactions makes use of a configurable pool of threads; more threads translate to more transaction volume. It would be easy to overwhelm a server if multiple high-volume workloads were running at one time.
Supported Databases
As mentioned earlier, DbLoadgen was designed to support a variety of database targets:
- H2: An in memory database useful for low volume testing of configurations, etc.
- MySQL
- SQL Server
- PostgreSQL
- Oracle - Oracle 12c and up
Other databases can be added without much difficulty. All that is required is creating a class that extends the base class qlikpe.dbloadgen.model.database.Database
. This class is responsible for generating the appropriate SQL dialect for the target database.
DbLoadgen Configuration
Workload Metadata
A workload metadata file is used to configure the execution of a test. It defines items such as the number of worker threads, the dataset and associated tables to use, the duration of the test, and more. The metadata is expressed in YAML format as shown in the following example. Note that all properties are required. The tables section identifies a list of tables that should be part of the test.
!!qlikpe.dbloadgen.model.workload.WorkloadConfig
dataset: test
schema: myschema
duration: 1
workerThreads: 2
preloadThreads: 2
preloadBatchSize: 10
workerBatchSize: 5
workerBatchSleep: 50
workerThreadSleep: 0
allTablesAllThreads: true
operationPct:
insert: 25
update: 50
delete: 25
tables:
- name: table1
preload: data
updateColumns: "G,W,L"
- name: table2
preload: data
updateColumns: "active,NAassoc"
- name: table3
preload: 21
updateColumns: "active"
- dataset: the name of the subdirectory where the table metadata can be found
- schema: the name of the schema in the target database where the tables should be created.
- duration: the number of minutes that the workload should run after initialization.
- workerThreads: the number of independent threads to spin up for generating load during the test.
- preloadThreads: the number of threads to have running in parallel during database initialization. A thread is responsible for loading a table. When a table has been initialized, the thread will terminate and another thread will take its place to load the next table and so on until all tables have been initialized.
- preloadBatchSize: the size of each batch of inserts to a table during the preload phase.
- workerBatchSize: the number of random operations (insert/update/delete) that should be generated before each worker thread commits a transaction.
- workerBatchSleep: specifies the length of time in milliseconds to sleep after applying each batch of operations to the database.
- workerThreadSleep: specifies the length of time in milliseconds to sleep between each operation that is generated. Specifying a non-zero value here can significantly reduce the number of operations generated. This can be quite useful for demos and functional tests where high volume testing is not necessarily required.
- allTablesAllThreads:
- a value of
true
specifies that all tables should be accessed and updated from each worker thread. In this mode, transaction volume against each table will be somewhat higher, but the possibility of transaction deadlock / operation collisions will exist. When they occur, these errors are caught and handled by DbLoadgen. - A value of
false
will cause each table to be allocated to only a single worker thread.
- a value of
- operationPct: specifies the distribution of randomly generated INSERT, UPDATE, and DELETE operations.
- tables: this is a YAML "list" that identifies the specific tables from the dataset subdirectory that we want to be a part of this test. This list might include only a subset of tables from the directory, or all of the tables.
- name: the name of the table
- preload: identifies the source of the data to use to preload each table. The value may be data, which indicates that the initialization data can be found in the data section of the table metadata, or it may be an integer which will indicate the number of rows to generate automatically. If the value is an integer then the data section in the table metadata is ignored even if it is present.
- updateColumns: the name(s) of the column(s) that should be modified as a part of any update operations. New values for these columns will be randomly generated. Note that if updateColumns is empty, then the table will be loaded, but omitted from participation in random generation of CDC load.
Table Metadata
Table metadata files are part of a "dataset" and are located in a subdirectory with the same name as the dataset that was specified in the workload configuration file. A table metadata file is used to define the structure of a table, the initializer type that should be used to generate values for each column, columns that should make up the primary key, etc. The metadata is specified in YAML format as shown in the following example. Note that all properties are required.
!!qlikpe.dbloadgen.model.database.Table
name: table1
columns:
- name: yearID
type: int(11)
nullable: false
initializer: "SignedInteger,1890,2021"
- name: lgID
type: varchar(64)
nullable: false
initializer: "RandomString,ALPHA,10"
- name: teamID
type: varchar(64)
nullable: false
initializer: "RandomString,ALPHA,10"
- name: HalfID
type: int(11)
nullable: false
initializer: "SignedInteger"
- name: divID
type: varchar(64)
nullable: false
initializer: "RandomString,ALPHA,10"
- name: DivWin
type: varchar(64)
nullable: true
initializer: "RandomString,ALPHA,10"
- name: Rank
type: int(11)
nullable: true
initializer: "SignedInteger,0,100"
- name: G
type: int(11)
nullable: true
initializer: "SignedInteger"
- name: W
type: int(11)
nullable: true
initializer: "SignedInteger"
- name: L
type: int(11)
nullable: true
initializer: "SignedInteger"
keyColumns:
- yearID
- lgID
- teamID
- HalfID
- divID
data:
- "'1981', 'NL', 'ATL', '1', 'W', 'N', '4', '54', '25', '29'"
- "'1981', 'NL', 'ATL', '2', 'W', 'N', '5', '52', '25', '27'"
- "'1981', 'AL', 'BAL', '1', 'E', 'N', '2', '54', '31', '23'"
- "'1981', 'AL', 'BAL', '2', 'E', 'N', '4', '51', '28', '23'"
- "'1981', 'AL', 'BOS', '1', 'E', 'N', '5', '56', '30', '26'"
- "'1981', 'AL', 'BOS', '2', 'E', 'N', '2', '52', '29', '23'"
- "'1981', 'AL', 'CAL', '1', 'W', 'N', '4', '60', '31', '29'"
- "'1981', 'AL', 'CAL', '2', 'W', 'N', '6', '50', '20', '30'"
- "'1981', 'AL', 'CHA', '1', 'W', 'N', '3', '53', '31', '22'"
- "'1981', 'AL', 'CHA', '2', 'W', 'N', '6', '53', '23', '30'"
- "'1981', 'NL', 'CHN', '1', 'E', 'N', '6', '52', '15', '37'"
- "'1981', 'NL', 'CHN', '2', 'E', 'N', '5', '51', '23', '28'"
- "'1981', 'NL', 'CIN', '1', 'W', 'N', '2', '56', '35', '21'"
- "'1981', 'NL', 'CIN', '2', 'W', 'N', '2', '52', '31', '21'"
- "'1981', 'AL', 'CLE', '1', 'E', 'N', '6', '50', '26', '24'"
- "'1981', 'AL', 'CLE', '2', 'E', 'N', '5', '53', '26', '27'"
- "'1981', 'AL', 'DET', '1', 'E', 'N', '4', '57', '31', '26'"
- "'1981', 'AL', 'DET', '2', 'E', 'N', '2', '52', '29', '23'"
- "'1981', 'NL', 'HOU', '1', 'W', 'N', '3', '57', '28', '29'"
- "'1981', 'NL', 'HOU', '2', 'W', 'N', '1', '53', '33', '20'"
- "'1981', 'AL', 'KCA', '1', 'W', 'N', '5', '50', '20', '30'"
- "'1981', 'AL', 'KCA', '2', 'W', 'N', '1', '53', '30', '23'"
- "'1981', 'NL', 'LAN', '1', 'W', 'N', '1', '57', '36', '21'"
- "'1981', 'NL', 'LAN', '2', 'W', 'N', '4', '53', '27', '26'"
- "'1981', 'AL', 'MIN', '1', 'W', 'N', '7', '56', '17', '39'"
- name: the name of the table.
- columns: a YAML list of column-related information
- name: the column name. Column names are obviously required.
- type: the data type for the column (see Column Data Types below). A column data type must be specified.
- nullable: a Boolean (true/false) indicating whether the column should be nullable. This field is optional.
- initializer: the type of initializer that should be associated with this column. Initializers generate values to assign to the columns when rows are inserted, or when update operations are executed. See Column Initializers below). Specification of an initializer for a column is optional. If you do not specify one, DbLoadgen will use a default initializer specific to that column's data type.
- keyColumns: a YAML list of column names that will make up the primary key for the table. If no columns are specified or if keyColumns is absent, then an auto incrementing column named id_ will be generated in the DDL.
- data: an array of comma-delimited YAML strings that indicate the values that should be used to initialize the table. Pay attention to the format here ... each line is surrounded by double quotes, and each value by single quotes. There must be a value for each column in the table. Specification of this field is optional.
Column Data Types
Since data types that are supported can vary by database, DbLoadgen uses an abstracted set of "generic" data types that it then maps tables to the target database. Specify these types in the table.yml files that define a table's metadata. Not every database supports the idea of an unsigned integer ... in those cases, unsigned generic types are mapped to their signed counterparts.
- CHAR(length): A fixed length character string.
- NCHAR(length): A fixed length multibyte character string.
- VARCHAR(length): A variable length character string.
- NVARCHAR(length): A variable length multibyte character string.
- BOOLEAN: a Boolean column (true/false).
- BYTES(length): A fixed length array of binary data.
- VARBYTES(length): A variable length array of binary data.
- DATE: A date in the form of YYYY-MM-DD
- TIME: A time value of the form HH:mm:ss
- DATETIME: A timestamp-like value of the form YYYY-MM-DD HH:mm:ss
- INT1: A single byte signed integer (a MySQL "tinyint").
- INT2: A double byte signed integer (a MySQL "smallint").
- INT3: A 3-byte signed integer (a MySQL "mediumint").
- INT4: A 4-byte signed integer (a MySQL "int").
- INT8: An 8-byte signed integer (a MySQL "bigint").
- NUMERIC(precision,scale): A fixed decimal number of the specified precision and scale.
- FLOAT: A single precision binary floating point number.
- DOUBLE: A double precision binary floating point number.
- UINT1: A single byte unsigned integer (a MySQL "unsigned tinyint").
- UINT2: A double byte unsigned integer (a MySQL "unsigned smallint").
- UINT3: A 3-byte unsigned integer (a MySQL "unsigned mediumint").
- UINT4: A 4-byte unsigned integer (a MySQL "unsigned int").
- UINT8: An 8-byte unsigned integer (a MySQL "unsigned bigint").
- BLOB: A binary large object. Generated BLOBs are 4k by default unless overridden by a column initializer.
- CLOB: A character large object. Generated CLOBs are 4k by default unless overridden by a column initializer.
- NCLOB: A multibyte character large object. Generated NCLOBs are 4k by default unless overridden by a column initializer.
Column Initializers
There are a number of initializers that may be used to generate column values during insert and update operations. Initializers and their associated arguments that define their behavior as specified as YAML strings containing comma-delimited values in the table metadata.
For example: initializer: "SignedInteger,1890,2021"
says to generate a random integer with a value between 1890 and 2021. Note that initializer names are case-sensitive.
- RandomString: return a random string. All characters in the string are randomly selected. This is suitable for use in key columns.
- Usage: RandomString - generate an alphanumeric string of length 10.
- RandomString,type,length - generate a string of the specified length where type is one of
- FixedString: generate a string of fixed length. Relatively efficient but not terribly random. Not suitable for use in a key column.
- Usage: FixedString - generate a string of the default length of 10
- Usage: FixedString,length - generate a string of the specified length
- RandomBytes: generate an array of bytes to assign to a binary column value.
Relatively efficient but not terribly random. Not suitable for use in a key column.- Usage: RandomBytes - generate a string of the default length of 10
- usage: RandomBytes,length - generate a string of the specified length
- Pattern: generate a random value based on the specified template. A template is a string that describes how the data should appear. For example, a credit card number might be defined as
####-####-####-####
. Subpatterns are defined as follows, and if a character is used in sequence as in the credit card number example, they are treated as a single subpattern:- Usage: Pattern,template:
- #: a numeric value.
- @: a lower case alphabetic character.
- ^: an upper case alphabetic character.
- %: a random alphanumeric character.
- *: any random character.
- any other characters are treated as a literal.
- the characters that are used to define subpatterns (#,@,^,%,*) may be escaped with a double backslash ("\\") to force them to be taken literally.
- Usage: Pattern,template:
- SSN: return a random string formatted like a US social security number. This is an example of implementing a specialized type derived from Pattern.
- Usage: SSN
- PaddedInteger: return a random integer padded on the left with zeros if necessary.
- Usage: PaddedInteger,length UPPER, LOWER, ALPHA, NUMERIC, ALPHANUMERIC, or ALL
- UnsignedInteger: return a random integer value >= zero
- Usage: UnsignedInteger - any random integer >= zero
- Usage: UnsignedInteger,min,max - a random integer in the specified range where min is inclusive, max is exclusive and min must be greater than or equal to zero
- SignedInteger: return a random integer
- Usage: SignedInteger - any random integer
- Usage: SignedInteger,min,max - a random integer in the specified range where min is inclusive, max is exclusive.
- SignedLong: return a random long
- Usage: SignedLong - any random long
- Usage: SignedLong,min,max - a random long in the specified range where min is inclusive, max is exclusive and min must be greater than or equal to zero
- RandomDecimal: return a random decimal number
- Usage: RandomDecimal - defaults to precision(8), scale(2), min(0), max(max value)
- Usage: RandomDecimal,precision,scale - defaults to min(0), max(max value)
- Usage:RandomDecimal,precision,scale,min,max
- Literal: always return the specified literal
- Usage: Literal,string - the string to return
- LiteralArray: return a random value from the specified array
- Usage: LiteralArray,value 1,value 2,...,value N
- RandomDate: return a random date from the last 10 years of the form YYYY-MM-DD.
- Usage: RandomDate
- RandomTime: return a random time of day of the form HH:mm:ss.
- Usage: RandomTime
- DateTime: return a timestamp value of the form YYYY-MM-DD HH:mm:ss.
- Usage: DateTime
More initializers can be created if the need arises.
Specifying database connections
You can specify connections to the various databases you might want to connect to in a file called connections.yml in your datasets directory. You can then reference these connections by name from the command line.
!!qlikpe.dbloadgen.model.workload.DbLoadgenConnectionList
connections:
- name: h2mem
databaseType: h2
jdbcDriver: org.h2.Driver
url: jdbc:h2:mem:testdb
username: sa
password: sa
- name: mysqltest
databaseType: mysql
jdbcDriver: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://localhost:3306
username: root
password: attunity
- name: postgrestest
databaseType: postgres
jdbcDriver: org.postgresql.Driver
url: jdbc:postgresql://localhost:5432/qlikdb
username: qlik
password: attunity
- name: sqltest
databaseType: sqlserver
jdbcDriver: com.microsoft.sqlserver.jdbc.SQLServerDriver
url: jdbc:sqlserver://localhost:1433
username: sa
password: AttunityAdmin123
- name: oratest
databaseType: oracle
jdbcDriver: oracle.jdbc.driver.OracleDriver
url: jdbc:oracle:thin:@localhost:1521/XEPDB1
username: c##attunity
password: attunity
All fields are required:
- name: is the name of the connection as referred to on the command line
- databaseType: specifies the database dialect to use.
- jdbcDriver: specifies the jdbc driver entry point for this database.
- url: this is the database connection URL. It is used verbatim.
- username: the database username to connect with. This user must have sufficient privileges to create schemas and tables, as well as read and write them.
- password: the password of the database user.
Command Line Execution
There are many command line options when executing DbLoadgen:
[dbloadgen]$ ./dbloadgen.sh -h
usage: DbLoadgenCLI [-h] [--dataset-dir DATASET_DIRECTORY] [--workload-config DATASET]
[--connection-list CONNECTION_LIST] [--connection-name CONNECTION_NAME]
[--user USER_NAME] [--password PASSWORD] [--url JDBC_URL]
[--jdbc-driver JDBC_DRIVER] [--database-type DATABASE_TYPE]
[--properties-file PROPERTIES_FILE] COMMAND
This utility generates user-configurable loads against a database.
positional arguments:
COMMAND The runtime command to execute: [test-connection, preload, help-commands,
end-to-end, help-initializers, cleanup, list-connections, reset, list-
datasets, run, init]
named arguments:
-h, --help show this help message and exit
--dataset-dir DATASET_DIRECTORY
The directory where the source dataset(s) are stored
--workload-config DATASET
The name of the runtime workload configuration yaml file to use. This
should be located in the DATA_DIRECTORY.
--connection-list CONNECTION_LIST
The name of yaml file that contains a list of database connections that
have been defiled. This should be located in the DATA_DIRECTORY.
--connection-name CONNECTION_NAME
The name of the connection from the connection yaml to use.
--user USER_NAME The database user name.
--password PASSWORD The password for the database user
--url JDBC_URL The database JDBC connection URL.
--jdbc-driver JDBC_DRIVER
The fully qualified database driver class
--database-type DATABASE_TYPE
The type of database: [h2, mysql, postgres, oracle, sqlserver]
--properties-file PROPERTIES_FILE
The path to a properties file where these options may also be set
COMMAND reference:
test-connection - test database connectivity
preload - preload the tables for this workload
help-commands - print this brief description of COMMAND options
end-to-end - run an end-to-end test: cleanup, init, preload, and run
help-initializers - print a brief description of the available column initializers
cleanup - remove the workload's tables and associated schema from the database
list-connections - list database connections that have been defined
reset - perform a `cleanup', followed by an 'init' and 'preload'
list-datasets - list datasets that are available to the app.
run - run the CDC load against the database
init - create the schema and tables for this workload
[dbloadgen]$
Running the DbLoadgen Server
In addition to command line-based execution of workloads, workloads can also be configured and executed using the DbLoadgen Server.
Connecting via a web browser
TBD - to be completed once implemented.
Controlling a workload using REST API calls
DbLoadgen supports interfacing via REST API calls to initialize and execute workloads.
TBD - to be completed once implemented.
Using Docker
The DbLoadgen server is naturally well-suited to being executed in a Docker container. Specifics for this will be provided once the DbLoadgen server has been implemented.
From the command line
TBD
Using Docker Compose
TBD
Public Image
You can find a public image here:
Building DbLoadgen
Building DbLoadgen requires OpenJDK 11 or later. Apache Maven is used as the package manager, so Maven is required as well.
- Clean up from prior builds:
mvn clean
. - Build the common library that holds DbLoadgen's core functionality, which is shared across both the CLI and Server:
mvn -pl common install
. - Build DbLoadgen server and CLI modules:
mvn package
. Note that the common library must be built prior to executing the package goal in order for any changes made there to be picked up in the CLI and Server.
Being a little old school, we have also included a Makefile
. Assuming that you have gmake
installed in your environment, you can execute:
make
ormake rebuild
to perform a full clean and rebuild of the code.make clean
to clean remnants of prior builds.make common
to build the common jar file that contains DbLoadgen's core functionality.make package
to build the CLI and Server modules.make release
copies all relevant files into a directory and creates a zip bundle to copy / share.