minisql/DESIGN.md
2024-01-28 18:40:34 +01:00

7.6 KiB

MiniSQL

Official Description

MiniSQL server

Create a simple SQL server that supports SELECT (including column selection), INSERT and DELETE. The SELECT and DELETE statements support a WHERE clause with a single column. It is also possible to create indexes (hashes).

The database works with persistent storage - it can be turned off and on without data loss. Create an interface to the database to work - either CLI or desktop application.

Resources

DB internals

Parsing

Parser Combinator libraries

TCP socket programming

Scope

  • Primarily in-memory db.
  • Occasionally will save its state on to the disk.
  • Almost non-existent concurrency control?
  • Inspiration from SQLite, but not the server part. For the server part take a look at Postgres.
  • Can create custom column indexes.
  • What about Joins?
  • What about composite queries (i.e. instead of a table name in a select expression we include another select expression)?

Building a Minimal Viable Product (MVP)

Possible usage:

  • You start the db server with ./minisql server start --db path/to/db/my-db.db --port 1433 which will store the database as a file path/to/db/my-db.db and open a TCP server on port 1433
  • Then on possibly a different machine you run ./minisql client connect server_ip_address:6666 to start a client. This will open a REPL with which you can send queries/db management commands
  • TODO: We should also consider writing a rust library that allows you to spin up a client that connects to the server. How would the interface look like?
  use mysql::{DB, DBConnection}

  let maybe_conn: Result<DBConnection> = DB.connect({
    port: 1433,
    db_name: "db-name",
    username: "bojack",
    password: "12345"}
    ): Result<DBConnection> ?;
  let conn: DBConnection = maybe_conn?;

  // then we can execute queries 
  conn.perform("SELECT id, name, title, salary FROM employees"): ???
  • Then with the client cli we can request creation of tables/insertion of data etc
> CREATE TABLE persons(id u32 PRIMARY KEY, name String, salary Float)

> INSERT 1, "Alice", 20.0 INTO persons;
> INSERT 2, "Bob", 30 INTO persons;
> INSERT 3, "Claire", 15 INTO persons;

  • How should the SQL AST look like? For example when server parses SELECT id, name FROM persons;, how will the parsing output look like? Consider something like
// TODO: Parser has access to all table metadata

// Could also be called `SQLAbstractSyntaxTree`
enum Operation {
  Select(TableName, ColumnSelection, Option<Condition>),
  Insert(TableName, Vec<(ColumnName, DbValue)>), // String because we don't yet know which type of value this is for sure
  Delete(TableName, Option<Condition>),
  // Update(...),
}

enum ColumnSelection {
    All,
    Columns(Vec<ColumnName>),
}

enum Condition = {
    // And(Condition, Condition),
    // Or(Condition, Condition),
    // Not(Condition),

    Eq(ColumnName, DbValue)
    // LessOrEqual(ColumnName, DbValue)
    // Less(ColumnName, DbValue)

    // StringCondition(StringCondition)
}

enum StringCondition {
    Prefix(ColumnName, String)
    Substring(ColumnName, String)
}



INSERT 123
  • We also have to write an interpreter for these operations. How will the db-state be represented in memory? For example how can we implement a table?
// Not exactly efficient, but how could we do better?
enum DbValue {
  DbString(String),
  DbNumber(Float),
  DbUUID(u32)
}

// We also need a type of db-types
enum DbType {
    TString,
    TNumber,
    TId,
}

value_to_type(db_val: DbValue) -> DbType


// table-metadata and data

type TableName = String

// Note that it is nice to split metadata from the data because
// then you can give the metadata to the parser without giving it the data.
struct TableMetaData {
    name: TableName, // TODO: Is this really necessary? probably not
    columns: Vec<(ColumnName, DbType, ColumnPosition)>
}

fn column(TableMetaData, ColumnName) -> ColumnPosition

struct Table {
    meta: TableMetaData,
    rows: Rows // defined below
    indexes:
        BTree<ColumnName, Index> // TODO: Consider generalizing ColumnName to semething that would also apply to a pair of ColumnNames etc
}

type Tables = HashMap<TableName, Table>

// We also need a function that for a given value computes its type (for validation)


type ColumnName = String
type ColumnPosition = u32

// The below type is a type of a table row
type Row = HashMap<ColumnName, DbValue>

// Or you know... some appropriate Dictionary Type
HashMap::make![("id", 1), ("name", "Alice"), ("salary", 20.0)] : Row

type Rows =
    BTree<Id, Row>

// possible optimization: have a mapping
// column names ~> indexes
// so that we could represent rows as
type Row = Vec<DbValue>


// How to represent a table?
table : HashMap<Id, Row>

Vec<(Id, Row)>.

// suppose the row corresponds to 'INSERT 1, "Alice", 20.0 INTO persons;'
Row ~> Vec<DbValue> 
e.g. Row ~> vec![DbUUID 1, DbSTring "Alice"]

Vec<Vec<DbValue>>

  • Interpreter
trait SqlConsumer {
    // TODO: 
    ???
}

fn interpret<T: SqlConsumer>(operation: Operation, tables: &mut Tables, consumer: T) -> ()  {
    // TODO: lock stuff
    match operation {
          Select(table_name, column_selection, maybe_condition) => {
            let table: Table = ...
            // TODO: Wrap this into a response
            select(table, column_selection, maybe_condition, consumer)
          },
          Insert(table_name, Vec<(ColumnName, DbValue)>) => {
             insert(table, ???)
          }
          Delete(table_name, maybe_condition) => {

          }
    }
}

  response = interpret(...)
  knows_how_to_respond(response, client)


enum Response {
    Selected(impl Iter<???>) // TODO: How to do this? Some reference to an iterator somehow... slice..?
    Inserted(???),
    Deleted(usize), // how many were deleted
} 

fn select(table: Table, ColumnName



  • TODO: Consider streaming the response to the client and not just dumping 10K rows at once.

Server

  1. Client input parsing/validation string input from the client to Abstract Syntax Tree (AST) that represents SQL query
  2. Code gen (Not necessary for MVP) from SQL AST generating bytecode for a more low-level VM.
  3. VM (Not necessary for MVP) implement low-level VM that governs the in-memory db.
  4. Persistence Serialize in-memory db state to a file. What format should it have? At first perhaps just a dumb json serialization? You also have to implement the deserialization.
  5. Client response Stream selected rows/status/error messages back to the client. What should the protocol look like? Take a look at Tabular Data Stream or just respond with json.
  6. Concurrency control Consider what happens when two clients simultaneously wish to updat ethe same row. There has to be some minimal mutual exclusion/locking.
  7. Security Should we worry about secure communication over TCP?

Client

  1. Opens TCP connection to server
  2. REPL has to provide a basic REPL interface
  3. Server Response decoder Has to properly format the data/status/error messages from the server. Does the client have to parse the queries? Or is that only the server's responsibility?