# MiniSQL ## Official Description MiniSQL server Create a simple SQL server that supports SELECT (including column selection), INSERT and DELETE. The SELECT and DELETE statements support a WHERE clause with a single column. It is also possible to create indexes (hashes). The database works with persistent storage - it can be turned off and on without data loss. Create an interface to the database to work - either CLI or desktop application. ## Resources ### DB internals * [CMU intro to Database Systems](https://www.youtube.com/playlist?list=PLSE8ODhjZXjaKScG3l0nuOiDTTqpfnWFf) * Tutorial on sqlite clone in C: [Let's build a Simple Database. How Does a Database work?](https://cstack.github.io/db_tutorial/). * [sqlite opcodes](https://www.sqlite.org/opcode.html) ### Parsing Parser Combinator libraries * [nom](https://github.com/rust-bakery/nom) * [parser_combinators](https://docs.rs/parser-combinators/latest/parser_combinators/) ### TCP socket programming * See [simple http server](https://youtu.be/hzSsOV2F7-s) implementation with tcp sockets ## Scope * Primarily in-memory db. * Occasionally will save its state on to the disk. * Almost non-existent concurrency control? * Inspiration from SQLite, but not the server part. For the server part take a look at Postgres. * Can create custom column indexes. * What about Joins? * What about composite queries (i.e. instead of a table name in a select expression we include another select expression)? # Building a Minimal Viable Product (MVP) Possible usage: * You start the db server with ```./minisql server start --db path/to/db/my-db.db --port 1433``` which will store the database as a file `path/to/db/my-db.db` and open a TCP server on port `1433` * Then on possibly a different machine you run `./minisql client connect server_ip_address:6666` to start a client. This will open a REPL with which you can send queries/db management commands * TODO: We should also consider writing a rust library that allows you to spin up a client that connects to the server. How would the interface look like? ``` use mysql::{DB, DBConnection} let maybe_conn: Result = DB.connect({ port: 1433, db_name: "db-name", username: "bojack", password: "12345"} ): Result ?; let conn: DBConnection = maybe_conn?; // then we can execute queries conn.perform("SELECT id, name, title, salary FROM employees"): ??? ``` * Then with the client cli we can request creation of tables/insertion of data etc ``` > CREATE TABLE persons(id u32 PRIMARY KEY, name String, salary Float) > INSERT 1, "Alice", 20.0 INTO persons; > INSERT 2, "Bob", 30 INTO persons; > INSERT 3, "Claire", 15 INTO persons; ``` * How should the SQL AST look like? For example when server parses `SELECT id, name FROM persons;`, how will the parsing output look like? Consider something like ``` // TODO: Parser has access to all table metadata // Could also be called `SQLAbstractSyntaxTree` enum Operation { Select(TableName, ColumnSelection, Option), Insert(TableName, Vec<(ColumnName, DbValue)>), // String because we don't yet know which type of value this is for sure Delete(TableName, Option), // Update(...), } enum ColumnSelection { All, Columns(Vec), } enum Condition = { // And(Condition, Condition), // Or(Condition, Condition), // Not(Condition), Eq(ColumnName, DbValue) // LessOrEqual(ColumnName, DbValue) // Less(ColumnName, DbValue) // StringCondition(StringCondition) } enum StringCondition { Prefix(ColumnName, String) Substring(ColumnName, String) } INSERT 123 ``` * We also have to write an interpreter for these operations. How will the db-state be represented in memory? For example how can we implement a table? ``` // Not exactly efficient, but how could we do better? enum DbValue { DbString(String), DbNumber(Float), DbUUID(u32) } // We also need a type of db-types enum DbType { TString, TNumber, TId, } value_to_type(db_val: DbValue) -> DbType // table-metadata and data type TableName = String // Note that it is nice to split metadata from the data because // then you can give the metadata to the parser without giving it the data. struct TableMetaData { name: TableName, // TODO: Is this really necessary? probably not columns: Vec<(ColumnName, DbType, ColumnPosition)> } fn column(TableMetaData, ColumnName) -> ColumnPosition struct Table { meta: TableMetaData, rows: Rows // defined below indexes: BTree // TODO: Consider generalizing ColumnName to semething that would also apply to a pair of ColumnNames etc } type Tables = HashMap // We also need a function that for a given value computes its type (for validation) type ColumnName = String type ColumnPosition = u32 // The below type is a type of a table row type Row = HashMap // Or you know... some appropriate Dictionary Type HashMap::make![("id", 1), ("name", "Alice"), ("salary", 20.0)] : Row type Rows = BTree // possible optimization: have a mapping // column names ~> indexes // so that we could represent rows as type Row = Vec // How to represent a table? table : HashMap Vec<(Id, Row)>. // suppose the row corresponds to 'INSERT 1, "Alice", 20.0 INTO persons;' Row ~> Vec e.g. Row ~> vec![DbUUID 1, DbSTring "Alice"] Vec> ``` * Interpreter ``` trait SqlConsumer { // TODO: ??? } fn interpret(operation: Operation, tables: &mut Tables, consumer: T) -> () { // TODO: lock stuff match operation { Select(table_name, column_selection, maybe_condition) => { let table: Table = ... // TODO: Wrap this into a response select(table, column_selection, maybe_condition, consumer) }, Insert(table_name, Vec<(ColumnName, DbValue)>) => { insert(table, ???) } Delete(table_name, maybe_condition) => { } } } response = interpret(...) knows_how_to_respond(response, client) enum Response { Selected(impl Iter) // TODO: How to do this? Some reference to an iterator somehow... slice..? Inserted(???), Deleted(usize), // how many were deleted } fn select(table: Table, ColumnName ``` * TODO: Consider streaming the response to the client and not just dumping 10K rows at once. ## Server 1. **Client input parsing/validation** string input from the client to Abstract Syntax Tree (AST) that represents SQL query 2. **Code gen** (Not necessary for MVP) from SQL AST generating bytecode for a more low-level VM. 3. **VM** (Not necessary for MVP) implement low-level VM that governs the in-memory db. 4. **Persistence** Serialize in-memory db state to a file. What format should it have? At first perhaps just a dumb json serialization? You also have to implement the deserialization. 5. **Client response** Stream selected rows/status/error messages back to the client. What should the protocol look like? Take a look at [Tabular Data Stream](https://en.wikipedia.org/wiki/Tabular_Data_Stream) or just respond with json. 6. **Concurrency control** Consider what happens when two clients simultaneously wish to updat ethe same row. There has to be some minimal mutual exclusion/locking. 7. **Security** Should we worry about secure communication over TCP? ## Client 1. **Opens TCP connection to server** 2. **REPL** has to provide a basic REPL interface 3. **Server Response decoder** Has to properly format the data/status/error messages from the server. Does the client have to parse the queries? Or is that only the server's responsibility?