Database schema for project management app
As a side project, I'm developing a desktop app for project management, especially for students, self-employed, freelancers, etc. as a general purpose productivity tool, not from any business or industrial perspective. The idea is to create projects/goals/missions and break them down into individual milestones and tasks, and then keep tracking them as they are getting done.
If it takes off, I have other features planned too like embedded notes, tree view for brainstorming items, MCQ puzzles, quote of the day, etc. But for the core project-milestone tracking, this is what I've come up with so far in sqlite
:
drop table if exists projects;
drop table if exists milestones;
drop table if exists tasks;
drop table if exists timesheet;
-- drop table if exists notes;
-- drop table if exists mindmaps;
-- drop table if exists mcq;
create table projects (
id integer primary key,
status text, -- (Pending/WIP/Complete)
name text, -- Learn to use Vim Editor
category text, -- (Work/Study/Other)
tags text, -- (Tech/PHP/Python/Emacs/Vim)
start_date datetime,
end_date datetime,
notes text
);
create table milestones (
id integer primary key,
project_id int references projects(id),
name text, -- Vim first steps
status text,
tags text
);
create table tasks (
id integer primary key,
milestone_id int references milestones(id),
name text, -- hands on with vimtutor
status text,
tags text,
weekly_hrs int, -- no. of hours to be spend on this task each week.
notes text
);
create table timesheet (
id integer primary key,
task_id int references tasks(id),
fdate datetime,
tdate datetime, -- tracked 60 minutes
notes text
);
-- @todo mcq/mindmaps/notes
The tracking will be done by the frontend desktop app I will code separately in Python using tkinter library. Can you suggest me any improvements in this?
Indexes are part of the schema! Don't forget them.
Hi! DBA here! You've got a solid foundation. I'm not going to create indexes here in the interest of time, but as @teaearlgraycold mentioned, they are super critical once your project scales beyond trivial sizes. Here's my other thoughts:
I highly recommend liquibase for managing schema versioning. It can be back-ported easily onto existing systems, so while you're in the early phases, I wouldn't necessarily bother if this isn't in your wheelhouse.
Tags are going to be a nightmare to query as-is as @winther mentioned. You're going to want something like below. While it's much more complicated SQL, it will be much faster.
You may want to consider formalizing your category and status columns in the database as well using a similar method. If you're gonna strongly type a schema, strongly type it. It'll save many edge cases and minor bugs later.
I prefer recording start_date, end_date for timesheet stuff as well, and maybe having a computed diff inserted as well if you need the performance gain for your use case. Either that or go super-simple and just record a raw integer for recorded seconds to make the math simpler. So many bugs come across from datetime computations.
Edit: Oh I completely forgot a major one:
Row locking is your enemy. If this is single-user, single device, you'll probably be fine. If there may be multiple users or devices editing the same data, it may be worthwhile implementing a versioning scheme so that if a row is updated while you're editing, your changes are rejected.
That's the traditional and most normalized way to do it, but there are some alternatives I'll mention:
It would be slightly simpler to not give each tag a separate id, so you need one less table. It gives up the possibility of renaming a tag easily, since you'd need to do a search and replace on projects_tag. Sometimes that's an acceptable tradeoff.
For SQLite, there is a JSON datatype, which will hold a list of strings. However, I went looking for a way to index it and then asked ChatGPT, and it seems that you'd need a trigger to update another table anyway.
Tildes uses Postgres, which gives you access to fancy datatypes. For Tildes, it looks like tags are stored in a column with the Postgres ltree datatype (it stands for "label tree"), which seems to be a way to store multiple hierarchical labels, and that's probably why you can search on part of a dotted path on Tildes. Looks like it supports special indexes, too.
Call me old fashioned, but generally I adhere to 'normalize till it hurts.' Denormalizing is OK, but you should know why you're doing it. Bear in mind that it's going to be much less space to store 100's of rows of tag names with IDs and millions of rows of integers than to store millions of rows of arbitrary-length strings. I know that's not neccessarily relevant to this particular use case, but that's why its my default.
Strong data types without type inferance go a long way to preventing shooting yourself in the foot.
User is searching by tag? With a tag table, you can provide feedback immediately if it doesn't exist, without having to query every object.
I try to avoid fancy stuff unless there's a real good reason to use it. Indexing arbitrary JSON just makes your life much more complicated if there is a straightforward way to normalize your data.
Makes sense.
With a tag table, you could have tags that exist but have no records attached, which is a different sort of existence, and possibly something that needs garbage collecting using a trigger. (Using a reference count that could get out of date?) Or maybe not, if it's a controlled vocabulary and you don't want tags disappearing.
From a performance point of view, it seems like an index on a tag column and a tag table are fairly similar in some ways and might have similar performance, depending on the database implementation? An index could be seen as a sort of table that automatically updates without needing to write triggers.
(But I don't know how well that works in the real world; haven't worked with databases in quite a while.)
Thank you, these are some great ideas! I also prefer the strong typing approach, yes.
I'm still trying to make up my mind on how to treat the extra items like mcq, quiz and arbitrary notes. Do they need to be associated to a project or stay separately? I'm also thinking what other useful features a user can have for project/goal tracking. For now, it is single-user only and desktop based.
In the regular usage, I can see only the timesheet table having a large number of rows over a period of time (depending on how many times a user starts/stops the milestone/task tracking). Nevertheless, having indexes on all tables regardless should help with performance, right?
If you want this to be used and not just a personal learning project, it is essential to slow down and spend enough time doing deeper market research first, before you start designing the schema and writing code.
There’s a lot of time/project/goal solutions already available. Find out what features the current offerings have in common and what differentiates them. Figure out a way to talk your target audience directly. Find out the breakdown of features that real users believe are essential (dealbreakers if omitted), non-essential but nice to have, disliked, hated, and missing.
Just some quick thoughts.
No matter how good the schema is, you will likely need to do some schema migrations at some point. Could be a good idea to get that setup from the start. Since you mention Python, so maybe look at Alembic?
Having tags as simply text irks me a bit. Would be hard to manage individual tags if it is just a string with commas. At the very least could be an array type or a separate table for more flexibility.
Good names are super important. So what is fdate and tdate? That is not super obvious.
That's what I was thinking too. Have tags as a separate table and maybe a join table for a many-to-many relationship. That way you could resue existing tags or add new ones. Then you could format the tags after you query them however you want.
Other folks have already raised a lot of the things I see glancing at your schema definition:
Additionally:
DATETIME
isn't a real SQL column type. You probably want those to beTIMESTAMP WITH TIME ZONE
. (Though my personal experience is that using any database time type sets you up for pain down the line as the database mangles your data, and you should sacrifice type safety for more certainty in how it will be handled: either anINTEGER
epoch timestamp, or a consistent stringification in aTEXT
column, depending on usage and preference.)UPDATE
orDELETE
. The main tradeoff is having to keep schemas in sync and query a separate table for history, but it greatly simplifies queries over the "latest" table and saves you from the DB admin hassle of dealing with very large tables some of which are hot but most of which are very, very cold.I want to caveat this with I don't have a ton of experience with database schemas, but have a bit of web dev experience.