Database schema for project management app - ~comp

teaearlgraycold

June 26, 2024

Link

Indexes are part of the schema! Don't forget them.

15 votes

[6]

vord

June 26, 2024 (edited June 27, 2024)

Link

Hi! DBA here! You've got a solid foundation. I'm not going to create indexes here in the interest of time, but as @teaearlgraycold mentioned, they are super critical once your project scales...

Hi! DBA here! You've got a solid foundation. I'm not going to create indexes here in the interest of time, but as @teaearlgraycold mentioned, they are super critical once your project scales beyond trivial sizes. Here's my other thoughts:

I highly recommend liquibase for managing schema versioning. It can be back-ported easily onto existing systems, so while you're in the early phases, I wouldn't necessarily bother if this isn't in your wheelhouse.

Tags are going to be a nightmare to query as-is as @winther mentioned. You're going to want something like below. While it's much more complicated SQL, it will be much faster.

create table tag (
	id integer primary key,
	name text,
);

-- Create one of these per things that use tags
create table projects_tag (
	project_id int,
	tag_id int,
);

-- To get project names based on a single tag
select p.name from projects p
where id in (select project_id from projects_tag
             inner join tags on tag.id = projects_tag.tag_id
             where tag.name = :searched_tag)

You may want to consider formalizing your category and status columns in the database as well using a similar method. If you're gonna strongly type a schema, strongly type it. It'll save many edge cases and minor bugs later.

I prefer recording start_date, end_date for timesheet stuff as well, and maybe having a computed diff inserted as well if you need the performance gain for your use case. Either that or go super-simple and just record a raw integer for recorded seconds to make the math simpler. So many bugs come across from datetime computations.

Edit: Oh I completely forgot a major one:

Row locking is your enemy. If this is single-user, single device, you'll probably be fine. If there may be multiple users or devices editing the same data, it may be worthwhile implementing a versioning scheme so that if a row is updated while you're editing, your changes are rejected.

11 votes

[3]
skybrian
June 27, 2024
Link Parent
That's the traditional and most normalized way to do it, but there are some alternatives I'll mention: It would be slightly simpler to not give each tag a separate id, so you need one less table....

That's the traditional and most normalized way to do it, but there are some alternatives I'll mention:

It would be slightly simpler to not give each tag a separate id, so you need one less table. It gives up the possibility of renaming a tag easily, since you'd need to do a search and replace on projects_tag. Sometimes that's an acceptable tradeoff.

For SQLite, there is a JSON datatype, which will hold a list of strings. However, I went looking for a way to index it and then asked ChatGPT, and it seems that you'd need a trigger to update another table anyway.

Tildes uses Postgres, which gives you access to fancy datatypes. For Tildes, it looks like tags are stored in a column with the Postgres ltree datatype (it stands for "label tree"), which seems to be a way to store multiple hierarchical labels, and that's probably why you can search on part of a dotted path on Tildes. Looks like it supports special indexes, too.

4 votes
1. [2]
  vord
  June 27, 2024
  Link Parent
  Call me old fashioned, but generally I adhere to 'normalize till it hurts.' Denormalizing is OK, but you should know why you're doing it. Bear in mind that it's going to be much less space to...
  
  Call me old fashioned, but generally I adhere to 'normalize till it hurts.' Denormalizing is OK, but you should know why you're doing it. Bear in mind that it's going to be much less space to store 100's of rows of tag names with IDs and millions of rows of integers than to store millions of rows of arbitrary-length strings. I know that's not neccessarily relevant to this particular use case, but that's why its my default.
  
  Strong data types without type inferance go a long way to preventing shooting yourself in the foot.
  
  User is searching by tag? With a tag table, you can provide feedback immediately if it doesn't exist, without having to query every object.
  
  I try to avoid fancy stuff unless there's a real good reason to use it. Indexing arbitrary JSON just makes your life much more complicated if there is a straightforward way to normalize your data.
  
  7 votes
  1. skybrian
    June 27, 2024
    Link Parent
    Makes sense. With a tag table, you could have tags that exist but have no records attached, which is a different sort of existence, and possibly something that needs garbage collecting using a...
    
    Makes sense.
    
    With a tag table, you could have tags that exist but have no records attached, which is a different sort of existence, and possibly something that needs garbage collecting using a trigger. (Using a reference count that could get out of date?) Or maybe not, if it's a controlled vocabulary and you don't want tags disappearing.
    
    From a performance point of view, it seems like an index on a tag column and a tag table are fairly similar in some ways and might have similar performance, depending on the database implementation? An index could be seen as a sort of table that automatically updates without needing to write triggers.
    
    (But I don't know how well that works in the real world; haven't worked with databases in quite a while.)
    
    3 votes
[2]
pyeri
June 27, 2024
Link Parent
Thank you, these are some great ideas! I also prefer the strong typing approach, yes. I'm still trying to make up my mind on how to treat the extra items like mcq, quiz and arbitrary notes. Do...

Thank you, these are some great ideas! I also prefer the strong typing approach, yes.

I'm still trying to make up my mind on how to treat the extra items like mcq, quiz and arbitrary notes. Do they need to be associated to a project or stay separately? I'm also thinking what other useful features a user can have for project/goal tracking. For now, it is single-user only and desktop based.

In the regular usage, I can see only the timesheet table having a large number of rows over a period of time (depending on how many times a user starts/stops the milestone/task tracking). Nevertheless, having indexes on all tables regardless should help with performance, right?

1 vote
1. DrStone
  June 27, 2024
  Link Parent
  If you want this to be used and not just a personal learning project, it is essential to slow down and spend enough time doing deeper market research first, before you start designing the schema...
  
  I'm also thinking what other useful features a user can have for project/goal tracking
  
  If you want this to be used and not just a personal learning project, it is essential to slow down and spend enough time doing deeper market research first, before you start designing the schema and writing code.
  
  There’s a lot of time/project/goal solutions already available. Find out what features the current offerings have in common and what differentiates them. Figure out a way to talk your target audience directly. Find out the breakdown of features that real users believe are essential (dealbreakers if omitted), non-essential but nice to have, disliked, hated, and missing.
  
  1 vote

[2]

winther

June 26, 2024

Link

Just some quick thoughts. No matter how good the schema is, you will likely need to do some schema migrations at some point. Could be a good idea to get that setup from the start. Since you...

Just some quick thoughts.

No matter how good the schema is, you will likely need to do some schema migrations at some point. Could be a good idea to get that setup from the start. Since you mention Python, so maybe look at Alembic?

Having tags as simply text irks me a bit. Would be hard to manage individual tags if it is just a string with commas. At the very least could be an array type or a separate table for more flexibility.

Good names are super important. So what is fdate and tdate? That is not super obvious.

6 votes

TheRTV
June 26, 2024
Link Parent
That's what I was thinking too. Have tags as a separate table and maybe a join table for a many-to-many relationship. That way you could resue existing tags or add new ones. Then you could format...

Having tags as simply text irks me a bit. Would be hard to manage individual tags if it is just a string with commas. At the very least could be an array type or a separate table for more flexibility.

That's what I was thinking too. Have tags as a separate table and maybe a join table for a many-to-many relationship. That way you could resue existing tags or add new ones. Then you could format the tags after you query them however you want.

6 votes

whbboyd

June 27, 2024

Link

Other folks have already raised a lot of the things I see glancing at your schema definition: Everything which is its own "thing" should have its own table. Definitely tags, as discussed, but...

Other folks have already raised a lot of the things I see glancing at your schema definition:

Everything which is its own "thing" should have its own table. Definitely tags, as discussed, but probably also category and maybe status. ("Represent enum as scalar versus foreign key to an enum table" is definitely not a settled discussion; do whatever feels best to you.)

Additionally:

SQLite's support for types in schemas is abysmal, and so it hasn't warned you that DATETIME isn't a real SQL column type. You probably want those to be TIMESTAMP WITH TIME ZONE. (Though my personal experience is that using any database time type sets you up for pain down the line as the database mangles your data, and you should sacrifice type safety for more certainty in how it will be handled: either an INTEGER epoch timestamp, or a consistent stringification in a TEXT column, depending on usage and preference.)
Speaking of which, it's not immediately clear whether you intend your date columns to be dates ("2024-02-29") or instants in time ("2024-02-29T06:30-5"). The specificity of instants seems greater than necessary for the "project" level (and you've named the date columns "date"), and if you round them to the date level, the date will change based on timezone. Maybe that's okay for your use cases! But it's definitely something you should consider. (Fortunately, since you're just recording past times or projecting out a single fixed end time, you don't have to deal with any of the really weird time stuff, like "what do I do with a repeating event at 2:30AM when DST changes and that time happens zero or two times" and "when DST rules change, how do I make sure things are still at the right times".)
@vord mentioned row locking and versioning. (The keywords you're looking for for more details on that are "optimistic locking".) I'm going to hit a related point: you're almost certainly eventually going to want to record history. (n.b. your schema includes time, so this will make your data model bitemporal, something you'll want to embrace rather than try to work around.) There are a wide variety of ways to do this (tombstones, lifetime windows, etc.); the approach I've found to work best is to store history on a separate table and establish triggers to copy the old version of a row out on UPDATE or DELETE. The main tradeoff is having to keep schemas in sync and query a separate table for history, but it greatly simplifies queries over the "latest" table and saves you from the DB admin hassle of dealing with very large tables some of which are hot but most of which are very, very cold.

5 votes

searover

June 26, 2024

Link

I want to caveat this with I don't have a ton of experience with database schemas, but have a bit of web dev experience. Have a think about how you would search for tags. Would it be best in a...

I want to caveat this with I don't have a ton of experience with database schemas, but have a bit of web dev experience.

Have a think about how you would search for tags. Would it be best in a separate table with tag and id? And then you could search that table for all tags and then load the IDs.
Same thing for categories.
I don't know how you're statuses are going to work but if they are going to change or be configuravke you could store an id and have a table with rows for each status name and id. Like an enum.

3 votes