Ask Tildes: Design practices for retrieving dozens (or hundreds) of related records over a RESTful API
I'm looking for some feedback on a feasible mechanism for structuring a few API endpoints where a purely RFC-spec compliant REST API wouldn't suffice.
I have an endpoint which returns $child entries for a $parent resource, let's call it:
/api/parent/:parentId/children. There could be anywhere from a dozen to several hundred children returned from this call. From here, a
child entity is related to a single
userOrganization, which itself is a pivoting entity on a single
user. The relationship between a
user is not strictly transitive, but can each
child only has one
userOrganization which only has one
user, so it is trivial to reach a
user from a
Given this, the data I need for the particular request involves retrieving all
user's for a
parent. The obvious, and incorrect solution to the problem is to make the request mentioned above, and then iterate through and make an API request to retrieve each
user. This is less than very good as this would obviously be up to several hundred API calls.
There's a few more scalable solutions that could solve this problem, so any input on these ideas is great; but if you have a better proposal that also works, I'm keen to explore that!
user relationships in the call by default.
This certainly does solve the problem, but it's also pumping down a load of data I don't necessarily need. This would probably 2x the amount of bytes travelling along the wire, and in 8 out of 10 calls, that extra data isn't needed.
Have a separate
Another option that partially solves the issue: I need data from both the
child and the
user to format this view, so I'd still need to make the initial call I documented earlier. Semantically, it feels a bit odd to have this as a resource because I don't consider a
user to be nested under a
parent in terms of database topology.
Keep the original call, but add a query parameter to fetch the extra data
This comes across as the 'least worst' idea objectively, in terms of flexibility and design. Through the addition of the query parameter, you could optionally retrieve the relationship's data. This seems brittle and doesn't scale well to other endpoints where it could be useful though.
Utilize a Stripe
expands-style query parameter.
Stripe implements the ability to retrieve all related records from an API endpoint by specifying the relations as strings. This is essentially the same as the above answer, but is scaled to all available API endpoints. I love this idea, but implementing it in a secure way seems fraught with disaster. For example, this is a multi-tenancied application, and it would be trivial to request
userOrganization.user.organizations.users. This would retrieve all other organisations for the user, and their users! This is because my implementation of
expands simply utilises the ORM of my choice to perform a database join, and of course the database has no knowledge about application tenancy!
Now, I do realise this problem could easily be solved by implementing a GraphQL API server, which I have done in the past, but unfortunately time and workload constraints dictate implementing a GraphQL-based solution is infeasible. As much as I like GraphQL, I'm not as proficient in that area as compared to implementing high quality traditional APIs, and the applications I'm working on at the moment are focusing on choosing boring technology, and not using excessive innovation tokens.
Furthermore, I do consider the conceptuals around REST APIs to be more of an aspirational sliding scale, rather than a well defined physical entity, because let's face it, the majority of popular APIs today aren't REST-compliant, even Stripe's isn't, and it's usually both financially healthier and feature-rich to choose a development path that results in a rough product that can be refined later, than aiming for a perfect initial release. All this said, I don't mind proposals or solutions to my problem that are "good enough". As long as they aren't too hacky! :)
I was API lead for a company that used the
expandstyle in an unrestricted way and can confirm that it was operationally super complicated to deal with. Determining slow queries made by third parties and then dealing with them was super challenging.
Is this API also for third party consumption or only first party? That could inform how tolerant to later change any choice you make here will be. I.e. if you made the choice to use the
expandstyle but later determined it to be a Bad Idea it may be more tractable to undo.
One other middle ground I’ve seen some APIs provide is a response “style”, like “full”, “thin”, “ids”, etc. This can be more constrained but could still allow you to embed users in a full response type and could be more generalizable to other endpoints you make. I honestly don’t love it because it’s somewhat opaque but it is a middle ground.
If I were you I’d probably go with the one off solution for now if this is the first time you are running into the problem. Maybe down the line if this is a repeated issue you could look into GraphQL.
Good question—I should have included that in my post, it’s for first-party consumption only, third parties shouldn’t ever need this information.
And thanks for sharing, it doesn’t surprise me that expanding arbitrary parameters is a complicated API design, it appears simple from a 10,000ft view but when you throw in authorisation, multi-tenancy, and conditional resource access it’s quite complicated and prone to error.
I like the idea of
expands, fundamentally, but it requires cautious design regardless. You’re right it may be worth implementing cheaply now and then rolling back later if needed.
I would probably argue in favor of using
/api/parent/:parentId/usersfrom what I can glean from your post, but I'm not sure I'm understanding the relationships well enough. Perhaps a diagram might help?
Have you measured this approach and seen that it is actually problematic? HTTP2 I believe alleviates a lot of the involved overhead (QUIC goes even further) so multiple requests might not be as much of an issue...
HTTP2 solves the issue of transport inefficiency, but doesn't do anything to alleviate service inefficiency. Each HTTP request probably triggers several DB queries, and it'll fill up the request queue for the application, possibly preventing other consumers of the API from getting responses.
In general for information retrieval it's not a good idea to scale the number of database queries linearly with the number of items in the request.
True, those things could be issues. Measuring the specific use case could lead to some insights whether providing a more optimized endpoint is warranted. I think there is something in the spec for pushing extra data from the server to the client that the server think the client will need, eg typically some webpage resources. Not sure if that has been used in Rest API design to push extra data following some request though... I'm a bit fuzzy on the details around this :)
Yeah, as @blitz expounded, it just doesn't scale well in that direction. Even with HTTP/2 there'd still be a significant time and service overhead—and it would tax the database non-trivially. It's easy to implement as a patch for now like this, but it needs re-architecting in a more performant and semantic way.
In my opinion the most REST-ful solution is to have smart clients that are doing heavy caching of intermediary resources and only hit the API when it expires. This coupled with an out of band cache-warming mechanism would ensure that the request numbers are relatively low. Of course, the life-times of your resources needs to be high enough for this to be feasible.
This post popped up on HN which might be of interest: https://evertpot.com/h2-parallelism/
It has some benchmarks related to using HTTP2 for what looks to be a similar scenario.