ubr's recent activity
-
Comment on Show Tildes: we built the world's first legal AI API in ~tech
-
Comment on Show Tildes: we built the world's first legal AI API in ~tech
ubr (edited )Link ParentThank you, much appreciated :) Some good questions. In terms of security, we use industry-standard security practices like encrypting and hashing data, rotating API keys, following the principle...This looks really great. For my own part, I am happy to see anyone on tildes posting on what they are doing, so long as its interesting.
Thank you, much appreciated :)
How can you assure privacy, confidentiality, and/or security with uploaded contracts? I can imagine that such documents might be sensitive and a client might want to know that use of the tool won't constitute a liability.
Should users anonymize or scrub their documents before uploading?
Some good questions. In terms of security, we use industry-standard security practices like encrypting and hashing data, rotating API keys, following the principle of least privilege, etc. to protect our services.
In terms of privacy and confidentiality, our policy is to retain data only for as long as necessary to ensure there is no abuse of our services going on before we delete it. We also don't use private user data sent to our API to train our models without users' consent, except for, in the case of abuse of our services, improving our internal abuse detection capabilities.
For highly sensitive workloads though, we actually also offer self-hosting, where users can run our models on their own hardware, including inside air-gapped envrionments, and ensure that no data ever leaves their hands.
I read the statement at the bottom of the linked webpage and thought you might want to contact Chris Kavelin, as it would seem that your interests might dovetail.
Thanks for the tip!
I work in scientific research and would love to have a screening tool for scientific papers where I can do something similar to, say, grabbing a ton of papers on ArXiv and then querying them for specific information using something like IQL. Is this a pipe-dream or an existing capability?
Actually yes, that would be possible. Theoretically, IQL can be applied to the extraction and classification of any arbitrary data using any arbitrary text extraction or text classification model.
For the short- and medium-term at least though, our focus is on building specifically legal AI models.
The Kanon Universal Classifier could be used for what you're looking for but it may not give you the best accuracy possible.
We are open to the possibility of licensing our technology to others to build their own domain-equivalent tools.
-
Comment on Show Tildes: we built the world's first legal AI API in ~tech
ubr You're right and indeed we're planning on releasing an API playground next, hopefully followed by, later in the year, some apps. We thought we'd start with the API first because, right now, our...For a product, it does seem to be very technical still. Requiring prompts via API/command line, and even using a DSL. I suspect a frontend UI - be it web-based or otherwise - would really make things more accessible for non-programmers.
You're right and indeed we're planning on releasing an API playground next, hopefully followed by, later in the year, some apps.
We thought we'd start with the API first because, right now, our team doesn't have a lot of frontend expertise (though we've learned quite a bit through building our platform ourselves).
We're hoping to use the API to quickly ship more models and thereby generate enough traction to expand our team.
Maybe with a micro-LLM for converting English prompts to your DSL. Is that on the roadmap?
Bingo! We're currently exploring how we can use LLMs to both perform prompt optimization (instead of us having to tune them ourselves by hand) as well as translate natural language queries into our DSL for less technical users.
We're also working on sharing a prompt that users can feed to their own agents to then have those agents be able to write their own queries, which should still save them money, time and often improve accuracy when running through large volumes of lengthy legal documents.
-
Comment on Show Tildes: we built the world's first legal AI API in ~tech
ubr (edited )Link ParentSorry about that! I tend to be more of a lurker on Tildes. If the mods think its too much, I'm happy to take this down for now. Certainly did not intend to cause offence!Sorry about that! I tend to be more of a lurker on Tildes. If the mods think its too much, I'm happy to take this down for now. Certainly did not intend to cause offence!
-
Comment on Show Tildes: we built the world's first legal AI API in ~tech
ubr Hey ~tech, Over the past couple months, we, a team of Aussie legal and AI experts, have been working on building a new type of legal AI company — a company that, instead of trying to automate...Hey ~tech,
Over the past couple months, we, a team of Aussie legal and AI experts, have been working on building a new type of legal AI company — a company that, instead of trying to automate legal jobs, is trying to automate legal tasks.
We want to make lawyers’ lives easier, not replace them.
We’ve been laser-focused on building small and efficient yet still highly accurate, specialized models for some of the most time-consuming and mundane legal tasks lawyers have to perform. Stuff like running through a thousand contracts just to locate any clauses that would allow you to get out early.
We just finished training our first set of models, focused on document and clause classification, probably the most common problem we see come up. Our benchmarks show our models to be far more accurate and almost more efficient than their closest general-purpose competitors.
Today, we’re making those models publicly available via the Isaacus API, the world’s first legal AI API.
Our models don’t require any finetuning because they’re zero-shot classifiers — you give them a description of what you’re looking for (for example,
This is a confidentiality clause.
) and they pop out a classification score.Because our models are so small, which they have to be to be able to process reams of legal data at scale, they can sometimes be a bit sensitive to prompts. To help with that, however, we’ve preoptimized an entire library of prompts, including what we call, universal templates, which let you plug in your own arbitrary descriptions of what you’ve looking for.
We’ve made our prompt library available via the Isaacus Query Language or IQL. Another world first — it’s a brand-new AI query language designed specifically for using AI models to analyze documents.
You can invoke query templates using the format
{IS <query_template_name>}
. You can also chain queries together using Boolean and mathematical operators, like so:{This is a confidentiality clause.} AND {IS unilateral clause}
.We think our API is pretty neat and we hope you will too.
This is just the beginning for us — over the course of this year, we’re planning on releasing text extraction and embedding models as well as a second generation of our Kanon legal foundational model.
Here are some quick links for your convenience:
-
Show Tildes: we built the world's first legal AI API
22 votes -
Comment on Show Tildes: mapping almost every law, regulation and case in Australia in ~comp
ubr I actually didn't use BERT for topic modelling, the embeddings were created by a far more advanced specialised model called BAAI/bge-small-en-v1.5, although still a transformer. A more rudimentary...I actually didn't use BERT for topic modelling, the embeddings were created by a far more advanced specialised model called
BAAI/bge-small-en-v1.5
, although still a transformer. A more rudimentary algorithm was used for clustering, HDBSCAN, but then I manually reviewed all 507 clusters and reduced them to 19 branches of law. I do mention that I modelled my process off BERTopic, which is probably where the confusion came from. I'm not too sure why it's still called that when it doesn't even use BERT anymore 😅.Thank you for the positive feedback, it is much appreciated. And apologies if the article seems a little too complex, the underlying process is 'pretty simple' but my implementation may be more involved because working with over 200k long-form documents meant having to build much more efficent code from scratch than what is available in the BERTopic Python library.
-
Comment on Show Tildes: mapping almost every law, regulation and case in Australia in ~comp
ubr Hey Tildes, After months of hard work, I am excited to share the first ever semantic map of Australian law. My map represents the first attempt to map Australian laws, cases and regulations across...Hey Tildes,
After months of hard work, I am excited to share the first ever semantic map of Australian law.
My map represents the first attempt to map Australian laws, cases and regulations across the Commonwealth, States and Territories semantically, that is, by their underlying meaning.
Each point on the map is a unique document in the Open Australian Legal Corpus, the largest open database of Australian law (which, full disclosure, I created). The closer any two points are on the map, the more similar they are in underlying meaning.
As I cover in my article, there’s a lot you can learn by mapping Australian law. Some of the most interesting insights to come out of this initiative are that:
⦁ Migration, family and substantive criminal law are the most isolated branches of case law on the map;
⦁ Migration, family and substantive criminal law are the most distant branches of case law from legislation on the map;
⦁ Development law is the closest branch of case law to legislation on the map;
⦁ Case law is more of a continuum than a rigidly defined structure and the borders between branches of case law can often be quite porous; and
⦁ The map does not reveal any noticeable distinctions between Australian state and federal law, whether it be in style, principles of interpretation or general jurisprudence.If you’re interested in learning more about what the map has to teach us about Australian law or if you’d like to find out how you can create semantic maps of your own, check out the full article on my blog, which provides a detailed analysis of my map and also covers the finer details of how I built it, with code examples offered along the way.
-
Show Tildes: mapping almost every law, regulation and case in Australia
14 votes -
Comment on Do you have or know of fun domain names? Do you think it's worth having them? in ~tech
ubr No it wouldn’t. I haven’t gotten around to doing that but it’s a good idea.No it wouldn’t. I haven’t gotten around to doing that but it’s a good idea.
-
Comment on Show Tildes: how I built the largest open database of Australian law in ~comp
ubr I have actually. Seeing how far ahead other jurisdictions like the UK were helped push me to decide this was something worth pursuing. Even the US was doing better, not because of their...Ever seen the UK's digital systems? We get a lot of grief for our bureaucratic nature. but .gov.uk is an absolute goldmine of everything the government offers. We've even got a legislation bit which touches on what you've made here. GDS (Government Digital Services) got absolutely hammered in the early years of austerity by our government and we've kind of dropped the ball since... and consultants have taken to stealing all the limelight for 'solving the problem.'
I have actually. Seeing how far ahead other jurisdictions like the UK were helped push me to decide this was something worth pursuing. Even the US was doing better, not because of their bureaucracy but because there was a motivation for researchers to do the work themselves. I'm hoping that this database will help progress the legal tech and open data scene here in Australia further.
This kind of thing is invaluable though. You might want to see if the big wigs would be interested in using it? Someone, Somewhere in the Aus-Gov has to be doing something similar.
Good idea. I'm might try reaching out to people in the government to see if they'd be interested in helping me expand the database and perhaps even maintaining (because obviously in the long run, it will require work to keep up with changes to various data sources' system). Thanks for the encouragement :)
-
Comment on Show Tildes: how I built the largest open database of Australian law in ~comp
ubr I haven’t got around to training an LLM on the Corpus yet, however, I did build a prototype AI assistant at work where I feed GPT-4 documents from the Corpus (RAG essentially, with some...Have you begun training your LLM yet, and if so, what are the results like?
I haven’t got around to training an LLM on the Corpus yet, however, I did build a prototype AI assistant at work where I feed GPT-4 documents from the Corpus (RAG essentially, with some hierarchical question decomposition and graph of thoughts mixed in) and it has performed really well, beating Bing, Bard and Claude 2. So I’m keen to try training an LLM on the data and seeing how much better that works.
The source/citation fields being required by the prompt would help mitigate the hallucination problem (as I'm sure was your intention).
Interestingly enough, I did find that forcing GPT-4 to cite its claims, particularly to individual sections of laws, really improved the accuracy of its responses. Wild how citations alone can make it more truthful.
The project sounds like the kind of modernizing that governments should be doing themselves.
…
I'm sure they could be adapted to other countries' legal systems as well.I got quite lucky as this project was some really low hanging fruit, I was surprised no one had bothered doing it before. I am actually already thinking about how I can do this for other countries. I have an Australian degree in law so it could require a bit of learning depending on the jurisdiction, but for places like the UK, Canada, New Zealand, it shouldn’t be too difficult.
-
Comment on Show Tildes: how I built the largest open database of Australian law in ~comp
ubr Thanks mate, really appreciate the positive feedback. We really do have a long ways to go in terms of tech, but particularly in the public service. Their data management is all over the place.Thanks mate, really appreciate the positive feedback. We really do have a long ways to go in terms of tech, but particularly in the public service. Their data management is all over the place.
-
Comment on Show Tildes: how I built the largest open database of Australian law in ~comp
ubr Hey Tildes, Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no...Hey Tildes,
Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.
My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!
You can find my database on HuggingFace and the code used to create it on GitHub.
-
Show Tildes: how I built the largest open database of Australian law
28 votes -
Comment on Jina AI releases first open source 8k embedding model in ~comp
ubr The model ranks 17th on the Massive Text Embedding Benchmark (MTEB) Leaderboard, making it a great option for those looking for a FOSS alternative to ada-002 that can handle just as many tokens....The model ranks 17th on the Massive Text Embedding Benchmark (MTEB) Leaderboard, making it a great option for those looking for a FOSS alternative to ada-002 that can handle just as many tokens. Naturally, for those dealing with smaller context windows, BGE still reigns supreme.
-
Jina AI releases first open source 8k embedding model
8 votes -
Comment on GPT-4 understands in ~tech
-
Comment on GPT-4 understands in ~tech
ubr It depends on the use case and domain. I’ve been using GPT-4 for Australian legal QA (via RAG) and it honestly wouldn’t be possible if I only had access to GPT-3.5-Turbo. For me, GPT-4’s ability...It depends on the use case and domain. I’ve been using GPT-4 for Australian legal QA (via RAG) and it honestly wouldn’t be possible if I only had access to GPT-3.5-Turbo. For me, GPT-4’s ability to “understand” Australian law and reason through problems makes it a far more powerful tool than GPT-3.5-Turbo. At the same time, for other domains and use cases, the gains can be more moderate. The training sets used may have a role to play in this.
-
Comment on Teaching LLMs to divide and conquer problems with hierarchical question decomposition in ~comp
I take your point (and that of @zestier) and will increase my engagement in the future. Apologies again.