ubr's recent activity
-
Comment on Show Tildes: mapping almost every law, regulation and case in Australia in ~comp
-
Comment on Show Tildes: mapping almost every law, regulation and case in Australia in ~comp
ubr Hey Tildes, After months of hard work, I am excited to share the first ever semantic map of Australian law. My map represents the first attempt to map Australian laws, cases and regulations across...Hey Tildes,
After months of hard work, I am excited to share the first ever semantic map of Australian law.
My map represents the first attempt to map Australian laws, cases and regulations across the Commonwealth, States and Territories semantically, that is, by their underlying meaning.
Each point on the map is a unique document in the Open Australian Legal Corpus, the largest open database of Australian law (which, full disclosure, I created). The closer any two points are on the map, the more similar they are in underlying meaning.
As I cover in my article, thereās a lot you can learn by mapping Australian law. Some of the most interesting insights to come out of this initiative are that:
ā¦āMigration, family and substantive criminal law are the most isolated branches of case law on the map;
ā¦āMigration, family and substantive criminal law are the most distant branches of case law from legislation on the map;
ā¦āDevelopment law is the closest branch of case law to legislation on the map;
ā¦āCase law is more of a continuum than a rigidly defined structure and the borders between branches of case law can often be quite porous; and
ā¦āThe map does not reveal any noticeable distinctions between Australian state and federal law, whether it be in style, principles of interpretation or general jurisprudence.If youāre interested in learning more about what the map has to teach us about Australian law or if youād like to find out how you can create semantic maps of your own, check out the full article on my blog, which provides a detailed analysis of my map and also covers the finer details of how I built it, with code examples offered along the way.
-
Show Tildes: mapping almost every law, regulation and case in Australia
14 votes -
Comment on Do you have or know of fun domain names? Do you think it's worth having them? in ~tech
ubr No it wouldnāt. I havenāt gotten around to doing that but itās a good idea.No it wouldnāt. I havenāt gotten around to doing that but itās a good idea.
-
Comment on Show Tildes: how I built the largest open database of Australian law in ~comp
ubr I have actually. Seeing how far ahead other jurisdictions like the UK were helped push me to decide this was something worth pursuing. Even the US was doing better, not because of their...Ever seen the UK's digital systems? We get a lot of grief for our bureaucratic nature. but .gov.uk is an absolute goldmine of everything the government offers. We've even got a legislation bit which touches on what you've made here. GDS (Government Digital Services) got absolutely hammered in the early years of austerity by our government and we've kind of dropped the ball since... and consultants have taken to stealing all the limelight for 'solving the problem.'
I have actually. Seeing how far ahead other jurisdictions like the UK were helped push me to decide this was something worth pursuing. Even the US was doing better, not because of their bureaucracy but because there was a motivation for researchers to do the work themselves. I'm hoping that this database will help progress the legal tech and open data scene here in Australia further.
This kind of thing is invaluable though. You might want to see if the big wigs would be interested in using it? Someone, Somewhere in the Aus-Gov has to be doing something similar.
Good idea. I'm might try reaching out to people in the government to see if they'd be interested in helping me expand the database and perhaps even maintaining (because obviously in the long run, it will require work to keep up with changes to various data sources' system). Thanks for the encouragement :)
-
Comment on Show Tildes: how I built the largest open database of Australian law in ~comp
ubr I havenāt got around to training an LLM on the Corpus yet, however, I did build a prototype AI assistant at work where I feed GPT-4 documents from the Corpus (RAG essentially, with some...Have you begun training your LLM yet, and if so, what are the results like?
I havenāt got around to training an LLM on the Corpus yet, however, I did build a prototype AI assistant at work where I feed GPT-4 documents from the Corpus (RAG essentially, with some hierarchical question decomposition and graph of thoughts mixed in) and it has performed really well, beating Bing, Bard and Claude 2. So Iām keen to try training an LLM on the data and seeing how much better that works.
The source/citation fields being required by the prompt would help mitigate the hallucination problem (as I'm sure was your intention).
Interestingly enough, I did find that forcing GPT-4 to cite its claims, particularly to individual sections of laws, really improved the accuracy of its responses. Wild how citations alone can make it more truthful.
The project sounds like the kind of modernizing that governments should be doing themselves.
ā¦
I'm sure they could be adapted to other countries' legal systems as well.I got quite lucky as this project was some really low hanging fruit, I was surprised no one had bothered doing it before. I am actually already thinking about how I can do this for other countries. I have an Australian degree in law so it could require a bit of learning depending on the jurisdiction, but for places like the UK, Canada, New Zealand, it shouldnāt be too difficult.
-
Comment on Show Tildes: how I built the largest open database of Australian law in ~comp
ubr Thanks mate, really appreciate the positive feedback. We really do have a long ways to go in terms of tech, but particularly in the public service. Their data management is all over the place.Thanks mate, really appreciate the positive feedback. We really do have a long ways to go in terms of tech, but particularly in the public service. Their data management is all over the place.
-
Comment on Show Tildes: how I built the largest open database of Australian law in ~comp
ubr Hey Tildes, Over the past year, Iāve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no...Hey Tildes,
Over the past year, Iāve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.
My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!
You can find my database on HuggingFace and the code used to create it on GitHub.
-
Show Tildes: how I built the largest open database of Australian law
28 votes -
Comment on Jina AI releases first open source 8k embedding model in ~comp
ubr The model ranks 17th on the Massive Text Embedding Benchmark (MTEB) Leaderboard, making it a great option for those looking for a FOSS alternative to ada-002 that can handle just as many tokens....The model ranks 17th on the Massive Text Embedding Benchmark (MTEB) Leaderboard, making it a great option for those looking for a FOSS alternative to ada-002 that can handle just as many tokens. Naturally, for those dealing with smaller context windows, BGE still reigns supreme.
-
Jina AI releases first open source 8k embedding model
8 votes -
Comment on GPT-4 understands in ~tech
-
Comment on GPT-4 understands in ~tech
ubr It depends on the use case and domain. Iāve been using GPT-4 for Australian legal QA (via RAG) and it honestly wouldnāt be possible if I only had access to GPT-3.5-Turbo. For me, GPT-4ās ability...It depends on the use case and domain. Iāve been using GPT-4 for Australian legal QA (via RAG) and it honestly wouldnāt be possible if I only had access to GPT-3.5-Turbo. For me, GPT-4ās ability to āunderstandā Australian law and reason through problems makes it a far more powerful tool than GPT-3.5-Turbo. At the same time, for other domains and use cases, the gains can be more moderate. The training sets used may have a role to play in this.
-
Comment on Teaching LLMs to divide and conquer problems with hierarchical question decomposition in ~comp
-
Comment on Teaching LLMs to divide and conquer problems with hierarchical question decomposition in ~comp
ubr @cfabbro Thanks for the gentle heads up. It certainly wasn't my intention to contravene your code of conduct. I'll keep this in mind going forward.@cfabbro Thanks for the gentle heads up. It certainly wasn't my intention to contravene your code of conduct. I'll keep this in mind going forward.
-
Teaching LLMs to divide and conquer problems with hierarchical question decomposition
8 votes
I actually didn't use BERT for topic modelling, the embeddings were created by a far more advanced specialised model called
BAAI/bge-small-en-v1.5
, although still a transformer. A more rudimentary algorithm was used for clustering, HDBSCAN, but then I manually reviewed all 507 clusters and reduced them to 19 branches of law. I do mention that I modelled my process off BERTopic, which is probably where the confusion came from. I'm not too sure why it's still called that when it doesn't even use BERT anymore š .Thank you for the positive feedback, it is much appreciated. And apologies if the article seems a little too complex, the underlying process is 'pretty simple' but my implementation may be more involved because working with over 200k long-form documents meant having to build much more efficent code from scratch than what is available in the BERTopic Python library.