28 votes

Show Tildes: how I built the largest open database of Australian law

Posted October 29, 2023 by ubr

Tags: open source, databases, data science, machine learning, law, australia, original content, author.umar butler, source.umarbutler

https://umarbutler.com/how-i-built-the-largest-open-database-of-australian-law/

Link information

This data is scraped automatically and may be incorrect.

Published: Oct 27 2023
Word count: 1051 words

8 comments

ubr (OP)
October 29, 2023
Link
Hey Tildes, Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no...

Hey Tildes,
Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.

In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.

My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!

You can find my database on HuggingFace and the code used to create it on GitHub.

13 votes
[4]
Pioneer
October 29, 2023
Link
Well done mate, that's really tough work. Australia seem to be so woefully equipped in the digital space at the moment, it's insane. I worked for a firm that got bought up about five years ago...

Well done mate, that's really tough work. Australia seem to be so woefully equipped in the digital space at the moment, it's insane.

I worked for a firm that got bought up about five years ago that was way ahead on the LLM fad that's going around now. Having these big, chunky, open databases is going to be so much better than endless scrapping of inane documentation to 'assume' and 'interpolate' data and language for this kind of work.

Genuinely. 10/10.

7 votes
1. [3]
  ubr (OP)
  October 29, 2023
  Link Parent
  Thanks mate, really appreciate the positive feedback. We really do have a long ways to go in terms of tech, but particularly in the public service. Their data management is all over the place.
  
  Thanks mate, really appreciate the positive feedback. We really do have a long ways to go in terms of tech, but particularly in the public service. Their data management is all over the place.
  
  4 votes
  1. [2]
    Pioneer
    October 29, 2023
    Link Parent
    Ever seen the UK's digital systems? We get a lot of grief for our bureaucratic nature. but .gov.uk is an absolute goldmine of everything the government offers. We've even got a legislation bit...
    
    Their data management is all over the place.
    
    Ever seen the UK's digital systems? We get a lot of grief for our bureaucratic nature. but .gov.uk is an absolute goldmine of everything the government offers. We've even got a legislation bit which touches on what you've made here. GDS (Government Digital Services) got absolutely hammered in the early years of austerity by our government and we've kind of dropped the ball since... and consultants have taken to stealing all the limelight for 'solving the problem.'
    
    This kind of thing is invaluable though. You might want to see if the big wigs would be interested in using it? Someone, Somewhere in the Aus-Gov has to be doing something similar.
    
    2 votes
    
    ubr (OP)
    October 30, 2023
    Link Parent
    I have actually. Seeing how far ahead other jurisdictions like the UK were helped push me to decide this was something worth pursuing. Even the US was doing better, not because of their...
    
    Ever seen the UK's digital systems? We get a lot of grief for our bureaucratic nature. but .gov.uk is an absolute goldmine of everything the government offers. We've even got a legislation bit which touches on what you've made here. GDS (Government Digital Services) got absolutely hammered in the early years of austerity by our government and we've kind of dropped the ball since... and consultants have taken to stealing all the limelight for 'solving the problem.'
    
    I have actually. Seeing how far ahead other jurisdictions like the UK were helped push me to decide this was something worth pursuing. Even the US was doing better, not because of their bureaucracy but because there was a motivation for researchers to do the work themselves. I'm hoping that this database will help progress the legal tech and open data scene here in Australia further.
    
    This kind of thing is invaluable though. You might want to see if the big wigs would be interested in using it? Someone, Somewhere in the Aus-Gov has to be doing something similar.
    
    Good idea. I'm might try reaching out to people in the government to see if they'd be interested in helping me expand the database and perhaps even maintaining (because obviously in the long run, it will require work to keep up with changes to various data sources' system). Thanks for the encouragement :)
    
    1 vote
vord
October 29, 2023
Link
Your story is one of the definitive examples of why all legal documents should be in plain text, and I applaud you.

Your story is one of the definitive examples of why all legal documents should be in plain text, and I applaud you.

4 votes
[2]
Wes
October 29, 2023
Link
Very cool. Have you begun training your LLM yet, and if so, what are the results like? I feel like if you only trained it on this legal corpus, you'd get a very difficult to understand agent. So...

Very cool. Have you begun training your LLM yet, and if so, what are the results like? I feel like if you only trained it on this legal corpus, you'd get a very difficult to understand agent. So would it be better to take an existing LLM like llama2 and finetune it on this corpus instead? The source/citation fields being required by the prompt would help mitigate the hallucination problem (as I'm sure was your intention).

The project sounds like the kind of modernizing that governments should be doing themselves. Especially as such an easily queryably tool would be a boon to the lawyers and lawmakers that seem so reluctant to help in the first place.

I glanced at the scraper code and it looks very neatly put together. I'm sure they could be adapted to other countries' legal systems as well. Really excellent project you've created.

2 votes
1. ubr (OP)
  October 29, 2023
  Link Parent
  I haven’t got around to training an LLM on the Corpus yet, however, I did build a prototype AI assistant at work where I feed GPT-4 documents from the Corpus (RAG essentially, with some...
  
  Have you begun training your LLM yet, and if so, what are the results like?
  
  I haven’t got around to training an LLM on the Corpus yet, however, I did build a prototype AI assistant at work where I feed GPT-4 documents from the Corpus (RAG essentially, with some hierarchical question decomposition and graph of thoughts mixed in) and it has performed really well, beating Bing, Bard and Claude 2. So I’m keen to try training an LLM on the data and seeing how much better that works.
  
  The source/citation fields being required by the prompt would help mitigate the hallucination problem (as I'm sure was your intention).
  
  Interestingly enough, I did find that forcing GPT-4 to cite its claims, particularly to individual sections of laws, really improved the accuracy of its responses. Wild how citations alone can make it more truthful.
  
  The project sounds like the kind of modernizing that governments should be doing themselves.
  …
  I'm sure they could be adapted to other countries' legal systems as well.
  
  I got quite lucky as this project was some really low hanging fruit, I was surprised no one had bothered doing it before. I am actually already thinking about how I can do this for other countries. I have an Australian degree in law so it could require a bit of learning depending on the jurisdiction, but for places like the UK, Canada, New Zealand, it shouldn’t be too difficult.
  
  4 votes