ubr's recent activity

  1. Comment on Show Tildes: mapping almost every law, regulation and case in Australia in ~comp

    ubr
    Link Parent
    I actually didn't use BERT for topic modelling, the embeddings were created by a far more advanced specialised model called BAAI/bge-small-en-v1.5, although still a transformer. A more rudimentary...

    I actually didn't use BERT for topic modelling, the embeddings were created by a far more advanced specialised model called BAAI/bge-small-en-v1.5, although still a transformer. A more rudimentary algorithm was used for clustering, HDBSCAN, but then I manually reviewed all 507 clusters and reduced them to 19 branches of law. I do mention that I modelled my process off BERTopic, which is probably where the confusion came from. I'm not too sure why it's still called that when it doesn't even use BERT anymore šŸ˜….

    Thank you for the positive feedback, it is much appreciated. And apologies if the article seems a little too complex, the underlying process is 'pretty simple' but my implementation may be more involved because working with over 200k long-form documents meant having to build much more efficent code from scratch than what is available in the BERTopic Python library.

    3 votes
  2. Comment on Show Tildes: mapping almost every law, regulation and case in Australia in ~comp

    ubr
    Link
    Hey Tildes, After months of hard work, I am excited to share the first ever semantic map of Australian law. My map represents the first attempt to map Australian laws, cases and regulations across...

    Hey Tildes,

    After months of hard work, I am excited to share the first ever semantic map of Australian law.

    My map represents the first attempt to map Australian laws, cases and regulations across the Commonwealth, States and Territories semantically, that is, by their underlying meaning.

    Each point on the map is a unique document in the Open Australian Legal Corpus, the largest open database of Australian law (which, full disclosure, I created). The closer any two points are on the map, the more similar they are in underlying meaning.

    As I cover in my article, thereā€™s a lot you can learn by mapping Australian law. Some of the most interesting insights to come out of this initiative are that:
    ā¦ā€‚Migration, family and substantive criminal law are the most isolated branches of case law on the map;
    ā¦ā€‚Migration, family and substantive criminal law are the most distant branches of case law from legislation on the map;
    ā¦ā€‚Development law is the closest branch of case law to legislation on the map;
    ā¦ā€‚Case law is more of a continuum than a rigidly defined structure and the borders between branches of case law can often be quite porous; and
    ā¦ā€‚The map does not reveal any noticeable distinctions between Australian state and federal law, whether it be in style, principles of interpretation or general jurisprudence.

    If youā€™re interested in learning more about what the map has to teach us about Australian law or if youā€™d like to find out how you can create semantic maps of your own, check out the full article on my blog, which provides a detailed analysis of my map and also covers the finer details of how I built it, with code examples offered along the way.

    8 votes
  3. Comment on Do you have or know of fun domain names? Do you think it's worth having them? in ~tech

    ubr
    Link Parent
    No it wouldnā€™t. I havenā€™t gotten around to doing that but itā€™s a good idea.

    No it wouldnā€™t. I havenā€™t gotten around to doing that but itā€™s a good idea.

  4. Comment on Show Tildes: how I built the largest open database of Australian law in ~comp

    ubr
    Link Parent
    I have actually. Seeing how far ahead other jurisdictions like the UK were helped push me to decide this was something worth pursuing. Even the US was doing better, not because of their...

    Ever seen the UK's digital systems? We get a lot of grief for our bureaucratic nature. but .gov.uk is an absolute goldmine of everything the government offers. We've even got a legislation bit which touches on what you've made here. GDS (Government Digital Services) got absolutely hammered in the early years of austerity by our government and we've kind of dropped the ball since... and consultants have taken to stealing all the limelight for 'solving the problem.'

    I have actually. Seeing how far ahead other jurisdictions like the UK were helped push me to decide this was something worth pursuing. Even the US was doing better, not because of their bureaucracy but because there was a motivation for researchers to do the work themselves. I'm hoping that this database will help progress the legal tech and open data scene here in Australia further.

    This kind of thing is invaluable though. You might want to see if the big wigs would be interested in using it? Someone, Somewhere in the Aus-Gov has to be doing something similar.

    Good idea. I'm might try reaching out to people in the government to see if they'd be interested in helping me expand the database and perhaps even maintaining (because obviously in the long run, it will require work to keep up with changes to various data sources' system). Thanks for the encouragement :)

    1 vote
  5. Comment on Show Tildes: how I built the largest open database of Australian law in ~comp

    ubr
    Link Parent
    I havenā€™t got around to training an LLM on the Corpus yet, however, I did build a prototype AI assistant at work where I feed GPT-4 documents from the Corpus (RAG essentially, with some...

    Have you begun training your LLM yet, and if so, what are the results like?

    I havenā€™t got around to training an LLM on the Corpus yet, however, I did build a prototype AI assistant at work where I feed GPT-4 documents from the Corpus (RAG essentially, with some hierarchical question decomposition and graph of thoughts mixed in) and it has performed really well, beating Bing, Bard and Claude 2. So Iā€™m keen to try training an LLM on the data and seeing how much better that works.

    The source/citation fields being required by the prompt would help mitigate the hallucination problem (as I'm sure was your intention).

    Interestingly enough, I did find that forcing GPT-4 to cite its claims, particularly to individual sections of laws, really improved the accuracy of its responses. Wild how citations alone can make it more truthful.

    The project sounds like the kind of modernizing that governments should be doing themselves.
    ā€¦
    I'm sure they could be adapted to other countries' legal systems as well.

    I got quite lucky as this project was some really low hanging fruit, I was surprised no one had bothered doing it before. I am actually already thinking about how I can do this for other countries. I have an Australian degree in law so it could require a bit of learning depending on the jurisdiction, but for places like the UK, Canada, New Zealand, it shouldnā€™t be too difficult.

    4 votes
  6. Comment on Show Tildes: how I built the largest open database of Australian law in ~comp

    ubr
    Link Parent
    Thanks mate, really appreciate the positive feedback. We really do have a long ways to go in terms of tech, but particularly in the public service. Their data management is all over the place.

    Thanks mate, really appreciate the positive feedback. We really do have a long ways to go in terms of tech, but particularly in the public service. Their data management is all over the place.

    4 votes
  7. Comment on Show Tildes: how I built the largest open database of Australian law in ~comp

    ubr
    Link
    Hey Tildes, Over the past year, Iā€™ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no...

    Hey Tildes,
    Over the past year, Iā€™ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.

    In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.

    My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!

    You can find my database on HuggingFace and the code used to create it on GitHub.

    13 votes
  8. Comment on Jina AI releases first open source 8k embedding model in ~comp

    ubr
    Link
    The model ranks 17th on the Massive Text Embedding Benchmark (MTEB) Leaderboard, making it a great option for those looking for a FOSS alternative to ada-002 that can handle just as many tokens....

    The model ranks 17th on the Massive Text Embedding Benchmark (MTEB) Leaderboard, making it a great option for those looking for a FOSS alternative to ada-002 that can handle just as many tokens. Naturally, for those dealing with smaller context windows, BGE still reigns supreme.

    5 votes
  9. Comment on GPT-4 understands in ~tech

    ubr
    Link Parent
    Iā€™ve built something on my own.

    Iā€™ve built something on my own.

    2 votes
  10. Comment on GPT-4 understands in ~tech

    ubr
    Link Parent
    It depends on the use case and domain. Iā€™ve been using GPT-4 for Australian legal QA (via RAG) and it honestly wouldnā€™t be possible if I only had access to GPT-3.5-Turbo. For me, GPT-4ā€™s ability...

    It depends on the use case and domain. Iā€™ve been using GPT-4 for Australian legal QA (via RAG) and it honestly wouldnā€™t be possible if I only had access to GPT-3.5-Turbo. For me, GPT-4ā€™s ability to ā€œunderstandā€ Australian law and reason through problems makes it a far more powerful tool than GPT-3.5-Turbo. At the same time, for other domains and use cases, the gains can be more moderate. The training sets used may have a role to play in this.

    6 votes
  11. Comment on Teaching LLMs to divide and conquer problems with hierarchical question decomposition in ~comp

    ubr
    Link Parent
    At the moment, legal QA.

    At the moment, legal QA.

    4 votes
  12. Comment on Teaching LLMs to divide and conquer problems with hierarchical question decomposition in ~comp

    ubr
    Link Parent
    @cfabbro Thanks for the gentle heads up. It certainly wasn't my intention to contravene your code of conduct. I'll keep this in mind going forward.

    @cfabbro Thanks for the gentle heads up. It certainly wasn't my intention to contravene your code of conduct. I'll keep this in mind going forward.

    3 votes