24 votes

Communal answering machine: please leave a message after the beep

6 comments

  1. [6]
    cfabbro
    (edited )
    Link
    If you don't mine my asking, where have you been finding all these neat old-school style sites you've been posting lately, @xk3? They've been super interesting, this one included!

    If you don't mine my asking, where have you been finding all these neat old-school style sites you've been posting lately, @xk3? They've been super interesting, this one included!

    10 votes
    1. [5]
      xk3
      Link Parent
      There's three components to my setup: When I find a list of interesting links I save them to a links database: $ cb | library links-add ~/mc/links.db --skip-extract - This isn't anything special...
      • Exemplary

      There's three components to my setup:

      1. When I find a list of interesting links I save them to a links database:

        $ cb | library links-add ~/mc/links.db --skip-extract -

      This isn't anything special but the command will let me know how many links were new or previously saved which is useful when the Copy All Links Firefox extension doesn't work so I know I need to try copying the links again, etc.

      If links have useful description as their link text. I'll do something like this to capture that as well:

      $ library linksdb ~/mc/tv.db --local-html (cb -t text/html | psub)
      

      Right now I have 110,672 links... which will take me 34 years, 16 days and 8 hours to get through at my current rate. But that number keeps growing so... probably won't ever read some of these.

      1. Regex-sort. This is a bit technical but I feel like it is a big reason why I'm able to find very long-tail things. Having the additional metadata like page title helps here as that is fed in as additional text.

      Basically it is a text processor where all the words in each line (or dict values in the case of links) are enriched (eg. compared to the corpus as a whole, etc)/sorted within the line before sorting all the lines.

      details
      library regex-sort -h
      
      usage: library regex-sort [input_path | stdin] [output_path | stdout]
      
          regex-sort is effectively a text-processing pipeline with the following steps:
      
          line_splitter -- split lines into "words" (--regex)
          word_sorter -- sort words within each line (--word-sort)
          line_sorter -- sort lines (--line-sort)
      
          Examples:
      
              If your data has a lot of repeating rows it will help to sort by dup count:
                  --line-sort dup,natsort
      
              You can use any matching regex to produce sorting words:
                  --regex \b\w\w+\b  # word boundaries (default)
                  --regex \b\d+\b    # digits
                  --regex '.{3}'     # char counts
      
                  --regex '.{3}' --line-sort dup,natsort -v
                  (0, ((' Ja',), ('Sva',), ('aye',), ('lba',), ('n M',), ('rd ',)))  # Svalbard and Jan Mayen
                  (0, ((' La',), ('Sri',), ('nka',)))  # Sri Lanka
                  (0, ((' Ma',), ('San',), ('rin',)))  # San Marino
                  (0, ((' Ri',), ('Pue',), ('rto',)))  # Puerto Rico
                  (0, (('And',), ('orr',)))  # Andorra
                  (0, (('Arm',), ('eni',)))  # Armenia
      
              You can use 'mcda' as a strategy for ranking multiple sort score criteria:
                  --word-sorts '-dup, mcda, count, -len, -lastindex, alpha' \\  # count, -len, -lastindex, alpha are weighted by entropy
                  --line-sorts '-allunique, alpha, mcda, alldup, dupmode, line'  # alldup, dupmode, line are weighted by entropy
      
      ...
      
      Regex-sort options:
        --regexs (-re)         Default eliminates common URL chars like `-,_.` and use word breaks to build words
          STRING
        --word-sorts (-wu)            Specify the word sorting strategy to use within each line
          STRING
                                      Choose ONE OR MORE of the following options:
                                        skip       skip word sorting
                                        len        length of word
                                        unique     word is a unique in corpus (boolean)
                                        dup        word is a duplicate in corpus (boolean)
                                        count      count of same word in corpus
                                        linecount  count of same word in line
                                        index      index of word in line (first occurrence)
                                        lastindex  index of word in line (last occurrence)
                                        alpha      python alphabetic sorting
      
                                        natural    natsort default sorting (numbers as integers)
                                        signed     natsort signed numbers sorting (for negative numbers)
                                        path       natsort path sorting
                                      (https://natsort.readthedocs.io/en/stable/api.html#the-ns-enum)
                                        locale     natsort system locale sorting
                                        os         natsort OS File Explorer sorting. To improve non-alphanumeric sorting
                                      on Mac OS X and Linux it is necessary to install pyicu (perhaps via python3-icu --
                                      https://gitlab.pyicu.org/main/pyicu#installing-pyicu)
      
                                        mcda       all line_sort arguments after "mcda" will be consumed by MCDA and
                                      sorted by equal-weight
      
                                      (default: -dup, count, -len, -lastindex, alpha)
        --line-sorts (-lu)            Specify the line sorting strategy to use on the text-processed words (after regex,
          STRING                      word-sort, etc)
      
                                      Choose ONE OR MORE of the following options:
                                        skip       skip line sorting
                                        line       the original line (python alphabetic sorting)
                                        len        length of line
                                        count      count of words in line
      
                                        dup        count of duplicate in corpus words (sum of boolean)
                                        unique     count of unique in corpus words (sum of boolean)
                                        alldup     all line-words are duplicate in corpus words (boolean)
                                        allunique  all line-words are unique in corpus words (boolean)
      
                                        sum        count of all uses of line-words (within corpus)
                                        dupmax     highest line-word corpus usage
                                        dupmin     lowest line-word corpus usage
                                        dupavg     average line-word corpus usage
                                        dupmedian  median line-word corpus usage
                                        dupmode    mode (most repeated value) line-word corpus usage
      
                                        alpha    python alphabetic sorting
                                        natural  natsort default sorting (numbers as integers)
                                        ...      the other natsort options specified in --word-sort are also allowed
      
                                        mcda       all line_sort arguments after "mcda" will be consumed by MCDA and
                                      sorted by equal-weight
      
                                       (default: -allunique, alpha, alldup, dupmode, line)
      
      1. Every day I will open at least 7. If I see anything worth discussing or sharing I'll post it here :)

        $ library openlinks -w 'play_count=0' --regex-sort --max-same-domain 2 --browser ~/mc/links.db

      19 votes
      1. RheingoldRiver
        Link Parent
        this is worth its own top-level post!

        this is worth its own top-level post!

        5 votes
      2. cfabbro
        Link Parent
        Damn, that's actually a super cool link tracking/discovery system you have developed for yourself. Thanks for sharing it (and all the interesting links you've been finding in your database too)!

        Damn, that's actually a super cool link tracking/discovery system you have developed for yourself. Thanks for sharing it (and all the interesting links you've been finding in your database too)!

        4 votes
      3. [2]
        carrotflowerr
        Link Parent
        interesting! where do you find lists like this?

        find a list of interesting links

        interesting! where do you find lists like this?

        2 votes
        1. xk3
          Link Parent
          Sometimes I'll find them on digital gardens. Other times category pages of wikis or websites. For example, most recently I came across the site HiLobrow.com. It's a weird domain name. I guess it...

          Sometimes I'll find them on digital gardens. Other times category pages of wikis or websites.

          For example, most recently I came across the site HiLobrow.com. It's a weird domain name. I guess it stands for Highbrow/Lowbrow... Anyway. From the couple of articles that I read, I like the content on the site. But it is way too much for me to read in one sitting! So I went to the sitemap.xml and then looked at the different types of URLs on that page. The page category didn't seem very interesting so I just grabbed all 10,000 or so of the post URLs.

          To give you an idea of the scope of different domains. It looks like I have 19,985 unique second-level domains. Here are my top ones:

              500 cyborganthropology.com
              500 gitstar-ranking.com
              502 astralcodexten.com
              520 medium.com
              553 reasonstobecheerful.world
              600 coptr.digipres.org
              615 techdirt.com
              617 lesswrong.com
              621 net.au
              870 special.fish
             1169 slatestarcodex.com
             1300 inconsolation.wordpress.com
             2848 theguardian.com
             2876 youtube.com
             2973 web.archive.org
             3654 en.wikipedia.org
             4264 reddit.com
             5406 theconversation.com
             6904 man7.org
            10370 hilobrow.com
            16827 github.com
          
          6 votes