• Activity
  • Votes
  • Comments
  • New
  • All activity
  • Showing only topics in ~comp with the tag "data". Back to normal view / Search all groups
    1. Looking for help scraping and deleting a Reddit account

      I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any...

      I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any recommendations on good simple ways to scrape all the comments off of it and save them? Then what’s the best way to completely erase a Reddit footprint these days?

      Looking for as simple a solution as possible, I’m not tech illiterate by any means but it’s also not a real strong suit for me.

      18 votes
    2. Help me decide what technology should I use for this project

      I’m a solo freelance programmer and want to write an app for internal project management, somewhere I can add projects, milestones, tasks, etc. and track them as I work on them, occasionally...

      I’m a solo freelance programmer and want to write an app for internal project management, somewhere I can add projects, milestones, tasks, etc. and track them as I work on them, occasionally remind me of things like take a break, lunch time, etc. and over time I can track on which category I worked how many hours, etc.

      I’m actually confused between whether to build this as a Web or Windows Desktop app. I’m considering latter because it can run efficiently on my laptop in the system tray using least memory and resources, web-based on the other hand will force me keep running an apache server too which will be an overhead (unless I host it on Google Cloud or someplace which might be an option?)

      The only reason for considering web-based is that eventually I’m planning to make this tool open source and with web-based, many others can find this useful too (including OSX/Linux users). At that point, I may consider expanding its schema to include multi-user connectivity, client login, etc. but that’s going too far at this point!

      The idea is that this tool should be useful not just for me but other freelancers, students, etc. who might be in my shoes. From that perspective, what do you think is the right technology to use? Web based or Windows based?

      (I’ve extensively worked on C#/WinForms projects before and I’m thinking Visual Studio Express for desktop development. If web-based, it’ll be php/mysql based)

      5 votes
    3. How to design a database?

      I'm working on an application that allows a user to view playlists belonging to a particular radio show and stream/download/favourite the tracks in them. It has 4 core entities: User, Show,...

      I'm working on an application that allows a user to view playlists belonging to a particular radio show and stream/download/favourite the tracks in them. It has 4 core entities: User, Show, Playlist and Track.

      • Each show has multiple playlists (one-to-many)
      • Each playlist has multiple tracks (one-to-many)

      To be able to reference a playlist belonging to a particular show. I gave those playlists the same uuid as the show they belong to. A few questions though.

      1. Is this the right/best way to associate data?
      2. As a track could potentially belong to multiple playlists, I can't take the same approach as I do for (show/playlist) How would be best to handle this? Ideally I would like to have a single "Track" table containing all tracks for all playlists.

      For any experienced database designers out there, how would you structure this data? What would you consider in designing the schema and why? If I did go with 4 tables only, presumably there would be performance implications given the potential amount of data in any one of those tables, particularly tracks. If that is the case, how best to structure this kind of thing with performance in mind? Thanks in advance for any help :)

      For reference, in case it's of importance, I'm using sqlite3.

      5 votes
    4. Where would a beginner start with data compression? What are some good books for it?

      Mostly the title. I have experience with Python, and I was thinking of learning more about data compression. How should I proceed? And what are some good books I could read, both about specifics...

      Mostly the title. I have experience with Python, and I was thinking of learning more about data compression. How should I proceed? And what are some good books I could read, both about specifics and abstracts of data compression, data management, data in general.

      15 votes
    5. XML Data Munging Problem

      Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML. Your input is some XML such as this:...

      Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML.

      Your input is some XML such as this:

      <DOC>
      <TEXT PARTNO="000">
      <TAG ID="3">This</TAG> is <TAG ID="0">some *JUNK* data</TAG> .
      </TEXT>
      <TEXT PARTNO="001">
      *FOO* Sometimes <TAG ID="1">tags in <TAG ID="0">the data</TAG> are nested</TAG> .
      </TEXT>
      <TEXT PARTNO="002">
      In addition to <TAG ID="1">nested tags</TAG> , sometimes there is also <TAG ID="2">junk</TAG> we need to ignore .
      </TEXT>
      <TEXT PARTNO="003">*BAR*-1
      <TAG ID="2">Junk</TAG> is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . *JUNK*-123
      </TEXT>
      <TEXT PARTNO="004">
      Note that <TAG ID="4">*this*</TAG> is just emphasized . It's not <TAG ID="2">junk</TAG> !
      </TEXT>
      </DOC>
      

      The above XML has so-called in-line textual annotations because the XML <TAG> elements are embedded within the document text itself.

      Your goal is to convert the in-line XML annotations to so-called stand-off annotations where the text is separated from the annotations and the annotations refer to the text via slicing into the text as a character array with starting and ending character offsets. While in-line annotations are more human-readable, stand-off annotations are equally machine-readable, and stand-off annotations can be modified without changing the document content itself (the text is immutable).

      The challenge, then, is to convert to a stand-off JSON format that includes the plain-text of the document and the XML tag annotations grouped by their tag element IDs. In order to preserve the annotation information from the original XML, you must keep track of each <TAG>’s starting and ending character offset within the plain-text of the document. The plain-text is defined as the character data in the XML document ignoring any junk. We’ll define junk as one or more uppercase ASCII characters [A-Z]+ between two *, and optionally a trailing dash - followed by any number of digits [0-9]+.

      Here is the desired JSON output for the above example to test your solution:

      {
        "data": "\nThis is some data .\n\n\nSometimes tags in the data are nested .\n\n\nIn addition to nested tags , sometimes there is also junk we need to ignore .\n\nJunk is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . \n\nNote that *this* is just emphasized . It's not junk !\n\n",
        "entities": [
          {
            "id": 0,
            "mentions": [
              {
                "start": 9,
                "end": 18,
                "id": 0,
                "text": "some data"
              },
              {
                "start": 41,
                "end": 49,
                "id": 0,
                "text": "the data"
              }
            ]
          },
          {
            "id": 1,
            "mentions": [
              {
                "start": 33,
                "end": 60,
                "id": 1,
                "text": "tags in the data are nested"
              },
              {
                "start": 80,
                "end": 91,
                "id": 1,
                "text": "nested tags"
              }
            ]
          },
          {
            "id": 2,
            "mentions": [
              {
                "start": 118,
                "end": 122,
                "id": 2,
                "text": "junk"
              },
              {
                "start": 144,
                "end": 148,
                "id": 2,
                "text": "Junk"
              },
              {
                "start": 326,
                "end": 330,
                "id": 2,
                "text": "junk"
              }
            ]
          },
          {
            "id": 3,
            "mentions": [
              {
                "start": 1,
                "end": 5,
                "id": 3,
                "text": "This"
              }
            ]
          },
          {
            "id": 4,
            "mentions": [
              {
                "start": 289,
                "end": 295,
                "id": 4,
                "text": "*this*"
              }
            ]
          }
        ]
      }
      

      Python 3 solution here.

      If you need a hint, see if you can find an event-based XML parser (or if you’re feeling really motivated, write your own).

      4 votes