• Activity
  • Votes
  • Comments
  • New
  • All activity
  • Showing only topics in ~comp with the tag "json". Back to normal view / Search all groups
    1. Input from a text file, pull from multiple APIs, formatting output, etc. in Python

      I don't need answers so much as an idea of where to start. Essentially, I have a Google Sheet that uses importjson.gs to pull from the following APIs OMDB (IMDB) TheMovieDB TVMaze I also use...

      I don't need answers so much as an idea of where to start.

      Essentially, I have a Google Sheet that uses importjson.gs to pull from the following APIs

      • OMDB (IMDB)
      • TheMovieDB
      • TVMaze

      I also use another script to scrape Letterboxd for ratings.

      This works well, but sometimes it'll time out or I'll hit urlFetch limits that Google has in place.

      Basically, I'd like to have a text file (input.txt) where I pop in a bunch of titles and year or IMDB IDs, then the script runs and pulls set endpoints from all of these, outputting everything on one line (a pipe as a delimiter.)

      My thinking is that I can then pull that info a sheet and run all of the formatting, basic math, and whatever else so it suits my Sheet.

      I have a feeling I'll be using requests for the JSON and beautifulsoup for letterboxd -- or maybe a module.

      Can anyone point me in the right direction? I don't think it'll be too difficult and should work well for a first python project.

      7 votes
    2. XML Data Munging Problem

      Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML. Your input is some XML such as this:...

      Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML.

      Your input is some XML such as this:

      <DOC>
      <TEXT PARTNO="000">
      <TAG ID="3">This</TAG> is <TAG ID="0">some *JUNK* data</TAG> .
      </TEXT>
      <TEXT PARTNO="001">
      *FOO* Sometimes <TAG ID="1">tags in <TAG ID="0">the data</TAG> are nested</TAG> .
      </TEXT>
      <TEXT PARTNO="002">
      In addition to <TAG ID="1">nested tags</TAG> , sometimes there is also <TAG ID="2">junk</TAG> we need to ignore .
      </TEXT>
      <TEXT PARTNO="003">*BAR*-1
      <TAG ID="2">Junk</TAG> is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . *JUNK*-123
      </TEXT>
      <TEXT PARTNO="004">
      Note that <TAG ID="4">*this*</TAG> is just emphasized . It's not <TAG ID="2">junk</TAG> !
      </TEXT>
      </DOC>
      

      The above XML has so-called in-line textual annotations because the XML <TAG> elements are embedded within the document text itself.

      Your goal is to convert the in-line XML annotations to so-called stand-off annotations where the text is separated from the annotations and the annotations refer to the text via slicing into the text as a character array with starting and ending character offsets. While in-line annotations are more human-readable, stand-off annotations are equally machine-readable, and stand-off annotations can be modified without changing the document content itself (the text is immutable).

      The challenge, then, is to convert to a stand-off JSON format that includes the plain-text of the document and the XML tag annotations grouped by their tag element IDs. In order to preserve the annotation information from the original XML, you must keep track of each <TAG>’s starting and ending character offset within the plain-text of the document. The plain-text is defined as the character data in the XML document ignoring any junk. We’ll define junk as one or more uppercase ASCII characters [A-Z]+ between two *, and optionally a trailing dash - followed by any number of digits [0-9]+.

      Here is the desired JSON output for the above example to test your solution:

      {
        "data": "\nThis is some data .\n\n\nSometimes tags in the data are nested .\n\n\nIn addition to nested tags , sometimes there is also junk we need to ignore .\n\nJunk is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . \n\nNote that *this* is just emphasized . It's not junk !\n\n",
        "entities": [
          {
            "id": 0,
            "mentions": [
              {
                "start": 9,
                "end": 18,
                "id": 0,
                "text": "some data"
              },
              {
                "start": 41,
                "end": 49,
                "id": 0,
                "text": "the data"
              }
            ]
          },
          {
            "id": 1,
            "mentions": [
              {
                "start": 33,
                "end": 60,
                "id": 1,
                "text": "tags in the data are nested"
              },
              {
                "start": 80,
                "end": 91,
                "id": 1,
                "text": "nested tags"
              }
            ]
          },
          {
            "id": 2,
            "mentions": [
              {
                "start": 118,
                "end": 122,
                "id": 2,
                "text": "junk"
              },
              {
                "start": 144,
                "end": 148,
                "id": 2,
                "text": "Junk"
              },
              {
                "start": 326,
                "end": 330,
                "id": 2,
                "text": "junk"
              }
            ]
          },
          {
            "id": 3,
            "mentions": [
              {
                "start": 1,
                "end": 5,
                "id": 3,
                "text": "This"
              }
            ]
          },
          {
            "id": 4,
            "mentions": [
              {
                "start": 289,
                "end": 295,
                "id": 4,
                "text": "*this*"
              }
            ]
          }
        ]
      }
      

      Python 3 solution here.

      If you need a hint, see if you can find an event-based XML parser (or if you’re feeling really motivated, write your own).

      4 votes