-
5 votes
-
How Telltale Games went broke in one week
12 votes -
The impact of gratitude on adolescent materialism and generosity
10 votes -
Not real news: The Associated Press reports on this week's most shared fake news
19 votes -
The suffocation of American democracy
8 votes -
XML Data Munging Problem
Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML. Your input is some XML such as this:...
Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML.
Your input is some XML such as this:
<DOC> <TEXT PARTNO="000"> <TAG ID="3">This</TAG> is <TAG ID="0">some *JUNK* data</TAG> . </TEXT> <TEXT PARTNO="001"> *FOO* Sometimes <TAG ID="1">tags in <TAG ID="0">the data</TAG> are nested</TAG> . </TEXT> <TEXT PARTNO="002"> In addition to <TAG ID="1">nested tags</TAG> , sometimes there is also <TAG ID="2">junk</TAG> we need to ignore . </TEXT> <TEXT PARTNO="003">*BAR*-1 <TAG ID="2">Junk</TAG> is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . *JUNK*-123 </TEXT> <TEXT PARTNO="004"> Note that <TAG ID="4">*this*</TAG> is just emphasized . It's not <TAG ID="2">junk</TAG> ! </TEXT> </DOC>The above XML has so-called in-line textual annotations because the XML
<TAG>elements are embedded within the document text itself.Your goal is to convert the in-line XML annotations to so-called stand-off annotations where the text is separated from the annotations and the annotations refer to the text via slicing into the text as a character array with starting and ending character offsets. While in-line annotations are more human-readable, stand-off annotations are equally machine-readable, and stand-off annotations can be modified without changing the document content itself (the text is immutable).
The challenge, then, is to convert to a stand-off JSON format that includes the plain-text of the document and the XML tag annotations grouped by their tag element IDs. In order to preserve the annotation information from the original XML, you must keep track of each
<TAG>’s starting and ending character offset within the plain-text of the document. The plain-text is defined as the character data in the XML document ignoring any junk. We’ll define junk as one or more uppercase ASCII characters[A-Z]+between two*, and optionally a trailing dash-followed by any number of digits[0-9]+.Here is the desired JSON output for the above example to test your solution:
{ "data": "\nThis is some data .\n\n\nSometimes tags in the data are nested .\n\n\nIn addition to nested tags , sometimes there is also junk we need to ignore .\n\nJunk is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . \n\nNote that *this* is just emphasized . It's not junk !\n\n", "entities": [ { "id": 0, "mentions": [ { "start": 9, "end": 18, "id": 0, "text": "some data" }, { "start": 41, "end": 49, "id": 0, "text": "the data" } ] }, { "id": 1, "mentions": [ { "start": 33, "end": 60, "id": 1, "text": "tags in the data are nested" }, { "start": 80, "end": 91, "id": 1, "text": "nested tags" } ] }, { "id": 2, "mentions": [ { "start": 118, "end": 122, "id": 2, "text": "junk" }, { "start": 144, "end": 148, "id": 2, "text": "Junk" }, { "start": 326, "end": 330, "id": 2, "text": "junk" } ] }, { "id": 3, "mentions": [ { "start": 1, "end": 5, "id": 3, "text": "This" } ] }, { "id": 4, "mentions": [ { "start": 289, "end": 295, "id": 4, "text": "*this*" } ] } ] }Python 3 solution here.
If you need a hint, see if you can find an event-based XML parser (or if you’re feeling really motivated, write your own).
4 votes -
Endless Depth - A Fractal Piece
7 votes -
Knuckle Puck - Double Helix (2017)
5 votes -
What I’ve learnt about parenting a queer teen
9 votes -
Funimation, Crunchyroll end content-sharing partnership
13 votes -
The LibriVox free audiobook collection
12 votes -
Jony Ive on the Apple Watch and Big Tech’s responsibilities
5 votes -
Extreme botany: The precarious science of endangered rare plants
7 votes -
Gorky Park - Bang (1989)
4 votes -
100 Websites That Shaped the Internet as We Know It
9 votes -
What are some current examples of "the emperor's new clothes?"
For those unfamiliar with the story, "The Emperor's New Clothes" is about an emperor who parades around naked, but nobody will point out the obvious for fear of being seen as ignorant....
For those unfamiliar with the story, "The Emperor's New Clothes" is about an emperor who parades around naked, but nobody will point out the obvious for fear of being seen as ignorant. Idiomatically, it refers to something seen as true or widely praised, simply because nobody is willing to speak out against it.
I saw a rant about "blockchains" being the new overhyped hotness for tech companies, and it made me wonder what other "new clothes" are out there right now. What's something you have a strong takedown for that everybody else seems to love/support?
38 votes -
VC folks talk about social media, community, and the failings - includes ex-product head of YouTube
3 votes -
Discord just added a forced-arbitration clause to their Terms of Service (Discord staff response in comments)
39 votes -
The bad behavior of the richest: what I learned from wealth managers
16 votes -
Bomb the Music Industry! - Struggler (2010)
6 votes -
Your Real Biological Clock is You’re Going to Die
9 votes -
Japan's Hometown Tax
10 votes -
Apple CEO Tim Cook is calling for Bloomberg to retract its Chinese spy chip story
13 votes -
What will be left of the people who make our games?
20 votes -
Halfbakery (is back)
13 votes -
Steam developers speak: Maximum profits for Valve, minimum responsibilities
10 votes -
Google responds to EU by adding a fee to Play Services
18 votes -
How alt-right is Friedrich Nietzsche really?
6 votes -
Six red carnations and one severed ram’s head: Deadly threats sent to Russian independent newspaper
6 votes -
What Maniac does (and doesn't) get right about the Bible and the Gnostics
5 votes -
Ubuntu 18.10 released
28 votes -
What have you been listening to this week?
What have you been listening to this week? You don't need to do a 6000 word review if you don't want to, but please write something! Feel free to give recs or discuss anything about each others'...
What have you been listening to this week? You don't need to do a 6000 word review if you don't want to, but please write something!
Feel free to give recs or discuss anything about each others' listening habits.
You can make a chart if you use last.fm:
http://www.tapmusic.net/lastfm/
Remember that linking directly to your image will update with your future listening, make sure to reupload to somewhere like imgur if you'd like it to remain what you have at the time of posting.
16 votes -
Deutschland 86: Spy drama rebooted, braced for communism's collapse
4 votes -
The people who moved to Chernobyl
8 votes -
Why aren't most women represented in the last names of their children?
14 votes -
Sydney Anglicans to ban same-sex marriage, yoga on all church property
3 votes -
Julia Holter - Words I Heard (2018)
3 votes -
Thanks for doing this, guys
I just read over the Technical Goals page, and I agree with everything you're doing. I feel like I'm the only one who's annoyed by the recent trends of web design and development that involves...
I just read over the Technical Goals page, and I agree with everything you're doing. I feel like I'm the only one who's annoyed by the recent trends of web design and development that involves heavy use of icon-based navigation and JavaScript. I'm stoked to see what Tildes will become given this direction.
I know I'm not nearly as skilled and experienced as you -- Deimos and the Tildes team -- at web development, but if there's any way that I -- a sysadmin with experience in AWS -- can help, I would be happy to be a part of it. Otherwise, I can continue to lurk, occasionally post, and sing praise of Tildes! Cheers!
26 votes -
How an unlikely family history website transformed cold case investigations
6 votes -
How Facebook’s Chaotic Push Into Video Cost Hundreds of Journalists Their Jobs
11 votes -
Brad forages for porcini mushrooms | It's Alive
5 votes -
Ray Tracing Is No New Thing
12 votes -
Pat the Bunny - I'm Going Home (2014)
8 votes -
"Queer people are allowed to exist – but only as long as they’re of a certain stock": 'The Wound' star Nakhane
5 votes -
Sydney Anglicans set to ban gay weddings and pro-LGBTI advocacy on church property
2 votes -
Why Pyramid?
This is mostly a question for @Deimos, just out of curiosity: is there are particular reason for the choice of Pyramid as the framework for TIldes? Is it familiarity, or clear advantages over sth....
This is mostly a question for @Deimos, just out of curiosity: is there are particular reason for the choice of Pyramid as the framework for TIldes? Is it familiarity, or clear advantages over sth. like Django or Flask?
(Edit: actually I'd welcome comparisons favouring one or another from anyone too, related or not to Tildes itself.)
20 votes -
What are the strings in String Theory?
7 votes -
New York Attorney General launches probe into MoviePass parent company for allegedly misleading investors
9 votes -
Devoted fans pay thousands for this fermented tea, whose decades-old vintages are treated like wine
13 votes -
Disrupting cyberwar with open source intelligence
5 votes