-
7 votes
-
Turning old maps into 3D digital models of lost neighborhoods
9 votes -
These 3D printers print 3D printers! Touring inside Prusa Research's factory to see how they make their 3d printers (using their 3d printers!) and their filament.
10 votes -
Ancient Sahul's submerged landscapes reveal a mosaic of human habitation
17 votes -
Can you recreate Spirited Away in Blender? Should you? I gave it a try with three scenes.
22 votes -
A Spanish agency became so sick of models and influencers that they created their own with AI — and she’s raking in up to $11,000 a month
27 votes -
I lost my finger.. so I made a new one
4 votes -
Miss Ukraine makes powerful fashion statement amid nation's ongoing conflict with Russia
8 votes -
Visualizing the layers of the TCP/IP model
3 votes -
The strangest explanation of VLANs you've never heard
5 votes -
Nirvana attorneys seek dismissal of ‘Nevermind’ ‘child porn’ lawsuit, calling it too late and too ‘absurd’
9 votes -
Quannah Chasinghorse is on a mission - The 19-year-old model is a warrior for her culture and the land her people have inhabited for thousands of years
5 votes -
Lelush: How a sulky Russian model became China's slacker icon while trapped in a TV show against his will
10 votes -
3D CAD software provider Dassault Systèmes announces Solidworks for Makers and Solidworks for Students
7 votes -
Emily Ratajkowski - Owning my image
26 votes -
POTS: protective optimization technologies
5 votes -
South African lawyer is first albino model on Vogue cover: ‘The way I look is enough’
5 votes -
'Esquire' criticized for cover story on 'what it’s like to grow up white, middle class, and male'
10 votes -
I watched D&G’s China show fall apart from the inside
14 votes -
How to build an amazing seaside diorama – Realistic scenery Vol.14
4 votes -
XML Data Munging Problem
Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML. Your input is some XML such as this:...
Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML.
Your input is some XML such as this:
<DOC> <TEXT PARTNO="000"> <TAG ID="3">This</TAG> is <TAG ID="0">some *JUNK* data</TAG> . </TEXT> <TEXT PARTNO="001"> *FOO* Sometimes <TAG ID="1">tags in <TAG ID="0">the data</TAG> are nested</TAG> . </TEXT> <TEXT PARTNO="002"> In addition to <TAG ID="1">nested tags</TAG> , sometimes there is also <TAG ID="2">junk</TAG> we need to ignore . </TEXT> <TEXT PARTNO="003">*BAR*-1 <TAG ID="2">Junk</TAG> is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . *JUNK*-123 </TEXT> <TEXT PARTNO="004"> Note that <TAG ID="4">*this*</TAG> is just emphasized . It's not <TAG ID="2">junk</TAG> ! </TEXT> </DOC>
The above XML has so-called in-line textual annotations because the XML
<TAG>
elements are embedded within the document text itself.Your goal is to convert the in-line XML annotations to so-called stand-off annotations where the text is separated from the annotations and the annotations refer to the text via slicing into the text as a character array with starting and ending character offsets. While in-line annotations are more human-readable, stand-off annotations are equally machine-readable, and stand-off annotations can be modified without changing the document content itself (the text is immutable).
The challenge, then, is to convert to a stand-off JSON format that includes the plain-text of the document and the XML tag annotations grouped by their tag element IDs. In order to preserve the annotation information from the original XML, you must keep track of each
<TAG>
’s starting and ending character offset within the plain-text of the document. The plain-text is defined as the character data in the XML document ignoring any junk. We’ll define junk as one or more uppercase ASCII characters[A-Z]+
between two*
, and optionally a trailing dash-
followed by any number of digits[0-9]+
.Here is the desired JSON output for the above example to test your solution:
{ "data": "\nThis is some data .\n\n\nSometimes tags in the data are nested .\n\n\nIn addition to nested tags , sometimes there is also junk we need to ignore .\n\nJunk is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . \n\nNote that *this* is just emphasized . It's not junk !\n\n", "entities": [ { "id": 0, "mentions": [ { "start": 9, "end": 18, "id": 0, "text": "some data" }, { "start": 41, "end": 49, "id": 0, "text": "the data" } ] }, { "id": 1, "mentions": [ { "start": 33, "end": 60, "id": 1, "text": "tags in the data are nested" }, { "start": 80, "end": 91, "id": 1, "text": "nested tags" } ] }, { "id": 2, "mentions": [ { "start": 118, "end": 122, "id": 2, "text": "junk" }, { "start": 144, "end": 148, "id": 2, "text": "Junk" }, { "start": 326, "end": 330, "id": 2, "text": "junk" } ] }, { "id": 3, "mentions": [ { "start": 1, "end": 5, "id": 3, "text": "This" } ] }, { "id": 4, "mentions": [ { "start": 289, "end": 295, "id": 4, "text": "*this*" } ] } ] }
Python 3 solution here.
If you need a hint, see if you can find an event-based XML parser (or if you’re feeling really motivated, write your own).
4 votes