-
22 votes
-
Get all Megascans for free
12 votes -
Scientists and archivists worry Epic Games' control of the 3D model market will 'destroy' cultural heritage
35 votes -
Trees and land absorbed almost no CO2 last year
35 votes -
Inside the companies that set sports gambling odds
8 votes -
Modeling shows that reconductoring can quickly beef up grids
6 votes -
Maelstrom under Greenland's glaciers could slow future sea level rise – pioneering mission into mysterious and violent world may reveal ‘speed bumps’ on the way to global coastal inundation
3 votes -
The Sydler π/4 polyhedron. The shape that should be impossible.
15 votes -
See the most detailed map of human brain matter ever created
14 votes -
Inflation in times of overlapping emergencies: Systemically significant prices from an input–output perspective
7 votes -
Turning old maps into 3D digital models of lost neighborhoods
9 votes -
Atlantic Ocean circulation nearing ‘devastating’ tipping point, study finds
45 votes -
These 3D printers print 3D printers! Touring inside Prusa Research's factory to see how they make their 3d printers (using their 3d printers!) and their filament.
10 votes -
Ancient Sahul's submerged landscapes reveal a mosaic of human habitation
17 votes -
Can you recreate Spirited Away in Blender? Should you? I gave it a try with three scenes.
22 votes -
Greenland's massive ice sheet could experience runaway melting if the world overshoots climate targets – but even then quick action could stabilize it
6 votes -
Daily tides stoked with increasingly warmer water ate a huge hole at the bottom of one of Greenland's major glaciers in the last couple of years
4 votes -
I lost my finger.. so I made a new one
4 votes -
Visualizing the layers of the TCP/IP model
3 votes -
The strangest explanation of VLANs you've never heard
5 votes -
A centuries-old concept in soil science has recently been thrown out. Yet it remains a key ingredient in everything from climate models to advanced carbon-capture projects.
17 votes -
Name don'ts
14 votes -
DeepMind worked with UK weather forecasters to create a model that was better at making short term predictions than existing systems
9 votes -
CliMA collaboration aims to reinvent Earth system modeling
6 votes -
3D CAD software provider Dassault Systèmes announces Solidworks for Makers and Solidworks for Students
7 votes -
Climate change has likely already affected global food production
5 votes -
The South Pole is warming fast. Very fast.
10 votes -
Climate change: The South Pole feels the heat, as warming over the Antarctic continent took place at three times the global rate since 1989
5 votes -
Climate worst-case scenarios may not go far enough, cloud data shows
7 votes -
New state-level model from Imperial College London suggests that epidemic is not under control in most American states, predicts major surge in cases and deaths over next two months
25 votes -
The need for software testing: Neil Ferguson's unstable epidemiologic model
10 votes -
Predictability: Can the turning point and end of an expanding epidemic be precisely forecast?
7 votes -
Scientists confirm dramatic melting of Greenland ice sheet – loss largely due to high pressure zone not taken into account by climate models
5 votes -
Why it’s so hard to make a good COVID-19 model
8 votes -
POTS: protective optimization technologies
5 votes -
The long-awaited upgrade to the US weather forecast model is here
7 votes -
How to build an amazing seaside diorama – Realistic scenery Vol.14
4 votes -
XML Data Munging Problem
Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML. Your input is some XML such as this:...
Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML.
Your input is some XML such as this:
<DOC> <TEXT PARTNO="000"> <TAG ID="3">This</TAG> is <TAG ID="0">some *JUNK* data</TAG> . </TEXT> <TEXT PARTNO="001"> *FOO* Sometimes <TAG ID="1">tags in <TAG ID="0">the data</TAG> are nested</TAG> . </TEXT> <TEXT PARTNO="002"> In addition to <TAG ID="1">nested tags</TAG> , sometimes there is also <TAG ID="2">junk</TAG> we need to ignore . </TEXT> <TEXT PARTNO="003">*BAR*-1 <TAG ID="2">Junk</TAG> is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . *JUNK*-123 </TEXT> <TEXT PARTNO="004"> Note that <TAG ID="4">*this*</TAG> is just emphasized . It's not <TAG ID="2">junk</TAG> ! </TEXT> </DOC>
The above XML has so-called in-line textual annotations because the XML
<TAG>
elements are embedded within the document text itself.Your goal is to convert the in-line XML annotations to so-called stand-off annotations where the text is separated from the annotations and the annotations refer to the text via slicing into the text as a character array with starting and ending character offsets. While in-line annotations are more human-readable, stand-off annotations are equally machine-readable, and stand-off annotations can be modified without changing the document content itself (the text is immutable).
The challenge, then, is to convert to a stand-off JSON format that includes the plain-text of the document and the XML tag annotations grouped by their tag element IDs. In order to preserve the annotation information from the original XML, you must keep track of each
<TAG>
’s starting and ending character offset within the plain-text of the document. The plain-text is defined as the character data in the XML document ignoring any junk. We’ll define junk as one or more uppercase ASCII characters[A-Z]+
between two*
, and optionally a trailing dash-
followed by any number of digits[0-9]+
.Here is the desired JSON output for the above example to test your solution:
{ "data": "\nThis is some data .\n\n\nSometimes tags in the data are nested .\n\n\nIn addition to nested tags , sometimes there is also junk we need to ignore .\n\nJunk is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . \n\nNote that *this* is just emphasized . It's not junk !\n\n", "entities": [ { "id": 0, "mentions": [ { "start": 9, "end": 18, "id": 0, "text": "some data" }, { "start": 41, "end": 49, "id": 0, "text": "the data" } ] }, { "id": 1, "mentions": [ { "start": 33, "end": 60, "id": 1, "text": "tags in the data are nested" }, { "start": 80, "end": 91, "id": 1, "text": "nested tags" } ] }, { "id": 2, "mentions": [ { "start": 118, "end": 122, "id": 2, "text": "junk" }, { "start": 144, "end": 148, "id": 2, "text": "Junk" }, { "start": 326, "end": 330, "id": 2, "text": "junk" } ] }, { "id": 3, "mentions": [ { "start": 1, "end": 5, "id": 3, "text": "This" } ] }, { "id": 4, "mentions": [ { "start": 289, "end": 295, "id": 4, "text": "*this*" } ] } ] }
Python 3 solution here.
If you need a hint, see if you can find an event-based XML parser (or if you’re feeling really motivated, write your own).
4 votes -
These stunning 3D models are transforming scientists’ raw data
4 votes