Going between XML (or JSON) and table-like formats

jpessin1 · October 23, 2020, 7:56pm

Are there any good SQL or other table based methods to support the variability in XML/JSON data?
When is there a standard way to flatten this for tabular work?

As a simple example illustration, NCBI database are accessible via a RESTful API.
Background For PubMed’s API The E-utilities In-Depth: Parameters, Syntax and More - Entrez Programming Utilities Help - NCBI Bookshelf
First you query, and get a listing of PubMed ID (PMID’s) that match with esearch
Then you request the the information for the PMIDs, and PubMed will give and XML dump of all the data for each ID.

So initially you do a series of calls one for each unique search collecting id’

search[terms1] = PMID_list1
search[terms2] = PMID_list2
. . .

where each PMID_list has the list of PMIDs corresponding to the search. there is a strong PMID standard but the number of PMIDs returned varies from 0 to quite large (1000s)

Then you can take the unique set of PMIDs and request the data, and parse it out to your favorite layout, and filter on the search: PMID key-value as needed (you did remember to keep it, right) including tabularly convenient ones like dataframes or a key-value of PMID: rows contents.

For an individual comfortable enough with a scripting language to do the above, one can argue that this is enough. Often, though, for one reason or another, there are spaces where being purely table base (csv or sql like) has added value.

I’m wondering if others in the community have alternative ideas for working with such data. Methods that improve this process or simply additional approaches.

claude-s · November 2, 2020, 7:11pm

Hi! I might not fully understand the question, but to me the question totally reads like discussion if you should use a SQL or NoSQL (GraphQL) pattern for your database. When delving more into that topic one will see that each of the patterns has pro and cons for each situation…

… but at the end of the day you have to accept what your data provider initially gives you. If that schema is XML/JSON (which has the power of representing nested relations without repeating data) then you have to process that.

For most of my projects I do use relational SQL databases and the input most of the time comes from XML/JSON endpoints, so I need to process them. I do it with separate bash scripts, and my favorite tool is jq (https://stedolan.github.io/jq/), jq is like sed for JSON data.

In larger projects I always advise to add a separate “adapter layers” between data sources/providers and actual databases or handling of the data. These are pieces of software/scripts that simply transform the data from one layout to the other.

jpessin1 · November 9, 2020, 5:53pm

Hi Claude,
Thanks for your reply.
jq is great, (have you seen yq for YAML?)

You’re working in the same context. I think I may not have been clear enough with the specifics of the question.

Some of the content in the adaptive layer doesn’t seem to fit easily into a SQL or ‘easy to read table’
As a result, some of the intermediates are not stored, so we end up rerunning from the beginning every time the question varies, even if the needed content would have been in the intermediates.

I’m being optimistic that I’m just unaware of a method/option that would work for tabularizing it that others can then use for filters or joins etc. any ideas (or follow-up questions)?

Thanks again.

perrim · November 17, 2020, 4:42pm

If you just want to flatten the results there are packages like json_normalize (pandas) and flatten: https://github.com/amirziai/flatten

Otherwise you could create a database and use foreign key relations to connect your entries. Use this as a cache – once you do a search or retreive entries from pubmed store them in your database and you’ll have everything for later. You can also download the pubmed database in pieces if you have a large drive.