While attempting to discover latent topics for an assignment at work, I ran into the field of information extraction. A simple data model for information extraction is a RDF (Resource Description Framework). The RDF relates entities by the subject-predicate-object format where the subject and object are related to one another by the predicate. The triple is a minimal representation for information.
Here are some examples of some simple relations in subject-predicate-object format:
- Houston – is located in – Texas
- Ted – is the son of – Steve
- Elvis – is buried in – Graceland
This triple format can be used to pull information from any sentence. To aid with this extraction I found a paper that explained in great detail the algorithm for extracting the triplet. To begin the process of triplet extraction it was necessary to download the Stanford Parser and then utilize python’s great NLTK package to parse the sentence in an NLTK readable format. Once the sentence was parsed, the algorithms from the paper were implemented.
The outcome generated a subject, predicate, object as well as attributes for each item in the triplet. I formatted the results into a JSON object and now have them readily for anyone to use at the following URL – my very first API.To access the triple, use this url and enter in your sentence: http://www.newventify.com/rdf?sentence=”your sentence here”Heres an example:http://www.newventify.com/rdf?sentence=%22The%20man%20stood%20next%20to%20the%20refrigerator%22
The code for this miniproject can be found here on Github