How to Manage Electronic Library Databases with Intellexer API

Managing the development and delivery of electronic library services is one of the major challenges universities face. Information professionals who work in higher education institutions encounter a number of problems while managing data and addressing the needs of those who use their electronic libraries.

Let’s first understand what e-libraries need.

  • Material categorization;
  • Advanced search;
  • Annotation creation.

Providers of custom business intelligence solutions propose a number of tools that can help to address the above-mentioned challenges. Let’s have a look at some linguistic tools, incorporated in the Intellexer APITrack this API.  

1) Categorizer addresses the need for material categorization; it is available as a desktop application or as a component of the Intellexer SDK. Its algorithms are based on a machine learning technique, thus, information classification runs in two modes: the training phase and the prediction stage.

At the training phase, Categorizer builds a classifier by learning from a set of model documents (books) for each category. Its algorithm uses a wide range of semantic features extracted from document texts.

At the prediction stage, to cluster the documents into a certain category, the tool uses the vector space model (an algebraic model for representing text documents as vectors of identifiers). How does the classification process happen? The text entered by the user is compared with semantic features from the model category, and the tool defines the degree of proximity between the documents. Then, the document with the maximum relevance value is assigned to the category the user created. 

2) The Summarizer tool allows users to extract vital concepts from analyzed documents and create different types of summaries/annotations such as the following ones:

  • A theme-oriented summary delivers the info that is relevant to a certain topic (for instance, politics, culture, technology, science, etc.);  
  • A structure-oriented summary comprises the content according to the input document structure (it may be a news article, a patent, etc.);
  • A concept-oriented summary contains the sentences with regard to user defined concepts.

These summaries can be of any size, and the user can set the needed number of characters/sentences. This can be an important feature especially for those who use e-libraries from mobile phones that usually have the automatic screen zoom.      

Here’s an example of how Intellexer Summarizer can be integrated using programming languages C/C++ and C#:

#include <iostream>
#include <string>
#include <SumCore.h>
#include <LPXml.h>

using std::cout;
using std::cerr;
using std::endl;
using std::string;

using namespace NsSemSDK;

/// Print tree brunch.
void PrintTree(ISumTreeNode* pNode, int nLevel)
{
        int i;
        const char* pszText = pNode->GetText();

        if (strlen(pszText) > 0)
        {
                for (i = 0; i < nLevel - 1; ++i)
                         cout << '\t';
                cout << pszText << endl;
        }
        for (i = 0; i < pNode->GetChildCount(); ++i)
                PrintTree(pNode->GetChild(i), nLevel + 1);
}

/// Print several most significant concepts, sentences and tree brunches.
void PrintSummary(ISummary* pSummary)
{
        int nSize;
        int i;

        // Set current summary type to concept list
        pSummary->SetCurrentItem(ESumItemTypeRelation);
       
        // Set desired summary size to 5 concept
        nSize = pSummary->RestrictSummary(ESumRestrictionItem, 5);
       
        // Get and print summary concepts with there weights
        cout << "List of most informative concepts:" << endl;
        for (i = 0; i < nSize; ++i)
        {
                ISumItem* pItem = pSummary->GetItem(i);
                cout << pItem->GetWeight() << '\t' << pItem->GetText() << endl;
        }
        cout << endl;
       
        // Set current summary type to sentence list
        pSummary->SetCurrentItem(ESumItemTypeSentence);
       
        // Print total number of sentences in document
        cout << "Total number of sentences: " << pSummary->GetTotalItemCount() << endl;
       
        // Set desired summary size to 3 percent
        pSummary->RestrictSummary(ESumRestrictionPercent, 3);
        nSize = pSummary->GetSummarySize();
        cout << "List of most informative sentences:" << endl;
        for (i = 0; i < nSize; ++i)
        {
                ISumItem* pItem = pSummary->GetItem(i);
                cout << pItem->GetRank() << '\t' << pItem->GetText() << endl;
        }
        cout << endl;
       
        // Get tree of document concepts
        ISumTree* pTree = pSummary->GetTree();
       
        // Concept text will contain text from parent tree nodes
        pTree->SetFullText(true);
       
        // Set maximum count of nodes on each tree level   
        pTree->SetChildViewBound(3);
       
        // print summary concept tree
        cout << "Document concept tree (top part):" << endl;
        PrintTree(pTree->GetTreeRoot(), 0);
        cout << endl;
}

/// Find given concept in tree brunch.
ISumTreeNode* FindConcept(ISumTreeNode* pNode, const char* pszConcept)
{
        if (strcmp(pNode->GetText(), pszConcept) == 0)
                return pNode;
        for (int i = 0; i < pNode->GetChildCount(); ++i)
        {
                ISumTreeNode* pChild = FindConcept(pNode->GetChild(i), pszConcept);
                if (pChild != NULL)
                         return pChild;
        }
        return NULL;
}

int main(int argc, char* argv[])
{
        string sFileName("../Data/ForSummarizer.htm"); // path to source document
        if (argc > 1)
        {
                sFileName = argv[1]; // path to source document
        }
        try
        {
                string sDBPath("../../LDB");                //path to ldb
                string sLPluginsPath("../../LPlugins");    //path to plugins
                string sConceptPos("company");
                string sConceptQuery("ingredient");
                if (argc == 4)
                {
                         sDBPath = argv[2];
                         sLPluginsPath = argv[3];
                }
               
                // provide path to license file
                SetSumLicensePath("../../ISDK_License.xml");
                SetLPXMLLicensePath("../../ISDK_License.xml");
                cerr << "Initializing\t...";
               
                // create summarizer database interface
                CInterfacePtr<ISumDB> pSummarizerDB(CreateSummarizerDB());
               
                // create summarizer interface
                CInterfacePtr<ISummarizer> pSummarizer(CreateSummarizer());
                ISummary* pSummary = NULL;
               
                //initialize summarizer database interface
                pSummarizerDB->Setup(sDBPath.c_str(), sLPluginsPath.c_str());
               
                //initialize summarizer interface          
                pSummarizer->Setup(pSummarizerDB.Get());
                cerr << "Done" << endl;
               
                // summarize source document
                pSummarizer->Summarize(sFileName.c_str());
               
                //get summary
                pSummary = pSummarizer->GetSummary();
                cout << "Current summary\n" << endl;
                PrintSummary(pSummary);
               
                // Set big limit on tree to be able to find concept in it
                pSummary->GetTree()->SetChildViewBound(1000);
                cout << "Positions and precise wording for concept \'" << sConceptPos << "\'\n" << endl;
                ISumTreeNode* pConcept = FindConcept(pSummary->GetTree()->GetTreeRoot(), sConceptPos.c_str());
                if (pConcept != NULL)
                {
                // Get and print array of document phrases containing given concept
                         int i;
                         CInterfacePtr<ISumPhraseContainer> pPhrases(pConcept->GetPhrases(pSummary));
                         cout << "Concept phrases: ";
                         for (i = 0; i < pPhrases->GetCount(); ++i)
                                  cout << pPhrases->GetItem(i) << ", ";
                         cout << endl;
                        
                         // Get and print array of item describing position of given concept in current summary
                         const ISumLocation* pLocation;
                         CInterfacePtr<ISumLocationContainerRef> pLocations(pConcept->GetLocations(pSummary));
                         cout << "Concept locations:" << endl;
                         pLocations->Reset();
                         while (pLocations->Next(pLocation))
                                  cout << "\tsentence " << pLocation->GetIndex() << " offset ("
                                           << pLocation->GetStartOffset() << ", " << pLocation->GetEndOffset() << ")" << endl;
                         cout << endl;
                }
                cout << "Reorder summary with query: " << sConceptQuery << "\n\n";
                pConcept = FindConcept(pSummary->GetTree()->GetTreeRoot(), sConceptQuery.c_str());
                if (pConcept != NULL)
                {
                         // mark found concept and all its subconcepts as selected
                         pConcept->SetStatus(ESumItStSelected, true);
                         pSummary->Reorder(ESumOrMoQuery);
                         PrintSummary(pSummary);
                }
        }
        catch (const CSemBaseException& x)
        {
                // Handle exceptions.
                cerr << x.what();
        }
        return 0;
}

3) Question Answering System is aimed at increasing the search efficiency in databases. It handles users’ queries and provides relevant answers in natural language, without giving users complete documents or best-matching paragraphs as a search result.

Yana Yelina graduated from Minsk State Linguistic University with a bachelor’s degree in Translation/Interpretation (English, Spanish, and Italian) and Public Relations. After that, she has worked as a copywriter/journalist for a number of Belarusian companies. At EffectiveSoft, a custom software development company, Yana holds a position of a Tech Journalist and writes about modern technologies, covering software development practices in a broad array of business domains: trading and finance, e-commerce, education, healthcare, logistics, etc. Yana can be reached online at contact@effectivesoft.com.
 

Comments