SAP HANA Advanced Data Processing
HANA Advanced Data Processing brings
together all the information, tools and methods a professional will need
to efficiently use text mining applications and statistical analysis. For example, you can specify your own entity types and names to be used with text analysis, which may be critical for particular industries or data domains.
There is a single custom dictionary that will support all languages or a single language and custom dictionaries reside in the HANA repository and will benefit from its life cycle management.
Features and Capabilities
Public Sector Extraction: Augments predefined entity types for core extraction with a number of entity, event, and relation types targeting public sectors needs. The following major fact types are classified:
Action: information about action and travel events. Military Units: information about teams, wings, and
squadrons. Organizational Information: information about organizations. Person - Alias: information about a persons possible aliases. Person - Appearance: information about a persons appearance. Person - Attributes: information about a persons non-appearance attributes. Person-Relationships: information about a persons relationships. Spatial References: distances, cardinal directions, or locations ...
Document filters in the NLP engine automatically detect and extract text content and metadata from almost any type of binary file format from PPT to XLS to PDF, etc.
Improved Text Analysis Features
- Custom dictionaries
- Custom configurations
- Indexing throughput
Text mining works at the document level making semantic determinations about the overall content of documents relative to other documents. Whereas text analysis does linguistic analysis and extracts information embedded within each document.
Functions based on Vector Space Model
- Identify similar documents
- Identify key terms of a document
- Identify related terms
- Categorize new documents based on a training corpus
- Highlight the key terms when viewing a patent document
- Identify similar incidents for faster problem solving
- Categorize new scientific papers along a hierarchy of topics
Greater Indexing Throughput
Language Coverage: Social Media extraction, Numerical extraction, Core extraction & Voice of Customer. etc. Language Identification: Text analysis automatically detects the language of the input text in order to apply the appropriate linguistic rules.
Improved scalability of the highlighted pre-processing steps:
- File filtering converting binary document formats to text/HTML
- Tokenization decompose word sequence, e.g. the quick brown fox -> the quick brown fox
- Stemming reduction of tokens to linguistic base form, e.g. houses -> house; ran -> run
- Linguistic analysis part-of-speech identification, e.g. quick: Adjective; houses: Plural Noun
Rules for the extraction of entities and facts of particular interest to the enterprise domain. The following major fact types are classified:
- Membership Information: information about a persons affiliations
- Management Changes: information about management changes
- Product Releases: information about product releases
- Mergers & Acquisitions: information about mergers and acquisitions
- Organizational Information: founder, location or contact information
The text mining index is an optional data structure that is built from the results of linguistic analysis. It is bound to the full - text indexing and text analysis process