April 2018 Published in Forbes
As an Artificial Intelligence think tank we develop a variety of products that can be of value to our clients. We aim to build independent modules but when these are interfaced together appropriately they can produce a complete AI solution. This blog explains how a few of our offerings can be customised to best serve the interests and needs of our clients and their consumers.
Please note that our products are constantly being improved, and therefore, many of the modules described here are the subject of research and specifications may well change during the development, or customisation, process.
Content curation refers to the process of importing existing knowledge, resources and facts into a database, and then transforming this information into a suitable digital format. To be used effectively, this data needs to be catalogued and stored in a searchable archive so that it can be easily accessed whenever required.
The following content curation modules have been categorised as:
1. Object Character Recognition. A module, usually interfaced with hardware accessories, for example a scanner or a camera, to encode lines of text and convert them to a digital format that can be used by computers, such as a simple text document. An everyday example would be the scanning of text and converting it into a searchable PDF document.
2. Language detection. A module that scans digital text and detects its language. Our products are intended to work in many languages, so translation modules will be required as and when required.
3. Subject Classification. A module that classifies given content into a subject category (e.g. history, geography, cookery, banking etc). A known method of doing this is by sorting keywords relating to known subjects and using statistical measures of significance to attribute categories. (e.g. the word wavelength = physics).
4. Entity Identification. A module that can assign ‘entities’ to words found within a text document, then using them as reference for content curation to find additional information for that given entity; then ‘entity resolution’ whereby the correct assignment of data is made to that particular entity. A very simple example might be the sentence “John has visited Paris”. A possible output might be “Person: John, Name: John, Location: Paris, Action: visited”.
5. Entity resolution. A module that uses various algorithms such as semantics graphs (see further below) to ‘match’,‘un-match’ and correspond data, identifying relationships between entities. It would draw upon a set of digital archives and a set of named entities, and then produce an annotation/correspondence to provide real world ‘accuracy’ and ‘context’. For example, a named entity might be “Michael Jordan” taken from the digital archives of sports news articles. The software entity would group the documents related to Michael Jordan the basketball player together, but would NOT include within that group, any documents related to someone named Michael living in the country of Jordan.
6. Curation. One or more modules that are used to populate a database. By using entity identification and entity resolution, further content can be assigned with a subject and filed accordingly.
7. Subject Level Determination Engine. A module that is used to assess curated content for its perceived level of cognitive difficulty. This enables the platform to provide appropriate output to each user.
8. Breadth/Depth Engine. A module that is used to separate curated content into digestible material. This engine has to be able to analyse the content of a piece of knowledge and understand whether it should form separate material in a different subject while being part of the same topic (defined as ‘breadth, horizontal enrichment’), OR further knowledge within the same topic (defined as ‘depth, vertical enrichment’).
9. Summarisation. A module that takes as input a piece of digital text and summarises it for brevity whilst retaining key content. It may also interface with other modules such as the content representation module (see below) or the paraphrasing module (see content delivery).
10. Content representation (aka knowledge objects). A module that obtains curated content and converts it to a format that can be used in a lesson. This module may interface with other modules such as the entity resolution module and breadth/depth engine and the summarisation engine, in order to be able to personalise lessons.
11. Curation resources. A database that lists resources where software can find more content for a given search term. The database is also used for subject categorisation verification.
12. Confidence Score Source. A module that creates or updates a score associated with a given resource, (directly affecting the score given to the data from that resource, discussed immediately below) representing confidence in its accuracy and provenance. In short, a ‘trust-worthiness’ scoring system of resources.
13. Confidence Score Data. A module that provides the score for the content imported into a database from the resources discussed immediately above. In short, a ‘trust-worthiness’ scoring system of data from previously ‘scored’ resources.
14. Knowledge Base. The given software package’s database of knowledge. How to store, retrieve/access a given piece of data.
Content delivery refers to the collection of methods that are required to present knowledge and content to a software user. In a hyper-personalised context, content delivery methods differ for every individual’s requirements. As such, the following modules described below might be used in the context of personalisation and could also be referred to as ‘personalisation modules’.
15. Text-To-Speech. A module that ‘reads’ digital text and converts it to voice audio for users to hear.
16. Speech-To-Text. A module that converts a stream of audio input into digital text that can be processed by a computer. For example, digital text can be used with other AI modules that fall under the Natural Language Processing category.
17. ‘Look-alike’ engine. A module that compares software users’ profiles to match suitable ‘potential matches’ (e.g. for dating or recruitment purposes) or suggest material to a user based what other (similar) users have accessed.
18. Content recommendation. A module used to recommend additional content to a given user dependent upon their response to it. Content recommendation may also utilise the look alike engine discussed immediately above.
19. Paraphrasing. A module that takes as input pieces of digital text of whatever length, where the context remains the same but the wording differs. This might interface with other modules such as the entity identification module described previously. For example, if the input is: “The weather is ghastly today”, the resulting output might be: “The weather is not great today”.
20. Onboarding. A module that is used to create an account with a platform or application. The onboarding session could be interactive, to include use of speech-to-text, text-to-speech, natural language processing tools, biometrics etc.
21. Biometric user recognition. A module used to identify a software user by obtaining and analysing their biometric data such as facial recognition, voice recognition, fingerprint etc. The aim is to provide a seamless, ultra-secure login experience and to create a human-like relationship with the app or platform in question every time it is used.
22. Users database. A database that will hold users’ details. Each entry contains personal information and platform relevant information such as preferences, recommended content, user engagement metrics etc.
23. Profile engine updater. A module used to obtain and update a user profile after each interaction. For example, metrics that can quantify interest to presented content, thereby understanding user engagement and user preference.
24. Personalisation Engine. A module used to personalise content according to a user’s preferences. Having a user’s profile helps understand preferences and present relevant content. Our personalisation engine can adapt dynamically and also build a sense of time. For example, a user may prefer viewing different content on a Friday night when compared to a Monday morning.To adapt content this module may interface with other modules such as the user’s database, profile engine updater and the user engagement modules presented below.
25. Sentiment engine. A module that understands user sentiment, thereby monitoring engagement with software.
26. Emotion engine. A module that is able to gauge a user’s emotional state, similarly monitoring engagement.
The AI is distributed via algorithms all over the system so it doesn’t exactly ‘exist’ in one location or folder as such. The tasks/processes that are relevant to any product can be grouped into one or more of the following AI categories:
NaturalLanguage Understanding (NLU) – Understanding context from text
Natural Language Generation (NLG) –Stringing sentences together
Recommender systems – Methods that use structured data to make recommendations to users about things/content they would like
Classification –Identifying objects when presented
Clustering –Grouping similar objects and / or data together,
Reinforcement methods – methods that provide the ability to an AI process to optimise its output by itself overtime.
These modules can be seen as ‘building blocks’ that can be interfaced in many ways. For example, to be able to strike a conversation with a platform, both NLU and NLG tools are required.
The examples and explanations that follow are not exhaustive. Any project by nature is dynamic and development is always ongoing so various avenues can be explored.
Natural Language Processing (NLP) is a collection of various algorithms that enable computers to ‘understand’ language. They may also convert text to other forms of processable data.
NLP is further subdivided into two main categories, Natural Language Understanding (NLU) which deals with deriving human understandable context/meaning from digital text, and Natural Language Generation (NLG) which deals with text composition into a human understandable form.
Non-limiting examples might be:
As explained above, platforms may need to be able to obtain knowledge and then communicate with users. To do so they must be able to ‘understand’ language; therefore a digital representation of that language is first required. From the previously mentioned modules, this is done via optical character recognition, and speech-to-text (STT).
Once the knowledge is in the appropriate digital format, then understanding can be gained from such tools as language detection, entity detection/relation extraction and entity resolution. The result is the storage of ‘knowledge objects’. One such example might be a semantic graph.
Supervised learning is a set of algorithms that use predictive and/or classification functions when receiving data. For purposes of model creation and optimisation (the computer learning to improve), the software requires training data, that which is ‘labelled’ as definitively ‘true’.
An example of supervised learning could be that of classifying various subjects and user interactions into their respective categories. For example, when given a piece of text, a platform should be able to index new knowledge into a subject classification. Appropriate yet simple models to do this include tokenisation, lemmatisation, bag-of-words, logistic regression and naïve Bayes. More complex ones are neural networks and autoencoders that perform word embeddings. In either case, dictionaries that contain words used in each subject/topic are required. The above methods input the words that appear in a document and attempt to assign them to relevant subject topics.
Another example of supervised computer learning is the analysis of sentiment/emotion of a user to assess if they are responsive under certain emotional conditions. The result of that analysis can enable the appropriate adjustment of content delivery.
Any software that provides a personalised experience needs a recommender system to provide recommendations to the user of content that might be beneficial. There are various algorithms that allow this functionality and one example is collaborative filtering. The algorithm’s objective is to predict (filter) the choices of an individual user/entity for particular items based on third-party recommendations.
Reinforcement learning algorithms can be used to create personalised content and presentation styles for individual users. Sentiment and emotion can be used to gauge whether certain presented content is positively received by the user and adjusted accordingly.
Unsupervised learning by a computer is a set of algorithms that can be used to group, categorise or find patterns in a dataset, without the data being associated/preassigned with a set of labels. That is, the algorithms do not require a data set that is associated with a ground truth.
One example, say on a dating site, would be in identifying groups of people who have relevant interests and create pairs or groups of ‘potential matches ’, using clustering algorithms such as ‘k-means’ and ‘hierarchical selection’.
Clustering Algorithms. A set of algorithms that receive a set of data points and assign each data point to a group. Examples of clustering algorithms include ‘k-means’ and ‘hierarchical clustering’.
Hierarchical Selection. The algorithm receives a data set and produces a hierarchy of possible groupings. The algorithm can use a ‘bottom-up’ or a ‘top-down’ approach when separating the data into groups. In the bottom up approach the algorithm starts by considering each individual point in the dataset as a separate cluster to produce the hierarchy. In each iteration the algorithm merges the closest clusters together into a one cluster. The algorithm stops work once all data belongs to the same cluster. The hierarchical clustering is presented to the user in optimally separated groups. In the top-down approach all the data starts as a single cluster and the algorithm separates the data into smaller clusters until each point is its own cluster.
k-means. The algorithm receives a set of data points and a required number of groups and assigns each point to one of the groups.
Logistic Regression. The software receives a set of numeric values and produces a binary output (i.e. zero or one) indicating whether the input belongs to a predefined group or not, using one for yes and zero for no. An example might be used in weather prediction, where humidity, cloud coverage, the wind force, wind direction and other factors can be output as a prediction of rain. When the factors are aggregated and analysed, a score of one would predict rain and a score of zero would mean dry weather.
(Deep) Neural Networks. The model receives a set of numeric values of length X, and produces as output a set of(binary) numeric values of length Y, where Y < X. The output Y corresponds to a classification. For example, an ‘animal image classifier’, where the model receives a set of numeric values that describe the pixel intensity of an image portraying an animal and can classify the image as illustrating a cat, a dog ora parrot. The model is referred to as deep because the processing of input data goes through several intermediate stages of binary classification, referred to as layers, before producing the final result.
Probability Based Classification. Statistical models use numeric data, which describe levels of probability. The distribution, defined by its associated measures, can be used to assign a probability, or belief, as to whether an unseen entity is likely to come from the same distribution (i.e. associated with the already known entities). There are various statistical models, examples include: outlier detection, naïve Bayes classifier, Bayesian Networks, hidden Markov models and others. A simple example of a statistical model in outlier detection might be data about the height of adult males; where an average height and height variability can be estimated. These variables can be used to design a threshold scheme and/or probability belief, when encountering a new height value as to whether the height is appropriate to that of an adult male.
Reinforcement Learning. Reinforcement learning is a term describing a set of algorithms that are used to create an AI process that adjusts its actions to self-improve overtime. This is performed by a series of ‘rewards’ that are given to the agent that predicts/classifies content. For example, the prediction is compared to a set of data that can be real or constructed by a separate virtual ‘trying’ agent. The ‘trying’ agent keeps offering varying combinations of data, and learns the optimal way of being correct by receiving rewards or not as a result.
Semantic Graph. A graph representation is composed of two types of entities, a node and an edge. A relation between nodes is indicated by connecting the two nodes by an edge.The edge may have additional properties that describe the type of relation between the nodes. The internal representation of the graph in a computer may differ from the visual representation that is human understandable. One example might be: Input: “Joe has a daughter named Eve”. Output: “JOE à EVE”. In this example JOE and EVE are nodes, and the arrow is the edge that connects the two indicating a relation between the two nodes. The arrow’s additional properties could be a textual description, such as the word “daughter” or “parent”.
Support Vector Machines (SVMs). SVMs receive a set of numeric values and produce a binary value (zero or one) as to whether the input data belongs to one of two predefined classes (although extensions can be made to accommodate extra classes). SVMs can only produce accurate predictions when the two classes are separable by a linear boundary or surface. An example of an SVM can be one that receives the geographical coordinates of a house and decides whether the house belongs north or south of a given boundary.