In this work I focused on the problem of finding the objective of the customer query in a robust way. I based my work on the given example where a set of given keywords is used for tagging. The drawback of this method that I aimed at improving is that determining the most indicative set of keywords is not a trivial task. There may be additional keywords that the writer is not aware of, or some set phrases may be better indicators than a single word.
Ideally, we would prefer the keywords and their relative weights would be learn automatically by the tagging algorithm. That is, given a large volume of relevant text, a data mining algorithm would deduce both the topic and its indicative keywords. However, despite a sincere attempt, this goal remained beyond the scope of this project at the given time frame.
Instead, I worked toward obtaining a samples set of manageable size, tagging it manually, and using it as ground truth for a machine learning algorithm. The design goals and assumptions for the algorithm were:
- Work with short, natural sentences, so that minimal post-processing over the text is required.
- Match one sentence with one tag. Analyzing multiple objectives at once is out of scope.
- Return a confidence score, so that an application using it can decide when not to present the result to the user, and ask for input instead.
- Support adding tagged samples, so that users' input may be incorporated into the system.
- If possible, the set of tags should not be limited, allowing more flexible user input.
As the main source of data I downloaded mobile phones specs pages from Phone Arena. See for example this Samsung Galaxy Express page. The scripts file lists the BASH code snippets used for fetching the pages and extracting data from them.
After downloading 7456 phone pages, I extracted their pros and cons lists, appearing in the middle-left panel. These were meant to serve as a representative body of text for what customers are saying about mobile phones. Since there where only 37 set phrases used as cons and pros, I tagged them manually according to the objective appearing in the specs: size, weight, resolution, pixel density, etc. The result is the tagged samples file.
Since 37 phrases may be a too small corpus, I also extracted each phone description into a descriptions file. I then compared the descriptions vocabulary with baseline English, using the Brown corpus (see NLTK data item 5) and Conditional Frequency Distribution analysis (see Chapter 2 of the NLTK book, section 2.2). The desc.py script includes the code for this analysis (also requiring additional downloads).
The above analysis resulted in a set of 200 keywords that are more common to the domain of
mobile phones than new or romance English.
Such words include selfies, megapixels and qwerty keyboards.
I selected the more unique ones, and appended them to the tagged samples set.
To withstand the above design goals, I implemented a K-Nearest-Neighbors tagger where document similarity is the distance metrics, and tags from the most similar samples are weighted by their similarity. The tagged samples are tokenized, stemmed, and transformed using TF-IDF to the data matrix. Queries are likewise tokenized and stemmed. The TF-IDF similarity score is computed for each sample, and the top K "vote" for the selected tag. The tagger module implements this algorithm, with a similar interface to scikit-learn classifiers.
The bot module exposes the predict_proba
method
as a REST API POST operation using Klein.
One advantage of KNN is that samples can be added without an expensive re-training.
This way, when the probability of the prediction is too low, the application may
decide to ask the user for additional input.
The tagger add
method takes a document and a tag and add its data set.
Since in the current settings the number of samples may be small, each new sentence
may affect the terms frequencies significantly. Therefore, the TF-IDF matrix is
recalculated. This may not be necessary once the data set is large enough.
The bot module exposes the this method as a REST API PUT operation.
Due to the short time-frame, I have not followed these possible approaches:
-
I did not attempt to parse or match sentences using NLP rules. Adding Part-of-Speech tags may have improved the algorithm, but due to time considerations was left out.
-
As mentioned before, ideally the tags where learned from the text as well, making manual tagging redundant. This is not impossible, since, for example, Phone Arena pages include specs for each phone. By comparing these specs, an algorithm may be able to determine that a phone is, e.g., heavier than others, and therefore its reviewers are likely to complain about its weight.
-
I did not use the full text of phone reviewers, although it is richer than the succinct pros-and-cons list and descriptions.
-
KNN is one possible algorithm. Other algorithms, e.g. Neural Networks, may yield better accuracy and performance. However, the small data set that I had the time to extract and analyze was not sufficient for training large models.
-
I did not not implement a tagger for the constraint attached the objective. A very similar approach may be used to the same end, i.e. use a tagger with a samples set of positive ("I want...") and negative ("I don't want...") short samples.