[Dclib-devel] CNN for Text Classification

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi dlib-users,

I guess this question is both related to MITIE and dlib: I would really
like to contribute an example for text classification with a CNN. I'm
working with the MNIST example code and I've got a few questions:

In MITIE I can use pretrained word embeddings from the english model
(total_word_feature_extractor.dat). E.g. the wordrep tool shows the
feature vector for a word with --test <Word>.

What is the correct way to the get a sentence representation? At the
moment I' using the following code:

matrix<float,0,1> sentence_matrix;

for (auto &word : sentence) {
  matrix<float,0,1> feats;
  fe.get_feature_vector(word,feats);

  join_rows(sentence_matrix, feats);
}
testing_set.emplace_back(sentence_matrix);
testing_labels.emplace_back(1);

Of course I have to pad the sentence, but would this code create a
correct sentence representation which I could later use to train the
network? The intention is to create a n x k representation for the
sentence (n = length of sentence, k is length of feature vector from
word embedding). For the concatenation of each word I found the
join_rows method, or should I use something else?

Thanks in advance + regards,

Stefan