IBM Presentations: Smart Planet Template

Hetero-Labeled LDA:A partially supervised topic model with heterogeneouslabel information

Dongyeop Kang1, Youngja Park2, Suresh Chari2

1. IT Convergence Laboratory, KAIST Institute,Korea

2. IBM T.J. Watson Research Center, NY, USA

Topic Discovery - Supervised

Topic classification

–Learn decision boundaries of classes by learning from data with labels

– Accurate topic classification for general domains

 Very hard to build a model for business applications due todata bottleneck

Topic Discovery – Unsupervised

Probabilistic topic modeling

–Learn topic distribution for each class by learning from datawithout label information, and choose topic of new data frommost similar topic distribution

–e.g., Latent Dirichlet Allocation (LDA)

Not sufficiently accurate or interpretable

Topic Discovery – Semi-supervised

Supervised topic modeling methods

–Supervised LDA [Blei&McAuliffe,2007], Labeled LDA [Ramage,2009]:document labels provided

Semi-supervised topic modeling methods

–Seeded LDA [Jagarlamudi,2012], zLDA [Andrzejewski,2009]: wordlabels/constraints provided

Limitations

1.Only one kind of domain knowledge is supported

2.The labels should cover the entire topic space, |L| = |T|

3.All documents should be labeled in training data, |Dunlabeled| = Ф

Partially Semi-supervised Topic Modeling withHeterogeneous Labels

Generation of labeled training samples is much morechallenging for real-world applications

In most large companies, data are generated and managedindependently by many different divisions

Different types of domain knowledge are available in differentdivisions

Can we discover accurate and meaningful topics withsmall amount of various types of domain knowledge?

Hetero-Lableled LDA: Main Contributions

Heterogeneity

–Domain knowledge (labels) come in different forms

– e.g., document labels, topic-indicative features, a partial taxonomy

Partialness

–Small amount of labels are given

–We address two kinds of partialness

•Partially labeled documents: |L| << |T|

•Partially labeled corpus: |Dlabeled| << |Dunlabeled|

Three levels of domain information

–Group Information:

–Label Information:

–Topic Distribution:

Challenges

Document labels (Ld)

Feature labels (Lw)

{trade, billion, dollar, export, bank, finance}

{grain, wheat, corn, oil, oildseed, sugar, tonn}

{game, team, player, hit, dont, run, pitch}

{god, christian, jesus, bible, church, christ}

?????

Hetero-Labeled LDA: Heterogeneity

Λd

Document Labels

Λw

Word Labels

Hetero-Labeled LDA: Partialness

Λw

Λd

Kd << K

Kw << K

Hetero-Labeled LDA: Heterogeneity+Partialness

Λd

Λw

Hybrid Constraint

Hetero-Labeled LDA: Generative Process

Hetero-Labeled LDA: Generative Process

Hetero-Labeled LDA: Inference & Learning

Gibbs-Sampling

Experiments

Datasets:

Algorithms:

–Baseline: LDA, LLDA, zLDA

–Proposed: HLLDA (L=T), HLLDA (L<T)

Evaluation metric:

– Prediction Accuracy: the higher the better

– Clustering F-measure: the higher the better

– Variational Information: the lower the better

Data set

Reuters

21,073

32,848

News20

19,997

82,780

Delicious.5K

5,000

890,429

Experiment: Questions

Q1. How does mixture of heterogeneous labelinformation improve performance of classification andclustering?

Multi-class Prediction Accuracy

Clustering F-Measure

Experiment: Questions

Q2. How does HLLDA improve performance of partiallylabeled documents?

–Partially labeled corpus: |Dlabeled| << |Dunlabeled|

–Partially labeled document: |L| << |T|

For a document, the provided label set covers a subset ofall the topics the document belongs to. Our goal is topredict the full set of topics for each document.