CVSC 2014

paper:Investigating the Contribution of Distributional Semantic Information for Dialogue Act Classification
download:milajevs-purver14cvsc.pdf
authors:Dmitrijs Milajevs, Matthew Purver
workshop:https://sites.google.com/site/cvscworkshop2014/
@InProceedings{milajevs-purver:2014:CVSC,
  author    = {Milajevs, Dmitrijs  and  Purver, Matthew},
  title     = {Investigating the Contribution of Distributional Semantic Information for Dialogue Act Classification},
  booktitle = {Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)},
  month     = {April},
  year      = {2014},
  address   = {Gothenburg, Sweden},
  publisher = {Association for Computational Linguistics},
  pages     = {40--47},
  url       = {http://www.aclweb.org/anthology/W14-1505}
}

This provides the data used in the experiment, describes the software used and gives instructions on how to re-run the experiments.

Experiment data

Co-occurrence matrix

The co-occurrence matrix used in the experiments was extracted from the English (20120701) version of Google Book Ngrams. You can access data in three difference ways.

in CSV format

This is the most universal way. To access the raw co-occurrence, download this three files:

  • cvsc14_targets.csv the mapping of target words to the matrix rows:

    $ head cvsc14_targets.csv
    ngram,id
    .,0
    I,1
    the,2
    and,3
    you,4
    that,5
    it,6
    to,7
    a,8
    
  • cvsc14_contexts.csv the mapping of context words to the matrix columns:

    $ head cvsc14_contexts.csv
    ngram,id
    very,0
    between,1
    then,2
    over,3
    But,4
    your,5
    like,6
    did,7
    must,8
    
  • cvsc14_matrix.csv.gz the nonzero values of the matrix:

    $ zcat cvsc14_matrix.csv.gz | head
    id_target,id_context,count
    0,0,2293330
    0,1,274533
    0,2,838342
    0,3,1662726
    0,4,21622
    0,5,7247041
    0,6,3017870
    0,7,1393590
    0,8,509590
    

    The first record means that . co-occurred with very 2293330 times. The second record means that . co-cooccured with between 274533 times.

in HDF5 using Pandas

In case you use Python and are familiar with Pandas, you can also get the co-occurrence matrix as a SciPy sparse matrix. Download cvsc14_matrix.h5. Here is an example on how to access the data:

>>> import padas as pd
>>> from scipy.sparse import csr_matrix

# Read targets, context and the co-occurrence frequencies
>>> with pd.get_store('cvsc14_matrix.h5', mode='r') as store:
...    targets = store['targets']
...    context = store['context']
...    matrix = store['matrix'].reset_index()
...

# Access the data
>>> targets
       id
ngram
.       0
I       1
the     2
and     3
you     4
      ...

[23585 rows x 1 columns]
>>> context
         id
ngram
very      0
between   1
then      2
over      3
But       4
        ...

[2900 rows x 1 columns]
>>> matrix
   id_target  id_context    count
0          0           0  2293330
1          0           1   274533
2          0           2   838342
3          0           3  1662726
4          0           4    21622
         ...         ...      ...

[11867396 rows x 3 columns]

# Build a sparse matrix
>>> space = csr_matrix(
...    (
...        matrix['count'].values,
...        matrix[['id_target','id_context']].values.T,
...    ),
... )
>>> space
<23585x2900 sparse matrix of type '<class 'numpy.uint64'>'
    with 11867396 stored elements in Compressed Sparse Row format>

# Get a vector for `country`
>>> space[targets.loc['country'].id]
<1x2900 sparse matrix of type '<class 'numpy.uint64'>'
    with 2643 stored elements in Compressed Sparse Row format>

in HDF5 using fowler.corpora

Finally, you can use fowler.corpora to do the job for you:

>>> from fowler.corpora.models import read_space_from_file

>>> space = read_space_from_file('cvsc14_matrix.h5')
>>> space['country']
<1x2900 sparse matrix of type '<class 'numpy.uint64'>'
        with 2643 stored elements in Compressed Sparse Row format>

CC This work is licensed under a Creative Commons Attribution 4.0 International License.

Predicted tags

y_test.csv are the true tags of the testing data. y_predicted.csv are the predicted tags.

Switchboard split

ws97-train-convs.list.txt is the list of training conversations. ws97-test-convs.list.txt are the test conversations.

Software

fowler.corpora was developed to run the experiments. By the time of writing, the package is still in the early development stage. If you are interested in using it contact d.milajevs@qmul.ac.uk or dimazest@gmail.com. Check out the cvsc14 tag.

Check out or view the IPython notebook of the experiment.

Numpy, scipy and scikit-learn performed computations. Google Ngram dowloader was used to obtain ngram data. Finally, pandas did IO and data management.

Re-running the experiments

A preconfigured virtual machine image is available in VM Depot. Please refer to VM Depot and Microsoft Azure documentation on how to run the image in the cloud. I've used the A7 (8 cores, 56 GB memory) configuration:

$ openssl req -x509 -key ~/.ssh/id_rsa -nodes -days 365 \
-newkey rsa:2048 -out cert.pem
$ azure vm create $DNS_NAME -o vmdepot-35178-1-32 \
-l "West Europe" azureuser -z a7 --ssh -t cert.pem -P

Once the machine is running, ssh to it (all the data is stored in /home/azureuser/):

$ ssh azureuser@$DNS_NAME.cloudapp.net -i cert.pem
Dialogue act tagging.

This is an isolated environment to run dialogue act tagging experiments.
For more details, see http://www.eecs.qmul.ac.uk/~dm303/cvsc14.html

Check README for further instructions.
Last login: Tue Apr 15 17:48:46 2014 from 127.0.0.1

$ head README -n 11
Investigating the Contribution of Distributional Semantic Information for
Dialogue Act Classification.

Run the following commands to reproduce the experiments described in the paper.
Append the -v flag to write logs to /tmp/fowler.log. Use --limit NUMBER to limit the
training data set. Use the -j NUMBER option to parallelize the computation among
several executors.

1. Bag of unigrams

   tools/bin/corpora serafin03 plain-lsa

$ tools/bin/corpora serafin03 plain-lsa -j 8
:paper: Serafin et al. 2003
:accuracy: 0.604
:command: tools/bin/corpora serafin03 plain-lsa -j 8

==================== ========== ========== ========== ==========
                 tag  precision     recall   f1-score    support
==================== ========== ========== ========== ==========
                   %      0.526      0.742      0.615        360
                  ^2      0.118      0.105      0.111         19
                  ^h      0.143      0.143      0.143          7
                  ^q      0.000      0.000      0.000         17
                  aa      0.329      0.505      0.398        208
             aap\_am      0.000      0.000      0.000          7
                  ad      0.125      0.037      0.057         27
                  ar      0.000      0.000      0.000          3
             arp\_nd      0.000      0.000      0.000          3
                   b      0.803      0.763      0.783        765
                 b^m      0.000      0.000      0.000         21
                  ba      0.571      0.737      0.644         76
                  bd      1.000      1.000      1.000          1
                  bf      0.000      0.000      0.000         23
                  bh      0.500      0.571      0.533         21
                  bk      0.364      0.429      0.393         28
                  br      0.625      0.556      0.588          9
                  fa      1.000      0.500      0.667          2
                  fc      0.648      0.432      0.519         81
fo\_o\_fw\_"\_by\_bc      0.333      0.062      0.105         16
                  fp      0.250      0.200      0.222          5
                  ft      0.000      0.000      0.000          7
                   h      0.588      0.435      0.500         23
                  na      0.000      0.000      0.000         10
                  ng      0.000      0.000      0.000          6
                  nn      0.500      0.923      0.649         26
                  no      0.000      0.000      0.000          6
                  ny      0.278      0.068      0.110         73
                  qh      0.250      0.083      0.125         12
                  qo      0.550      0.688      0.611         16
                 qrr      0.200      0.500      0.286          2
                  qw      0.629      0.400      0.489         55
                qw^d      0.000      0.000      0.000          1
                  qy      0.429      0.429      0.429         84
                qy^d      0.286      0.111      0.160         36
                  sd      0.624      0.801      0.702       1317
                  sv      0.593      0.253      0.355        718
                  t1      0.000      0.000      0.000          1
                   x      0.887      1.000      0.940         94
-------------------- ---------- ---------- ---------- ----------
  weighted avg/total      0.593      0.604      0.577       4186
==================== ========== ========== ========== ==========

The model is trained on the full development set.
The scores are computed on the full evaluation set.

Comments