DARPA Communicator Testbed


Log Standard Proposal (v3)


Introduction

This document is intended to establish standards for logfile contents and format. We will try to determine what is the smallest set of data necessary in order to re-run a system, yet also includes meaningful metrics. This may vary depending on how much of the system is to be re-run as well as what we would like to measure. In the process we will attempt to establish a standard format which all logfiles can be converted to (or generated in, although we foresee that at least a minimal amount of inferencing might be required to render the logs in this form). A goal of this document is to provide a standard that is flexible and general enough such that it could be used in different domains.

In order to accomplish this goal, we will propose an XML DTD which records the basic events in a Communicator-compliant system which can be annotated with type information indicating that a data element is "significant" from the point of view of annotators (and annotation tools).

To clarify we will consider the following (term definitions are by no means final and are open to suggestion):

The definition of "turn" requires special attention. In some accounts, a turn is an exchange between user and system. In a robust dialogue context, this definition fails to be adequate when the user or system barges in with follow-up information, etc., or when the dialogue involves more than two parties (a situation which we shouldn't rule out). We propose that the term "turn" in the context of these log files be reserved for the processing of a single participant's utterance (either user or system). This definition is not without its problems. For instance, it's not clear whether a call to the backend belongs at the end of the processing of a user's utterance (because it's the presentation of the utterance to the backend) or the beginning of the processing of the system's utterance (because it's the source of the system's response). We can currently think of nothing that this decision hinges on in the data analysis, and recommend that either interpretation be recognized at the moment.


Content

Here we will try to discuss the granularity of data to be logged in an end-to-end system. The contents of these bullets were derived mainly from the information needed by MITRE to do its own internal evaluation and will probably change as the perspectives of other sites are incorporated. Every log should contain enough information to determine the following (here input refers to the user sending information to the system and output refers to the system sending information to the user). Ideally, all this information should be extractable from the log file without any site-specific analysis. In this table, we describe the data to be logged, whether it's optional or obligatory, and how we propose to standardize access to the data:
 
Data Obligatory Standard access
Duration of session yes readable directly off the XML representation proposed below
Duration of turn (input or output) yes readable directly off the XML representation proposed below
Duration of generation of output (in a phone demo, the time the synthesizer takes to generate the audio file) yes  see 1
Duration of display of output (in a phone demo, how long it takes to play the audio file) yes  see 2
Duration of recognition of input (in a phone demo, how long it takes the recognizer to produce its hypotheses) yes  see 3
Duration of arbitrary operations no readable directly off the XML representation proposed below
Number of turns within a session yes readable directly off the XML representation proposed below
Number of sessions (in our current model each session is its own logfile) yes readable directly off the XML representation proposed below
The audio files corresponding to the user input and system output and their formats. The audio files should be stored and distributed with the logs, and the pathnames of these files should be relative to the log. yes accessed given an arbitrary search of the logged data (see the "audio_input" and "audio_output" values for the type attribute of the GC_DATA tag, as well as the "mime_type" attribute)
The text of the user input chosen by the system yes accessed given an arbitrary search of the logged data (see the "text_input" values for the type attribute of the GC_DATA tag)
The text of the system output yes accessed given an arbitrary search of the logged data (see the "text_output" value for the type attribute of the GC_DATA tag)
All possible input sentences (from the recognizer) up to a certain limit (TBD) (N/A to systems that use a word lattice) no accessed given an arbitrary search of the logged data (see the "text_input_hypothesis" value for the type attribute of the GC_DATA tag)
Indication of whether the parse succeeded no  see 4
The full input interpretation no accessed given an arbitrary search of the logged data

The elements which may pose minor complications have been left blank. Here we make tentative proposals for each of these:

  1. Duration of output generation. In a system where there is a single, obvious call to the synthesizer, this is simply the duration of that operation, but this is only one possible configuration. We propose that the "type" attribute be added to the GC_OPERATION element and that a "virtual" operation be generated by a postprocess phase with a distinguished type (say, "synthesis_duration"); alternatively, we could introduce a new XML element (say, GC_EVENT) reserved for these "virtual" events.
  2. Duration of output presentation. In the MIT system, this is an inference from notifications posted by the audio server (playing_has_begun, playing_has_ended; see the Communicator documentation for the MIT audio server). This could be handled similarly to output generation, or we could add optional start and end time attributes to the GC_DATA element which contains the audio file.
  3. Duration of recognition. Again, we propose to handle this similarly to output generation.
  4. Indication of whether the parse succeeded. Again, this is frequently an inference. We can insert a distinguished GC_DATA element (say, with a type of "input_parse_successful").
We believe that this sort of proposal will allow sites to gather data in the form they prefer, and augment it with sharable semantics in such a way that individual sites' data will retain its site-specific integrity.
 


Format

We believe that XML would be a good candidate language for this format for many reasons, among them that there is a growing supply of viewers, editors, as well as a variety of parsers available in many programming languages.

We propose that operations should be logged as single XML elements. For example:

<operation
    server="nl"
    turnid="-01"
    location="localhost:11000"
    name="paraphrase_reply"
    stime="930254422.720000"
    etime="930254422.790000"
>
    <data
        type="input"
        key="tidx"
    >
    3813
    </data>
    <data
        type="output"
        key=":reply_string"
    >
    Hi! Welcome to MITRE's Travel demonstration. This call is being recorded for system development. You may hang up or ask for help at any time. How can I help you?
    </data>
</operation>


Since in our distributed architecture messages are sent asynchronously, and many events may occur before the completion of an operation, some caching (or post processing) will be necessary to log operations as single elements.

Next we will try to define the main entities in the logfile and their formats. A DTD is also available which defines these terms and their relations. We will assume all time types will use a standard base time known as "the epoch", the number of milliseconds since January 1, 1970, 00:00:00 GMT.
 

GC_SESSION

 A session represents an interaction of a user with the system. In our current demo the equivalent to a phone call. The elements in this table refer to the XML DTD.
 
Name Description Type Required
id We should attempt to determine a unique identifier for sessions. MIT's solution for this is of the following format (IP:process id:session counter). Process id's might not be trivial to achieve in different programing languages and OS' however there usually are "equivalent" data available string yes
stime time when session started milliseconds yes
etime time when session finished milliseconds yes
GC_TURN see GC_TURN GC_TURN no

 

Example:

<GC_SESSION
    id="129.10.2.200:1010:3"
    stime="930254422.720000"
    etime="930254434.790000"
>
    ...
</GC_SESSION>

GC_TURN

 Consists of each interaction of the user with the system, as discussed in the introduction. The elements in this table refer to the XML DTD.
 
Name Description Type Required
id A unique identifier within each session number yes
stime time when turn started milliseconds yes
etime time when turn ended milliseconds yes
GC_DATA see GC_DATA GC_DATA no
GC_OPERATION see GC_OPERATION GC_OPERATION no
GC_FRAME see GC_FRAME GC_FRAME no

 

Example:

<GC_TURN
    id="-01"
    stime="930254422.720000"
    etime="930254424.790000"
>
    ...
</GC_TURN>

GC_OPERATION

Every command executed by the system within a turn. All operations can send and receive data, frames or audio files. The elements in this table refer to the XML DTD.
 
Name Description Type Required
type the type of operation being executed (specific values TBD) string no
turnid the turn id that this operation was executed under number yes
stime time when operation started milliseconds yes
etime time when operation ended milliseconds yes
server  the name (according to the program file) of the server that executed the operation string yes
location the server (real server name or IP address) and its port (server_name:port_number) string yes
name the name of the operation string yes
GC_DATA see GC_DATA  GC_DATA no
GC_FRAME see GC_FRAME  GC_FRAME no

 

Example:

<GC_OPERATION
    server="nl"
    turnid="-01"
    location="localhost:11000"
    name="paraphrase_reply"
    stime="930254422.720000"
    etime="930254422.790000"
>
    <GC_DATA
        type="input"
        key="tidx"
    >
    3813
    </GC_DATA>
    <GC_DATA
        type="output"
        key=":reply_string"
    >
    Hi! Welcome to Mitre's Travel demonstration. This call is being recorded for system development. You may hang up or ask for help at any time. How can I help you?
    </GC_DATA>
</GC_OPERATION>

GC_DATA

A key/value pair. This datatype can be used to display the information involved in an operation, as well as to display the contents of a GC_FRAME. The elements in this table refer to the XML DTD.
 
Name Description Type Required
key the name of this data point string yes
turnid the turn id that this operation was executed under number no
time time stamp for this data point  milliseconds no
type valid values of type include audio_input, audio_output, text_input, text_output, text_input_hypothesis, and concept. See the Content section. string no
mime_type the mime type of the data string no

 

Examples:

<GC_DATA
key=":synth_log_filename"
turnid="-01"
type="audio_output"
mime_type="audio/wav"
>
/home/communicator/Travel-demo/../logs/travel_cfone/19990624/006/travel_ cfone-19990624-006-synth--01.wav
</GC_DATA>

<GC_DATA

key=":listening_has_begun"
turnid="000"
time="930254422.790000"
>
</GC_DATA>

GC_FRAME

This stucture would allow for recording of frames. The elements in this table refer to the XML DTD.
 
Name Description Type Required
frame_type Galaxy frame type string no
name the name of the frame string no
turnid the turn id that this operation was executed under number no
GC_DATA see GC_DATA GC_DATA no

Example:

<GC_FRAME
    turnid="000"
    name="scores"
    type="c"
>
<GC_DATA
    key=":total_score"
>
-1408.9955
</GC_DATA>
<GC_DATA
    key=":acoustic_score"
>
-1367.4408
</GC_DATA>
<GC_DATA
    key=":ngram_score"
>
-15.5547
</GC_DATA>
<GC_DATA
    key=":nphones"
>
58
</GC_DATA>
<GC_DATA
    key=":nwords"
>13
</GC_DATA>
</GC_FRAME>

Code support

MITRE volunteers to work with sites to produce the appropriate conversion tools from MIT logfiles to the proposed logfile standard. If more appropriate, we will produce a new logging module for the Hub which will simplify this process; however, we don't envision this to be necessary.


Document Type Definition (DTD)

Below we provide an XML DTD to define the above types.
 

<?xml version="1.0"?>

<!ELEMENT GC_LOG GC_SESSION*>

<!ELEMENT GC_SESSION GC_TURN* >
<!ATTLIST GC_SESSION id NMTOKEN #REQUIRED>
<!-- time could be defined as CDATA if we chose to use a non millisecond format -->
<!ATTLIST GC_SESSION stime NMTOKEN #REQUIRED>
<!ATTLIST GC_SESSION etime NMTOKEN #REQUIRED>

<!ELEMENT GC_TURN ( GC_OPERATION | GC_DATA | GC_FRAME )*>
<!ATTLIST GC_TURN id NMTOKEN #REQUIRED>
<!ATTLIST GC_TURN stime NMTOKEN #REQUIRED>
<!ATTLIST GC_TURN etime NMTOKEN #REQUIRED>

<!ELEMENT GC_OPERATION ( GC_DATA | GC_FRAME )*>
<!ATTLIST GC_OPERATION type NMTOKENS #IMPLIED>
<!ATTLIST GC_OPERATION turnid NMTOKEN #REQUIRED>
<!ATTLIST GC_OPERATION server CDATA #REQUIRED>
<!ATTLIST GC_OPERATION location NMTOKEN #REQUIRED>
<!ATTLIST GC_OPERATION name CDATA #REQUIRED>
<!ATTLIST GC_OPERATION stime NMTOKEN #REQUIRED>
<!ATTLIST GC_OPERATION etime NMTOKEN #REQUIRED>

<!ELEMENT GC_DATA ANY>
<!ATTLIST GC_DATA key NMTOKEN #REQUIRED>
<!ATTLIST GC_DATA type NMTOKENS #IMPLIED>
<!ATTLIST GC_DATA mime_type NMTOKEN #IMPLIED>
<!ATTLIST GC_DATA time NMTOKEN #IMPLIED>
<!ATTLIST GC_DATA turnid NMTOKEN #IMPLIED>

<!ELEMENT GC_FRAME ( GC_DATA | GC_FRAME )*>
<!ATTLIST GC_FRAME frame_type NMTOKEN #IMPLIED>
<!ATTLIST GC_FRAME name CDATA #IMPLIED>
<!ATTLIST GC_FRAME turnid NMTOKEN #IMPLIED>