Quantitative Evaluation Results
Metrics
VCUA:
87.55% LLMNode prediction accuracy for several vocal commands based on the transcribed vocal instructions from the SRNode. < 3%Â confusion
NSR:
86.27% accuracy in the REM’s ability to abstract the high-level understanding from the LLMNode to the actual robot’s navigation actions.
OIA:
69.09% CLIPNode accuracy in identifying and localizing objects within the robot’s task environment.
Confusion Matrix:
strong predictive accuracy for several vocal commands, e.g., vocal instructions completely in English sentences (phrase or clause).
struggles with some vocal commands, e.g., vocal instructions containing action descriptions specified in non-English words. "Workshop" and "Offices" contain the majority of the action patterns described with non-English words in the task dictionary.
Commands Sent and Received Time
Robot's Average Response Time
ART:
on average, the robot takes less than a second from receiving a vocal chat command to initiating the robot's actual physical action, which suggests a relatively quick response time for our framework.