Training n-gram Language Models on Congressional Speech Data

dan calacci

2014-04-15

1. N-Gram Language Models Trained on Congressional Speech Data

N-Gram Language Models Trained on Congressional Speech Data

Recently, I’ve been working with a dataset that contains the public statements of congresspeople from the past 10 years.

The data was scraped by a colleague from Vote Smart’s public statements collection.

We have a longer-term project running that involves this dataset, but this past weekend, I wondered what kind of generated text an n-gram model could make of the public statements (hilarious text, no doubt).

Then, I realized that I’d never actually implemented an n-gram language model that could generate text before.

I set to work, and a little while later (plus some tweaks over the next week or so), I made this

The markov model code is general enough to allow the user to choose the length of the n-grams to search for, which is nice. It’s also general enough to not need to be run on word data.

The code’s all up on github.

And the tool is live here

An example from Marco Rubio:

significant interests in ensuring that no matter what your parents
talk and you are advancing toward democracy and free enterprise wants
to survive this is the most exceptional nation in all the government
take that for granted those of the doubt i hope i can t forget the
rhetoric basically goes is there an issue i am new to legislation i
introduced immigration legislation we need a strong and stable
alternative to the people in america that are happening as much they
cooperate with this that egypt has a credit rating agencies say is at
the idea of its citizens in

And another one, because this one had Wiz Khalifa in it and I can’t let that go to waste:

institution and i was born with these threats by focusing solely
within our borders and in the best in our interest is to promote
access for low skilled labor long delays in the senate but let s put a
stop to this point i would submit to you period the sooner we accept
that but did you not do in the medicare advantage and you will find in
legal circles particularly in the system is the story of how prevalent
human trafficking and modern day poet his name is wiz khalifa called
work hard so that we need to pass the

Contents

N-Gram Language Models Trained on Congressional Speech Data

dan calacci