Training n-gram Language Models on Congressional Speech Data
N-Gram Language Models Trained on Congressional Speech Data
Recently, I’ve been working with a dataset that contains the public statements of congresspeople from the past 10 years.
The data was scraped by a colleague from Vote Smart’s public statements collection.
We have a longer-term project running that involves this dataset, but this past weekend, I wondered what kind of generated text an n-gram model could make of the public statements (hilarious text, no doubt).
Then, I realized that I’d never actually implemented an n-gram language model that could generate text before.
I set to work, and a little while later (plus some tweaks over the next week or so), I made this
The markov model code is general enough to allow the user to choose the length of the n-grams to search for, which is nice. It’s also general enough to not need to be run on word data.
An example from Marco Rubio:
1 | significant interests in ensuring that no matter what your parents |
And another one, because this one had Wiz Khalifa in it and I can’t let that go to waste:
1 | institution and i was born with these threats by focusing solely |