I am starting a series of posts looking at a variety of data processing patterns used to build real-time stream processing applications, the use cases that the patterns relate to, and how you would go about implementing within Wallaroo.
These posts will help you understand the data processing use cases that Wallaroo is best designed to handle and how you can go about building them right away.
I will be looking at the Wallaroo application builder, the part of your application that hooks into the Wallaroo framework, and some the business logic of the pattern.
Preprocessing involves the transformation of the messages in your data pipeline.
A variety of operations can occur, including:
- Removing attributes from a message
- Adding or enhancing attributes in a messaging
- Filtering out entire messages from a pipeline
- Splitting messages to be processed by multiple pipelines
- Combining multiple pipelines into a new pipeline
It’s not unusual to see the preprocessing pattern used in many use cases and combined with other patterns.
One excellent example use case is removing stop words for sentiment analysis.
Sentiment analysis is used by data scientists to look at a piece of text and determine whether the underlying sentiment is positive or negative.
Let’s say you are monitoring tweets that mention Nike. You want to know how people are feeling about Nike. Are the tweets and comments generally positive or negative? “Nike has some great new looks this season” would be a positive sentiment. Each piece of text is given a score, and you can add up the score over some time, say the last hour, to get an overall sentiment score.
We will assume that upstream from our sentiment processor, so that we only have Nike related Tweets. The next step before doing the actual sentiment analysis is to remove stop words. Stop words are words that are meaningless to the underlying sentiment analysis and therefore can and should be removed before the sentiment algorithm runs.
Stop word lists are customized to the specifics of the use case but would include words like “are, aren’t, as, at, be, because, been, has, I’m, it’s, on, some, the, this.”
In our example message above “@Nike has some great new looks this season” would become “@Nike great new looks season” after stop word preprocessing occurred. This isn’t the easiest text for humans to read, but it is just how our sentiment algorithm wants to see it!
Wallaroo Application Builder
ab.new_pipeline( "Sentiment Analysis", wallaroo.TCPSourceConfig(order_host, order_port, order_decoder) ) ab.to_parallel(remove_stop_words) ab.to_stateful(sum_sentiment, Sentiment, "Sentiment") ab.to_sink(wallaroo.TCPSinkConfig(out_host, out_port, encoder)) return ab.build()
ab.new_pipeline( "Sentiment Analysis", wallaroo.TCPSourceConfig(order_host, order_port, order_decoder) )
Definition of the Wallaroo pipeline that includes the pipeline name, “Sentiment Analysis” and the source of the data, in this example we are receiving tweet messages over TCP.
Our first processing step is to call a function called “remove_stop_words.” This function would look at the text of the tweet and strip out any instances of the stop words from our list, then send the processed message to the next step in the pipeline.
In this example, I am using the Wallaroo
to_parallel method to add a computation to the pipeline. The
remove_stop_words function does not need to save anything to state, and the computation can run in parallel across all available workers.
ab.to_stateful(sum_sentiment, Sentiment, "Sentiment")
to_stateful is a non-parallel method that contains a stateful computation. The
sum_sentiment function tallies up the sentiment scores and stores the totals in the “Sentiment” state object. You could use a Python library like NLTK to accomplish the sentiment analysis.
ab.to_sink(wallaroo.TCPSinkConfig(out_host, out_port, encoder)) return ab.build()
to _sink specifies the endpoint of the pipleline and where the messages are sent.
build() specifies to Wallaroo that there are no more steps for this application.
The preprocessing pattern is one of the most commonly used patterns that you will come across when building your streaming application. As you can see, Wallaroo’s lightweight API gives you the ability to construct your data processing pipeline and run whatever application logic you need to power your application.
Did you enjoy this post?
Subscribe to our blog and be updated each time we add a new post.