-
Notifications
You must be signed in to change notification settings - Fork 2
Tweet Manager
The Tweet manager is responsible to launch a proper number of queries based on the GPS coordinates and the translated keywords. It appropriates select the update rate in order to avoid being kicked by the Twitter API.
Following some discussion and after some tests, we decided to avoid having dynamic reallocation of searcher, which means that once a research on a list of queries has started, the list of queries cannot be changed without stopping the whole research. This is a design choice to avoid heavy load on the Akka Scheduler and to avoid as much as possible duplicated data. Moreover, due to the design choices of the clustering, this is not required.
Source code:
The TweetManager can receive the following messages (see the CommunicationModels:
-
AddQueries: Add a list of queries zipped with their corresponding Actor listeners, but doesn't start them. -
Start: Start all the queries previously added. Note that once the Manager is started, there will be no more possibility to add queries. -
Stop: Stop the research (no more tweet will be asked to Twitter). The Manager will be put back in its initial states, which allows to restart a new research. The list of queries will be wiped out, and new queries could be added by the AddQueries message.
Using the Search API, the Tweet Searcher is responsible to get the latest tweets. It will ping back the API periodically in order to get the newly indexed tweets.
The Twitter Search API is a great choice since it allows narrow filtering based on localisation (delimitation of squares of research), as well as logical operators inside the query string. After some test, it appeared that the API keeps track of the previous request sent in order to deliver only the newly indexed tweets while doing a call back. This is very suitable, since we don't need to keep tracks of tweets IDs to avoid duplicates.
The API also proposes a way to get more tweets by looking back into the past (one ping to the API returns at most 100 tweets, but it's possible to ask, in a second request, to get older data). Since we do not want to go too much in the past and since we assumed that we were already getting enough data without using this option, we discarded it for now. Some older versions of the code were containing this option though.
Source code:
The size of the request is 1000 characters, which already allow some narrow filtering.
Using the Streaming API, the Tweet Streamer is responsible to get the brand new tweets, not yet indexed by the Search API. Due to the limit of 1%, the amount of tweets that might be found in both searches is small, but will still have to be considered. The amount of interesting data from the Streamer might be smaller, as we cannot ask the API to filter directly for both keywords and localization, which is allowed by the Search API. We can only have a few streams by connection. So to compute interesting tweets, we launch the Tweet Streamer with all the location parameters. From the received tweets, we look at their location parameter and the words they contain, search from this for their correct listener, and send the tweet. We will receive a lot of data, and lot's of tweets that are in the desired location but do not contain the keywords we are interested in. To be able to rapidly discard non relevant tweet, each time the Tweet Streamer receives a ping, he will discard all the non relevant tweets in his InputStream until he finds a correct one to return. A correct tweet is a tweet containing one of the desired keywords, because we already know that they will be in one of the desired squares.
In order to ensure that the TweetManager was working under heavy load, a "scaler test" has been programmed. This one launches about 400 Akka actors querying Twitter in turn. Following the test, we got about 100'000 tweets in 12 hours speaking about Switzerland, banks, and Swiss banks, all over the USA. This is not "big data" in a sense, but it remains enough for the clustering and the data display. Moreover this test was only using the Searcher. Using the Streaming API as well could significantly increase the number of tweets. Research on more trending topics could also get way better results.
Source code:
Detailed outcome
- Three keyword lists, one about banks, one about Switzerland, one about Swiss banks.
- Each request was covering the U.S. by a grid of 144 squares (12 rows, 12 columns).
- There were therefore a total of 432 actors running in parallel. They were using about 400Mb of memory.
- In about 12 hours:
- Run on Banks got 51'000 tweets.
- Run on Switzerland got 23500 tweets.
- Run on Swiss bank got 5700 tweets.
- Total : about 80200 tweets.
- Since a tweet is small, all of this represent a total of 600Mb. In term of size, it isn't that big, But 80'000 tweets is not small I think. Moreover we could totally have bigger result sets using more topics and more trending topics, and in more than 12 hours. Since 400Mb to run this isn't big and the CPU consumption wasn't high, we could totally scale it.
- The maximum tweets we could get using this technique on only three lists of keywords and on the same time period is 2160'000, but it's highly improbable, since it considers that we would get 100 tweets each time we ask a research on a square and on a specific list of keywords.
- With the Bank / Swiss bank / Switzerland keywords, we have a 3.7 % of the maximum.