VGIsink

This is the proof-of-concept implementation of the concept presented by the VGIscience Privacy Project. It shows how to read social media data from a certain social media service and store it in a local database using the cardinality estimator HyperLogLog.

Core part is a FastAPI-based Python application ("sink-API") that provides a RESTful API to do the following:

API-Documentation: Swagger UI, ReDoc

Requirements

Usage

This project includes a Compose file that can run the sink-API, the sinkdb database, a pgAdmin instance and a sinkmap visualization tool.

  1. Clone this project
  2. Copy and adjust .env.example to .env
  3. Run docker compose up - the database container will auto-create tables for already existing stream rules
  4. Create (more) rules to curate a filtered stream of Twitter posts
  5. Use the included stream reader script stream-twitter.sh to read that filtered stream and post it to the sink-API
  6. Go to http://localhost:8081 to see a map with the cardinality visualization

Advanced usage

To run this in a production environment, you have to create reverse proxies to the published container ports on your machine, e.g. using Apache or nginx. Don't forget to adjust .env accordingly.

You can also run the stream reader script as a systemd service unit:

Development

Recommended dev environment is VSCode and its Python extension. Experimental dependency management with Poetry.

Copypaste

sink-API:

http get localhost:8888

http get localhost:8888/rules
http post localhost:8888/rules value="tag_$(echo $((999 + RANDOM % 8999)))" tag="test" precision="4"
http delete localhost:8888/rules/1430817260

http delete localhost:8888/rules/$(http post localhost:8888/rules value="tag_$(echo $((RANDOM)))" tag="test" | jq -r '.data[0].id')

for id in $(http get localhost:8888/rules | jq -r '.data[].id' | grep 1438); do http delete "localhost:8888/rules/$id"; done

for t in flood fire storm; do http post localhost:8888/rules value="$t has:geo" tag="$t disaster"; done

# get number of areas for each rule
for id in $(http localhost:8888/rules | jq -r '.data[].id'); do echo -n "$id "; http "localhost:8888/rules/$id" | jq '.features | length '; done

Twitter stream:

BEARER_TOKEN=$(gopass show -o www/twitter.com/mlvgi/bearer_token)

http --stream GET "https://api.twitter.com/2/tweets/search/stream?tweet.fields=author_id,created_at,geo,id,text&place.fields=full_name,geo,id,name,place_type" "Authorization: Bearer $BEARER_TOKEN" | jq

http --stream get "https://api.twitter.com/2/tweets/search/stream?tweet.fields=author_id,created_at,geo,id,text&place.fields=full_name,geo,id,name,place_type" "Authorization: Bearer $BEARER_TOKEN" | while read line; do echo "$line" | http post localhost:8888/lbsn/posts ; done

( "POST /lbsn/posts HTTP/1.1" 422 Unprocessable Entity )

http --stream get "https://api.twitter.com/2/tweets/search/stream?tweet.fields=author_id,created_at,geo,id,text&place.fields=full_name,geo,id,name,place_type" "Authorization: Bearer $BEARER_TOKEN" | jq '. | {id: .data.id, geo: .data.geo, rules: .matching_rules}' | while read line; do echo "$line" | http post localhost:8888/lbsn/posts ; done

( https://stackoverflow.com/questions/69364553 )

Database connection:

source ~/projects/ml/vgisink/.env && PGPASSWORD=$SINKDB_PASS psql -U $SINKDB_USER -h $SINKDB_HOST -d $SINKDB_NAME -p $SINKDB_PORT

Example post:

http post localhost:8888/posts <~/playground/example-tweet.json