The Beginner-Friendly Guide to Containers
December 22, 2018
Show all

Semi-Structured Data Models

Level Up Image

Learn about data modeling with streaming data in this article by James Lee, a passionate software wizard working at one of the top Silicon Valley-based start-ups specializing in big data analysis.

Different types of data include structured, semi-structured, and unstructured. In this article, we’ll discuss semi-structured data. The World Wide Web (WWW) is the largest information source today. If we have to classify the data model behind the web, we can say it belongs to the semi-structured data model. Most of the semi-structured data refer to tree-structure data. 

Let’s take the example of a web page:

 
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
<ul>
<li>List Item 1</li>
<li>List Item 2</li>
<li>List Item 3</li>
</ul>
<footer><center>Copyright: Hands-on exercise</center></footer>
</body>
</html>

Here, an HTML document must be wrapped inside the <html> tag, and all the content goes inside the <body> tag. The code in the preceding snippet can render the HTML page. All the data comes from the HTML and slash HTML blocks. Similarly, we have a body and end, a header begins and end, list begin and end. The second thing to notice is, unlike a relational structure, there are multiple list items and multiple paragraphs. Any single document would have a different number of them. This means that while the data object has some structure, it is more flexible. This is the hallmark of an office semi-structure data model. 

eXtensible Markup Language (XML) is another well-known standard for representing data. XML can be perceived as the generalization of HTML, where the elements, or the beginning and end markers within the angular brackets, can be any string. Let’s take an example of an XML document: 

 
<?xml version="1.0" encoding="UTF-8"?>
<depression_patients>
  <patient>
    <name>John Doe</name>
    <bill>$1115.95</bill>
    <session>2</session>
    <level>6</level>
    <telecom>
      <system value="phone"/>
      <value value="(03) 5555 6473"/>
      <use value="work"/>
      <rank value="1"/>
    </telecom>
  </patient>
  <patient>
    <name>Ola Nordmann</name>
    <bill>$7000.95</bill>
    <session>3</session>
    <level>9</level>
    <telecom>
      <system value="phone"/>
      <value value="(03) 5555 6473"/>
      <use value="work"/>
      <rank value="1"/>
    </telecom>
  </patient>
  <patient>
    <name>Gummy Bear</name>
    <bill>$43.95</bill>
    <session>4</session>
    <level>90</level>
    <telecom>
      <system value="phone"/>
      <value value="(03) 5555 6473"/>
      <use value="work"/>
      <rank value="1"/>
    </telecom>
  </patient>
  <patient>
    <name>Reshika Adhikari</name>
    <bill>$4343.50</bill>
    <session>6</session>
    <level>3</level>
    <telecom>
      <system value="phone"/>
      <value value="(03) 5555 6473"/>
      <use value="work"/>
      <rank value="1"/>
    </telecom>
  </patient>
  <patient>
    <name>Yoshmi Mukhiya</name>
    <bill>$634.95</bill>
    <session>7</session>
    <level>0</level>
    <telecom>
      <system value="phone"/>
      <value value="(03) 5555 6473"/>
      <use value="work"/>
      <rank value="1"/>
    </telecom>
  </patient>
</depression_patients>

Another most popular format used for different data, such as Facebook and Twitter, is JavaScript Object Notation(JSON). Let’s consider the following example, which is exactly the same snippet represented as XML previously: 

 
{
  "depression_patients": {
    "patient": [
      {
        "name": "John Doe",
        "bill": "$1115.95",
        "session": "2",
        "level": "6",
        "telecom": {
          "system": {
            "_value": "phone"
          },
          "value": {
            "_value": "(03) 5555 6473"
          },
          "use": {
            "_value": "work"
          },
          "rank": {
            "_value": "1"
          }
        }
      },
      {
        "name": "Ola Nordmann",
        "bill": "$7000.95",
        "session": "3",
        "level": "9",
        "telecom": {
          "system": {
            "_value": "phone"
          },
          "value": {
            "_value": "(03) 5555 6473"
          },
          "use": {
            "_value": "work"
          },
          "rank": {
            "_value": "1"
          }
        }
      },
      {
        "name": "Gummy Bear",
        "bill": "$43.95",
        "session": "4",
        "level": "90",
        "telecom": {
          "system": {
            "_value": "phone"
          },
          "value": {
            "_value": "(03) 5555 6473"
          },
          "use": {
            "_value": "work"
          },
          "rank": {
            "_value": "1"
          }
        }
      },
      {
        "name": "Reshika Adhikari",
        "bill": "$4343.50",
        "session": "6",
        "level": "3",
        "telecom": {
          "system": {
            "_value": "phone"
          },
          "value": {
            "_value": "(03) 5555 6473"
          },
          "use": {
            "_value": "work"
          },
          "rank": {
            "_value": "1"
          }
        }
      },
      {
        "name": "Yoshmi Mukhiya",
        "bill": "$634.95",
        "session": "7",
        "level": "0",
        "telecom": {
          "system": {
            "_value": "phone"
          },
          "value": {
            "_value": "(03) 5555 6473"
          },
          "use": {
            "_value": "work"
          },
          "rank": {
            "_value": "1"
          }
        }
      }
    ]
  }
}

JSON uses text only, which is easier for sending and receiving over any server. Hence, it is used as a data format by many programming languages. In the preceding snippet, we have a similar nested structure; that is, lists containing other lists which will contain tuples that consist of key-value pairs. So, the key-value pairs at atomic property names and their values. One way to generalize about all these different forms of semi-structured data is to model them as trees:


Tree data structure of a simple web page showing semi-structured data

Exploring the semi-structured data model of JSON data

Let’s consume the Twitter API (https://apps.twitter.com/) to download some tweets and construct a semi-structured data model. Let’s use the Tweepy library (https://www.tweepy.org/) to download the tweets.

Installing Python and the Tweepy library

Start up your virtual machine and run the Terminal. You should have pip installed by now. If you do not have pip installed, please follow the tutorials at https://pip.pypa.io/en/latest/installing/. Just run pip to install tweepy by running the following command:


$ pip install tweepy

Once you have that installed, the next step is getting set up with the Twitter API. 

Getting authorization credentials to access the Twitter API

Authorization credentials can be obtained by creating a new app in the Twitter developer platform (https://apps.twitter.com/). Twitter permits downloading 3,200 tweets (https://developer.twitter.com/en/docs/api-reference-index) in the JSON format. The script to download the tweets can be found at https://github.com/PacktPublishing/Hands-On-Big-Data-Modeling. Run the Python script simply by python tweet.py.

After creating an app on the site, you should be able to get access to keys and tokens similar to the following screenshots: 


Twitter app credentials page


The Python scripts use the REST API provided by Twitter to download the data and save it into our destination. You just need to populate the script with your own keys and run the script: 

 
#!/usr/bin/env python
# encoding: utf-8
 
import tweepy
import json
 
#Twitter API credentials
consumer_key = "CONSUMER_KEY_GOES_HERE"
consumer_secret = "CONSUMER_SECRET_GOES_HERE"
access_key = "ACCESS_KEY_GOES_HERE"
access_secret = "ACCESS_SECRET_GOES_HERE"
 
def get_all_tweets_by_handle(screen_name):
 
    #Twitter only allows access to a users most recent 3240 tweets with this method
 
    #authorize twitter, initialize tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)
 
    #initialize a list to hold all the tweepy Tweets
    alltweets = []
 
    #make initial request for most recent tweets (200 is the maximum allowed count)
    new_tweets = api.user_timeline(screen_name = screen_name,count=200)
 
    #save most recent tweets
    alltweets.extend(new_tweets)
 
    #save the id of the oldest tweet less one
    oldest = alltweets[-1].id - 1
 
    #keep grabbing tweets until there are no tweets left to grab
    while len(new_tweets) > 0:
 
        #all subsequent requests use the max_id param to prevent duplicates
        new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
 
        #save most recent tweets
        alltweets.extend(new_tweets)
 
        #update the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1
 
        print "...%s tweets downloaded so far" % (len(alltweets))
 
    #write tweet objects to JSON
    file = open('tweet.json', 'wb')
    print "Writing tweet objects to JSON please wait..."
    for status in alltweets:
        json.dump(status._json,file,sort_keys = True,indent = 4)
 
    #close the file
    print "Done"
    file.close()
 
if __name__ == '__main__':
    #pass in the username of the account you want to download
    get_all_tweets_by_handle("@IBM")

Make sure to replace the value of the key with your application key’s value. Also, write the username you want to download tweets from. In this case, download 3,200 tweets from IBM: 

 
if __name__ == '__main__':
#pass in the username of the account you want to download
get_all_tweets("@IBM")

You can run the script using the following command:

 
python tweets.py

Once you run the command, you will be able to see the following output:


Running the python script to download the tweets

Here’s an example response obtained by the script:

 
{
"contributors":null,
"coordinates":null,
"created_at":"Fri May 11 03:47:28 +0000 2018",
"entities":{
"hashtags":[
{
"indices":[
41,
46
],
"text":"IBMQ"
},
{
"indices":[
93,
101
],
"text":"quantum"
}
],
"symbols":[
 
],
"urls":[
{
"display_url":"bitly.com/2G5pSDD",
"expanded_url":"http://bitly.com/2G5pSDD",
"indices":[
69,
92
],
"url":"https://t.co/JBSmlMqj5C"
}
],
"user_mentions":[
{
"id":487624974,
"id_str":"487624974",
"indices":[
1,
9
],
"name":"NC State University",
"screen_name":"NCState"
}
]
},
"favorite_count":86,
"favorited":false,
"geo":null,
"id":994786038236286976,
"id_str":"994786038236286976",
"in_reply_to_screen_name":null,
"in_reply_to_status_id":null,
"in_reply_to_status_id_str":null,
"in_reply_to_user_id":null,
"in_reply_to_user_id_str":null,
"is_quote_status":false,
"lang":"en",
"place":null,
"possibly_sensitive":false,
"retweet_count":37,
"retweeted":false,
"source":"<a href=\"https://www.sprinklr.com\" rel=\"nofollow\">Sprinklr</a>",
"text":".@NCState to become 1st university-based #IBMQ hub in North America: https://t.co/JBSmlMqj5C #quantum",
"truncated":false,
"user":{
"contributors_enabled":false,
"created_at":"Wed Jan 14 20:41:57 +0000 2009",
"default_profile":false,
"default_profile_image":false,
"description":"Official IBM Twitter account. Follows the IBM Social Computing Guidelines.",
"entities":{
"description":{
"urls":[
 
]
},
"url":{
"urls":[
{
"display_url":"ibm.com",
"expanded_url":"https://www.ibm.com",
"indices":[
0,
23
],
"url":"https://t.co/4ZyG9FgkYe"
}
]
}
},
"favourites_count":3496,
"follow_request_sent":false,
"followers_count":443681,
"following":false,
"friends_count":6364,
"geo_enabled":false,
"has_extended_profile":false,
"id":18994444,
"id_str":"18994444",
"is_translation_enabled":false,
"is_translator":false,
"lang":"en",
"listed_count":5432,
"location":"Armonk, New York",
"name":"IBM",
"notifications":false,
"profile_background_color":"FFFFFF",
"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000152426467/Viwc1IvP.jpeg",
"profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000152426467/Viwc1IvP.jpeg",
"profile_background_tile":false,
"profile_banner_url":"https://pbs.twimg.com/profile_banners/18994444/1516234232",
"profile_image_url":"http://pbs.twimg.com/profile_images/925460050994270208/2IXQLOut_normal.jpg",
"profile_image_url_https":"https://pbs.twimg.com/profile_images/925460050994270208/2IXQLOut_normal.jpg",
"profile_link_color":"2FC2EF",
"profile_sidebar_border_color":"000000",
"profile_sidebar_fill_color":"252429",
"profile_text_color":"666666",
"profile_use_background_image":false,
"protected":false,
"screen_name":"IBM",
"statuses_count":10821,
"time_zone":"Eastern Time (US & Canada)",
"translator_type":"none",
"url":"https://t.co/4ZyG9FgkYe",
"utc_offset":-14400,
"verified":true
}
}

Let’s examine the semi-structured data from the code base. From the GitHub link, open Ch05/JSON/twitter.json. Follow these steps: 

Step-1. Open a Terminal shell by clicking on the square black box on the top-left of the screen.

Step-2. Change into the directory where the Twitter data was downloaded—assuming you ran the preceding scripts and you have the twitter.json file in Downloads inside the data folder:

 
cd Downloads/data

Step-3. To look at the JSON file, you can use the more command:

 
more twitter.json

Step-4. The JSON file is quite long and only a part of the file is shown.

The contents of the file are difficult to understand since it is packed together. In this section, we are going to write Python scripts to see the schema of the JSON file: 

 
#!/usr/bin/python
 
#prints the schema of a json file
# usage: json_schema.py jsonfilename
 
import sys
import json
 
filename = sys.argv[1]
data = []
i = 0
try:
    with open(filename) as fname:
        for line in fname:
            i=i+1
            if i%2 == 1: #skip every other line since it is empty
                data.append(json.loads(line))
except Exception:
    print "Error while reading file: ", filename
    print "Check if the file complies with JSON format"
    print "\nUsage: json_schema.py jsonfilename"
    sys.exit()
 
#inner dfs
def dfs_inner(x, indent):
    try:
        for key, value in x.iteritems():
            print indent + key
            try:
                dfs_inner(value, indent+"....")
            except Exception:
                pass
    except Exception:
        pass
 
#outer dfs
indent = " "
for key, value in data[0].iteritems():
    print key
    dfs_inner(value, indent+"....")

Save the snippet into a schema.py file. We can get the schema from the JSON file using the following command:

 
./json_schema.py tweet.json | more

You should get the following result: 


Schema of the JSON file

If you found this article interesting, you can explore Hands-On Big Data Modeling to solve all big data problems by learning how to create efficient data models. Hands-On Big Data Modeling will help you develop practical skills in modeling your own big data projects and improve the performance of analytical queries for your specific business requirements.

James Lee
James Lee
James Lee is a passionate software wizard working at one of the top Silicon Valley-based startups specializing in big data analysis. In the past, he has worked on big companies such as Google and Amazon In his day job, he works with big data technologies such as Cassandra and ElasticSearch, and he is an absolute Docker technology geek and IntelliJ IDEA lover with strong focus on efficiency and simplicity.

Leave a Reply

Your email address will not be published.

LEARN HOW TO GET STARTED WITH DEVOPS

get free access to this free guide, downloaded over 200,00 times !

You have Successfully Subscribed!

Level Up Big Data Pdf Book

LEARN HOW TO GET STARTED WITH BIG DATA

get free access to this free guide, downloaded over 200,00 times !

You have Successfully Subscribed!

Jenkins Level Up

Get started with Jenkins!!!

get free access to this free guide, downloaded over 200,00 times !

You have Successfully Subscribed!