Powerup your Notion Safari Extension

May 4, 2020 - 7 minute read - Category: Tech - Tags: Automation , Productivity

In the previous post, we’ve discussed how to build a customized safari-to-notion Automator. It can save the page title and URL to the designated notion page in an automatic fashion. We’ve learned how to write AppleScripts, use Mac automation, and use Notion-py API to achieve this goal.

However, there are two limitations: 1) we could not save the page content in Notion, and 2) we could not process multiple pages in batch, which could significantly improve our productivity in some cases. And we are trying to solve these two challenges in this article.

Saving web contents

The AppleScript API

As you may recall, in the last post, I shared how to use the Script Editor Library to inspect the provided APIs of an Application, and the name attribute was used to fetch the title of the web page. Similarly, in this case, you may also find the text attribute for a webpage that stores the text on that web page.

Library

Let’s modify the AppleScript code and obtain the text value:

tell application "Safari"
		set theURL to URL of current tab of window 1
		set theTitle to name of current tab of window 1
		set theText to text of current tab of window 1 # NEW! store the text 
	end tell
return {theChoice, theTitle, theURL, theText}

And you can also have access to the running results in the Run AppleScript action in Automator:

Show-output

Finally, don’t forget to change our Python code as well to save the information into Notion.

from notion.client import NotionClient 
import sys 
from notion.block import * # NEW! import all block classes 

page_selection = sys.argv[1] 
tab_name = sys.argv[2]
url_string = sys.argv[3]
text_string = sys.argv[4] # NEW! save the fourth parameter as text 

# ... skipped ... 
row = cv.collection.add_row() 						
row.set_property("Name", tab_name)  
row.set_property("URL", url_string)
row.children.add_new(TextBlock, title=text_string) # NEW! add the text string 

You might notice the syntax for the last line is a bit different. This is because we are not setting the attribute for that row, but adding content to the associated notion pages.

The Newspaper3k API

However, the used text attribute from AppleScript will include unnecessary information like page header and footer which we don’t need. Most of the time, you only need the main text body on the website. How to achieve that?

Parsing the webpage source HTML could be a solution. However, it’s time-consuming to develop an algorithm that can robustly handle various webpages. Luckily, the Newspaper3k library in Python provides us with an elegant and neat solution. It can automatically extract the main text from the given webpage. The following exemplar code fetches the textual content in the url, and can produce a variety of meta-information:

from newspaper import Article

url = 'xx'
article = Article(url)
article.download()
article.parse()

article.text
article.top_image
article.authors
article.publish_date

It even support simple NLP analysis with the help of Natural Language Toolkit(NLTK):

article.nlp()
article.keywords
article.summary

Let’s use an exemplar article to check the actual performance. As shown in the figure below, the newspaper3k API does not only provide us with better text extraction results, but also generates some helpful meta-information that could be useful for future reference.

Compare

article = Article(url_string)
article.download(); article.parse(); article.nlp()

row.set_property("keywords", ','.join(article.keywords))
row.set_property("Summary", article.summary)
row.children.add_new(TextBlock, title=article.text)
# You can also try using the following code and compare the 
# difference between 
# for text in article.text.split("\n"):
#    row.children.add_new(TextBlock, title=text)

Batch Processing for all tabs

To further improve productivity, you may want to save and process multiple pages in batch. Let’s try with processing all tabs in a Safari window in one click.

The for loop in AppleScript could be helpful in the scenario. In the beginning, we set up a new list called returnList to store the returned variables. The repeat with command iterates through all the tabs in the current window (window 1) and store the required data. At the end of each iteration, we append the fetched data to the returnList. And finally, we return the returnList to the python script.

# Selection part is the same

set returnList to {theChoice}
tell application "Safari"
    set tabCount to number of tabs in window 1
    repeat with x from 1 to tabCount
        set theURL to URL of tab x of window 1
        set theTitle to name of tab x of window 1
        set theText to text of tab x of window 1
        set returnList to returnList & {theTitle, theURL, theText}
    end repeat
end tell
return returnList

In the Python script, the modification is similar. Instead of hard-coded values, we use a for loop to iterate all the passed variables:

for idx in range(1, len(sys.argv), 3): 
    #iterate three elements at a time
    tab_name = sys.argv[idx+1]
    url_string = sys.argv[idx+2]
    text_string = sys.argv[idx+3]
		
    # Processing the page 

The final AppleScript and Python script

Congratulations! You’ve built a better version of the Notion Safari Extension. It’s even more powerful than the official extension in Chrome or Firefox. And I believe you can create more powerful functions with the tools I’ve shared with you.

on run {input, parameters}
	
	set pageChoices to {"PageA", "PageB"}
	set theChoice to (choose from list pageChoices with prompt "Please select the target page" default items {"PageA"}) as string
	
	set returnList to {theChoice}
	tell application "Safari"
		set tabCount to number of tabs in window 1
		repeat with x from 1 to tabCount
			set theURL to URL of tab x of window 1
			set theTitle to name of tab x of window 1
			set theText to text of tab x of window 1
			set returnList to returnList & {theTitle, theURL, theText}
		end repeat
	end tell
	
	return returnList
end run

# safari2notion.py 
from notion.client import NotionClient 
import sys 
from notion.block import *
from newspaper import Article

# Parsing inputs from the command line
page_selection = sys.argv[1] 

page_selection_lookup_table = {
  'PageA': 'https://www.notion.so/xxx',
  'PageB': 'https://www.notion.so/xxx',
}

if __name__ == "__main__":
    token = 'your token'
    client = NotionClient(token_v2=token)
    for idx in range(1, len(sys.argv), 3): 
        #iterate three elements at a time
        tab_name = sys.argv[idx+1]
        url_string = sys.argv[idx+2]
        text_string = sys.argv[idx+3]
        
        page_link = page_selection_lookup_table.get(page_selection, '')
            # It is more robust if the page_selection is not a key in 
            # the dictionary.        
        
        article = Article(url_string)
        article.download(); article.parse(); article.nlp()
        
        if page_link != '':
            cv = client.get_collection_view(page_link) 
            row = cv.collection.add_row() 						
            row.set_property("Name", tab_name)  
            row.set_property("URL", url)
            # Remember to configure your database to add the Name and URL 
            property 
            row.set_property("keywords", ','.join(article.keywords))
            row.set_property("Summary", article.summary)
            for text in article.text.split("\n"):
               row.children.add_new(TextBlock, title=text)

Reference

Newspaper3k: Article scraping & curation

Acknowledgement

Thanks to Rosen for precious advice during editing.