NLP – DXC Tech Talks

What about YouTube?   can I import the transcript from the CTO Tech Talks?

YouTube creates a closed caption of all videos (unless disabled by the person uploading)

this transcript can be accessed via the chrome development console


  1. Open the required video on YouTube, check it has CC enabled
  2. open the chrome console (F12)
  3. paste the following into the console
    if(yt.config_.TTS_URL.length) window.location.href=yt.config_.TTS_URL+"&kind=asr&fmt=srv1&lang=en"
  4. The YouTube tab is reloaded to just show the CC transcript
    <text start="2.179" dur="7.84">
    hello everyone welcome<font color="#E5E5E5"> to the DXE</font>
    <text start="7.2" dur="6.99">
    <font color="#E5E5E5">technology Tech Talk really excited</font>
    <text start="10.019" dur="7.081">
    <font color="#E5E5E5">today</font><font color="#CCCCCC"> I think we've got a great agenda</font>
    <text start="14.19" dur="5.19">
    item today which<font color="#CCCCCC"> is what why is digital</font>
  5. There is quite a lot of information here, starting points, durations and font colours to remove these use Notepad++ to replace anything within a < > with a space
    1. ctrl+H
    2. Find what : <[^>]+>
    3. Ensure Regular expression is checked
    4. replace all
    5. to remove the blank lines and spaces
    6. ctrl+H
    7. find what : \n
    8. replace all
    9. find what : \r
    10. replace all
    11. you will also need to replace the double spaces with single spaces – replace double space with single until none remain; you will now have a single line of text
  6. Create a “paper” node within the graph to pin this transcript against
    CREATE (n:paper { name: 'DXCTechTalkMay122017', title: 'DXC TechTalk May 12 2017' });
  7. paste the new text into the standard query – UPDATE THE PAPER NAME

View the results

Top words

(warning) with the initial import query, this top words list is actually across all transcripts, not just the transcript of interest.

“business” 48
“information” 46
“new” 46
“digital” 42
“users” 36
“user” 32
“must” 30
“organizations” 30
“experience” 28

Adjacency graph

match (p:paper{name:"DXCTechTalkMay122017"})--(w:Word)
return p, w
order by w.count desc limit 25


(warning)Discovered a limitation with the current import query, the word count also needs to be applied in the context of the transcript.  The current import query is a +1 against the overall usage of the word across all imported transcripts, which is useful, but I need to be able to query against both to understand the overall AND an individual transcript.


Updated query

WITH split(tolower("TEXT HERE"), " ") AS words
WITH [w in words WHERE NOT w IN ["1","2","3","4","5","6","7","8","9","0","?",".",",","a","about","above","after","again","against","all","am","an","and","any","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours","ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves"]] AS text
UNWIND range (0,size(text)-2)as i
MERGE (w1:Word {name: text[i]})
    ON CREATE SET w1.count = 1 ON MATCH SET w1.count = w1.count +1
MERGE (w2:Word {name: text[i+1]})
    ON CREATE SET w2.count = 1 ON MATCH SET w2.count = w2.count +1
MERGE (w1)-[r:NEXT]->(w2)
    ON CREATE SET r.count = 1
    ON MATCH SET r.count = r.count+1
//Create a relationship to the paper node (assumes the node exists)
//AND add the relationship count for this transcript.
WITH w1,w2
MATCH (p:paper) WHERE"DXCTechTalkMay2017"
MERGE (p)-[r1:INCLUDED]->(w1)
    ON CREATE SET r1.count = 1
    ON MATCH SET r1.count = r1.count+1
MERGE (p)-[r2:INCLUDED]->(w2)
    ON CREATE SET r2.count = 1
    ON MATCH SET r2.count = r2.count+1;


Updated top words query to include the word count from the transcript only and not the full word dataset; which might include multiple transcripts.

//Added the [r] label within the relationship connector, properties within the relationship can then be queried, returned or used in the order
MATCH (p:paper{name:"DXCTechTalkMay2017"})-[r]-(w:Word)
RETURN,, r.count
ORDER by r.count desc


updated top words

“DXCTechTalkMay122017” “users” 36
“DXCTechTalkMay122017” “user” 32
“DXCTechTalkMay122017” “organizations” 30
“DXCTechTalkMay122017” “business” 26 was 48 when originally counted as dataset includes LC keynote
“DXCTechTalkMay122017” “workers” 24
“DXCTechTalkMay122017” “new” 24
“DXCTechTalkMay122017” “information” 24
“DXCTechTalkMay122017” “workplace” 22
“DXCTechTalkMay122017” “work” 22
“DXCTechTalkMay122017” “experience” 22
“DXCTechTalkMay122017” “systems” 22
“DXCTechTalkMay122017” “change” 20
“DXCTechTalkMay122017” “task” 20


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s